HPC Observability Engineer
22 hours ago Be among the first 25 applicants
Get AI-powered advice on this job and more exclusive features.
Direct message the job poster from EIT Professionals Corp
Role: HPC Observability Engineer (Python, HPC) Location: Remote
Contract
Description: The client has Grafana and InfluxDB services running on K8S in-house on-premises. Telegraf is used to ingest data from a GPU HPC cluster into InfluxDB. This engineer will help collect and visualize data for the "Terra" platform. The HPC Observability Engineer should have experience in:
Setting up and maintaining Grafana dashboards for HPC environments
Creating drill-down dashboards for servers, including metrics like memory, network, and CPU utilization
Exploring and utilizing out-of-the-box metrics from InfluxDB
Writing Python scripts for data ingestion into InfluxDB with examples
Developing a proof of concept with a simple Python script to monitor load
Ingesting Infiniband packet data
Monitoring LSF jobs in various states
Visualizing server-specific and cluster-wide metrics in Grafana
Optional: Integrating third-party plugins like DDN's Lustre, Mellanox fabric, etc.
Qualifications and Skills: B.Tech, MS, or PhD in Computer Science or related field
5-8 years of experience with Grafana, InfluxDB, and Telegraf
Experience in Python and Bash scripting is a plus
Knowledge of Docker and Google Cloud Platform is advantageous
HPC operations experience is beneficial
Strong communication skills and ability to work independently
Proficiency in requirements analysis and automated testing
Ability to write efficient, secure, and well-documented Python code
Experience with Git and pipeline development
Awareness of modern security and development practices
Responsibilities: Develop and leverage Grafana dashboards and Telegraf configurations
Create dashboards for server and cluster metrics
Develop Python scripts for data ingestion and documentation
Visualize non-native resources in Grafana
Optional: Integrate third-party plugins
Maintain high-quality code and documentation
Collaborate with teams to troubleshoot and optimize pipelines
Desired Skills: Python (good to have)
Bash scripting (good to have)
Docker (must)
HPC operations and LSF (good to have)
Experience with DDN Lustre, Mellanox fabric (good to have)
Google Cloud Platform (good to have)
Knowledge of Git (must)
Seniority level: Mid-Senior level
Employment type: Contract
Job function: Engineering and Information Technology
Industries: IT Services and IT Consulting
This job is active and accepting applications.
J-18808-Ljbffr