JOBSEARCHER

HPC Observability Engineer

22 hours ago Be among the first 25 applicantsGet AI-powered advice on this job and more exclusive features.Direct message the job poster from EIT Professionals CorpRole: HPC Observability Engineer (Python, HPC)Location: RemoteContractDescription:The client has Grafana and InfluxDB services running on K8S in-house on-premises. Telegraf is used to ingest data from a GPU HPC cluster into InfluxDB. This engineer will help collect and visualize data for the “Terra” platform. The HPC Observability Engineer should have experience in:Setting up and maintaining Grafana dashboards for HPC environmentsCreating drill-down dashboards for servers, including metrics like memory, network, and CPU utilizationExploring and utilizing out-of-the-box metrics from InfluxDBWriting Python scripts for data ingestion into InfluxDB with examplesDeveloping a proof of concept with a simple Python script to monitor loadIngesting Infiniband packet dataMonitoring LSF jobs in various statesVisualizing server-specific and cluster-wide metrics in GrafanaOptional: Integrating third-party plugins like DDN’s Lustre, Mellanox fabric, etc.Qualifications and Skills:B.Tech, MS, or PhD in Computer Science or related field5-8 years of experience with Grafana, InfluxDB, and TelegrafExperience in Python and Bash scripting is a plusKnowledge of Docker and Google Cloud Platform is advantageousHPC operations experience is beneficialStrong communication skills and ability to work independentlyProficiency in requirements analysis and automated testingAbility to write efficient, secure, and well-documented Python codeExperience with Git and pipeline developmentAwareness of modern security and development practicesResponsibilities:Develop and leverage Grafana dashboards and Telegraf configurationsCreate dashboards for server and cluster metricsDevelop Python scripts for data ingestion and documentationVisualize non-native resources in GrafanaOptional: Integrate third-party pluginsMaintain high-quality code and documentationCollaborate with teams to troubleshoot and optimize pipelinesDesired Skills:Python (good to have)Bash scripting (good to have)Docker (must)HPC operations and LSF (good to have)Experience with DDN Lustre, Mellanox fabric (good to have)Google Cloud Platform (good to have)Knowledge of Git (must)Seniority level:Mid-Senior levelEmployment type:ContractJob function:Engineering and Information TechnologyIndustries:IT Services and IT ConsultingThis job is active and accepting applications. #J-18808-Ljbffr