Operations Engineer, HPC Networking
Job Description: Monitor health and performance of InfiniBand and Ethernet fabrics: switches, HCAs, transceivers, links.Apply (by clicking the relevant button) after checking through all the related job information below.Investigate and resolve fabric issues: connectivity, congestion, performance regressions.Support fabric bring-up alongside DC ops and customer-facing teams.Run maintenance and upgrades on switches and control plane components.Partner with cluster ops on cross-domain incidents where the line between compute and network is blurry.Improve the tooling and runbooks so the next incident resolves faster than the last.Requirements: Operated InfiniBand fabrics in production: subnet manager, routing, partitioning, monitoring.Debugged the full stack: cables, transceivers, switch firmware, HCAs, drivers, NCCL.Brought up new fabrics from cable pull through validation.Scripted your way through repetitive operational work (bash, python, go, whatever). xevrcyc