Lead Infrastructure Engineer: GPU Fleet (HPC)
Company: IntelagenType: Contract C2C or 1099 | Full-time (Critical) | Rate: NegotiableLocation: Remote, North America (Core working hours must overlap with EST/PST business hours)About the RoleIntelagen is scaling the next generation of AI infrastructure. We are seeking a Lead GPU Infrastructure Engineer for one of our clients to architect and own the lifecycle of our high-density GPU fleet (H200, B200, and B300). You will not be inheriting legacy systems; you will be building the software-defined systems that deliver enterprise-grade availability for massive production AI training workloads.If you view manual \"toil\" as a bug to be fixed with code, and you want to write the playbook for deploying the most powerful AI hardware on the planet, we want you on our team.CORE RESPONSIBILITIESFleet Architecture & Lifecycle: Own the end-to-end health of our H200, B200, and B300 nodes. You are responsible for the “Day 0” to “Day N” lifecycle—from firmware validation and bare-metal provisioning to decommissioning.Thermal & Power Management: Lead the operational oversight of high-density liquid-cooled environments. Monitor CDU (Coolant Distribution Unit) health and secondary loop telemetry alongside GPU thermals for extreme 120kW+ racks.Auto-Remediation & Observability: Architect a telemetry stack using Prometheus, Grafana, and NVIDIA DCGM that doesn’t just alert you to issues, but actively triggers automated remediation (e.g., automated node draining, reboots, and health validation) for common hardware regressions.NetBox Integration: Own the migration of our inventory to NetBox DCIM. Build the API integrations that make NetBox the undisputed, authoritative source of truth for asset tracking, IPAM, and cabling for our compliance audits.Vendor & Operator Authority: Serve as the primary technical interface for third-party facility operators and MSPs. Set the bar for SLA/KPI compliance, lead technical post-mortems, and manage escalations for cluster-level outages.Commercial Support: Serve as the technical authority on enterprise deal cycles, supporting the Sales team with capacity planning, infrastructure deep-dives, and technical reviews for top-tier clients.On-Call Leadership: Participate in a 24/7 on-call rotation. This role carries primary accountability for fleet availability and incident response.TECHNICAL REQUIREMENTSHPC & GPU Pedigree: Extensive experience managing large-scale HPC environments or production GPU fleets at a hyperscaler, neocloud, or top-tier research facility.Hopper & Blackwell Mastery: Deep, hands-on experience with H200, B200, or B300 systems. You must intimately understand the unique power, thermal, and networking demands of Blackwell-class hardware.Fabric & Interconnects: Expert knowledge of 400G/800G InfiniBand (ConnectX-7 NDR / ConnectX-8 XDR), NVLink, and NVSwitch architectures.Engineering Mindset: Strong Linux internals and proven proficiency in building bulletproof infrastructure automation using Python or Go.Observability: Deep experience deploying and scaling DCGM-based telemetry and SNMP-based environmental monitoring.STRONG PLUS• Liquid Cooling Experience: Direct experience with Direct-to-Chip (DLC) systems, coolant chemistry management, or immersion cooling.• NVIDIA Mission Control: Familiarity with NVIDIA Mission Control for Blackwell-class cluster management.• Confidential Compute: Expertise in Intel TDX or NVIDIA RIM attestation flows.• Early-Stage Growth: Prior experience as an initial infrastructure hire responsible for building standards from the ground up.Intelagen LLC. is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.