JOBSEARCHER

[Remote] AI Support Operations Engineer

NerdLevel TechIrvine, CARemoteJune 4th, 2026
Note: The job is a remote job and is open to candidates in USA. CyberCoders is creating the next generation of AI-optimized data center infrastructure. The Staff AI Support Operations Engineer will lead the Ops team, focusing on architecting and deploying AI compute clusters while providing expert support and building operational standards. Responsibilities Collaborate with engineering teams to architect, deploy, and bring new AI compute clusters online while delivering expert-level support for existing high-density GPU environments Own NetBox and related internal systems, ensuring all infrastructure data is accurate, consistent, and reliably maintained Build and refine internal automation using Python, Ansible, and Terraform to eliminate manual workflows and modernize fragile legacy processes Serve as the highest technical escalation point for customer and internal issues prior to involvement from Platform or Network/Undercloud teams Transform tribal knowledge into clear, durable SOPs and technical documentation that establish the operational "gold standard" Raise the technical bar for the team through code reviews, architectural guidance, and mentorship as the organization scales Skills Enterprise-Grade Server Proficiency: Advanced operational knowledge of HPE, Dell, and SuperMicro platforms, including IPMI, BMC, iDRAC workflows, and familiarity with Redfish-based management Core Engineering Toolkit: Mastery of Python, Ansible, and Terraform as primary tools for automation, orchestration, and infrastructure lifecycle management Linux Performance Engineering: Strong capability in diagnosing and tuning Linux systems, resolving performance bottlenecks, and optimizing workloads at the OS level Advanced Incident Resolution: Demonstrated experience serving as the final technical escalation point for complex, high-impact infrastructure failures Cloud-Native Operations: Proven production experience operating and troubleshooting Kubernetes environments Next-Generation GPU Hardware: Familiarity with NVIDIA Blackwell (B200/B300) or Hopper (H100/H200) architectures High-Performance Fabrics: Experience with InfiniBand or RoCE networking, and modern high-throughput storage platforms such as Weka or VAST Data Bare-Metal Provisioning: Exposure to OpenStack or Canonical MAAS for automated provisioning of physical infrastructure Benefits BONUS RSUs J-18808-Ljbffr