IOC Systems Specialist
IOC Systems SpecialistSchedule: Onsite | 12-hour rotating shifts (4-on/3-off alternating with 3-on/4-off)Optomi, in partnership with a leading AI cloud infrastructure organization, is seeking an IOC Systems Specialist to join their growing operations team in Fort Worth, TX. This role will provide Tier 2 operational support for high-performance computing (HPC) cloud environments focused on large-scale AI training and inference workloads. The ideal candidate will have hands-on experience supporting HPC infrastructure, Kubernetes environments, Slurm workload management, and enterprise storage platforms such as WEKA and VAST. This individual will play a key role in maintaining system stability, troubleshooting complex incidents, and supporting mission-critical infrastructure within a 24x7 IOC/NOC environment.What the Right Candidate Will Enjoy:Working with cutting-edge AI and HPC infrastructure technologies!Supporting large-scale GPU cluster environments!Exposure to advanced Kubernetes, cloud, and storage technologies!Opportunities to contribute to operational improvements and automation initiatives!Joining a fast-growing organization focused on sustainable, renewable-powered AI infrastructure!Collaborative environment with strong technical leadership and growth opportunities!What Type of Experience the Right Candidate Has:2–5 years of experience supporting or operating HPC clusters in production environmentsStrong operational experience with WEKA and VAST storage platformsHands-on experience with Kubernetes administration and troubleshootingExperience supporting Slurm workload manager environmentsFamiliarity with HPC monitoring, observability, and alerting platformsExperience performing incident response and root cause analysis in complex systemsUnderstanding of cloud platforms such as AWS, Azure, or GCPKnowledge of HPC networking and storage technologies, including InfiniBand and high-throughput interconnectsResponsibilities of the Right Candidate:Provide Tier 2 operational support for HPC cloud infrastructure environmentsMonitor, troubleshoot, and resolve incidents involving Kubernetes, Slurm, storage, networking, and cloud systemsServe as an escalation point for Tier 1 support teamsPerform root cause analysis and coordinate with engineering teams on permanent resolutionsExecute operational changes, upgrades, patching, and maintenance activitiesMaintain and improve operational documentation, runbooks, and knowledge base articlesSupport monitoring and observability tooling to proactively identify system issuesAssist with operational readiness and production support for new HPC capabilitiesMentor junior operations staff and support continuous service improvement initiativesParticipate in on-call rotations and major incident response activitiesJob Must Haves:Must have hands-on experience with WEKA and VAST storage environments2–5 years supporting HPC clusters in production or IOC/NOC environmentsWorking knowledge of KubernetesOperational experience with Slurm workload managerFamiliarity with HPC monitoring and observability toolingExperience with incident response and root cause analysisUnderstanding of AWS, Azure, or GCP cloud platformsKnowledge of HPC networking and storage infrastructureAbility to work onsite in Fort Worth on a rotating 12-hour shift scheduleNice to Have Skills:Bare-metal Kubernetes experienceRelevant certifications such as CKA/CKAD, RHCSA, Linux+, ITIL, or Server+Experience with GPU or HPC vendor technologiesExperience supporting AI or large-scale compute environmentsAutomation or scripting experience