JOBSEARCHER

Head of Global Operations

About TensorWaveOur mission is simple: deliver seamless, secure, reliable, and resilient AI compute at scale. We've built a versatile cloud platform that eliminates infrastructure barriers, empowering builders to focus on innovation instead of fighting their stack. Because breakthrough AI should move at the speed of ideas, not infrastructure.About The RoleWe’re looking for a Head of Global Operations to join our team during an exciting phase of growth. In this role, you’ll be responsible for standardizing 24/7 monitoring and incident response operations across all TensorWave data centers, working closely with cross-functional partners to support business objectives while upholding our standards for excellence, collaboration, and impact.What You’ll DoBuild the Global Operations organization, including solidifying team structure, shift models, and staffing plans to achieve 24/7 coverageRecruit, hire, and develop Operations engineers and shift leads who can monitor complex GPU cloud infrastructureEstablish a culture of operational excellence, clear communication, and continuous improvementDefine career progression paths for Global Operations team members and invest in their technical developmentFormalize real-time monitoring process across all TensorWave data centers covering compute, networking, storage, power, and cooling systemsDocument improvements for incident detection, triage, and escalation processes with clear severity classifications and response SLAsOwn L1 incident response including initial diagnosis, runbook execution, and coordination of engineering escalationsMaintain and update status page communications during incidents, ensuring internal and external stakeholders receive timely updatesDrive post-incident review participation and ensure GOC-identified lessons are fed back into runbooks and alertingDevelop and maintain a comprehensive runbook library covering common failure modes, triage procedures, and escalation pathsDefine alert tuning strategies to minimize noise and ensure high-signal alerting across the infrastructureEstablish GOC operational metrics and reporting, including MTTD, MTTR, alert-to-incident ratios, and escalation ratesCollaborate with engineering teams to ensure new infrastructure deployments include monitoring and alerting coverage from day onePartner with Customer Experience to define the handoff between GOC real-time monitoring and customer-facing communications during incidentsWork closely with Data Center Operations and Infrastructure Engineering to maintain escalation matrices and on-call schedulesCoordinate with the PMO on change management processes to ensure the GOC is informed of planned maintenance and infrastructure changesContribute to SLA management by ensuring monitoring and response capabilities align with customer contractual commitmentsWho You AreRequired Qualifications5+ years of experience working in a NOC or infrastructure operations center environmentDemonstrated experience supporting 24/7 customer-facing operations, including shift scheduling and on-call managementHands-on experience with monitoring and observability platforms (e.g. Grafana, Prometheus, or similar)Strong understanding of incident management frameworks, including severity classification, escalation procedures, and post-incident reviewExperience writing and maintaining operational runbooks for infrastructure triage and remediationWorking knowledge of networking fundamentals (TCP/IP, DNS, BGP, VLANs) and Linux system administrationProven ability to hire, train, and lead technical operations teamsExcellent written and verbal communication skills, particularly in high-pressure incident scenariosExperience with ticketing and incident tracking systems (e.g., PagerDuty, ServiceNow, Jira, or equivalent)Preferred QualificationsExperience building a GOC function from scratch at a cloud provider, managed services provider, or data center operatorFamiliarity with GPU infrastructure, HPC clusters, or AI/ML workloadsExperience with Kubernetes and container orchestration platforms in a production environmentKnowledge of data center physical infrastructure including power distribution, cooling, and cablingExperience operating across multiple geographically distributed data centersBackground in cloud infrastructure providers (AWS, GCP, Azure, or neocloud/GPU cloud providers)ITIL certification or equivalent experience with IT service management frameworksWhat we offerStock Options100% paid Medical, Dental, and Vision insurance for EmployeesCompany Health Savings Account Contributions100% paid Short Term and Long Term Disability Insurance for EmployeesLife and Voluntary Supplemental Insurance OptionsOther Insurance Options, such as Pet & Legal InsuranceVarious Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness SupportFlexible Spending Account401(k)Employee Assistance ProgramFlexible PTOPaid HolidaysParental LeaveOther In-Office PerksEqual Employment OpportunityTensorWave is an Equal Opportunity Employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of any protected status under applicable law.Reasonable AccommodationsTensorWave provides reasonable accommodations in accordance with applicable laws. If you require accommodation during the hiring process, please contact accomodations@tensorwave.com.Employment EligibilityAll offers of employment are contingent upon verification of identity and authorization to work in the United States, as required by law.Background ChecksWhere permitted by law, employment may be contingent upon the successful completion of a job-related background check.Data Privacy NoticeBy submitting an application, you acknowledge that TensorWave may collect, use, and retain your personal information for recruiting and employment-related purposes in accordance with applicable data privacy laws.