Senior Site Reliability Engineer -AI Infrastructure Operations
About NscaleNscale is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises. We enable organizations to accelerate innovation, reduce the complexity of AI development, and achieve meaningful business outcomes through scalable, sustainable compute.Our culture is defined by ownership, accountability, and rapid innovation. We operate with urgency and transparency, and every team member contributes to building the infrastructure powering the future of AI.The OpportunityNscale’s AI Infrastructure Operations team supports one of the most demanding AI platforms in the industry. We are looking for a Senior Site Reliability Engineer to help design, build, and operate reliable, scalable infrastructure across our GPU cloud.This role is focused on hands-on engineering, system reliability, and operational excellence. You will work across software, systems, and infrastructure to improve performance, automate operations, and ensure platform stability at scale.What You’ll Be DoingDesign, build, and improve automation, tooling, and infrastructure systems supporting AI and HPC workloadsContribute to the development of control-plane systems and operational frameworksDefine and implement SLOs, SLIs, and monitoring strategies to ensure system reliabilityParticipate in incident response and root cause analysis, driving improvements to reduce recurrenceIdentify and address reliability and performance bottlenecks across systemsCollaborate with Engineering, Network, and Fleet teams to improve system design and operational processesDrive improvements in availability, scalability, and operational efficiencyMentor junior engineers and contribute to a strong engineering and reliability cultureWhat You Bring5–8+ years of experience in SRE, Systems Engineering, or Software Engineering in production environmentsStrong software engineering skills with experience building automation and infrastructure toolingSolid understanding of Linux systems, networking, and distributed systemsExperience troubleshooting issues across infrastructure, OS, networking, and application layersFamiliarity with monitoring, alerting, and observability toolsAbility to balance reliability, performance, and delivery speedPreferred ExperienceExperience with AI or HPC environments, including GPUs or high-performance systemsExposure to high-speed networking (InfiniBand/RDMA)Familiarity with Kubernetes, cloud platforms, or bare-metal environmentsExperience with observability systems in high-scale environmentsThe range below reflects the base salary for the position. Actual compensation may vary based on job-related factors such as skill set, experience, education, and location. In addition to base salary, this role may be eligible for bonus, equity, and/or commission programs. Nscale may offer a competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.Salary Range: $100,000 USD - $165,000 USDFor information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.