DevOps / Site Reliability Engineer

ReactorMillbrae, CAMay 11th, 2026

Computer Systems Engineers/ArchitectsSpecial Food Services

Department: EngineeringLocation: San FranciscoDescriptionWe're looking for a DevOps / SRE engineer to own the reliability, delivery, and observability of our AI platform. You'll be the person who ensures models get from a developer's branch to production without anyone losing sleep — and when something does go wrong at 2am, you'll be the one who knows where to look.We run production across multiple Kubernetes clusters, cloud providers, and regions. Our deployment pipeline is fully automated through CI/CD and GitOps, our infrastructure is managed as code, and our observability stack gives us full visibility across every service and GPU workload. This role is about making all of that faster, more reliable, and easier to operate as we scale.What You'll DoOwn and evolve our CI/CD pipelines: dynamic pipeline generation across a monorepo of Go services, Python model containers, and Helm chartsOperate and improve our GitOps deployment lifecycle: Helm releases, Kustomizations, and image automation across multiple clustersBuild and maintain our observability stack: distributed tracing, metrics, dashboards, and alerting across all services and GPU workloadsDefine and track SLOs for core platform services, including session latency, model cold start time, and streaming reliabilityRun incident response: triage production issues, write postmortems, build runbooks, and drive reliability improvementsManage infrastructure-as-code across multiple cloud providers and regions: plan/apply workflows, state management, drift detectionOperate secret management: encrypted secrets, external secret syncing, certificate automationImprove deployment safety: canary rollouts, health checks, startup probes, rollback automationManage authentication infrastructure: OIDC federation for CI, workload identity for cloud services, cross-cloud credential managementParticipate in on-call rotation and build the tooling that makes on-call less painfulWhat We're Looking ForYou've run production Kubernetes clusters and been on-call for them. You've debugged node scheduling failures, OOM kills, and mysterious pod evictions at 3amStrong CI/CD experience: you've built and maintained pipelines for monorepos, not just single-service reposGitOps experience: you understand reconciliation loops, drift detection, and why image automation mattersInfrastructure-as-code fluency with Terraform or similar across multiple environments and cloud accountsYou know observability beyond just "set up dashboards". You've defined SLOs, built alerting that doesn't page on noise, and used traces to debug cross-service latency issues.Comfortable with secret management patterns (KMS, encrypted configs, external secret operators). You've thought about credential rotation and zero-trust.Incident response experience: you've triaged production outages, written postmortems that actually led to improvements, and built runbooks that other engineers could followYou write code, not just YAML. Proficiency in Go, Python, or Bash for building tooling, automation, and pipeline scriptsNice to HaveExperience with GPU workloads on Kubernetes: device plugins, GPU-aware scheduling, GPU monitoringMulti-cloud operations beyond a single providerReal-time or streaming workloads: low-latency systems where p99 matters more than averageExperience with Helm chart authoring and managing complex value layering across environmentsFamiliarity with real-time media or relay infrastructureFinOps experience: GPU cost optimization, spot/preemptible instance managementWhat We're Not Looking ForEngineers who treat infrastructure-as-code as "click around in the console and import later"SREs who've only monitored systems but never built the deployment pipelines that ship to themCandidates whose CI/CD experience is limited to GitHub Actions for a single-service repoPeople who write alerts that fire every day and then get ignoredLogisticsWe are based in-person in San Francisco. We are also hiring for this role in Europe for on-call coverage and timezone distribution.BenefitsCompetitive salary and meaningful early equityVisa sponsorship and relocation supportGenerous health, dental, and vision coverage

DevOps / Site Reliability Engineer

matching similar jobs near Millbrae, CA