JOBSEARCHER
<Back to Search

Site Reliability Engineer

Job Description Site Reliability EngineerOnsite- Bay Area, CASkillsRelevant Skills and ExperienceWhat You'll Do (Day-to-Day)Own and manage our cloud infrastructure (GCP or AWS, on-prem).Build, maintain, and optimize Kubernetes clusters (including GPU-backed clusters).Implement and improve CI/CD pipelines (GitHub Actions).Write and maintain Infrastructure as Code (Terraform).Monitor system health and performance using Grafana and other observability tools.Ensure high availability, reliability, and uptime across platforms.Handle infrastructure maintenance, upgrades, and scaling.Administer and improve our platform architecture and apply general security best practices across the stack.Note: This is an internal-facing role — no customer interaction.Must-Have:4+ years in SRE, DevOps, or Infrastructure EngineeringSolid experience with GCP or AWS (hybrid/on-prem a plus)Experience with Kubernetes cluster management (GPU experience a bonus)Hands-on with Terraform and CI/CD (GitHub)Experience with monitoring/observability (Grafana, etc.)Strong understanding of high availability and infrastructure reliabilityFamiliarity with platform/cluster architecture and administrationSecurity mindset and ability to apply best practiceNice-to-Have:Startup experience (you enjoy building, not just maintaining)Experience with scalable GPU infrastructure for AI/ML

564 matching similar jobs near Mountain View, CA