JOBSEARCHER

Senior Site Reliability Engineer / DevOps Engineer

Senior Site Reliability Engineer (SRE) / DevOps EngineerLocation: Onsite - Mountain View, CAExperience Required: 5+ yearsInfrastructure Footprint: Global production infrastructure across AWS, South America, and EuropeRole Type: Hands-on engineering roleRole OverviewSeeking a Senior Site Reliability Engineer / DevOps Engineer to design, scale, and operate highly available global infrastructure supporting production systems across multiple international regions.This role is for an engineer with 5+ years of experience building and running production-grade cloud infrastructure. The right person understands where distributed systems fail and has learned the hard lessons that come from operating Kubernetes and cloud platforms at scale.The ideal candidate has deep hands-on experience with Kubernetes, ArgoCD, Terraform, CI/CD pipelines, AWS infrastructure, and multi-region platform reliability. They should understand the limitations, sharp edges, and operational failure modes of these tools.This is an onsite role working closely with platform engineering and leadership to build resilient global infrastructure.What You’ll DoGlobal Infrastructure ArchitectureDesign and operate globally distributed production infrastructure across AWS regions and physical data center environments in South America and EuropeBuild highly available multi-region systems with strong disaster recovery and failover strategiesSolve cross-region networking, latency, DNS routing, replication, and reliability challengesKubernetes Platform EngineeringBuild, scale, secure, and troubleshoot production Kubernetes clustersHandle cluster lifecycle management, upgrades, node failures, networking issues, storage problems, and control-plane troubleshootingTune workloads for resiliency, scheduling efficiency, autoscaling behavior, and resource optimizationDebug real-world Kubernetes issues, including:etcd instabilitynetworking overlays and CNI failuresingress/controller edge casespersistent volume failuresnode pressure and eviction behaviorcluster upgrade regressionsGitOps / ArgoCD OperationsDesign and maintain GitOps workflows using ArgoCDManage promotion pipelines across environments and regionsResolve drift detection issues, sync conflicts, reconciliation failures, and deployment ordering challengesBuild safe rollback and progressive deployment strategiesCandidates should know why ArgoCD breaks, not just how to click “Sync.”Infrastructure as CodeBuild and maintain reusable Terraform modules for multi-region infrastructureManage state strategy, workspace isolation, secrets handling, and provider complexitySolve real-world Terraform pain points, including:state corruption and locking conflictsmodule version driftprovider upgrade regressionsdependency graph surprisescross-account provisioning complexityCI/CD EngineeringBuild and optimize production CI/CD pipelinesImprove deployment speed, safety, and repeatabilityTroubleshoot flaky pipelines, artifact inconsistencies, race conditions, environment drift, and rollback failuresReliability & ObservabilityEstablish SLIs/SLOs and production health standardsBuild alerting, monitoring, tracing, and incident response workflowsLead root cause analysis and postmortem improvementsReduce operational toil through automationWhy This RoleYou’ll own foundational infrastructure decisions for globally distributed systems and help build resilient platform capabilities at international scale.This is a hands-on engineering role for someone who wants meaningful ownership and complex technical problems.RequirementsRequired Experience5+ years in Site Reliability Engineering, DevOps, or Platform EngineeringDeep production experience with:KubernetesArgoCDTerraformAWSCI/CD systemsLinux systems administrationInfrastructure automationPreferred ExperienceExperience operating infrastructure across multiple continentsExperience with hybrid cloud or physical data center integrationStrong networking knowledge, including BGP, VPNs, routing, DNS, and load balancingExperience with security hardening and compliance in production systemsSoftware engineering background with Go, Python, or BashWhat “Senior” Means HereYou have enough production experience to have strong opinions because you have seen failures firsthand.You know:why Terraform plans sometimes liewhy ArgoCD syncs can fail for non-obvious reasonswhy Kubernetes upgrades can ruin your weekwhy “works in staging” means very littlewhy multi-region failover diagrams often fail in productionwhy observability usually breaks exactly when needed mostYou’ve solved these problems repeatedly and improved systems because of those lessons.