{"schemaVersion":"jobsearcher.job.v1","id":"9edf32bc771cacda4cb402c6","url":"https://jobsearcher.com/jobs/9edf32bc771cacda4cb402c6","canonicalUrl":"https://jobsearcher.com/jobs/9edf32bc771cacda4cb402c6","title":"Staff Devops Engineer","description":"This is a hands-on Staff DevOps Engineer role responsible for designing, operating, and evolving a highly available, multi-tenant platform on AWS. You will work closely with software engineering to deploy, operate, and scale production systems while driving improvements in reliability, automation, and performance.This role requires strong ownership of infrastructure and production systems. You will also provide technical leadership and mentorship to other DevOps engineers.You will also help introduce and operationalize AI/LLM capabilities within the platform.AI / LLM Systems (Emerging Area)Experience operating or integrating LLM/AI services in production environments, including tracing and evaluation(OpenTelemetry, LangSmith, LangFuse or equivalent)Experience managing performance, cost, and reliability of LLM workloads (latency, token usage, rate limiting, fallbacks)Experience using AI/agentic developer tools (e.g., Claude Code, Cursor or similar) to accelerate DevOps, workflows and improve engineering efficiencyWhat You'll Be Responsible ForDesign, build, and operate scalable, highly available infrastructure in AWSOwn and evolve infrastructure as code (Terraform) across all environmentsOperate and optimize Aurora PostgreSQL (replication, failover, performance tuning)Operate ECS (Fargate), ECR, and containerized servicesOperate Kafka-based event streaming systemsManage Auto Scaling Groups and EC2-based workloadsDesign and maintain CI/CD pipelines (Buildkite)Build automation to eliminate manual operational workManage and secure secrets and access (Vault, AWS Secrets Manager, IAM)Partner with engineering teams to improve system reliability and performanceProvide technical leadership and mentorship to DevOps engineersDrive cost optimization across AWS infrastructureOperate systems behind Cloudflare (WAF, CDN, traffic management)Production Reliability & Incident OwnershipOwn production incident response end-to-end (triage, mitigation, coordination)Lead high-severity outage response under pressureDrive root cause analysis (RCA) and enforce follow-upsContinuously improve system resilience and recovery mechanismsObservability & System InsightDesign and operate end-to-end observability (metrics, logs, tracing)Build high-signal monitoring, alerting, and dashboardsDefine and enforce SLIs/SLOs and alerting standardsReduce alert fatigue and improve signal-to-noise ratioWhat You BringDeep experience operating production systems on AWS (ECS/Fargate, EC2, networking, IAM)Expert-level Terraform experience managing infrastructure at scaleStrong experience with containerized applications and distributed systems (e.g., Kafka)Experience operating multi-tenant, highly available systemsProven ownership of production on-call and resolving critical incidentsStrong systems fundamentals (Linux, networking, debugging)Strong scripting ability (Bash, Python or equivalent)Experience designing and operating CI/CD systemsStrong understanding of security best practices (IAM, secrets management)Nice to HaveExperience operating multi-region or globally distributed systemsExperience working with Cloudflare at scaleExperience optimizing high-throughput or event-driven systems","company":"Way2b1","rawCompany":"way2b1","city":"Millbrae","state":"CA","isRemote":false,"isActive":false,"createdAt":"2026-05-06T08:47:00.156Z","occupations":[{"code":"15-1299.08","title":"Computer Systems Engineers/Architects","slug":"computer-systems-engineers-architects"},{"code":"15-1252.00","title":"Software Developers","slug":"software-developers"},{"code":"15-1244.00","title":"Network and Computer Systems Administrators","slug":"network-and-computer-systems-administrators"}],"industries":[{"code":"541512","title":"Computer Systems Design Services","slug":"computer-systems-design-services"},{"code":"518210","title":"Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services","slug":"computing-infrastructure-providers-data-processing-web-hosting-and-related-services"},{"code":"513210","title":"Software Publishers","slug":"software-publishers"}],"jobPosting":{"@context":"https://schema.org","@type":"JobPosting","title":"Staff Devops Engineer","description":"This is a hands-on Staff DevOps Engineer role responsible for designing, operating, and evolving a highly available, multi-tenant platform on AWS. You will work closely with software engineering to deploy, operate, and scale production systems while driving improvements in reliability, automation, and performance.This role requires strong ownership of infrastructure and production systems. You will also provide technical leadership and mentorship to other DevOps engineers.You will also help introduce and operationalize AI/LLM capabilities within the platform.AI / LLM Systems (Emerging Area)Experience operating or integrating LLM/AI services in production environments, including tracing and evaluation(OpenTelemetry, LangSmith, LangFuse or equivalent)Experience managing performance, cost, and reliability of LLM workloads (latency, token usage, rate limiting, fallbacks)Experience using AI/agentic developer tools (e.g., Claude Code, Cursor or similar) to accelerate DevOps, workflows and improve engineering efficiencyWhat You'll Be Responsible ForDesign, build, and operate scalable, highly available infrastructure in AWSOwn and evolve infrastructure as code (Terraform) across all environmentsOperate and optimize Aurora PostgreSQL (replication, failover, performance tuning)Operate ECS (Fargate), ECR, and containerized servicesOperate Kafka-based event streaming systemsManage Auto Scaling Groups and EC2-based workloadsDesign and maintain CI/CD pipelines (Buildkite)Build automation to eliminate manual operational workManage and secure secrets and access (Vault, AWS Secrets Manager, IAM)Partner with engineering teams to improve system reliability and performanceProvide technical leadership and mentorship to DevOps engineersDrive cost optimization across AWS infrastructureOperate systems behind Cloudflare (WAF, CDN, traffic management)Production Reliability & Incident OwnershipOwn production incident response end-to-end (triage, mitigation, coordination)Lead high-severity outage response under pressureDrive root cause analysis (RCA) and enforce follow-upsContinuously improve system resilience and recovery mechanismsObservability & System InsightDesign and operate end-to-end observability (metrics, logs, tracing)Build high-signal monitoring, alerting, and dashboardsDefine and enforce SLIs/SLOs and alerting standardsReduce alert fatigue and improve signal-to-noise ratioWhat You BringDeep experience operating production systems on AWS (ECS/Fargate, EC2, networking, IAM)Expert-level Terraform experience managing infrastructure at scaleStrong experience with containerized applications and distributed systems (e.g., Kafka)Experience operating multi-tenant, highly available systemsProven ownership of production on-call and resolving critical incidentsStrong systems fundamentals (Linux, networking, debugging)Strong scripting ability (Bash, Python or equivalent)Experience designing and operating CI/CD systemsStrong understanding of security best practices (IAM, secrets management)Nice to HaveExperience operating multi-region or globally distributed systemsExperience working with Cloudflare at scaleExperience optimizing high-throughput or event-driven systems","datePosted":"2026-05-06T08:47:00.156Z","dateModified":"2026-05-06T08:47:00.156Z","hiringOrganization":{"@type":"Organization","name":"Way2b1","sameAs":"https://jobsearcher.com"},"jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Millbrae","addressRegion":"CA","addressCountry":"US"}},"identifier":{"@type":"PropertyValue","name":"JobSearcher","value":"9edf32bc771cacda4cb402c6"},"url":"https://jobsearcher.com/jobs/9edf32bc771cacda4cb402c6"}}