<Back to Search
Senior Site Reliability Engineer
East Boston, MAApril 3rd, 2026
No H1 or C2C. Must be Permanent Resident or US CitizenSenior Site Reliability EngineerDescription and RequirementsAbout Our TeamWe are building Quantum, a next‑generation hybrid AI platform that spans Windows, Android, and cloud. As part of this vision, we are expanding the reliability engineering organization that powers cross‑device Personal AI.We are looking for Senior Site Reliability Engineers (SREs) to help us build and evolve the foundational reliability, observability, and operations capabilities that ensure fast, safe, and dependable for millions of users.This role may support one of several teams within the SRE organization (e.g., Observability, Operations, or Service Reliability), depending on your strengths and interests.Operating with the speed, ownership, and creative latitude of a startup—yet supported by the scale, resources, and technical depth. We are building new systems, new tooling, and new operational models from the ground up, and we are doing so with clarity, intention, and high engineering standards.Location: Open to remote work in the US. The preferred work location is Chicago, IL.What You Might Work OnAs a Senior SRE, you may be responsible for a subset of the following, depending on team placement and skill alignment:Reliability & Performance EngineeringImproving the availability, scalability, and performance of distributed systems across device, edge, and cloud.Defining or refining SLIs, SLOs, and error budgets for critical services.Leading initiatives to remove single points of failure, improve resilience, and reduce operational risk.Operational ExcellenceParticipating in on‑call rotations and contributing to incident response, triage, and post-incident reviews.Developing automation, runbooks, and self‑healing systems to reduce alert noise and MTTR.Enhancing operational readiness and supporting incident prevention programs.Observability & InsightDesigning or improving observability systems using OpenTelemetry, Grafana, and modern signal pipelines.Building dashboards, analytics, and alerting that illuminate system health and AI service behavior.Ensuring telemetry is reliable, actionable, and tied to real‑world outcomes.Deployments & Change SafetyImproving reliability of CI/CD workflows, including phased rollouts, canaries, shadow testing, and safe rollback mechanisms.Contributing to the evolution of deployment tooling for device+edge+cloud hybrid systems.Systems Design & CollaborationInfluencing architectural decisions by injecting reliability, observability, and operational considerations early in design.Collaborating with AI/ML engineers, platform engineers, firmware teams, and product partners to deliver robust, dependable user experiences.Basic Qualifications10+ years of experience in Site Reliability Engineering, Production Engineering, DevOps, or large‑scale distributed systems operationsBachelor’s Degree in Computer Science, Engineering, or a related technical disciplineStrong experience running production distributed systems at scaleProficiency in at least one modern programming language (e.g., Python, Go, Java, C++)Strong understanding of Linux systems, networking fundamentals, and system performance tuningExperience with monitoring/observability (metrics, logs, tracing)Hands‑on experience with cloud environments (Azure, AWS, or GCP)Experience in incident management, on‑call rotations, and postmortem processesPreferred QualificationsDeep experience with Azure cloud servicesExperience with OpenTelemetry for end‑to‑end instrumentationStrong familiarity with Grafana, Prometheus, Loki, Tempo, or similar toolsExperience supporting AI/ML systems, model serving, or data‑intensive workloadsBackground with hybrid architectures (device + edge + cloud)Experience improving deployment reliability and progressive delivery systemsPassion for automation, reliability engineering, and reducing operational frictionWhat Success Looks LikeSystems become more observable, reliable, and predictable.Incidents are resolved quickly, and follow‑up improvements prevent recurrence.Alerting becomes more accurate, actionable, and trusted.Deployments become safer and more consistent.Teams move faster because reliability foundations are strong and intuitive.
514 matching similar jobs near East Boston, MA
- [Remote] Principal Site Reliability Developer- USC RequiredEast Boston, MAApril 3rd, 2026
- Senior Site Reliability Engineer
- Site Reliability Engineer IIEast Boston, MAApril 2nd, 2026
- Senior DevOps & SRE Architect - Hybrid Boston
- Site Reliability Engineer & Administrator (Intern)
- DevSecOps Engineer II - CI/CD, Security & Infra (Onsite)
- Hands-on Tech Lead: Platform Security & DevOps
- Dev Ops Engineer III
- Systems Engineer - level 3
- Systems Engineer - level 3
- Manager II, Engineering - Secure Compute Platform
- Platform Infra Engineer - Cloud, Kubernetes, CI/CD (Remote)
- AWS Cloud Systems Engineer
- Remote Java Engineer - Cloud & API Specialist for Veterans
- CTO: AI-Powered Marketplace Platform Leader (Remote)
- Omni Cloud Strategy and Advisory Manager
- DevOps Developer - Permanent position
- Cloud-Scale Software Architect - AI & Healthcare Automation
- CTO, Financial Services | Cloud-Native & AI-Driven Leader
- IT Director: Lead Tech Ops & Security at a Nonprofit
- Manager, Software Engineering (User Systems)
- Engineering Manager, Platform & SaaS
- Full-stack Engineer (front-end focused)
- Microsoft Full Stack .Net Developer
- Associate Product Reliability Engineer
- Node.js Backend Developer - Remote
- Senior Software Engineer, Integration Platform
- Principal Software Engineer, Berxi
- ServiceNow Developer
- Netsuite Engineer
- Principal React Native Developer - PlatformEast Boston, MAApril 2nd, 2026
- Senior DevOps Engineer
- Technical Product Manager – Platform
- Lead Developer, Baseball Systems
- CTO, Financial Services — Cloud-Native & AI Leadership
- Senior Manager, Engineering - Cloud Infrastructure
- Full Stack Developer Lead
- Platform & Security AE
- Hybrid Cloud Network Architect
- Software Dev Engineer III - AMZ9673940