Reliability Engineer
This role supports the U.S. Air Force Cloud One Architecture and Common Shared Services contract and currently has an opening for a Reliability Engineer . The Reliability Engineer is responsible for ensuring the availability, performance, scalability, and resiliency of mission‐critical systems. This role applies software engineering principles to infrastructure and operations, with a strong emphasis on automation, monitoring, incident response, and continuous reliability improvement. The reliability engineer serves as the bridge between development, operations, and platform teams to ensure production systems consistently meet defined service level objectives (SLOs) while supporting rapid, safe delivery of new capabilities.Location: This position will be hybrid remote. Candidates will be required to work onsite as needed. Candidates preferred to be located near Hanscom AFB (Boston, MA).RequirementsSystem Reliability & AvailabilityDesign, implement, and maintain highly available, fault‐tolerant systems in cloud and hybrid environmentsDefine, measure, and report Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgetsIdentify reliability risks and implement mitigation strategies across the system lifecycleConduct capacity planning and performance modeling to ensure systems scale to meet demandMonitoring, Observability & AlertingImplement and manage monitoring, logging, and tracing solutions to provide full system observabilityDefine actionable alerting thresholds that minimize noise and enable rapid incident detectionAnalyze trends and metrics to proactively identify potential reliability issuesIncident Response & Problem ManagementParticipate in on‐call rotations and lead incident response activities for production systemsCoordinate troubleshooting efforts across development, infrastructure, and security teamsConduct post‐incident reviews (PIRs) and develop corrective and preventive action plansTrack recurring issues and ensure root causes are resolvedAutomation & Engineering ExcellenceAutomate operational tasks to reduce manual intervention and operational riskDevelop scripts, tools, and services that improve system reliability and reduce mean time to recovery (MTTR)Promote "automation over toil" and standardize operational workflowsReliability‐Focused EngineeringParticipate in architecture and design reviews with an emphasis on reliability, resiliency, and recoverabilityValidate disaster recovery (DR) and business continuity plans; test failover mechanismsSupport chaos engineering, fault injection testing, and resilience validation where appropriateCollaboration & GovernancePartner with DevOps, Platform, and Security teams to ensure reliability aligns with delivery and compliance objectivesDocument system reliability standards, runbooks, and operational proceduresSupport compliance and audit activities (e.g., FedRAMP, FISMA, internal operational controls)Required SkillsBachelors and eight (8) years or more of experience; Masters and six (6) years or more of experience. Additional experience may be accepted in lieu of degreeActive Secret clearance at a minimum required to startUS citizenship requiredExperience with cloud platforms (AWS, Azure, OCI, or GCP), including managed servicesExperience with containerized environments (Docker, Kubernetes)Familiarity with CI/CD pipelines and deployment automationSLOs and error budgetsCapacity modeling and performance testingStrong understanding of:Distributed systems and high‐availability architecturesLinux/Windows system administrationNetworking fundamentals (DNS, TCP/IP, load balancing)Hands‐on experience with:Monitoring and observability tools (e.g., Prometheus, Grafana, ELK/Elastic, Datadog, Azure Monitor)Infrastructure as Code (Terraform, ARM, CloudFormation)Scripting or programming languages (Python, Bash, Go, PowerShell, or similar)Experience supporting incident management and on‐call operationsPreferred SkillsExperience with USAF Cloud One or Platform 1.Experience with Zero Trust ArchitectureCloud certifications in AWS, Azure, Google, or Oracle cloudsBenefitsSES provides a competitive salary and the following benefits:MedicalDentalVisionAD&DSTDLTDCompany paid Life Insurance401k with employer contributionPaid Time OffPet Insurance#J-18808-Ljbffr