{"schemaVersion":"jobsearcher.job.v1","id":"db04e379c4dc04934d097b9e","url":"https://jobsearcher.com/jobs/db04e379c4dc04934d097b9e","canonicalUrl":"https://jobsearcher.com/jobs/db04e379c4dc04934d097b9e","title":"Reliability Engineer","description":"This role supports the U.S. Air Force Cloud One Architecture and Common Shared Services contract and currently has an opening for a Reliability Engineer . The Reliability Engineer is responsible for ensuring the availability, performance, scalability, and resiliency of mission-critical systems. This role applies software engineering principles to infrastructure and operations, with a strong emphasis on automation, monitoring, incident response, and continuous reliability improvement. The reliability engineer serves as the bridge between development, operations, and platform teams to ensure production systems consistently meet defined service level objectives (SLOs) while supporting rapid, safe delivery of new capabilities.\n\nLocation: This position will be hybrid remote. Candidates will be required to work onsite as needed. Candidates preferred to be located near Hanscom AFB (Boston, MA).\n\nSystem Reliability & Availability\n\nDesign, implement, and maintain highly available, fault-tolerant systems in cloud and hybrid environments\n\nDefine, measure, and report Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets\n\nIdentify reliability risks and implement mitigation strategies across the system lifecycle\n\nConduct capacity planning and performance modeling to ensure systems scale to meet demand\n\nMonitoring, Observability & Alerting\n\nImplement and manage monitoring, logging, and tracing solutions to provide full system observability\n\nDefine actionable alerting thresholds that minimize noise and enable rapid incident detection\n\nAnalyze trends and metrics to proactively identify potential reliability issues\n\nIncident Response & Problem Management\n\nParticipate in oncall rotations and lead incident response activities for production systems\n\nCoordinate troubleshooting efforts across development, infrastructure, and security teams\n\nConduct postincident reviews (PIRs) and develop corrective and preventive action plans\n\nTrack recurring issues and ensure root causes are resolved\n\nAutomation & Engineering Excellence\n\nAutomate operational tasks to reduce manual intervention and operational risk\n\nDevelop scripts, tools, and services that improve system reliability and reduce mean time to recovery (MTTR)\n\nPromote \"automation over toil\" and standardize operational workflows\n\nReliability-Focused Engineering\n\nParticipate in architecture and design reviews with an emphasis on reliability, resiliency, and recoverability\n\nValidate disaster recovery (DR) and business continuity plans; test failover mechanisms\n\nSupport chaos engineering, fault injection testing, and resilience validation where appropriate\n\nCollaboration & Governance\n\nPartner with DevOps, Platform, and Security teams to ensure reliability aligns with delivery and compliance objectives\n\nDocument system reliability standards, runbooks, and operational procedures\n\nSupport compliance and audit activities (e.g., FedRAMP, FISMA, internal operational controls)\n\nRequired Skills\n\nBachelors and eight (8) years or more of experience; Masters and six (6) years or more of experience. Additional experience may be accepted in lieu of degree.\n\nActive Secret clearance at a minimum required to start\n\nUS citizenship required\n\nExperience with cloud platforms (AWS, Azure, OCI, or GCP), including managed services\n\nExperience with containerized environments (Docker, Kubernetes)\n\nFamiliarity with CI/CD pipelines and deployment automation\n\nSLOs and error budgets\n\nCapacity modeling and performance testing\n\nStrong understanding of:\n\nDistributed systems and highavailability architectures\n\nLinux/Windows system administration\n\nNetworking fundamentals (DNS, TCP/IP, load balancing)\n\nHands-on experience with:\n\nMonitoring and observability tools (e.g., Prometheus, Grafana, ELK/Elastic, Datadog, Azure Monitor)\n\nInfrastructure as Code (Terraform, ARM, CloudFormation)\n\nScripting or programming languages (Python, Bash, Go, PowerShell, or similar)\n\nExperience supporting incident management and oncall operations\n\nPreferred Skills\n\nExperience with USAF Cloud One or Platform 1.\n\nExperience with Zero Trust Architecture\n\nCloud certifications in AWS, Azure, Google, or Oracle clouds\n\nSES provides a competitive salary and the following benefits:\n\nMedical\n\nDental\n\nVision\n\nAD&D\n\nSTD\n\nLTD\n\nCompany paid Life Insurance\n\n401k with employer contribution\n\nPaid Time Off\n\nPet Insurance\n\n#J-18808-Ljbffr","company":"Systems Engineering Solutions Defunct","rawCompany":"systems engineering solutions defunct","city":"Bedford","state":"MA","isRemote":false,"isActive":false,"createdAt":"2026-04-09T09:41:45.467Z","occupations":[{"code":"15-1299.08","title":"Computer Systems Engineers/Architects","slug":"computer-systems-engineers-architects"},{"code":"15-1244.00","title":"Network and Computer Systems Administrators","slug":"network-and-computer-systems-administrators"},{"code":"15-1211.00","title":"Computer Systems Analysts","slug":"computer-systems-analysts"}],"industries":[{"code":"541512","title":"Computer Systems Design Services","slug":"computer-systems-design-services"},{"code":"541519","title":"Other Computer Related Services","slug":"other-computer-related-services"},{"code":"541513","title":"Computer Facilities Management Services","slug":"computer-facilities-management-services"}],"jobPosting":{"@context":"https://schema.org","@type":"JobPosting","title":"Reliability Engineer","description":"This role supports the U.S. Air Force Cloud One Architecture and Common Shared Services contract and currently has an opening for a Reliability Engineer . The Reliability Engineer is responsible for ensuring the availability, performance, scalability, and resiliency of mission-critical systems. This role applies software engineering principles to infrastructure and operations, with a strong emphasis on automation, monitoring, incident response, and continuous reliability improvement. The reliability engineer serves as the bridge between development, operations, and platform teams to ensure production systems consistently meet defined service level objectives (SLOs) while supporting rapid, safe delivery of new capabilities.\n\nLocation: This position will be hybrid remote. Candidates will be required to work onsite as needed. Candidates preferred to be located near Hanscom AFB (Boston, MA).\n\nSystem Reliability & Availability\n\nDesign, implement, and maintain highly available, fault-tolerant systems in cloud and hybrid environments\n\nDefine, measure, and report Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets\n\nIdentify reliability risks and implement mitigation strategies across the system lifecycle\n\nConduct capacity planning and performance modeling to ensure systems scale to meet demand\n\nMonitoring, Observability & Alerting\n\nImplement and manage monitoring, logging, and tracing solutions to provide full system observability\n\nDefine actionable alerting thresholds that minimize noise and enable rapid incident detection\n\nAnalyze trends and metrics to proactively identify potential reliability issues\n\nIncident Response & Problem Management\n\nParticipate in oncall rotations and lead incident response activities for production systems\n\nCoordinate troubleshooting efforts across development, infrastructure, and security teams\n\nConduct postincident reviews (PIRs) and develop corrective and preventive action plans\n\nTrack recurring issues and ensure root causes are resolved\n\nAutomation & Engineering Excellence\n\nAutomate operational tasks to reduce manual intervention and operational risk\n\nDevelop scripts, tools, and services that improve system reliability and reduce mean time to recovery (MTTR)\n\nPromote \"automation over toil\" and standardize operational workflows\n\nReliability-Focused Engineering\n\nParticipate in architecture and design reviews with an emphasis on reliability, resiliency, and recoverability\n\nValidate disaster recovery (DR) and business continuity plans; test failover mechanisms\n\nSupport chaos engineering, fault injection testing, and resilience validation where appropriate\n\nCollaboration & Governance\n\nPartner with DevOps, Platform, and Security teams to ensure reliability aligns with delivery and compliance objectives\n\nDocument system reliability standards, runbooks, and operational procedures\n\nSupport compliance and audit activities (e.g., FedRAMP, FISMA, internal operational controls)\n\nRequired Skills\n\nBachelors and eight (8) years or more of experience; Masters and six (6) years or more of experience. Additional experience may be accepted in lieu of degree.\n\nActive Secret clearance at a minimum required to start\n\nUS citizenship required\n\nExperience with cloud platforms (AWS, Azure, OCI, or GCP), including managed services\n\nExperience with containerized environments (Docker, Kubernetes)\n\nFamiliarity with CI/CD pipelines and deployment automation\n\nSLOs and error budgets\n\nCapacity modeling and performance testing\n\nStrong understanding of:\n\nDistributed systems and highavailability architectures\n\nLinux/Windows system administration\n\nNetworking fundamentals (DNS, TCP/IP, load balancing)\n\nHands-on experience with:\n\nMonitoring and observability tools (e.g., Prometheus, Grafana, ELK/Elastic, Datadog, Azure Monitor)\n\nInfrastructure as Code (Terraform, ARM, CloudFormation)\n\nScripting or programming languages (Python, Bash, Go, PowerShell, or similar)\n\nExperience supporting incident management and oncall operations\n\nPreferred Skills\n\nExperience with USAF Cloud One or Platform 1.\n\nExperience with Zero Trust Architecture\n\nCloud certifications in AWS, Azure, Google, or Oracle clouds\n\nSES provides a competitive salary and the following benefits:\n\nMedical\n\nDental\n\nVision\n\nAD&D\n\nSTD\n\nLTD\n\nCompany paid Life Insurance\n\n401k with employer contribution\n\nPaid Time Off\n\nPet Insurance\n\n#J-18808-Ljbffr","datePosted":"2026-04-09T09:41:45.467Z","dateModified":"2026-04-09T09:41:45.467Z","hiringOrganization":{"@type":"Organization","name":"Systems Engineering Solutions Defunct","sameAs":"https://jobsearcher.com"},"jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Bedford","addressRegion":"MA","addressCountry":"US"}},"identifier":{"@type":"PropertyValue","name":"JobSearcher","value":"db04e379c4dc04934d097b9e"},"url":"https://jobsearcher.com/jobs/db04e379c4dc04934d097b9e"}}