JOBSEARCHER

Site Reliability Engineer

Role OverviewThe Site Reliability Engineer will support Cyber Data Risk & Resilience by ensuring the reliability, availability, performance, and operational visibility of critical cybersecurity platforms and services. This role is responsible for keeping production systems running, instrumenting infrastructure and application layers, building meaningful monitoring and actionable alerting, supporting incident response, and continuously improving dashboards used by engineering, operations, risk, and executive stakeholders.ResponsibilitiesMaintain and improve the reliability, availability, scalability, and performance of cybersecurity platforms, services, and supporting infrastructureSupport day-to-day operational stability by monitoring system health, identifying risks, responding to incidents, and driving timely resolution of service-impacting issuesInstrument infrastructure, applications, services, APIs, data pipelines, and cloud components to provide end-to-end visibility into system behavior and service healthDesign, build, and continuously refine monitoring, alerting, logging, tracing, and observability capabilities across distributed systems and cloud environmentsDevelop meaningful and actionable alerts that reduce noise, improve signal quality, and enable teams to respond quickly to emerging issuesDefine and track key reliability metrics, including availability, latency, throughput, error rates, saturation, service-level indicators, service-level objectives, and operational risk indicatorsBuild, maintain, and enhance dashboards for engineering, operations, product, risk, and executive stakeholders, ensuring information is accurate, timely, and decision-readyContinuously modify and improve executive dashboards to support regular leadership reviews of service health, reliability trends, incidents, risks, and operational performancePartner with engineering, cybersecurity, infrastructure, cloud, and application teams to identify reliability gaps and implement long-term improvementsParticipate in incident response, root-cause analysis, problem management, and post-incident reviews to prevent recurrence and improve operational maturityAutomate operational tasks, health checks, reporting, deployment validation, and recovery procedures to improve efficiency and reduce manual effortCollaborate with application and platform teams to embed reliability, monitoring, and supportability requirements into the software development lifecycleSupport CI/CD, DevOps, and release management practices by validating operational readiness, monitoring coverage, rollback plans, and production support requirementsContribute to resiliency engineering efforts, including capacity planning, performance tuning, failover validation, disaster recovery readiness, and chaos/resilience testing where applicableEnsure monitoring, alerting, dashboards, and operational processes align with enterprise security, risk, compliance, and governance standardsRequired Qualifications7 to 10+ years of experience in site reliability engineering, systems engineering, software engineering, DevOps, infrastructure engineering, or production operationsStrong experience supporting highly available, distributed, cloud-based, or mission-critical technology platformsHands-on experience with observability practices, including monitoring, alerting, logging, metrics, tracing, dashboards, and service health reportingExperience instrumenting applications, services, APIs, infrastructure, databases, and cloud components to enable end-to-end operational visibilityStrong understanding of reliability engineering concepts, including SLIs, SLOs, SLAs, error budgets, incident management, capacity management, and operational readinessExperience designing actionable alerts that support rapid issue detection, triage, escalation, and resolutionExperience building and maintaining operational dashboards for technical teams, support teams, and senior/executive stakeholdersStrong scripting or programming skills using Python, Java, Bash, PowerShell, or similar languages for automation and operational toolingExperience with cloud platforms such as AWS, Azure, or GCPExperience with Infrastructure-as-Code tools such as Terraform or similar technologiesExperience working with CI/CD pipelines, DevOps workflows, release processes, and production support modelsExperience troubleshooting distributed systems, REST services, event-driven architectures, messaging platforms, and service-to-service integrationsFamiliarity with relational and non-relational databases, such as PostgreSQL, MSSQL, MongoDB, or similar platformsStrong analytical, troubleshooting, and problem-solving skills with the ability to diagnose complex technical issues across multiple layers of the stackStrong written and verbal communication skills, including the ability to translate technical issues into clear business and executive-level updatesPreferred SkillsExperience supporting cybersecurity, risk, resilience, compliance, or enterprise security platformsExperience with observability and monitoring tools such as Splunk, Grafana, Prometheus, Datadog, Dynatrace, New Relic, Azure Monitor, CloudWatch, OpenTelemetry, or similar platformsExperience creating executive-level service health dashboards, reliability scorecards, operational risk reporting, or incident trend reportingExperience developing automated health checks, synthetic monitoring, service dependency maps, and operational runbooksExperience with incident response, major incident management, postmortems, root-cause analysis, and problem management practicesExperience with containerized and cloud-native environments, including Kubernetes, Docker, serverless services, or managed cloud platformsExperience with distributed messaging or streaming platforms such as Apache KafkaFamiliarity with cloud-native security, governance, and policy tooling such as Azure Policy, AWS SCP, GCP constraints, or related controlsFamiliarity with Cloud Security Posture Management tools such as Wiz, Prisma, CloudGuard, or similar platformsExperience with cloud-based AI services such as Azure AI, AWS Bedrock, or Google Vertex AI, particularly from an operational monitoring, reliability, or governance perspectiveExperience supporting Linux and Windows environments through scripting, automation, monitoring, and operational troubleshootingExposure to web technologies, APIs, front-end services, or user-facing application monitoringAdditional SkillsStrong ownership mindset with a focus on operational excellence and service reliabilityAbility to operate effectively in fast-paced, production-focused environments with minimal supervisionStrong ability to prioritize issues based on customer impact, business risk, service criticality, and operational urgencyEffective collaboration skills across engineering, operations, cybersecurity, infrastructure, risk, and executive stakeholder groupsAbility to communicate service health, operational risks, incidents, and reliability trends clearly to both technical and non-technical audiencesProactive and continuous-improvement mindset with a focus on automation, simplification, resilience, and measurable outcomesStrong attention to detail when building dashboards, defining metrics, tuning alerts, and preparing executive-level operational reporting Rate range -$60-$65