Site Reliability Engineer
Site Reliability EngineerRequired Technical / Functional Skills7+ years of experience in Site Reliability Engineering (SRE), Platform Engineering, Cloud Infrastructure Engineering, or related roles within large-scale enterprise environments.Minimum 4+ years of hands-on experience working primarily within Microsoft Azure cloud environments.Strong expertise in Azure Kubernetes Service (AKS), including cluster lifecycle management, RBAC, network security policies, pod security standards, autoscaling, workload identity, and platform governance.Proven experience building and supporting microservices-based applications using Java and implementing CI/CD pipelines using Azure DevOps (ADO).Hands-on experience designing, implementing, and operating enterprise-scale observability solutions using Dynatrace.Strong understanding and practical experience establishing Service Level Objectives (SLOs), Service Level Indicators (SLIs), Error Budgets, and reliability-focused operational practices.Strong scripting and automation experience using Python, PowerShell, Azure Automation, and cloud-native tooling.Roles & ResponsibilitiesReliability Engineering & Platform OwnershipDefine, establish, and continuously improve enterprise-wide reliability standards, including SLOs, SLIs, and Error Budgets across business-critical Azure-hosted services.Own service reliability metrics and regularly communicate SLA compliance, operational health, and reliability improvements to business and executive stakeholders.Partner with architecture, development, and platform teams to ensure reliability, scalability, and resiliency requirements are embedded throughout the service lifecycle.Conduct architecture and design reviews to ensure availability targets, resilience requirements, and recovery objectives are incorporated from initial design through production deployment.Drive adoption of reliability engineering best practices and champion proactive resilience initiatives including chaos engineering methodologies.Incident Management & Operational ExcellenceLead major incident management activities by serving as Incident Commander for high-priority production incidents (P1/P2) and driving resolution efforts across cross-functional teams.Own the end-to-end incident lifecycle including detection, escalation, communication, resolution management, and post-incident reviews.Participate in structured global on-call rotations and maintain operational response objectives for mission-critical services.Foster a blameless postmortem culture focused on continuous improvement and ensure corrective actions are tracked through completion.Disaster Recovery & ResiliencyDesign, implement, and maintain Disaster Recovery (DR) strategies across Azure environments to ensure business continuity and operational resilience.Lead regular disaster recovery exercises, validate recovery processes, and continuously improve recovery readiness across critical workloads.Establish and maintain Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) aligned with business requirements.Observability & MonitoringDesign, build, and operate enterprise observability capabilities using Dynatrace to provide comprehensive visibility across Metrics, Events, Logs, and Traces (MELT).Develop monitoring standards, dashboards, alerting frameworks, and operational reporting to improve service visibility and reduce incident response times.Integrate monitoring and alerting platforms with enterprise tools including PagerDuty and ServiceNow to enable proactive operations.Automation & Platform EngineeringBuild automation frameworks, operational tooling, self-healing capabilities, and reusable platform services to improve operational efficiency and reduce manual effort.Develop and maintain infrastructure automation, operational runbooks, and platform engineering capabilities using Azure-native services and scripting technologies.Continuously identify opportunities to improve reliability, scalability, security, and operational efficiency through automation and platform enhancements.