Onsite SRE Engineer
Key ResponsibilitiesProvide production support for Retail Applications and Microservices built using Spring Boot architecture.Ensure high availability, reliability, and performance of business-critical retail systems and services.Apply Site Reliability Engineering (SRE) principles to improve system stability, scalability, and operational efficiency.Define, implement, and monitor Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).Perform real-time monitoring, troubleshooting, and incident resolution for microservices and retail applications.Use Splunk for log analysis, alerting, and operational intelligence to diagnose and resolve production issues.Use Dynatrace for end-to-end application performance monitoring, distributed tracing, and root cause analysis.Investigate performance bottlenecks, latency issues, and system anomalies across microservices architecture.Build and maintain dashboards, alerts, and monitoring strategies for proactive issue detection.Participate in incident management processes, including on-call rotations, major incident response, and post-incident reviews.Conduct root cause analysis (RCA) and implement preventive measures to reduce recurrence of incidents.Work closely with development, DevOps, and infrastructure teams to improve system reliability and observability.Provide technical troubleshooting support to retail store associates and operations teams through calls or remote sessions.Ensure effective communication and coordination during incidents involving multiple teams and stakeholders.Drive automation and operational improvements to reduce manual intervention and improve system resilience.Support CI/CD pipelines and deployment monitoring for microservices applications.Analyze system logs, metrics, traces, and events to identify trends and proactively prevent outages.Document runbooks, troubleshooting guides, and operational procedures for retail application support.Demonstrate proactive learning, continuous improvement, and knowledge sharing within the SRE team.Mentor team members and contribute to best practices for monitoring, observability, and reliability engineering.Collaborate with engineering teams to improve system design for reliability, fault tolerance, and scalability.Ensure compliance with operational standards, security guidelines, and change management processes.Key SkillsStrong knowledge of SRE principles (SLI, SLO, SLA, Error Budgets).Hands-on expertise in Splunk and Dynatrace monitoring tools.Experience supporting Spring Boot microservices and distributed systems.Strong production troubleshooting and incident management skills.Excellent communication and stakeholder interaction skills.Ability to work in high-pressure production environments and on-call support models