JOBSEARCHER

SRE Lead Engineer

Litmus7Millbrae, CAMay 24th, 2026
Role SummaryWe are looking for a hands-on Lead SRE Engineer to work from onsite and own production reliability, observability, incident response, and operational improvements for enterprise-scale ecommerce systems.The candidate should be technically strong, able to lead P1/P2 incident triage, work directly with client stakeholders, guide offshore teams, and drive improvements across monitoring, alerting, dashboards, runbooks, and automation.Key ResponsibilitiesLead onsite production triage for critical incidents and coordinate with application, infrastructure, DevOps, database, network, and offshore teams.Monitor and support business-critical ecommerce flows such as checkout, order capture, payment, inventory, promotions, and fulfilment integrations.Use Dynatrace and Splunk to analyse logs, metrics, traces, service health, latency, failure rates, and downstream dependencies.Build and maintain dashboards for SRE operations, service owners, and leadership visibility.Improve alerting by reducing noise, defining meaningful thresholds, and aligning alerts with customer impact and SLOs.Drive root cause analysis, post-incident reviews, corrective actions, and preventive improvements.Create and maintain runbooks, SOPs, troubleshooting guides, and operational playbooks.Identify automation and AI-assisted triage opportunities to improve incident response and operational efficiency.Mentor SRE/support engineers and ensure smooth onsite-offshore coordination and handovers.Communicate incident status, business impact, risks, and next steps clearly to client stakeholders.Required Skills 8+ years of experience in SRE, production support, DevOps, platform engineering, or application operations.Strong hands-on experience with Dynatrace and Splunk.Good understanding of microservices, APIs, distributed systems, Kubernetes, containers, and cloud platforms.Experience supporting high-volume ecommerce or enterprise production systems.Strong knowledge of incident management, root cause analysis, monitoring, alerting, and SLO/SLA practices.Ability to analyse application performance issues including latency, throughput, error rates, pod restarts, CPU/memory, database latency, and third-party dependency issues.Strong communication skills with the ability to explain technical issues to both engineering and leadership teams.Experience leading onsite-offshore coordination and mentoring engineers.Preferred SkillsRetail or ecommerce domain experience.Experience with order capture, checkout, payment, inventory, or OMS flows.Knowledge of Dynatrace DQL, Grail, Smartscape, Davis AI, Open Pipeline, and SLOs.Experience with ServiceNow, Jira, PagerDuty, Teams, or similar incident-management integrations.Scripting or automation experience using Python, shell scripting, or similar tools.Exposure to AI-assisted triage, self-healing, or runbook automation.