Incident Management Lead
Roles & Responsibilities
Major Incident Command
Own end-to-end P1/P2 incident management across on-prem and Azure environments
Lead incident bridges / war rooms with internal teams, vendors, and Microsoft Azure Support
Perform rapid triage using Azure monitoring tools within first 15 minutes to assess impact and scope
Drive escalation decisions, including Microsoft P1 support and DR activation
Communicate effectively with technical teams and senior/C-level stakeholders
Post-Incident & Continuous Improvement
Conduct blameless PIRs (P1: 48 hrs, P2: 5 days)
Own incident action tracking and lead Service Improvement Plans (SIPs)
Perform trend analysis to identify recurring issues and drive RCA with Problem Management
Report key KPIs: MTTD, MTTR, recurrence rate, SLA adherence, customer impact
Process & ITSM Governance
Own and improve Major Incident Management process, playbooks, and runbooks (ITIL 4 aligned)
Define severity matrix and escalation framework across IT and vendors
Maintain crisis communication and executive notification protocols
Collaborate with Change Management to assess change-related incident risks
Azure Operations & Cloud Incident Management
Maintain Azure incident playbooks (AKS, Azure SQL, ExpressRoute, Entra ID, outages)
Work with Microsoft TAMs and Azure support for escalations
Proactively monitor Azure Service Health and trigger pre-emptive incidents
Identify monitoring and observability gaps with SRE/Cloud teams
Capability Building
Deliver MIM training for service desk and technical teams
Run quarterly incident simulation exercises (GameDays / IncidentEx)
Required Skills
6+ years IT Service Management experience; 3+ years in Major Incident Manager / Incident Commander role in large enterprise environments
Strong experience in Azure incident management (Azure Monitor, Service Health, Log Analytics, Application Insights, Microsoft support escalation)
ITIL 4 certification (Managing Professional / Specialist High Velocity IT preferred; Foundation mandatory)
Proven experience handling P1 incidents with 20+ stakeholders across technical and executive teams
Expert in ServiceNow ITSM (Incident, Problem, Change, dashboards, reporting)
Strong data analysis and KPI reporting skills (incident trends, dashboards, executive reporting)