Incident Manager (SRE / Operations)
Job Title: Incident Manager (SRE / Operations)Location: Philadelphia, PA (100% Onsite – Day 1)Duration: 12+ MonthsOpen Positions: 14⚠️ Critical Notes100% Onsite from Day 1 (Philadelphia, PA)Immediate hiring – bulk positions (14 openings)Virtual interview drive scheduled soon – fast turnaround requiredJob Summary:We are seeking experienced Incident Managers with strong expertise in SRE, operations engineering, and incident command. The ideal candidate will lead high-impact incident response, ensure system reliability, and drive cross-functional coordination during outages and large-scale system events.Key ResponsibilitiesLead incident command and management for critical production issuesCoordinate cross-functional teams during high-severity incidentsDrive root cause analysis (RCA) and implement preventive measuresManage system reliability and operational stabilityCollaborate with SRE, DevOps, and engineering teamsEnsure effective communication with stakeholders and leadershipDrive automation and observability improvementsHandle large-scale change events and system outagesMaintain incident reports, documentation, and post-mortem analysisContinuously improve incident response processes and frameworksRequired Skills & Experience:6–8 years of experience in:Incident Management / Production Support / SRE rolesStrong expertise in:Incident Command & Crisis ManagementSite Reliability Engineering (SRE)Operations EngineeringStrong knowledge of:Reliability architecture and system designAutomation and observability toolsProven ability to:Lead teams during high-impact outagesDrive systemic problem resolutionExcellent executive communication and stakeholder management skillsTechnical Skills:Incident ManagementSRE / Operations EngineeringMonitoring & Observability ToolsAutomation & Reliability EngineeringPreferred Qualifications:Experience in enterprise-scale production environmentsStrong analytical and problem-solving skillsAbility to work in high-pressure, fast-paced environmentsKey Deliverables:Rapid and effective incident resolutionImproved system reliability and uptimeWell-documented RCA and post-incident reportsStrong coordination across technical and business teams