Director, Production Services Manager
Job DescriptionHead of Production Services Governance, Incident & Problem ManagementRole SummaryThe Head of Production Services Governance, Incident & Problem Management is accountable for the enterprise governance, standards, and performance of Technology Incident Management and Problem Management (including root cause analysis) across BNY’s Platforms. This leader oversees a team that sets the operating model, drives consistent execution, improves quality and speed of restoration, and strengthens auditability and regulatory credibility.The Role Is The Senior Point Of Accountability For Firm-wide incident/problem governance and ITIL-aligned standards High-severity incident command and communications frameworks End-to-end RCA quality and timeliness, including corrective/preventive actions Regulatory and client-facing incident narratives and responses Internal oversight engagement with groups such as ORR and ERO Automation and AI augmentation to modernize and scale incident/problem practices This position partners closely with engineering, SRE/operations, cyber, resiliency, risk, compliance, and business stakeholders to ensure stability, transparency, and continuous improvement of production services.Key Objectives Protect service availability and client experience by ensuring rapid restoration and disciplined incident handling. Improve resiliency and reduce repeat incidents through high-quality problem management, robust RCAs, and effective remediation governance. Strengthen governance and audit defensibility by ensuring consistent process adherence, evidence capture, and clear accountability. Modernize production governance through automation, AIOps capabilities, and AI-assisted workflows. Elevate operational excellence through measurable improvements in MTTR, recurrence, SLA adherence, and control effectiveness. Primary Responsibilities Enterprise Incident Management Governance (ITIL) Own the Incident Management practice and ensure it is implemented consistently across Platform Production Services and aligned to ITIL principles. Establish and maintain incident taxonomy, severity models, prioritization rules, escalation paths, and functional/organizational RACI. Define Major Incident Management (MIM) framework: incident command roles, war-room orchestration, communications cadence, stakeholder engagement, and decision rights. Ensure end-to-end controls: accurate incident logging, categorization, impact assessment, timeline reconstruction, evidence retention, and closure criteria. Drive performance through standard KPIs (e.g., MTTA/MTTR, reopen rate, SLA compliance, major incident frequency, customer-impact minutes, incident backlog health). Enterprise Problem Management & RCA Excellence (ITIL) Own the Problem Management practice including proactive problem identification, trending, and prevention of recurrence. Establish RCA standards (methodologies such as 5 Whys, fishbone, fault tree, “cause–trigger–control gap” framing) and ensure consistent quality across teams. Govern Corrective and Preventive Action (CAPA) management: remediation backlog, prioritization, due dates, owner accountability, and validation of effectiveness. Maintain governance for Known Errors and Workarounds, enabling faster recovery and better knowledge reuse. Drive systemic improvements by connecting incidents/problems to resiliency risks, architectural weaknesses, control gaps, and engineering quality. Regulatory, Client, and Executive Communications & Responses Serve as accountable executive for regulatory responses and supervisory requests relating to incidents, outages, recovery actions, RCA findings, and resiliency improvements. Lead firm readiness for time-sensitive regulatory deliverables—ensuring accuracy, consistency, and defensible evidence. Coordinate and quality-assure client communications for impactful incidents (internal/external statements, timelines, cause, remediation, and prevention). Provide clear executive narratives and materials for senior leadership, risk committees, audit committees, and business stakeholders. Oversight & Partnership Model (ORR, ERO, Risk, Audit, Compliance) Act as the primary interface to internal oversight groups (e.g., ORR, ERO, Operational Risk, Compliance, Internal Audit, and Technology Risk Management). Ensure incidents/problems are appropriately mapped to relevant governance constructs (e.g., operational risk events where applicable) with clear traceability. Lead continuous improvement of control coverage and evidence quality to support audits and examinations. Partner with Resiliency teams to connect operational learning to scenario testing, dependency mapping, recovery planning, and service resiliency metrics. Standardization, Quality Assurance, and Continuous Improvement Build and run a Quality Management System for incident/problem practices: sampling, assurance reviews, coaching, playbooks, and maturity assessments. Develop and maintain standard artifacts (runbooks, major incident playbooks, comms templates, RCA templates, PIR guidance). Run Continual Improvement programs: trend analysis, “top drivers” remediation themes, performance benchmarking, and maturity roadmaps. Drive adoption of consistent tooling, workflows, and data standards across platforms. Automation & AI Enablement (AIOps / Intelligent Operations)This role is expected to use AI responsibly to improve speed, quality, and scale of incident/problem management while meeting security, privacy, and model-risk expectations.Key AI And Automation Outcomes Include AI-assisted triage: classification, routing, deduplication, and severity recommendation based on history and signals. Correlation and probable cause insights using telemetry, topology, and change data to identify likely blast radius and suspects. Automation for repetitive tasks: stakeholder updates, timeline capture, evidence packaging, and post-incident documentation generation. RCA acceleration: AI-supported timeline reconstruction, log summarization, anomaly explanation, and “similar incident” retrieval. Knowledge management uplift: automated drafting of knowledge articles/workarounds; improvement suggestions based on recurrence patterns. Establish governance for AI usage: model transparency, human-in-the-loop controls, data handling, audit logs, and bias/quality monitoring. Leadership & Talent Development Lead and develop a high-performing team of incident/problem governance professionals (e.g., problem managers, automation analysts). Establish role clarity, training paths, and ITIL-aligned capability development. Foster a culture of calm, disciplined execution during crises and a learning culture post-incident—focused on prevention, not blame. Scope & Decision Rights Enterprise-level authority to define and enforce incident/problem standards and minimum controls. Authority to convene major incident response, direct escalations, and require timely executive updates. Authority to gate incident/problem closure based on quality criteria (documentation, evidence, RCA completeness, CAPA commitments). Joint governance with engineering/production leaders to prioritize remediation work and measure effectiveness. Key Interfaces Platform Production Services leaders, SRE/Operations, Engineering, Architecture Cybersecurity Operations, Fraud/Financial Crime Technology (as relevant) Enterprise Resiliency Office (ERO) Office of Regulatory Relations (ORR) Operational Risk, Compliance, Legal, Privacy Internal Audit, Technology Risk Management Business/Product leadership and client coverage teams Required Qualifications 10–15+ years in technology operations, SRE/production services, service management, or resiliency roles in complex enterprises; regulated financial services strongly preferred. Demonstrated leadership in Major Incident Management and Problem Management/RCA at enterprise scale. Strong command of ITIL practices (Incident, Problem, Monitoring & Event, Service Level, Change Enablement, Continual Improvement; familiarity with CMDB/Service Configuration is a plus). Proven experience driving process standardization, operating model change, and measurable performance improvements (e.g., MTTR reduction, recurrence reduction). Experience leading regulatory/audit-facing responses with strong evidence discipline and executive communication. Preferred Qualifications / Certifications ITIL 4 Managing Professional (MP) and/or ITIL Strategic Leader (SL); ITIL Foundation minimum. Familiarity with ISO/IEC 20000, NIST, and resiliency/operational risk expectations in financial services (helpful but not required). Experience with AIOps platforms/observability tooling (e.g., event correlation, log analytics, tracing, anomaly detection). Experience with Agile/DevOps/SRE operating models and integrating incident/problem practices into product/platform delivery. Core Competencies (What “Great” Looks Like) Crisis leadership: calm command presence, structured decision-making, clear communications under pressure. Governance rigor: sets standards that are pragmatic, scalable, and audit-defensible. Analytical excellence: uses trends and data to drive prevention, not just restoration. Influence without friction: partners effectively with engineering leaders to get remediation done. Automation mindset: removes manual steps, improves quality through workflow and tooling. AI fluency with controls: leverages AI safely with strong human oversight and evidence trails. Success Metrics (Illustrative) Reduced major incident frequency and customer-impact minutes (YoY). Improved MTTR/MTTA and decreased escalations due to better routing/triage. Increased RCA timeliness and quality scores, fewer incomplete RCAs, higher CAPA completion on time. Reduced repeat incidents driven by top recurring causes. Improved audit/regulatory outcomes: fewer findings, faster response cycles, higher evidence quality. Increased automation coverage: % of incidents with AI-assisted classification/correlation; reduction in manual documentation hours. About UsAt BNY, our culture allows us to run our company better and enables employees’ growth and success. As a leading global financial services company at the heart of the global financial system, we influence nearly 20% of the world’s investible assets. Every day, our teams harness cutting-edge AI and breakthrough technologies to collaborate with clients, driving transformative solutions that redefine industries and uplift communities worldwide.Recognized as a top destination for innovators, BNY is where bold ideas meet advanced technology and exceptional talent. Together, we power the future of finance – and this is what is all about. Join us and be part of something extraordinary.About The TeamAt BNY, our culture speaks for itself, check out the latest BNY news at BNY Newsroom & BNY LinkedInHere’s a Few Of Our Recent AwardsAmerica’s Most Innovative Companies, Fortune, 2025World’s Most Admired Companies, Fortune 2025“Most Just Companies”, Just Capital and CNBC, 2025Our Benefits And RewardsBNY offers highly competitive compensation, benefits, and wellbeing programs rooted in a strong culture of excellence and our pay-for-performance philosophy. We provide access to flexible global resources and tools for your life’s journey. Focus on your health, foster your personal resilience, and reach your financial goals as a valued member of our team, along with generous paid leaves, including paid volunteer time, that can support you and your family through moments that matter.BNY is an Equal Employment Opportunity/Affirmative Action Employer - Underrepresented racial and ethnic groups/Females/Individuals with Disabilities/Protected Veterans.BNY assesses market data to ensure a competitive compensation package for our employees. The expected base salary for this position when employment commences can be found in the Job Info section at the bottom of the posting.Base salary offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience. Base salary is only part of the total rewards package, which may include eligibility for an annual discretionary incentive award. Subject to the terms and conditions of the applicable plans then in effect, eligible employees may enroll in a 401(k) plan as well as participate in Company-sponsored medical, dental, vision, and basic life insurance plans for the employee and the employee’s eligible dependents. Eligible employees also may receive other benefits (including various paid time off benefits, such as vacation and sick time), dependent on the position offered. Details of participation in these benefit plans will be provided if an employee receives an offer of employment.If hired, the employee will be in an “at will” position and the Company reserves the right to modify base salary (as well as any other discretionary payments or compensation programs) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.