Principal SRE
Job Title: Principal SREDuration: 12 Months with possibility to extend or convertType: W2 OnlyLocation: Charlotte, NC/Irving, TX/Phoenix, AZ/Minneapolis, MN/Iselin, NJ/San Francisco, CA – Hybrid Role (3 days onsite every week) Objective: Improve scalability & reliability of existing systems. Drive retroactive mitigation of technical debt for critical systems Skills and focus of PREs:ObservabilityAlerts & Dashboards focused on user experienceFocus: Metrics, Logs, TracesCapacityContinuous Trend & Usage analysisThresholds for new equipment purchases with lead timePerformanceMeasure performance from user perspectiveFocus: Response time, Error rate, LatencyResiliencyHigh Availability & Fault Tolerance (within & across DC)Focus: RTO should match business expectations In this role, you will:Act as a Platform Reliability Engineering (PRE) expert, providing deep technical leadership in one core domain (Database, cloud, Network, Compute/Storage, Middleware, or Application Support), while partnering across broader platform teams.Lead analysis and remediation of systemic reliability issues and complex production problems, translating recurring incidents into long-term engineering solutions.Design, influence, and implement scalable, highly reliable systems using SRE principles including SLI/SLOs, error budgets, and automation-first approaches.Drive observability standards across platforms, with a strong focus on metrics, logs, traces, and user-experience-based alerting and dashboards.Partner with application, infrastructure, cloud, and support teams to improve availability, performance, capacity, and resiliency of both new and existing systems.Lead or contribute to blameless post-mortems, ensuring actionable outcomes and sustained reduction of repeat incidents.Translate advanced technical knowledge and enterprise context into clear guidance for senior leadership on reliability risks, priorities, and investment areas.Mentor and guide engineers and support staff on reliability best practices, operational standards, and automation opportunities. Required Qualifications:10+ years of experience in Systems Operations, SRE, Platform Engineering, or Production Support, with deep expertise in at least one of the following domains:Database platformsCloud platformsNetwork infrastructureCompute / Storage platformsMiddleware platformsEnterprise Application SupportStrong hands-on experience applying SRE concepts such as SLI/SLO definition, error budgets, reliability metrics, and incident-driven engineering improvements.Proven experience diagnosing and resolving complex, large-scale production issues across distributed systems.Solid understanding of observability tools and practices covering monitoring, alerting, logging, and tracing.Experience driving automation and self-service to reduce operational toil and manual interventions.Strong communication skills with the ability to influence engineers, partners, and senior leaders. Desired Qualifications:Exposure to capacity management, performance engineering, and resiliency design (HA, fault tolerance, RTO/RPO).Experience working in environments with hybrid platforms (on-prem + cloud) and complex enterprise dependencies.Ability to drive technical debt remediation for critical legacy systems using structured, prioritized backlogs.Familiarity with IT service management, incident/problem management, and continuous improvement frameworks.Experience mentoring or leading senior engineers in reliability or operations-focused roles. Job Expectations:Strong collaboration and partnering skills across platform, application, and support teams.Ability to manage multiple priorities in a fast-paced, high-impact production environment.Consistent delivery of high-quality reliability outcomes within expected timelines.High attention to detail, data-driven problem-solving, and operational rigor.Prior project or initiative leadership experience is highly desirable.