Senior Systems Operations Engineer
Senior Systems Operations Engineer Location: Charlotte, NC, Irving, TX, Chandler, AZ Duration: 18 months Pay Rate: $73.50 Job/Role Description This role supports application and middleware production operations with a Site Reliability Engineering (SRE) mindset, shifting from reactive operations to proactive reliability engineering across VM-based and container-adjacent environments, including OpenShift (OCP). Provides senior-level application and middleware support for complex, high-availability services and acts as an escalation point for L2/L3 incidents, leading disciplined troubleshooting, recovery, and stabilization. Embeds SRE practices into day-to-day operations by defining reliability signals, improving alert quality, driving blameless post-incident learning, and prioritizing systemic fixes and toil reduction. Implements and continuously improves observability across applications and middleware, including logs, metrics, traces, dashboards, and actionable alerting to enhance detection, diagnosis, and mean time to resolution (MTTR). Designs, develops, and maintains infrastructure-as-code and configuration-as-code capabilities supporting VM-based and container-adjacent workloads, including OpenShift (OCP) enablement. Builds and supports automation for operational actions across middleware components, such as standardized status checks, start/stop/restart patterns, to enable safer self-service and reduce dependency bottlenecks. Designs and implements intelligent automation for platform and middleware operations, including integrating AI/agent-based approaches into workflows with appropriate guardrails for triage assistance, predictive signals, and automated remediation. Monitors configuration drift, supports automated compliance checks, and implements remediation patterns aligned to enterprise change management, security, and risk controls. Integrates infrastructure and operational automation with CI/CD pipelines to enable repeatable, auditable deployments and safer rollouts. Supports core platform components that enable applications and container platforms, including ingress patterns, load balancing integration, and shared supporting services. Develops and maintains runbooks, operational documentation, and validation/testing approaches for automation and platform procedures to ensure operational readiness and consistent execution. Participates in on-call rotations and provides operational support coverage as required, with flexibility to work in a 24/7 environment including weekends and holidays. Delivers assigned operational engineering and automation outcomes with a strong focus on stability, resiliency, and measurable toil reduction. Follows enterprise change management, risk, and compliance processes while continuously improving platform reliability and automation maturity through standardization, documentation, and repeatable delivery. Supports a large portfolio of mission-critical applications and platforms, contributing to capacity building and workload management in a dynamic environment. Required Qualifications 4+ years of Systems Engineering or Technology Infrastructure/Operations Engineering experience, or equivalent demonstrated through work experience, training, military experience, or education 4+ years of application and/or middleware production support in complex, high-availability environments, including incident response and problem management with strong root cause discipline 4+ years of hands-on automation and configuration management experience (Ansible preferred or similar), plus strong scripting skills (Python, Bash, PowerShell, or similar) 4+ years of Linux administration (RHEL preferred) and/or Windows Server administration supporting enterprise production workloads 4+ years of Git-based version control practices, including pull requests and peer review, with a focus on repeatability and code quality Working experience with infrastructure-as-code concepts, including modular design and environment consistency Experience supporting hybrid/private cloud platforms and container-adjacent hosting models; familiarity with OpenShift (OCP) or Kubernetes-based platforms Experience implementing SRE operating practices, including reliability metrics, reduction of manual toil, and continuous improvement via post-incident learnings Experience supporting common middleware platforms and shared services; ability to build automation patterns that standardize operational actions and reduce manual intervention Familiarity with enterprise observability and operational support practices, including service health dashboards, alert engineering, and actionable telemetry Exposure to responsible AI usage in operations, including security, validation, accuracy, and appropriate guardrails for automation and agents Strong cross-functional communication skills and experience operating in regulated environments Proven troubleshooting, architecture understanding, automation, observability, and scripting skills with experience in containerization and cloud platforms Ability to understand capacity planning, identify bottlenecks, and implement effective solutions in production environments Hands-on technical expertise with strong adaptability, learning agility, and a collaborative team-oriented mindset Well versed in crisis management, root cause analysis techniques, and blameless post-incident reviews Experience with tools such as Splunk, PowerShell, Bash, Python, and familiarity with Elastic or similar observability technologies a plus