Lead Platform Engineer
Occupations:
Computer Systems Engineers/ArchitectsSoftware DevelopersComputer Systems AnalystsComputer and Information Systems ManagersNetwork and Computer Systems AdministratorsIndustries:
Software PublishersNonferrous Metal (except Aluminum) Production and ProcessingShoe RetailersComputer Systems Design and Related ServicesComputing Infrastructure Providers, Data Processing, Web Hosting, and Related ServicesLead Platform Engineer (Monitoring & Observability) Evansville, IN; Baltimore, MD; Wilmington, DE; Charlotte, NC; or Irving, TX Hybrid role, onsite 3 days per week as directed.Candidates must live within 50 miles of the corporate office located in Evansville, IN; Baltimore, MD; Wilmington, DE; Charlotte, NC; or Irving, TX.Potential for contract extension Note: MUST be legally authorized to work in the United States. This role is NOT open to 3rd party providers, W2 only. SUMMARY:We're seeking a Lead Platform Engineer (Monitoring & Observability) to join a high performing Monitoring Engineering team within a fast paced financial technology organization. In this role, you will apply SRE principles to design, build, and evolve monitoring and observability capabilities that ensure the reliability, performance, and operability of core applications and infrastructureYou will partner closely with application, platform, and development teams to implement data driven alerting, SLO/SLA-based monitoring, telemetry pipelines, dashboards, correlations, and automated remediation. Your work will directly improve system reliability, reduce MTTR, and enhance enterprise wide operational insightThis role requires strong analytical thinking, systems engineering discipline, and a proactive approach to identifying risks, preventing incidents, and driving continuous improvement across the production ecosystem KEY RESPONSIBILITIES: Design, Build, and Maintain Monitoring & Observability SolutionsArchitect, deploy, and operate OpenTelemetry based telemetry pipelines, including instrumentation standards, collector configurations, sampling strategies, and routing to Elastic and other backendsDevelop and maintain instrumentation, telemetry, and alerting for the Enterprise Monitoring Center using industry leading tools, such as: Grafana, OpsRamp, ElasticStack, BigPanda | AWS CloudWatch, Azure MonitorDrive observability standards and best practices across multiple engineering teams through influence, documentation, and partnership rather than direct authorityApply SRE best practices to ensure measurable SLIs/SLOs, reliability dashboards, and health indicators for critical systemsIntegrate and manage OpenTelemetry for distributed tracing and telemetry data collection, enabling end to end visibility of business critical transactions. Collaboration & Project ParticipationCollaborate with application development teams to define and document observability requirements for each project or release, ensuring accurate and actionable monitoring and tracing are in place for every step of business critical workflowsEmbed reliability considerations early in the SDLC, including SLO definitions, instrumentation needs, and failure mode awarenessPartner with product and engineering teams to use SLOs and error budgets to guide release decisions, prioritization, and toil reduction Alerting & Escalation ProcessDefine and maintain standardized alert payloads per engineering guidelines, ensuring alerts are actionablePartner with Level 2 and Level 3 support teams to reflect process changes in monitoring dashboardsMaintain and optimize thresholds, ensuring seamless escalations via BigPanda as the central alert hub Dashboard Creation & MaintenanceCreate and maintain intuitive, actionable dashboards for the Enterprise Monitoring Center and other finance teamsEnsure dashboards are effectively monitored by Level 1 teams, presenting clear, actionable data that reduces MTTR Documentation, Governance & Reliability StandardsDevelop and maintain technical documentation, runbooks, diagnostic guides, and observability standards across the enterpriseEvaluate and refine release, deployment, and monitoring processes to support consistent, reliable delivery pipelinesMentor junior engineers and promote a culture focused on reliability, automation, and operational excellence Reliability Engineering, Automation & Continuous ImprovementBuild automation frameworks for monitoring, alerting, self healing workflows, and incident response to reduce toil and improve MTTRDrive system optimization through capacity analysis, performance tuning, and proactive detection of reliability risksContribute to the automation of routine operational tasks to improve system reliability and engineer quality of lifeAdvocate for and implement observability best practices across engineering teamsDefine, implement, and operationalize SLIs, SLOs, and error budgets for critical servicesParticipate in and improve incident response processes, including detection, triage, escalation, and recovery QUALIFICATIONS: Education:Bachelor's in computer science, IT, or related field. Experience:At least 5+ years of experience in software, systems, or reliability engineering roles, with multiple years of hands on experience owning production observability, monitoring, and SLOs in distributed systems Required Skills:Deep experience building scalable, reliable monitoring and observability solutions, including instrumentation, alerting, dashboarding, and configuration across large, complex environmentsHands-on expertise and proficiency with modern monitoring and observability tools, (e.g., OpsRamp, Grafana, Elastic, CloudWatch, Azure Monitor BigPanda (AIOps), and strong knowledge of metrics, logs, traces, and OpenTelemetryStrong scripting and programming capability (Bash, PowerShell, and one or more languages such as Python, C-family, or JavaScript) to automate telemetry, alerting, and platform workflowsStrong expertise with cloud platforms (AWS and/or Azure) and container orchestration systems (Kubernetes, Docker)Deep hands-on experience with Elastic Observability (APM, Logs, Metrics, Traces)Understanding of distributed systems fundamentals, including networking, security, databases, DevSecOps principles, and performance/capacity engineeringStrong communication skills, with the ability to clearly explain complex technical topics to both technical and non-technical audiencesExceptional problem-solving and troubleshooting abilities, especially in high-pressure or time-sensitive environmentsEffective prioritization and multitasking, able to manage competing deadlines while maintaining quality and focusProven cross-functional collaboration, working seamlessly with diverse teams in large, complex IT environments and driving continuous improvement across systems Preferred Qualifications:Experience with CI/CD pipelines and tools like Jenkins, GitHub, GitLab CI, or CircleCI Experience querying, manipulating, and visualizing time-series dataFamiliarity with Infrastructure as Code tools (e.g., Ansible, Terraform)Knowledge of microservices architecture and event-driven systemsWorking knowledge of REST APIs, JSON, and ServiceNowExperience with cloud monitoring—particularly AWS or Azure We are an equal opportunity employer, and we are an organization that values diversity. We welcome applications from all qualified candidates, including minorities and persons with disabilities. reqOMF-REQ-0005386