JOBSEARCHER

Cloud Observability Engineer

inteprosFlorida, NYApril 9th, 2026
Senior Cloud Observability Engineer - Hybrid - Onsite 4 days a week Open to candidates in New York City, NY; Lake Mary, FL; or Pittsburgh, PA We are seeking a Senior Cloud Observability Engineer to design, implement, and optimize enterprise-grade monitoring and observability solutions for mission-critical applications deployed in Microsoft Azure. This role operates at the intersection of cloud engineering, application performance monitoring, network visibility, and site reliability engineering. The ideal candidate brings deep hands-on expertise across APM, infrastructure monitoring, and network performance monitoring platforms, along with strong Azure-native monitoring experience. This individual will collaborate closely with application engineering, cloud platform, DevOps/SRE, network, and security teams to deliver unified, scalable, and compliance-aligned observability capabilities across the enterprise. Key Responsibilities Architect and implement end-to-end observability across application, infrastructure, network, and user experience layers within Azure-hosted environments. Integrate Azure-native services (Azure Monitor, Log Analytics, Application Insights) with enterprise monitoring platforms to deliver unified telemetry and correlation. Configure and optimize enterprise monitoring tools such as AppDynamics, Dynatrace, New Relic, ThousandEyes, NetScout, SolarWinds, Datadog, Prometheus/Grafana, or equivalent. Instrument applications for distributed tracing, code-level diagnostics, service mapping, and business transaction monitoring. Establish telemetry standards (metrics, logs, traces) aligned with SLIs/SLOs and reliability objectives. Design dashboards, synthetic tests, health checks, and alerting strategies that improve signal-to-noise ratio and reduce alert fatigue. Implement network performance monitoring and digital experience monitoring, including path visibility, BGP/DNS testing, and multi-hop user experience validation. Embed monitoring into CI/CD pipelines and infrastructure-as-code workflows to ensure new services meet observability standards. Support incident response and post-incident reviews with data-driven analysis and root cause insights. Conduct capacity planning, trend analysis, and performance optimization recommendations. Ensure observability solutions meet security and compliance requirements; support audit evidence and documentation. Develop runbooks, escalation processes, and knowledge transfer materials to support operational excellence. Required Qualifications 5+ years of experience implementing enterprise observability and monitoring solutions in cloud or hybrid environments. Strong hands-on experience with: Application Performance Monitoring (APM): AppDynamics, Dynatrace, New Relic (or equivalent). Application Insights Solid understanding of distributed tracing, OpenTelemetry concepts, metrics pipelines, and log aggregation. Experience building dashboards, defining alert policies, and tuning thresholds and anomaly detection. Scripting and automation skills (PowerShell, Python, or Bash). Strong networking fundamentals (DNS, BGP, HTTP, TLS, TCP/IP) with the ability to correlate network and application telemetry. Experience supporting production incident response and performance troubleshooting. Excellent documentation and cross-functional communication skills. Preferred Qualifications Experience in regulated industries (financial services, healthcare, government). Familiarity with SIEM/SOAR and log aggregation platforms (Splunk, Elastic). Integration experience with ITSM tools such as ServiceNow. Experience embedding observability into infrastructure-as-code (ARM, Bicep, Terraform). Exposure to SRE practices, including SLIs, SLOs, error budgets, and reliability reviews. Programming experience in Java, .NET (C#), or Python for instrumentation, automation, or custom telemetry integration. What You'll Bring A systems-thinking mindset with the ability to see across application, infrastructure, and network layers. Strong analytical skills and the ability to translate telemetry into actionable insights. A collaborative approach to partnering with engineering, operations, and security teams. A commitment to reliability, scalability, and operational excellence. J-18808-Ljbffr

matching similar jobs near Florida, NY

VIEW MORE