Site Reliability Engineer
Senior Site Reliability Engineer Start Date: 2-3 weeks needed for OnboardingLocation: Ideally Chicago or Hartford; open to remoteDuration: 12+ Month ContractJob Role: Lead SRE project plan and implementation for distributed applications across GCP and Azure covering API's , data pipelines , messaging/event driven systems and also external data platforms.Required experience: 7+ years in SRE with proven Azure, GCP observability, Grafana stack, GKE, AKS, OpenTelemetry, and instrumentation implementation experience. -> Hands-on experience mocking up using OTEL, Promethius, and metrics related targets for this project. "Must have skills: - Technical: Prometheus, Grafana, Kubernetes, Loki, Tempo, GCP or Azure logging- Logging & Tracing: Distributed tracing,W3C Trace Context headers implementation, log aggregation standards, correlation IDs across systems/applications- Structured Logging: JSON format with specific fields (trace_id, service.name, log.level, customer.id, request.id)- Experience monitoring batch/data pipelines (Cloud composer,Dataproc,ETL workflows) including job failures, scheduling issues- Infrastructure: CI/CD pipelines , AI tools like GIT copilot etc.- Observability Tools & Query Languages: PromQL for querying metrics(Grafana)- Strong experience with Kubernetes (GKE,AKS), including namespace management, RBAC, and deploying/maintaining SRE tools via code (Java/Python, Bash, YAML, Helm)- OpenTelemetry (OTEL): Instrumentation, collectors, data collection from GCP services- Alerting and Incident management :Implementing structured processes for handling failures, and conducting reviews that focus on fixing system issuesAdditional skills: - Experience in monitoring external managed services like Mongo DB ,Kafka,Cloud SQL, Azure based monitoring , Oncall systems designing and writing on call rotation policies and rules (Xmatters or PagerDuty or Opsgenie etc.)AI trainings, hands on experienceJob Description: - Design and implement comprehensive SRE monitoring for distributed applications- Implement distributed tracing and logging using W3C Trace Context headers and OpenTelemetry standards across all applications- Create drill-down Grafana dashboards with correlation between metrics, logs, and traces- Integrate GCP and Azure Monitoring, Logging, and Trace with existing Open telemetry standards by enterprise teams- Implement zero code instrumentation for monitoring and traceability- Experience in defining and working with core SRE models like SLI's , SLO's , Error budgets etc- Design reliability focused metrics (Latency, Request rate, Error, Duration, Availability) dashboards- Build service health dashboards with drill-down capabilities and error message analysis- Develop and maintain SRE automation/scripts within GKE namespaces for monitoring, deployment, and troubleshooting-Configure APIGEE monitoring and API performance tracking for applications working with enterprise teams