JOBSEARCHER

SME SRE Observability

Job Title: SME – SRE Observability EngineerLocation: Minnesota (Onsite – 4 to 5 days/week)Job Summary:We are seeking an experienced Subject Matter Expert (SME) in Site Reliability Engineering (SRE) with a strong focus on Observability . The ideal candidate will be responsible for designing, implementing, and optimizing observability frameworks to ensure high system reliability, performance, and scalability in a production environment.Key Responsibilities:Lead the design and implementation of observability solutions including metrics, logging, and tracing.Act as an SME for SRE best practices , ensuring system reliability, availability, and performance.Develop and maintain dashboards, alerts, and monitoring strategies.Collaborate with development, DevOps, and infrastructure teams to improve system visibility.Perform root cause analysis (RCA) and drive incident resolution.Optimize system performance and reliability through proactive monitoring.Implement automation to improve operational efficiency and reduce manual intervention.Define and track SLIs, SLOs, and SLAs .Required Skills & Qualifications:Strong experience in Site Reliability Engineering (SRE) concepts and practices.Deep expertise in Observability tools (e.g., Prometheus, Grafana, ELK Stack, Datadog, Splunk, or similar).Experience with cloud platforms (AWS, Azure, or GCP).Proficiency in scripting/programming (Python, Bash, or similar).Hands-on experience with monitoring, alerting, and logging frameworks .Strong troubleshooting and performance tuning skills.Experience with CI/CD pipelines and automation tools.Preferred Qualifications:Experience working in high-availability, distributed systems .Knowledge of containerization and orchestration tools (Docker, Kubernetes).Prior experience as an SRE SME or Lead .