Site Reliability Engineer (SRE) with Apache Druid
Job Title: Site Reliability Engineer (SRE) – Apache Druid Platform Job Location: Austin, TX / Sunnyvale, CA (Onsite)Type of Hire : Full TimeMode of interview - VirtualWe are seeking a highly motivated Site Reliability Engineer (SRE) with strong experience supporting large-scale distributed data platforms in cloud-native environments. The ideal candidate will have hands-on expertise with Apache Druid, Kubernetes (EKS), AWS infrastructure, observability tooling, and production reliability engineering practices.This role focuses on ensuring scalability, availability, operational efficiency, and reliability for mission-critical analytics and ingestion platforms operating at production scale.Key ResponsibilitiesManage and support large-scale distributed data platform infrastructure running on Kubernetes (EKS) and AWS.Administer and troubleshoot Apache Druid clusters, including ingestion pipelines, broker performance, historical nodes, deep storage integrations, and query reliability.Support Kafka-based streaming and ingestion ecosystems for high-volume analytics workloads.Improve platform reliability, scalability, and operational efficiency through automation and proactive monitoring.Develop automation scripts and operational tooling using Python and Bash for health checks, deployment validation, recovery workflows, and operational tasks.Work with observability and monitoring platforms such as Prometheus, Grafana, Datadog, Splunk, ELK Stack, OpenTelemetry, and CloudWatch.Participate in incident response, root cause analysis (RCA), postmortems, and production troubleshooting for distributed systems.Support Kubernetes lifecycle activities including deployments, upgrades, storage migrations, scaling, and cluster troubleshooting.Build and maintain CI/CD pipelines using Jenkins and GitHub Actions to streamline deployment and infrastructure operations.Collaborate closely with platform engineering, DevOps, infrastructure, and application teams to improve operational readiness and reliability standards.Implement Infrastructure as Code (IaC) practices using Terraform and cloud-native automation methodologies.Required SkillsStrong experience in Site Reliability Engineering (SRE), Platform Engineering, or Production Infrastructure Support.Hands-on experience with Apache Druid in production environments.Strong Kubernetes experience, preferably Amazon EKS.Experience supporting AWS cloud infrastructure and cloud-native distributed systems.Experience with Kafka ecosystems and distributed data ingestion architectures.Strong knowledge of observability, monitoring, and logging platforms.Experience with automation scripting using Python and Bash.Experience with CI/CD tools such as Jenkins and GitHub Actions.Knowledge of Infrastructure as Code (Terraform preferred).Strong troubleshooting and incident management skills across distributed systems environments.Preferred QualificationsExperience supporting large-scale analytics or real-time data platforms.Exposure to Apache Airflow or workflow orchestration platforms.Understanding of high-availability architectures and reliability engineering principles.Experience working in enterprise production environments with operational SLAs and reliability metrics.