Site Reliability Engineer
Senior Site Reliability Engineer (SRE / Infrastructure)Role OverviewWe’re hiring a Senior SRE to build and scale the infrastructure behind a high-growth, production system. You’ll ensure reliability, performance, and scalability as the platform grows from early traction to large-scale usage.This role focuses on designing resilient systems, improving observability, and automating operations so engineering teams can move quickly and safely.What You’ll DoOwn reliability, scalability, and performance of production systemsBuild and manage cloud infrastructure (primarily AWS/GCP + Linux)Design and operate Kubernetes clusters and containerized workloadsImprove CI/CD pipelines and deployment workflowsLead incident response, on-call practices, and root cause analysisBuild observability systems (monitoring, logging, alerting)Partner with engineers to design resilient systems (databases, pipelines, async systems)Automate infrastructure and operational workflows using IaCRequirements5+ years in SRE, DevOps, or infrastructure-focused engineeringStrong experience with cloud platforms (AWS/GCP) and Infrastructure as Code (e.g., Terraform)Production experience with KubernetesExperience with monitoring/observability tools (e.g., Prometheus, ELK, Datadog)Strong understanding of distributed systems, networking, and reliability best practicesComfortable coding/scripting (e.g., Python, Go, or similar)Nice to HaveExperience scaling high-availability systemsFamiliarity with CI/CD and modern deployment strategies (canary, blue/green)Background in data pipelines, async systems, or large-scale applicationsExposure to Go, Rust, C++, or TypeScriptInterest in applying AI to infrastructure or operationsIf you want this even tighter (like a LinkedIn post or a 6-line “we only want killers” version), I can compress it further—but this is about as short as you can go without losing signal.