Sr. Site Reliability Engineer
Standard Template Labs is an AI-native startup reimagining the future of IT Service and Configuration Management. Backed by leading investors, we're leveraging AI to transform how enterprises manage and engage with their IT ecosystems.About The RoleWe’re looking for a Senior Site Reliability Engineer (SRE) to own the reliability, performance, and scalability of our AI-native platform. You’ll operate at the intersection of software engineering and infrastructure, building systems that keep our platform highly available, observable, and resilient in production.This is a hands-on engineering role where you’ll write production code (primarily in Python) while also owning on-call operations and incident response.ResponsibilitiesReliability & Production OwnershipOwn the availability, latency, and performance of critical production systemsParticipate in and improve a 24/7 on-call rotation, responding to incidents and driving resolutionLead incident response, root cause analysis (RCA), and postmortemsDesign systems that fail gracefully and recover automaticallyAutomation & Engineering (Python-heavy)Write production-grade Python code to:Automate infrastructure workflowsBuild internal reliability toolsImprove deployment, rollback, and recovery systemsEliminate manual operational work through automation and self-healing systemsObservability & MonitoringDesign and implement:Metrics, logging, tracingAlerting systems (reduce noise, improve signal)Build dashboards and tooling to give real-time visibility into system healthInfrastructure & ScalabilityOperate and improve systems running on:Cloud platforms (AWS/GCP/Azure)Containers (Docker, Kubernetes)Scale systems to handle enterprise workloads and high-throughput trafficImprove deployment pipelines, CI/CD, and infrastructure-as-codeReliability Engineering & ResilienceDefine and enforce:SLAs / SLOs / error budgetsConduct:Load testingChaos testingBuild resilient systems that can tolerate failureCollaborationPartner with product and backend engineers to:Improve system reliabilityEmbed observability into servicesHelp teams design production-ready systems from day oneQualificationsCore RequirementsStrong software engineering background (not just ops)Proficiency in Python (required) for building tools and servicesExperience operating production systems at scaleInfrastructure & SystemsExperience with:Kubernetes / DockerCloud platforms (AWS/GCP/Azure)Distributed systemsReliability & OperationsExperience with:On-call rotations and incident responseMonitoring tools (Grafana, Prometheus, etc.)Debugging production issues under pressureNice to HaveExperience with:AI/ML systems or data pipelinesEvent-driven architecturesHigh-availability systemsWhat we offerBuild foundational product features for an AI-first enterprise platformThe opportunity to take ownership of critical systems that scale to millions of usersA culture that values craftsmanship, autonomy, and technical excellenceCompetitive compensation, equity, and benefits packageWork from our Flatiron District, Manhattan office, where you’ll be side-by-side with the founding team in a supportive, collaborative setting. Our team works on-site five days a week, growing and building together, and the location is easy to reach with plenty of public transportation options. As an equal opportunity employer, we don’t tolerate discrimination or harassment of any kind. Whether that’s based on race, ethnicity, age, gender identity, citizenship, religion, sexual orientation, disability, pregnancy, veteran status or any other protected characteristic as outlined by federal, state or local laws. The reasonably estimated yearly salary for this role at is: $160,000—$250,000 USD.