Site Reliability Engineer
Harrison Clarke partners exclusively with venture-backed technology companies building category-defining products. We are currently conducting a confidential retained search on behalf of one of our flagship portfolio companies, a well-capitalised, mission-driven technology firm that has been operating at scale since 2014, with millions of global users and a reputation for rigorous engineering.The RoleAs Senior Site Reliability Engineer, you will own the infrastructure foundation that the entire engineering organization depends on. This isn't a support function, it's a strategic one. You'll work at the intersection of reliability, scalability, and developer experience, ensuring that a high-availability distributed system stays fast, resilient, and ready to grow.What You'll Be DoingMaintaining, improving, and securing cloud infrastructure across AWS and GCP, alongside Linux systems at scalePartnering directly with engineering teams to streamline deployment, packaging, and troubleshooting of complex distributed applicationsDriving CI/CD maturity using Jenkins and adjacent toolingBuilding, operating, and continuously improving Kubernetes clusters in production; serving as the internal authority on container orchestrationLeading application migration efforts onto Kubernetes in close collaboration with development squadsOwning internal platform services including Prometheus and ELKMonitoring high-availability environments and responding to incidents with urgency and rigour; conducting thorough, blameless post-mortemsParticipating in architecture and code reviews, setting the bar for infrastructure best practiceEvaluating emerging technologies and making pragmatic decisions on adoptionIdentifying and eliminating toil through intelligent automationWhat You Bring5+ years in cloud-based systems operations as an SRE or DevOps engineerHands-on experience with infrastructure as code and configuration managementStrong command of SRE methodologies: SLOs/SLIs/error budgets, capacity planning, disaster recovery testingDeep understanding of networking fundamentalsProven track record managing production workloads with sophisticated monitoring and alertingComfort with on-call responsibilities and a systematic approach to incident managementProficiency in at least one scripting or programming languageA collaborative, low-ego working style — you raise the team up, especially under pressureBonus PointsAbility to read and reason about Go, Rust, C++, or TypeScriptExperience applying AI-driven approaches to operational automation