Senior Site Reliability Engineer

SanlamBellville, TXJune 5th, 2026

Computer Systems Engineers/ArchitectsComputer Systems Design and Related Services

Who are we?Sanlam Fintech is a newly established digital first business within the Sanlam Group on a mission to democratize financial advice and solutions for everyone across the African continent. We exist to pioneer inclusive financial confidence helping people build strong foundations to bridge the gap in generational wealth. Our culture us that of agility and constant deployment, we believe in learning fast, learning cheap and learning forward. Our aim is to provide a work environment where knowledge workers can accelerate the development of their ideas and bring innovation to market, at the same time provide compelling career and development proposition that will enable them to realize their dreams.Position OverviewThe Site Reliability Engineer (SRE) at Sanlam Fintech is responsible for ensuring the reliability, scalability, and performance of our cloud-native infrastructure and services. This role bridges software engineering and operations, applying engineering principles to solve complex infrastructure challenges. The SRE will focus on building and maintaining resilient systems on AWS, implementing comprehensive observability solutions, and driving automation across the infrastructure lifecycle.Operating in a DevOps environment, the SRE takes full ownership of the systems they build and operate, ensuring high availability and optimal customer experience. They work closely with Software Engineers, Platform Engineers, and DevSecOps teams to deliver infrastructure solutions that support Sanlam Fintech business objectives and uphold our commitment to operational excellence.What will you do?Reliability & ResilienceBuild highly available, fault-tolerant systems on AWSDefine SLIs, SLOs and error budgets to track and improve reliabilityPlan and implement disaster recovery strategies (RTO/RPO)Lead incident response and root cause analysisBuild self-healing systems with automated fixes for common failuresRun chaos engineering tests to find and fix weaknessesObservability & MonitoringSet up metrics, logs and traces for full system visibilityBuild dashboards and alerts for fast incident detectionImplement distributed tracing to spot performance issuesSet monitoring standards and maintain operational runbooksPublish regular uptime and operational metrics reportsInfrastructure AutomationWrite and maintain Infrastructure as Code using Terraform and CloudFormationAutomate provisioning, configuration and deployments with DevOps/Platform teamsBuild and manage CI/CD pipelines using GitHub ActionsImplement GitOps practices and self-service automation to reduce manual workCloud Infrastructure & ArchitectureDesign and optimise serverless solutions (Lambda, API Gateway, Step Functions)Manage and optimise Kubernetes clustersImplement cloud-native patterns like event-driven and microservices architecturesOptimise cloud costs and evaluate new AWS servicesSoftware Engineering & DevelopmentBuild clean, well-structured automation tools and scriptsApply Clean Architecture and Domain-Driven Design to infrastructure codeImprove internal tools to boost developer productivityUse AI tools (Claude, GPT) to automate routine tasksCollaboration & Knowledge SharingWork with cross-functional teams using Jira, Confluence and JSMParticipate in on-call rotations and incident handoffsMentor junior engineers in SRE practicesDocument decisions, procedures and run blameless postmortemsQualification and ExperienceRequired Experience5+ years of experience in systems engineering, DevOps, or site reliability engineering roles3+ years of hands-on experience with AWS cloud services in production environments2+ years of experience with Infrastructure as Code (Terraform and/or CloudFormation)Demonstrated experience in incident management and on-call responsibilitiesTrack record of implementing automation that reduced operational toilEducational Background* Bachelor's degree in Computer Science, Information Technology, Engineering or related field; or equivalent practical experience* Relevant professional certifications are advantageous but not requiredWhat will make you successful in this role?Cloud Platforms & InfrastructureStrong expertise in AWS services including EC2, ECS, EKS, Lambda, API Gateway, Step Functions, S3, RDS, DynamoDB, CloudWatch and networking services such as VPC, Route53 and ALB/NLBDeep understanding of serverless architecture patterns and best practicesExperience with Kubernetes cluster management, deployment strategies and service mesh conceptsKnowledge of cloud security best practices including IAM, security groups and encryptionInfrastructure as Code & AutomationProficiency in Terraform for multi-environment infrastructure managementExperience with AWS CloudFormation for native AWS resource provisioningStrong scripting skills in Python for automation and tooling developmentExperience with configuration management tools and practicesObservability & MonitoringExpertise in Datadog, Cloudwatch and OTEL for full-stack observability including APM, infrastructure monitoring, log management and synthetic testing and monitoringExperience designing and implementing SLI/SLO frameworksProficiency in creating effective dashboards, alerts and runbooksUnderstanding of distributed tracing and correlation across servicesDevelopment & Version ControlStrong experience with GitHub for version control, code review and CI/CD workflowsUnderstanding of Clean Architecture principles and their application to infrastructure codeFamiliarity with Domain-Driven Design concepts for complex system designExperience building and maintaining CI/CD pipelines using GitHub ActionsTools & PlatformsProficiency with Atlassian suite (Jira, Confluence) for project management and documentationExperience leveraging AI tools (Claude, GPT) for code generation, documentation, and problem-solvingFamiliarity with containerisation technologies (Docker) and orchestration platformsExperience with Linux system administration and troubleshootingNice To Have SkillsThe following skills are desirable and will strengthen a candidate's application:Experience with additional cloud providers (Azure and GCP) for multi-cloud strategiesKnowledge of FinOps practices and cloud cost optimisation techniquesExperience with chaos engineering tools (AWS Fault Injection Simulator, Gremlin and Chaos Monkey)Familiarity with service mesh technologies (Istio and AWS App Mesh)Experience with database reliability engineering and performance tuningKnowledge of compliance frameworks relevant to financial services (POPIA and PCI-DSS)Contributions to open-source projects or community involvementAWS certifications (Solutions Architect, DevOps Engineer or SysOps Administrator)Kubernetes certifications (CKA and CKAD)Experience with event-driven architectures using AWS EventBridge, SNS, SQS or KafkaKnowledge and SkillsIT Data AnalysisIT product enhancementsSoftware design and deploymentsPlatform management and integrationBusiness RequirementsPersonal AttributesOrganisational savvy - Contributing through othersManages complexity - Contributing through othersPlans and aligns - Contributing through othersOptimises work processes - Contributing through othersBuild a successful career with usWe're all about building strong, lasting relationships with our employees. We know that you have hopes for your future - your career, your personal development and of achieving great things. We pride ourselves in helping our employees to realise their worth. Through its five business clusters - Sanlam Fintech, Sanlam Life and Savings, Sanlam Investment Group, Sanlam Allianz, Santam, as well as MiWay and the Group Office - the group provides many opportunities for growth and development.Core CompetenciesBeing resilient - Contributing through othersCollaborates - Contributing through othersCultivates innovation - Contributing through othersCustomer focus - Contributing through othersDrives results - Contributing through othersTurnaround timeThe shortlisting process will only start once the application due date has been reached. The time taken to complete this process will depend on how far you progress and the availability of managers.Our commitment to transformationThe Sanlam Group is committed to achieving transformation and embraces diversity. This commitment is what drives us to achieve a diverse, inclusive and equitable workplace as we believe that these are key components to ensuring a thriving and sustainable business in South Africa. The Group's Employment Equity plan and targets will be considered as part of the selection process.

Senior Site Reliability Engineer

matching similar jobs near Bellville, TX