Site Reliability Engineer (SRE) - AWS Cloud (Terraform & Ansible Focus)
We are seeking a Site Reliability Engineer (SRE) with deep expertise in AWS cloud infrastructure , Infrastructure as Code (IaC) , and large-scale production operations. This role is heavily focused on designing, deploying, automating, and optimizing cloud-native infrastructure in AWS using Terraform and Ansible .
You will work at the intersection of software engineering and cloud operations to build resilient, scalable, secure, and highly automated systems that power mission‑critical applications.
Key Responsibilities
AWS Cloud Architecture & Deployments
Design, implement, and maintain scalable, secure, and highly available infrastructure in AWS
Lead large‑scale AWS deployments across multi‑account, multi‑region environments
Architect and optimize solutions using services such as:
EC2, EKS, ECS, Lambda
VPC, Route 53, CloudFront
RDS, DynamoDB, S3
IAM, KMS, Secrets Manager
Implement well‑architected solutions aligned with AWS best practices
Infrastructure as Code (Terraform Focus)
Develop and maintain reusable, modular Terraform code
Build CI/CD‑driven infrastructure pipelines
Manage Terraform state securely (remote backends, locking, environment separation)
Enforce policy‑as‑code and guardrails
Review and optimize Terraform modules for performance and maintainability
Configuration Management & Automation (Ansible Focus)
Design and maintain Ansible playbooks and roles
Automate configuration management and application deployments
Integrate Ansible with CI/CD pipelines
Ensure idempotent, secure, and maintainable automation
Reliability & Operations
Define and implement SLOs, SLIs, and error budgets
Lead incident response, root cause analysis (RCA), and postmortems
Improve observability using logging, monitoring, and tracing tools
Optimize system performance, cost, and resilience
Build self‑healing infrastructure and automation‑first solutions
Required to participate in recurring On Call shifts
DevOps & CI/CD
Design and maintain CI/CD pipelines for infrastructure and applications
Promote GitOps workflows
Integrate automated testing, security scanning, and compliance validation
Security & Compliance
Implement least‑privilege IAM policies
Automate security controls within Terraform and Ansible
Ensure compliance with internal and regulatory standards
Implement infrastructure security best practices (network segmentation, encryption, patching)
Required Qualifications
4+ years in DevOps, Cloud Engineering, or Site Reliability Engineering
3+ years hands‑on AWS experience in production environments
Deep expertise in:
Terraform (advanced modules, workspaces, state management)
Ansible (roles, playbooks, dynamic inventories)
Strong experience with:
CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, etc.)
Kubernetes (EKS preferred)
Linux systems administration
Networking fundamentals (VPC design, DNS, load balancing)
Proficiency in at least one scripting/programming language (Python, Bash, Go)
Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, etc.)
Strong understanding of distributed systems and reliability engineering principles
Preferred Qualifications
Experience with multi‑account AWS environments using Organizations and Control Tower
Experience implementing GitOps workflows
AWS certifications (Solutions Architect, DevOps Engineer, etc.)
Experience with service mesh technologies
Cost optimization and FinOps experience
Experience in highly regulated environments (HIPAA, SOC 2, PCI)
#J-18808-Ljbffr