JOBSEARCHER

Site Reliability Engineer (89322-1)

Dice is the leading career destination for tech experts at every stage of their careers. Our client, Key Business Solutions, Inc., is seeking the following. Apply via Dice today!Site Reliability Engineer (89322-1)Alpharetta, GA12+ Months ROLE_DESCRIPTION - Skill Set - Expertise in UNIX + LINUX Administration + AWS/ AZURE Cloud monitoring + Terraform/ Ansible + Promethe Grafana observability experience).Work Location - AlpharettaExperience required for role - 6+ yearsProduction experience in SRE / Infrastructure / ops for large-scale systemsStrong programming/scripting skills (Python, Go, Java, or equivalent)Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architecturesExperience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)Solid experience in capacity planning, performance tuning, scaling, and incident responseDemonstrated ability to lead RCAs, deploy fixes, and drive reliability improvementsExperience in regulated environments (financial services, compliance, audit, security) is a strong plusExcellent communication, documentation, and cross-team collaboration skillsProven track record of reducing operational toil via automationExperience: 6+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineering knowledge.Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)Design and build automation for core platform capabilities, reducing manual toilDevelop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboardsLead incident response, root cause analysis (RCA), postmortems, and systemic remediationPerform capacity planning, scaling strategies, workload scheduling, and resource forecastingOptimize cost vs. performance tradeoffs in large-scale compute environmentsHarden systems for security, compliance, auditability, and data governanceCollaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systemsDefine disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanismsMaintain runbooks, operational playbooks, documentation, and training materialsParticipate in on-call rotations and respond to production incidents 24/7 as neededContinuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliabilitySkills: Digital : Python~Digital : Docker~Digital : Kubernetes~Digital : Site Reliability Engineering (SRE)Experience Required: 6-8Skills: Category Name Required Importance ExperienceSkillCategoryTest1_MN Digital : Site Reliability Engineering (SRE) Yes 1 4-7 years