Senior Site Reliability Engineer
Job Title: Senior Site Reliability Engineer (SRE) – AI & Automation FocusLocation: Dallas, TXWork Setup: Hybrid Employment Type: Contract (Short-term, with potential extension)About the RoleWe’re looking for a Senior Site Reliability Engineer (SRE) who enjoys solving complex production issues and building smarter, more automated systems.In this role, you won’t just “keep the lights on”—you’ll help modernize operations using automation and AI, making systems more reliable, scalable, and easier to manage.This is a senior-level position where you’ll work across multiple systems and teams, helping drive a shift toward automation-first and AI-assisted operations.What You’ll DoKeep Systems Running SmoothlyMaintain and improve high-availability production systems in the cloudTake part in on-call rotations and lead incident response when neededInvestigate issues, identify root causes, and help prevent them from happening againBuild dashboards, alerts, and reliability metrics (SLIs/SLOs)Troubleshoot across Java apps, Kubernetes, and cloud infrastructurePartner with engineering and security teams to reduce risksBuild Smarter, Automated OperationsDesign and implement automation tools and AI-driven workflowsHelp create systems that can:Detect and analyze incidents automaticallySuggest root causes and possible fixesAssist or automate resolution (with proper safeguards)Gradually reduce manual work by introducing intelligent automationEnsure all automation includes human oversight for critical actionsTech StackCloud: AzureContainers: Kubernetes, DockerLanguages: Java, Python, BashCI/CD: GitHub ActionsMonitoring: Dynatrace (or similar tools)Automation: Ansible and modern AI/agent-based frameworksWhat You Bring7+ years of experience in Site Reliability Engineering or Production SupportStrong hands-on experience with:Cloud platforms (Azure preferred)Kubernetes & containerized environmentsJava-based production systemsCI/CD pipelinesMonitoring and observability toolsProven track record of automating manual processesSolid understanding of SRE best practices (SLIs, SLOs, error budgets)Nice to HaveExperience reducing on-call workload through automationExposure to AI/ML or intelligent automation toolsBackground working in regulated industries (e.g., healthcare, finance)Familiarity with distributed systems or multi-agent architecturesWhat Success Looks LikeLess manual work and fewer repetitive alertsFaster and more efficient incident resolutionImproved system uptime and reliabilityClear, consistent communication during incidentsIncreased use of automation and AI in daily operationsWho You AreYou take ownership of systems end-to-endYou think beyond tickets and focus on long-term solutionsYou’re always looking for ways to automate and improve processesYou’re curious about how AI can enhance engineering workflows—and you use it responsibly