JOBSEARCHER

Senior Reliability Engineer

Job DescriptionAbout the RoleAs Senior Reliability Engineer, you will ensure the resilience and availability of Kohl’s systems and applications, collaborate closely with development teams, contribute to architectural designs, conduct risk assessments and design for failure, and implement robust monitoring and failover mechanisms.What You’ll DoDrive error budget and Service Level Objective (SLO) adoption across productsDrive incident response efforts, perform root cause analysis and implement preventative measures to enhance system reliabilityEstablish consistent practices that elevate Kohl’s operational excellence through automation and process improvementsFollow software lifecycle and drive reliability, observability, and efficiency across product teams within an assigned domainIdentify repeated toil and find opportunities for automation and risk reductionOn-call on a rotation to respond to production incidents and conduct blameless retros and root-cause analyses (RCAs) to drive a culture of continuous improvementsProactively identifies failures before they cause outages using chaos engineering techniques such as edge cases, failure modes and design reviewAdvise on capacity planning and provide continuous assessments on systems behavior and consumptionWork with product managers to identify and prioritize work for reliability best practices (i.e., leveraging SLIs/SLOs/Error Budgets)Mentors and assists engineers on the teamAdditional tasks may be assignedRequiredWhat Skills You HaveBachelor's Degree or equivalent in MIS, Computer Science or related field4+ years of experience in software developmentStrong programming skills in one or more languages (Java, Python, Go or Node.js)In-depth knowledge of systems architecture, operating system internals and network fundamentals In-depth knowledge of application design patterns, event-driven architecture, database schemas, and testing strategiesExperience with multi-region application troubleshooting and performance tuningWorking experience with one cloud platform (GCP, AWS, or Azure)Working experience with monitoring techniques and tools (e.g., CloudWatch, Grafana, Prometheus, OpenTelemetry, Tracing) PreferredIn-depth knowledge of containerization and container orchestration (e.g., Docker, Kubernetes, Rancher) Experience with one or more configuration management systems (e.g., Chef, Ansible, Puppet)Passion for and experience with AI and ML methodologies (MLOps)Experience writing Infrastructure as code (e.g., Terraform, OpenTofu)