Upvote
Downvote
Systems Design Engineer - Site Reliability Engr
Share Job
- Suggest Revision
Full-time
- Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded.
- This position will be focused on the operational aspects of large-scale GPU-accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD. The SRE will work closely with the CPE Platform Engineering (PE) and Data Center Operations (DCOps) teams as internal and external systems are brought up for customers.
- Role and Responsibilities This SRE role will primarily involve learning the AMD GPU cluster systems, assisting in the bring up of these systems, and developing automation to keep them operational, as well as working with the various other DCGPU and DSG teams to incorporate requirements and address any issues on the systems.
- This includes some involvement with rack-and-stack datacenter operations, at scale software install and configuration management, and at scale system provisioning, helping to build and operate an on-prem cloud service for internal AMD stakeholders that forms a model for customer adoption.
- Experience with virtualization and containerization including systems like KVM, Docker, podman, OpenShift, and Kubernetes.
Expired 13 days agoInactive Job
Similar Job
Relevance
Active