{"schemaVersion":"jobsearcher.job.v1","id":"b0e703732fb80e0a285c28db","url":"https://jobsearcher.com/jobs/b0e703732fb80e0a285c28db","canonicalUrl":"https://jobsearcher.com/jobs/b0e703732fb80e0a285c28db","title":"Technical Program Manager- AI Cluster Validation","description":"WHAT YOU DO AT AMD CHANGES EVERYTHING\r\nAt AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.\r\nTechnical Program Manager- AI Cluster Validation\r\nTHE ROLE\r\nWe are seeking a Technical Program Manager to lead execution of AI cluster engineering programs with deep focus on GPU platforms, rack-level solutions, and AI Cluster validation. This role is responsible for driving end-to-end delivery from GPU + server integration through rack bring-up, scale testing, failure analysis, and system debug closure, ensuring platform readiness for hyperscale and enterprise AI deployments.\r\nThis role operates at the intersection of hardware, firmware, networking, and scale-test execution, and requires strong technical depth combined with disciplined program execution.\r\nTHE PERSON\r\nYou are a hands-on TPM who thrives in complex, fast-moving ecosystems, and can connect deep technical details to crisp program plans, executive reporting, and customer outcomes. You are comfortable driving execution in bring-up and EVT/DVT/PVT working closely with engineers to root-cause issues, unblock debug, and make data-driven tradeoffs to keep programs moving. You bring urgency, ownership, and clarity to ambiguous problem spaces and can communicate effectively from lab floor to executive review.\r\nKEY RESPONSIBILITIES\r\nProgram Leadership & Execution\r\n\r\nDefine, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring-up, and cluster-scale deployment readiness.\r\nCreate and maintain core PM artifacts: schedules, dependency maps, resource forecasts, risk/issue logs, and program dashboards/status reports.\r\nIdentify and drive mitigation plans for issues/risks, including cross-team escalations and corrective actions across multiple engineering areas.\r\nDrive regular execution reviews with engineering teams and provide concise, data-driven updates to senior leadership.\r\n\r\nGPU & Platform Execution\r\n\r\nOwn program execution for GPU-based AI platforms, spanning system bring-up, qualification, scale readiness, and deployment validation across server, rack, and cluster levels.\r\nDrive alignment across GPU, CPU, firmware, BIOS/BMC, and system teams to ensure readiness for scale testing and customer workloads.\r\nTrack platform issues, and debug dependencies; ensure risks are clearly documented, owned, and mitigated.\r\n\r\nAI Rack / Cluster Validation\r\n\r\nOwn program planning and execution for multi-node and multi-rack scale testing, including test strategy, scheduling, coverage tracking, and readiness gates.\r\nLead end-to-end delivery of rack-level AI solutions, including compute trays, switch trays, cabling, power, cooling, and management infrastructure.\r\nEnsure rack bring-up plans are executable, resourced, and gated with clear entry/exit criteria across EVT, DVT, and scale phases.\r\nDrive coordination across lab operations, infrastructure, and engineering teams to unblock rack access, power, networking, and test readiness.\r\nPartner with scale, performance, and automation teams to ensure workloads, stress tests, and regressions plans are ready before hardware arrives.\r\n\r\nDebug, Failure Analysis & Risk Management\r\n\r\nAct as the execution lead for platform debug, coordinating across engineering teams to ensure fast triage, root-cause analysis, and resolution of system-level issues.\r\nTrack high-impact failures (GPU, HSIO, FW, rack, network) through debug forums ensuring clear ownership and closure plans.\r\nBalance debug depth vs. program timelines, escalating tradeoffs when needed and ensuring leadership has a clear view of risk and impact.\r\n\r\nREQUIRED QUALIFICATIONS\r\n\r\nExperience leading complex hardware or AI infrastructure programs with ownership across bring-up, validation, and deployment phases.\r\nStrong technical understanding of GPU-based AI systems, rack architectures, and datacenter infrastructure.\r\nProven ability to manage ambiguity, drive debug execution, and lead cross-functional teams without direct authority.\r\nStrong written and verbal communication skills, including executive-level status reporting.\r\nProficiency with program management and execution tools (Jira, Confluence, dashboards, Excel/PowerPoint).\r\n\r\nPREFERRED QUALIFICATIONS\r\n\r\nHands-on experience with GPU cluster scale testing, system stress, or performance validation.\r\nFamiliarity with rack-level bring-up, power/cooling constraints, networking, and failure modes at scale.\r\nExperience working through hardware/firmware debug cycles in pre-production or customer-facing environments.\r\n\r\nACADEMIC CREDENTIALS\r\n\r\nBachelor's or master's degree in systems, EE, CS, or related engineering discipline.\r\nPMP, Scrum Master, or equivalent program management training.\r\n\r\nLOCATION\r\nAustin, TX\r\nThis role is not eligible for visa sponsorship.\r\nLI-JE1\r\nBenefits offered are described: AMD benefits at a glance.\r\nAMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.\r\nAMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's \"Responsible AI Policy\" is available here.\r\nThis posting is for an existing vacancy.","company":"Advanced Micro Devices","rawCompany":"advanced micro devices","city":"Austin","state":"TX","isRemote":false,"isActive":false,"createdAt":"2026-05-21T07:35:16.036Z","occupations":[{"code":"15-1299.09","title":"Information Technology Project Managers","slug":"information-technology-project-managers"},{"code":"11-3021.00","title":"Computer and Information Systems Managers","slug":"computer-and-information-systems-managers"},{"code":"15-1299.08","title":"Computer Systems Engineers/Architects","slug":"computer-systems-engineers-architects"}],"industries":[{"code":"334111","title":"Electronic Computer Manufacturing","slug":"electronic-computer-manufacturing"},{"code":"541512","title":"Computer Systems Design Services","slug":"computer-systems-design-services"},{"code":"513210","title":"Software Publishers","slug":"software-publishers"}],"jobPosting":{"@context":"https://schema.org","@type":"JobPosting","title":"Technical Program Manager- AI Cluster Validation","description":"WHAT YOU DO AT AMD CHANGES EVERYTHING\r\nAt AMD, our mission is to build great products that accelerate next-generation computing experiences-from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges-striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.\r\nTechnical Program Manager- AI Cluster Validation\r\nTHE ROLE\r\nWe are seeking a Technical Program Manager to lead execution of AI cluster engineering programs with deep focus on GPU platforms, rack-level solutions, and AI Cluster validation. This role is responsible for driving end-to-end delivery from GPU + server integration through rack bring-up, scale testing, failure analysis, and system debug closure, ensuring platform readiness for hyperscale and enterprise AI deployments.\r\nThis role operates at the intersection of hardware, firmware, networking, and scale-test execution, and requires strong technical depth combined with disciplined program execution.\r\nTHE PERSON\r\nYou are a hands-on TPM who thrives in complex, fast-moving ecosystems, and can connect deep technical details to crisp program plans, executive reporting, and customer outcomes. You are comfortable driving execution in bring-up and EVT/DVT/PVT working closely with engineers to root-cause issues, unblock debug, and make data-driven tradeoffs to keep programs moving. You bring urgency, ownership, and clarity to ambiguous problem spaces and can communicate effectively from lab floor to executive review.\r\nKEY RESPONSIBILITIES\r\nProgram Leadership & Execution\r\n\r\nDefine, plan, and drive program plans for AI infrastructure systems validation and readiness, including server integration, rack bring-up, and cluster-scale deployment readiness.\r\nCreate and maintain core PM artifacts: schedules, dependency maps, resource forecasts, risk/issue logs, and program dashboards/status reports.\r\nIdentify and drive mitigation plans for issues/risks, including cross-team escalations and corrective actions across multiple engineering areas.\r\nDrive regular execution reviews with engineering teams and provide concise, data-driven updates to senior leadership.\r\n\r\nGPU & Platform Execution\r\n\r\nOwn program execution for GPU-based AI platforms, spanning system bring-up, qualification, scale readiness, and deployment validation across server, rack, and cluster levels.\r\nDrive alignment across GPU, CPU, firmware, BIOS/BMC, and system teams to ensure readiness for scale testing and customer workloads.\r\nTrack platform issues, and debug dependencies; ensure risks are clearly documented, owned, and mitigated.\r\n\r\nAI Rack / Cluster Validation\r\n\r\nOwn program planning and execution for multi-node and multi-rack scale testing, including test strategy, scheduling, coverage tracking, and readiness gates.\r\nLead end-to-end delivery of rack-level AI solutions, including compute trays, switch trays, cabling, power, cooling, and management infrastructure.\r\nEnsure rack bring-up plans are executable, resourced, and gated with clear entry/exit criteria across EVT, DVT, and scale phases.\r\nDrive coordination across lab operations, infrastructure, and engineering teams to unblock rack access, power, networking, and test readiness.\r\nPartner with scale, performance, and automation teams to ensure workloads, stress tests, and regressions plans are ready before hardware arrives.\r\n\r\nDebug, Failure Analysis & Risk Management\r\n\r\nAct as the execution lead for platform debug, coordinating across engineering teams to ensure fast triage, root-cause analysis, and resolution of system-level issues.\r\nTrack high-impact failures (GPU, HSIO, FW, rack, network) through debug forums ensuring clear ownership and closure plans.\r\nBalance debug depth vs. program timelines, escalating tradeoffs when needed and ensuring leadership has a clear view of risk and impact.\r\n\r\nREQUIRED QUALIFICATIONS\r\n\r\nExperience leading complex hardware or AI infrastructure programs with ownership across bring-up, validation, and deployment phases.\r\nStrong technical understanding of GPU-based AI systems, rack architectures, and datacenter infrastructure.\r\nProven ability to manage ambiguity, drive debug execution, and lead cross-functional teams without direct authority.\r\nStrong written and verbal communication skills, including executive-level status reporting.\r\nProficiency with program management and execution tools (Jira, Confluence, dashboards, Excel/PowerPoint).\r\n\r\nPREFERRED QUALIFICATIONS\r\n\r\nHands-on experience with GPU cluster scale testing, system stress, or performance validation.\r\nFamiliarity with rack-level bring-up, power/cooling constraints, networking, and failure modes at scale.\r\nExperience working through hardware/firmware debug cycles in pre-production or customer-facing environments.\r\n\r\nACADEMIC CREDENTIALS\r\n\r\nBachelor's or master's degree in systems, EE, CS, or related engineering discipline.\r\nPMP, Scrum Master, or equivalent program management training.\r\n\r\nLOCATION\r\nAustin, TX\r\nThis role is not eligible for visa sponsorship.\r\nLI-JE1\r\nBenefits offered are described: AMD benefits at a glance.\r\nAMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.\r\nAMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's \"Responsible AI Policy\" is available here.\r\nThis posting is for an existing vacancy.","datePosted":"2026-05-21T07:35:16.036Z","dateModified":"2026-05-21T07:35:16.036Z","hiringOrganization":{"@type":"Organization","name":"Advanced Micro Devices","sameAs":"https://jobsearcher.com"},"jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Austin","addressRegion":"TX","addressCountry":"US"}},"identifier":{"@type":"PropertyValue","name":"JobSearcher","value":"b0e703732fb80e0a285c28db"},"url":"https://jobsearcher.com/jobs/b0e703732fb80e0a285c28db"}}