Upvote
Downvote
Reliability, Availability And Serviceability Expert, Datacenter AI Products Development
Share Job
- Suggest Revision
- We are looking for one product development engineer as a SME to drive key aspects of RAS/Resilience features from Chip to module to server for our next-generation products for AI Applications.
- We are expecting you to bring deep knowledge and experience in RAS/Resilience testing, characterization, analysis, benchmarking, and risk assessment of large AI training or HPC cluster systems with InfiniBand or enhanced Ethernet.
- The focal point SME for manufacturing test requirements, test methodology, test plan and test flow for AI system RAS/Resilience features to ensure good test coverage and successful production ramp-ups.
- Own the troubleshooting and root-causing of AI system RAS/Resilience related failures at factory and in the field.
- Lead the data analysis of RAS/Resilience logs to refine, revise and overhaul test methodology and manufacturing flows; influence and drive software tools/infrastructure required for new product development, validation, and productization.
Active Job
Updated 12 days agoSimilar Job
Relevance
Active