Upvote
Downvote
Sr. Staff Software Engineer, AI Training Infrastructure
Share Job
- Suggest Revision
Full-time
- The team also works closely with the open source community and has many open source committers (TensorFlow, Horovod, Ray, Hadoop, etc.)
- Additionally, this team focussed on technologies like LLMs, GNNs, Incremental Learning, Online Learning, and advanced LLM Agents work for Training infrastructure.
- You will design and implement high performance AI Training pipeline, data I/O, work with open source teams to identify and resolve issues in popular libraries like Huggingface, Horovod and PyTorch, debug and optimize deep learning training, and provide advanced support for internal AI teams in areas like model parallelism, data parallelism, Zero, automatic mixed precision and kernel fusion.
- Finally, you will assist in and guide the development of containerized pipeline orchestration infrastructure, including developing and distributing stable base container images, providing advanced profiling and observability, and updating internally maintained versions of deep learning frameworks and their companion libraries like Tensorflow, PyTorch, DeepSpeed, GNNs, Flash Attention and more.
- Java, Go, Rust, Scala
Active Job
Updated 6 days agoSimilar Job
Relevance
Active