Upvote
Downvote
Software Engineer, SystemML - Scaling / Performance
Share Job
- Suggest Revision
$177,008 - $251,000 a year
Full-time
- The team develops and owns the software stack around NCCL (NVIDIA Collective Communications Library), which enables multi-GPU and multi-node data communication through HPC-style collectives.
- Currently, one of the team’s focus is on building SW benchmarks, performance tuners and SW stacks around NCCL and PyTorch to improve the full-stack distributed ML performance (e.g. Large-Scale GenAI/LLM training) from the trainer down to the inter-GPU and network communication layer.
- And we are seeking for engineers to tech-lead the space of GenAI/LLM scaling and performance.
- Tech-leading the overall distributed ML enablement and performance on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling
- Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow
Active Job
Updated 2 days agoSimilar Job
Relevance
Active