JOBSEARCHER

Machine Learning Engineer - Training Systems

LocationPalo AltoEmployment TypeFull timeDepartmentResearchOverviewApplicationAt Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.We're looking for a Staff / Principal ML Systems Engineer to own training systems performance end-to-end. You will define how our models train at scale — driving efficiency, scalability, and correctness across large-scale multimodal training. This is a core systems role, not infrastructure support. Your work directly determines how efficiently we use compute, how well models scale across thousands of GPUs, and how quickly research can iterate.What You'll DoOwn training performance end-to-endDiagnose and improve performance of large-scale multimodal training (vision, video, proprioception, actions, language)Build systematic performance attribution: step-time decomposition (compute vs communication vs input pipeline), scaling curves across cluster sizes, and bottleneck identification and prioritizationDrive measurable gains in:Distributed efficiency (comm/compute overlap, bucketization, topology-aware mapping, parallelism strategies)Compute efficiency (kernel hotspots, operator fusion, attention optimization, framework/runtime overhead)Memory efficiency (activation checkpointing, sequence packing/bucketing, fragmentation reduction)Design training systems (not just tune them)Define and evolve parallelism strategies: data / tensor / pipeline / sharding / hybrid approachesImprove execution efficiency through communication scheduling and overlap, graph capture and execution optimization, and runtime-level improvementsContribute to and extend training frameworks where neededMake performance observable and measurableEstablish source-of-truth performance metrics: step-time breakdowns, MFU / throughput / scaling efficiencyBuild tools to identify bottlenecks quickly, track performance across model families, and compare scaling behavior across configurationsDevelop regression detection: microbenchmarks, performance baselines, and automated detection of efficiency regressionsPartner deeply with researchersWork side-by-side with research scientists and research engineers — no silosTranslate model innovations into scalable, efficient implementationsAdvise on training tradeoffs for robotics world models: long-horizon sequences, rollout/evaluation cadence, multimodal and variable-length dataCollaborate on cluster-level efficiencyWork with infrastructure/SRE teams to improve utilization across large distributed jobs, impact of network and collective performance on training, and topology-aware job placement and scaling behaviorWhat We're Looking ForProven track record improving large-scale distributed training performanceDeep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)Strong understanding of data / tensor / pipeline parallelism, sharded training (FSDP / ZeRO-style), communication patterns and overlap strategies, and scaling behavior across large GPU clustersStrong systems intuition — ability to reason across compute, communication, and memory bottlenecksExceptional debugging and measurement ability: turn "training is slow" into clear bottlenecks, experiments, and validated improvementsHigh ownership mindset and comfort in a fast-moving environmentNice To Have (But Not Required)GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)Experience with multimodal or video training (variable-length sequences, packing/bucketing)Experience working on large-scale training frameworks or distributed runtimesFamiliarity with cluster topology, networking, and large-scale scheduling effectsWhy This RoleDirect leverage on research velocity — every efficiency gain you make accelerates model iteration across the entire research teamOwn the scalability and performance of large-scale multimodal training for real-world embodied intelligence, not static benchmarksImprovements you make compound across every training run the company executes — high ownership, high impact, small elite teamAt Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.We're looking for a Staff / Principal ML Systems Engineer to own training systems performance end-to-end. You will define how our models train at scale — driving efficiency, scalability, and correctness across large-scale multimodal training. This is a core systems role, not infrastructure support. Your work directly determines how efficiently we use compute, how well models scale across thousands of GPUs, and how quickly research can iterate.What You'll DoOwn training performance end-to-endDiagnose and improve performance of large-scale multimodal training (vision, video, proprioception, actions, language)Build systematic performance attribution: step-time decomposition (compute vs communication vs input pipeline), scaling curves across cluster sizes, and bottleneck identification and prioritizationDrive measurable gains in:Distributed efficiency (comm/compute overlap, bucketization, topology-aware mapping, parallelism strategies)Compute efficiency (kernel hotspots, operator fusion, attention optimization, framework/runtime overhead)Memory efficiency (activation checkpointing, sequence packing/bucketing, fragmentation reduction)Design training systems (not just tune them)Define and evolve parallelism strategies: data / tensor / pipeline / sharding / hybrid approachesImprove execution efficiency through communication scheduling and overlap, graph capture and execution optimization, and runtime-level improvementsContribute to and extend training frameworks where neededMake performance observable and measurableEstablish source-of-truth performance metrics: step-time breakdowns, MFU / throughput / scaling efficiencyBuild tools to identify bottlenecks quickly, track performance across model families, and compare scaling behavior across configurationsDevelop regression detection: microbenchmarks, performance baselines, and automated detection of efficiency regressionsPartner deeply with researchersWork side-by-side with research scientists and research engineers — no silosTranslate model innovations into scalable, efficient implementationsAdvise on training tradeoffs for robotics world models: long-horizon sequences, rollout/evaluation cadence, multimodal and variable-length dataCollaborate on cluster-level efficiencyWork with infrastructure/SRE teams to improve utilization across large distributed jobs, impact of network and collective performance on training, and topology-aware job placement and scaling behaviorWhat We're Looking ForProven track record improving large-scale distributed training performanceDeep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)Strong understanding of data / tensor / pipeline parallelism, sharded training (FSDP / ZeRO-style), communication patterns and overlap strategies, and scaling behavior across large GPU clustersStrong systems intuition — ability to reason across compute, communication, and memory bottlenecksExceptional debugging and measurement ability: turn "training is slow" into clear bottlenecks, experiments, and validated improvementsHigh ownership mindset and comfort in a fast-moving environmentNice To Have (But Not Required)GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)Experience with multimodal or video training (variable-length sequences, packing/bucketing)Experience working on large-scale training frameworks or distributed runtimesFamiliarity with cluster topology, networking, and large-scale scheduling effectsWhy This RoleDirect leverage on research velocity — every efficiency gain you make accelerates model iteration across the entire research teamOwn the scalability and performance of large-scale multimodal training for real-world embodied intelligence, not static benchmarksImprovements you make compound across every training run the company executes — high ownership, high impact, small elite teamAt Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.We're looking for a Staff / Principal ML Systems Engineer to own training systems performance end-to-end. You will define how our models train at scale — driving efficiency, scalability, and correctness across large-scale multimodal training. This is a core systems role, not infrastructure support. Your work directly determines how efficiently we use compute, how well models scale across thousands of GPUs, and how quickly research can iterate.What You'll DoOwn training performance end-to-endDiagnose and improve performance of large-scale multimodal training (vision, video, proprioception, actions, language)Build systematic performance attribution: step-time decomposition (compute vs communication vs input pipeline), scaling curves across cluster sizes, and bottleneck identification and prioritizationDrive measurable gains in:Distributed efficiency (comm/compute overlap, bucketization, topology-aware mapping, parallelism strategies)Compute efficiency (kernel hotspots, operator fusion, attention optimization, framework/runtime overhead)Memory efficiency (activation checkpointing, sequence packing/bucketing, fragmentation reduction)Design training systems (not just tune them)Define and evolve parallelism strategies: data / tensor / pipeline / sharding / hybrid approachesImprove execution efficiency through communication scheduling and overlap, graph capture and execution optimization, and runtime-level improvementsContribute to and extend training frameworks where neededMake performance observable and measurableEstablish source-of-truth performance metrics: step-time breakdowns, MFU / throughput / scaling efficiencyBuild tools to identify bottlenecks quickly, track performance across model families, and compare scaling behavior across configurationsDevelop regression detection: microbenchmarks, performance baselines, and automated detection of efficiency regressionsPartner deeply with researchersWork side-by-side with research scientists and research engineers — no silosTranslate model innovations into scalable, efficient implementationsAdvise on training tradeoffs for robotics world models: long-horizon sequences, rollout/evaluation cadence, multimodal and variable-length dataCollaborate on cluster-level efficiencyWork with infrastructure/SRE teams to improve utilization across large distributed jobs, impact of network and collective performance on training, and topology-aware job placement and scaling behaviorWhat We're Looking ForProven track record improving large-scale distributed training performanceDeep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)Strong understanding of data / tensor / pipeline parallelism, sharded training (FSDP / ZeRO-style), communication patterns and overlap strategies, and scaling behavior across large GPU clustersStrong systems intuition — ability to reason across compute, communication, and memory bottlenecksExceptional debugging and measurement ability: turn "training is slow" into clear bottlenecks, experiments, and validated improvementsHigh ownership mindset and comfort in a fast-moving environmentNice To Have (But Not Required)GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)Experience with multimodal or video training (variable-length sequences, packing/bucketing)Experience working on large-scale training frameworks or distributed runtimesFamiliarity with cluster topology, networking, and large-scale scheduling effectsWhy This RoleDirect leverage on research velocity — every efficiency gain you make accelerates model iteration across the entire research teamOwn the scalability and performance of large-scale multimodal training for real-world embodied intelligence, not static benchmarksImprovements you make compound across every training run the company executes — high ownership, high impact, small elite team