JOBSEARCHER

Research Scientist / Engineer - Multimodal Pre-training

Rhoda AiPalo Alto, CAMay 25th, 2026
LocationPalo AltoEmployment TypeFull timeDepartmentResearchAt Rhoda AI, we're building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We're not building a feature; we're building a new computing platform for physical work — and with over $400M raised, we're investing aggressively in R&D, hardware development, and manufacturing scale-up to make that a reality.We're looking for Research Scientists and Research Engineers to push the frontier of large-scale pre-training for our video action model. Our approach formulates robot control as video prediction — we pre-train causal video generation models on web-scale video data, then adapt them to predict robot actions from real-world demonstrations. You'll work on the core architectures, training objectives, and scaling strategies that determine how well our models learn from internet-scale video. We hire across levels — from senior to staff — and welcome both research-track and engineering-track candidates.What You'll DoDesign and train large-scale causal video generation models on web-scale video dataDevelop and validate training objectives, model architectures, and data mixtures for video prediction at scaleResearch scaling laws and data efficiency for web-scale video pretrainingInvestigate what properties of web video transfer most effectively to robotic control and action predictionBuild systematic evaluations to measure video generation quality, long-horizon prediction fidelity, and downstream robot task performanceRun rigorous ablations and benchmarking to understand what drives model quality at scaleCollaborate closely with data & evaluation, post-training, and training systems teams to translate research ideas into working systemsPublish and present work at top-tier ML and robotics venues (especially valued for RS track)What We're Looking ForStrong background in large-scale generative modeling — either video generation (autoregressive video models, diffusion transformers, causal video architectures) or language model pretraining (LLMs, autoregressive transformers at scale)Hands-on experience training large generative models from scratch at scaleDeep understanding of autoregressive modeling, causal architectures, and scaling behaviorFluency with modern ML frameworks (PyTorch required; JAX a plus)Ability to design experiments, interpret results, and iterate quicklyStrong research taste: ability to identify high-leverage questions and cut through noiseComfort operating in a fast-moving, ambiguous startup environmentStaff-level candidates are expected to define technical direction and drive research strategy independently; senior/MTS candidates execute complex projects with strong fundamentals and growing scopeNice to Have (But Not Required)PhD in ML, CS, Robotics, or a related field — or equivalent research/industry experienceStrong publication record at NeurIPS, ICML, ICLR, CVPR, CoRL, etc. (especially valued for RS track)Prior work specifically on video generation models (autoregressive video, diffusion transformers, world models, or causal video architectures)Experience with large-scale autoregressive language model pretraining and scalingFamiliarity with web-scale video datasets and video data curation pipelinesPrior work connecting video generation to control, action prediction, or robotic learningFamiliarity with distributed training and multi-node infrastructureWhy This RoleWork on a fundamentally different approach to robot learning — web-scale video pretraining rather than robot-data-only VLA modelsYour models give our robots the ability to understand and predict the visual world from internet-scale supervisionDirect collaboration with data, post-training, and deployment teams with no silosHigh ownership and fast iteration in a small, elite teamJ-18808-Ljbffr