Lead Machine Learning Engineer, LLM Infrastructure
DescriptionAbout the RoleWe are seeking a Lead ML Engineer, LLM Post-Training Infrastructure to join the Salesforce AI Research Incubation Team. In this role, you will own the infrastructure and engineering systems that support LLM post-training, large-scale evaluation, and model deployment. You will build scalable, reliablepipelines for training orchestration, rollout generation, reward and feedback pipelines, experiment management, and model iteration, helping translate research ideas into production-grade systems.This is an engineering-first role focused on ML infrastructure, distributed systems, and training/evaluation workflows rather than developing new model architectures or algorithms. You will work closely with research scientists, agent engineers, and platform teams to operationalize post-training and feedback-driven learning methods into robust, reusable systems.This is a lead-level individual contributor role with deep ownership of model-facing infrastructure and strong cross-functional influence.Key ResponsibilitiesDesign, build, and maintain infrastructure for LLM post-training, evaluation, anddeployment.Own scalable pipelines for training orchestration, rollout generation, reward andfeedback processing, checkpointing, and experiment management.Build reliable systems for feedback-driven model improvement, including human or AIfeedback loops, large-scale offline evaluation, and regression detection.Partner closely with research scientists to turn new post-training methods into reusableengineering workflows.Collaborate with agent engineers and platform teams to integrate training and evaluationsystems with production model and agent stacks.Optimize distributed training and inference workloads for reliability, throughput, costefficiency, and observability.Drive best practices for reproducibility, versioning, monitoring, deployment, andoperational excellence across ML systems.Required Qualifications5+ years of experience in software engineering, ML systems, or distributedinfrastructure.Strong proficiency in Python and experience building production systems or large-scaleML pipelines.Hands-on experience building infrastructure for model training, post-training, evaluation,or serving.Experience designing reliable, scalable systems for distributed and GPU-basedworkloads.Strong debugging skills across systems, pipelines, and model-facing failures.Experience building infrastructure for LLM post-training, including RLHF, preferenceoptimization, reward modeling, or related feedback-driven training workflows.Experience working cross-functionally with research scientists and engineers.Familiarity with cloud platforms (AWS, GCP) and containerized environments (Docker,Kubernetes).Preferred QualificationsExperience with rollout systems, large-scale evaluation loops, or training data/feedbackpipelines.Familiarity with distributed training frameworks and modern ML infrastructure stacks.Experience supporting agent-based learning, simulation environments, or iterative modelimprovement systems.Prior experience working closely with AI research or incubation teams.Why Join Us?Own the systems that turn research models into production AI capabilities.Work at the intersection of AI research and large-scale engineering systems.Shape how models are trained, deployed, evaluated, and evolved.Competitive compensation, benefits, and strong long-term growth opportunities.For roles in San Francisco and Los Angeles: Pursuant to the San Francisco Fair Chance Ordinance and the Los Angeles Fair Chance Initiative for Hiring, Salesforce will consider for employment qualified applicants with arrest and conviction records.