JOBSEARCHER

Member of Technical Staff - Post Training (Sonoma)

CerebroSonoma, CAApril 29th, 2026
We're building the post-training and evaluation layer for AI agents.As models gain the ability to use tools, browse, and operate software, a new problem is emerging:They don't reliably work.They fail mid-task, behave inconsistently, and are extremely hard to evaluate or improve in real-world environments.We're solving this by building:High-fidelity, resettable environments for training agentsEvaluation systems that can automatically grade behaviourTraining loops that continuously improve real-world performanceOur systems are already used by leading AI teams to train and evaluate agentic models.The RoleWe're hiring a Founding MTS (Post-Training / Applied ML) to build and scale systems that make models actually useful in production.This is not a pure research role.You'll focus on:Taking models that "kind of work" → making them reliable, measurable, and deployableDesigning and running training + evaluation loops at scaleRapid experimentation on real-world tasks and environmentsYou'll operate in the space between:ML engineeringPost-training / RLAgent systemsWhat You'll Work OnBuilding post-training pipelines for agent behaviourFine-tuning, RL, dataset iterationImproving multi-step task completionDesigning evaluation systems that reflect real-world successLLM-as-judge, programmatic evals, hybrid approachesRunning tight experiment loops:Identify failure modesGenerate dataRetrainMeasure improvementsImproving:Reliability across long-horizon tasksTool use and environment interactionConsistency and robustnessShipping systems that are used daily to train real modelsWho This Is ForWe're looking for applied builders, not pure researchers.Strong candidates often come from:Applied ML / post-training teams at frontier labsEarly-stage AI startups working on agents or LLM productsInfra teams working on evaluation, fine-tuning, or deploymentYou might have:Experience with:LLM fine-tuning / RLHF / RLAIFAgent systems or tool useEvaluation frameworks or benchmarkingA track record of:Shipping ML systems into productionRunning fast, iterative experimentsComfort working in messy, real-world problem spacesStrong engineering instincts alongside ML knowledgeWhat Makes This Role DifferentApplied, not academicYou're judged on whether the system works—not papersTight feedback loopsYou'll see the impact of your work immediatelyReal-world complexityNot toy benchmarks—messy, dynamic environmentsHigh ownershipYou'll define core systems and how they evolveUpstream of the ecosystemYour work improves how entire teams train agentsWhy This MattersThe biggest gap in AI right now isn't pretraining—it's post-training and reliability.Models can generate actionsBut they can't consistently complete tasksAnd we don't have good ways to measure or fix thatWe're building the systems that close that gap.If successful, this unlocks:Reliable AI agentsEnd-to-end automationProduction-grade AI systems