Agentic Workflow Evaluation Consultant | Remote

Stem Sync AiRemoteMay 29th, 2026

Data ScientistsManagement, Scientific, and Technical Consulting Services

Frontier Model Evaluator (Academic & Domain Expert) Remote | W2 Contract | Up to $1,920 Referral Bonus | 30+ hrs/weekQuick SnapshotEmbedded within a leading frontier-model lab's GenAI team, working directly on benchmark design and model evaluation for cutting-edge LLM developmentDesign and validate real-world, domain-specific agentic tasks with executable Python test suites to surface reasoning and problem-solving failures in target modelsAnalyze model and agent behavior to classify failure types distinguishing logical reasoning gaps from other performance issuesOpen to professors, retired academics, and PhD candidates across STEM, finance, law, economics, business, and quantitative disciplinesW2 employment through an established enterprise staffing partner structured role with payroll, benefits, and compliance supportMinimum 30 hours/week commitment during weekdays; work is remote and task-driven, suited to researchers with flexible schedulesReferral program available earn up to $1,920 per successful referral with no cap on referralsRequirementsCurrent or retired professor, or PhD student (or candidate) in a STEM field (ML, CS, mathematics, physics, engineering, statistics, biology, chemistry, data science) or quantitative/professional domain (finance, economics, law, accounting, business)Degree or PhD in progress from a top-tier university in your fieldHands-on Python proficiency demonstrated through research, industry work, GitHub projects, or coursework; theoretical familiarity alone does not qualifyAbility to design rigorous, real-world domain problems targeting specific capability gaps in large language models or agentic systemsBuild complete task specifications including golden solutions and executable test cases within an agentic development environmentEvaluate model outputs systematically and classify failure modes with precisionPrior experience in model evaluation, data annotation, or LLM/agent training is a strong plusEasy apply to proceed.

Agentic Workflow Evaluation Consultant | Remote

Showing 10,000+ matching similar jobs