{"schemaVersion":"jobsearcher.job.v1","id":"7e8ffef4cdd824be60cebbb0","url":"https://jobsearcher.com/jobs/7e8ffef4cdd824be60cebbb0","canonicalUrl":"https://jobsearcher.com/jobs/7e8ffef4cdd824be60cebbb0","title":"Evaluations Engineer","description":"About the Role\nWe are looking for strong engineers to join our team and own the leaderboards that appear on Vals AI.\n\nYou will be responsible for testing and benchmarking new models as they are released on tasks in law, tax, coding, finance, and more. You will analyze error modes of models, evaluate their strengths and weaknesses, and work with our communications team to release results.\n\nOur results are used by startups, enterprises, and research labs alike. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg.\n\nWe are building the standard for evaluating the ability of LLMs to perform real‑world tasks. You will contribute directly to the leaderboards that make this possible.\n\nWhat You’ll Do\n\nEvaluate new LLM model releases across the Vals AI suite of benchmarks\n\nWork directly with both open‑source and closed‑source foundation model labs in evaluating model performance\n\nUse tools like Docent to analyze common failure modes and patterns in model performance\n\nWork directly with our social media team to post interesting findings and results\n\nAdd new models and maintain integrations in our model library\n\nHelp improve and maintain the infrastructure we use to run benchmarks (agentic and non‑agentic).\n\nCollaborate closely with our research team on the creation of new benchmarks\n\nThis role follows the rhythm of model releases. Expect intense sprints in the days following a major launch, and calmer stretches in between releases.\n\nRequirements\n\nFamiliarity with the LLMs: You should already be familiar with the space - the current leading models, relative performance across them, how to use large language models in practice.\n\nStrong engineering fundamentals: You can build and ship quickly with high quality. You should have a track record of building things of significant scope (at jobs, side projects, open source, etc.)\n\nPython expertise: Significant experience in Python, especially in a professional setting.\n\nTeam collaboration: Experience working in development sprints, Git workflows, and pull request reviews.\n\nLocation: We are an in‑person team based in San Francisco. We will support your relocation or transportation as needed.\n\nNice‑to‑Haves\n\nPrevious experience with benchmarking large language models, or creating benchmarks\n\nPrevious experience working at a startup or starting your own company\n\nTechnical writing experience and ability\n\nMachine learning research experience\n\nWhat We Offer\n\nHighly competitive salary and meaningful ownership. Excellence is well rewarded.\n\nRelocation and transportation support\n\nHealth/dental insurance coverage\n\nLunch and dinner provided, free snacks/coffee/drinks\n\n401K plan\n\nUnlimited PTO\n\nAbout Us\nFounding team : The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a $5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work. Our early team include Stanford PhDs, ex‑Jane Street quants, and the first designer at Snorkel.\n\nTech stack : We use Python for most things at Vals. Our platform is built on Django, with a React frontend. All of the infra is on AWS using CDK for IaC.\n\nWhat We're Looking For\n\nLearning velocity: The role encompasses a wide variety of tasks. Rather than expecting you to be an expert on Day 1, we are looking for someone who can learn new skills and technologies extremely quickly.\n\nOwnership: Working in a small, talent‑dense team, we expect everyone to show initiative to build where it's needed, not where it's asked. We strive for autonomy over consensus. This is especially true for this role.\n\nIntensity: The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier. The unicorn companies that will emerge from this technology shift are being built now. Those that win will have an incredibly high speed of execution.\n\nSolution‑oriented mindset: We're looking for people who see opportunities to craft solutions at each juncture, not those who pass hard problems to others or admit defeat.\n\nFurther Reading:\n\nHugging Face blog on evaluation\n\nAnthropic’s blog on challenges in evaluation\n\nNew York Times article on issues in benchmarking\n\nStanford HAI report showing hallucinations in legal tech tools\n\nReferral Bonus\nKnow someone who would be a good fit? Connect them with rayan@vals.ai. If we hire them and they stay on for 90 days you’ll get a $10,000 referral bonus and Vals AI merch! Please mention the bonus in your email.\n\n#J-18808-Ljbffr","company":"Petsapp","rawCompany":"petsapp","city":"Millbrae","state":"CA","isRemote":false,"isActive":false,"createdAt":"2026-06-20T03:36:06.586Z","occupations":[{"code":"15-1252.00","title":"Software Developers","slug":"software-developers"},{"code":"17-2112.02","title":"Validation Engineers","slug":"validation-engineers"},{"code":"17-2199.00","title":"Engineers, All Other","slug":"engineers-all-other"}],"industries":[{"code":"513210","title":"Software Publishers","slug":"software-publishers"},{"code":"541511","title":"Custom Computer Programming Services","slug":"custom-computer-programming-services"},{"code":"541990","title":"All Other Professional, Scientific, and Technical Services","slug":"all-other-professional-scientific-and-technical-services"}],"jobPosting":{"@context":"https://schema.org","@type":"JobPosting","title":"Evaluations Engineer","description":"About the Role\nWe are looking for strong engineers to join our team and own the leaderboards that appear on Vals AI.\n\nYou will be responsible for testing and benchmarking new models as they are released on tasks in law, tax, coding, finance, and more. You will analyze error modes of models, evaluate their strengths and weaknesses, and work with our communications team to release results.\n\nOur results are used by startups, enterprises, and research labs alike. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg.\n\nWe are building the standard for evaluating the ability of LLMs to perform real‑world tasks. You will contribute directly to the leaderboards that make this possible.\n\nWhat You’ll Do\n\nEvaluate new LLM model releases across the Vals AI suite of benchmarks\n\nWork directly with both open‑source and closed‑source foundation model labs in evaluating model performance\n\nUse tools like Docent to analyze common failure modes and patterns in model performance\n\nWork directly with our social media team to post interesting findings and results\n\nAdd new models and maintain integrations in our model library\n\nHelp improve and maintain the infrastructure we use to run benchmarks (agentic and non‑agentic).\n\nCollaborate closely with our research team on the creation of new benchmarks\n\nThis role follows the rhythm of model releases. Expect intense sprints in the days following a major launch, and calmer stretches in between releases.\n\nRequirements\n\nFamiliarity with the LLMs: You should already be familiar with the space - the current leading models, relative performance across them, how to use large language models in practice.\n\nStrong engineering fundamentals: You can build and ship quickly with high quality. You should have a track record of building things of significant scope (at jobs, side projects, open source, etc.)\n\nPython expertise: Significant experience in Python, especially in a professional setting.\n\nTeam collaboration: Experience working in development sprints, Git workflows, and pull request reviews.\n\nLocation: We are an in‑person team based in San Francisco. We will support your relocation or transportation as needed.\n\nNice‑to‑Haves\n\nPrevious experience with benchmarking large language models, or creating benchmarks\n\nPrevious experience working at a startup or starting your own company\n\nTechnical writing experience and ability\n\nMachine learning research experience\n\nWhat We Offer\n\nHighly competitive salary and meaningful ownership. Excellence is well rewarded.\n\nRelocation and transportation support\n\nHealth/dental insurance coverage\n\nLunch and dinner provided, free snacks/coffee/drinks\n\n401K plan\n\nUnlimited PTO\n\nAbout Us\nFounding team : The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a $5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work. Our early team include Stanford PhDs, ex‑Jane Street quants, and the first designer at Snorkel.\n\nTech stack : We use Python for most things at Vals. Our platform is built on Django, with a React frontend. All of the infra is on AWS using CDK for IaC.\n\nWhat We're Looking For\n\nLearning velocity: The role encompasses a wide variety of tasks. Rather than expecting you to be an expert on Day 1, we are looking for someone who can learn new skills and technologies extremely quickly.\n\nOwnership: Working in a small, talent‑dense team, we expect everyone to show initiative to build where it's needed, not where it's asked. We strive for autonomy over consensus. This is especially true for this role.\n\nIntensity: The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier. The unicorn companies that will emerge from this technology shift are being built now. Those that win will have an incredibly high speed of execution.\n\nSolution‑oriented mindset: We're looking for people who see opportunities to craft solutions at each juncture, not those who pass hard problems to others or admit defeat.\n\nFurther Reading:\n\nHugging Face blog on evaluation\n\nAnthropic’s blog on challenges in evaluation\n\nNew York Times article on issues in benchmarking\n\nStanford HAI report showing hallucinations in legal tech tools\n\nReferral Bonus\nKnow someone who would be a good fit? Connect them with rayan@vals.ai. If we hire them and they stay on for 90 days you’ll get a $10,000 referral bonus and Vals AI merch! Please mention the bonus in your email.\n\n#J-18808-Ljbffr","datePosted":"2026-06-20T03:36:06.586Z","dateModified":"2026-06-20T03:36:06.586Z","hiringOrganization":{"@type":"Organization","name":"Petsapp","sameAs":"https://jobsearcher.com"},"jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Millbrae","addressRegion":"CA","addressCountry":"US"}},"identifier":{"@type":"PropertyValue","name":"JobSearcher","value":"7e8ffef4cdd824be60cebbb0"},"url":"https://jobsearcher.com/jobs/7e8ffef4cdd824be60cebbb0"}}