JOBSEARCHER
<Back to Search

Research Scientist

Research ScientistSan Francisco (On-Site)About the RoleWe are looking for exceptional researchers and research engineers to design and build the next generation of AI benchmarks. You will create high-impact, challenging evaluations that push the boundaries of what we can measure in foundation models. This role is perfect for someone with deep research expertise who wants to see their work directly influence how the world evaluates AI systems.You will lead the design and development of novel benchmarks that assess real-world capabilities of LLMs. Our benchmark shapes how foundation models are developed and generative AI applications are built. We work with all the major foundation model labs, some of the largest financial institutions, and hospital systems in the world. Our work has been featured by the Wall Street Journal, Washington Post, and Bloomberg.We are building the standard for evaluating the ability of LLMs to perform real-world tasks. You will be at the forefront of defining what that standard looks like.What You'll DoDesign and develop novel, high-impact benchmarks that assess challenging real-world capabilitiesConduct research to ensure our benchmarks are valid, reliable, and meaningfulCollaborate with foundation model labs and enterprises to understand evaluation needsAnalyze model performance across benchmarks and communicate findingsPublish research findings and contribute to the broader evaluation research communityWork closely with the infrastructure team to implement your benchmark designs at scaleStay current with the latest developments in LLM capabilities and evaluation methodologiesRequirementsAdvanced research experience: Master's degree or PhD in Computer Science, NLP, Machine Learning, or related field. Undergrads with very strong research backgrounds may also be considered.Publication track record: Published papers in reputable venues (NeurIPS, ICML, ACL, EMNLP, etc.) with a focus on NLP, ML evaluation, or benchmarkingResearch methodology: Strong understanding of experimental design, statistical analysis, and evaluation frameworksTechnical skills: Proficiency in Python for research and experimentationCommunication: Ability to clearly communicate complex research ideas to both technical and non-technical audiencesCollaboration: Experience working in research teams and integrating feedbackPortfolio: Demonstrated track record of impactful research workLocation: We are an in-person team based in San Francisco. We will support your relocation or transportation as needed.Nice to HavesExperience specifically in LLM evaluation or benchmarking researchFamiliarity with foundation model architectures and capabilitiesExperience working with industry partners or in applied research settingsBackground in areas like human-computer interaction, psychology, or domain-specific evaluationExperience at early-stage startups or research labsContributions to open-source evaluation tools or datasets

Showing all 39 matching similar jobs