JOBSEARCHER

Forward Deployed Engineer

ARCHIVED

We can't find an active application page for this role right now. It may reopen or be listed elsewhere. Use Next Steps to search for an active apply link and similar live jobs.

Judgment Labs builds infrastructure for Agent Behavior Monitoring (ABM). While traditional observability focuses on logging exceptions and latency, our ABM surfaces behavioral anomalies such as instruction drifts and context retrieval loss in scaled production environments.Hundreds of teams building autonomous agents rely on Judgment to understand how their systems are behaving post-deployment. Instead of reactive incident triage, they cluster patterns across conversations and workflows, correlate regressions to specific interaction types, and pinpoint where reliability breaks down in their usage context.We’ve raised $30M+ across two rounds in the past five months. Our investors include Lightspeed, SV Angel, Valor Equity Partners, Nova Global, Chris Manning, Michael Ovitz, Michael Abbott, Cory Levy, Kevin Hartz, and others.The Role:Forward Deployed Engineers at Judgment Labs instrument our agent behavior monitoring (ABM) infrastructure directly into customer production systems. You act as a trusted partner in agent reliability — working inside live codebases, analyzing traces from real-world usage, and diagnosing failures in running environments while integrating monitoring and evaluation into mission-critical agent workflows. This is deep technical work: you need to move fast in unfamiliar stacks, form accurate hypotheses from incomplete data, and ship instrumentation that holds up under production load.Most days look like this: you go on-site and instrument our SDK in a new customer's codebase in the morning, spend the afternoon analyzing trace data to surface failure clusters, and close out with a stakeholder check-in where you translate what you found into something the Head of AI can act on. You're running 2–3 of these deployments simultaneously — each at a different stage, each with a different team on the other side. You define what "quality" means for each customer's domain, and then you make it measurable.You'll be at the forefront of Judgment, interacting daily with enterprise customers alongside our GTM, product, and research teams — reasoning about agent behavior, translating high-level goals into concrete ABM deployments, and owning outcomes end-to-end across real production environments. The customers you'll work with are AI-native startups. Their engineers have opinions, their infra teams have constraints, and their ops and product leads want to know why Judgment matters to them specifically. You figure that out fast and make it land. The scope, autonomy, and 0→1 execution this role demands make it a proving ground for people who want to build or lead a technical company.What You'll Do:Tracing & DeploymentInstrument Judgment's SDK across customer codebases — diagnosing integration issues across diverse and often unfamiliar agent architecturesManage 2-3 customer deployments simultaneously, owning the technical lifecycle from scoping to go-liveConfigure trace pipelines, set up span hierarchies, and ensure observability coverage is diagnostic— not just presentEvals, Behaviors & JudgesWrite domain-specific evals tailored to each customer's vertical — defining what a correct, safe, or high-quality agent response looks like for their use caseAnalyze trace data to surface failure patterns, cluster failure modes, and prioritize what gets measuredBuild and tune agent-as-a-judge models using real customer feedback, iterating until the judge reflects the customer's actual quality barDesign behaviors and scorers that work in production — low false positive rate, interpretable outputs, actionable signalCustomer & Deal ManagementOwn the customer relationship through the full deployment lifecycle — technical scoping, milestone communication, escalation handling, and renewalInterface across the full customer org: engineers who want to understand the SDK, infra teams with real deployment constraints, product and ops leads who need to understand why Judgment makes their job fasterEducate customers on what Judgment is and how it fits into their stack — not with a sales pitch, but by showing them what their agent is actually doing and what they should be monitoringTranslate customer needs into product feedback for research and engineeringAble to go deep on eval design in one conversation and explain a precision/recall tradeoff to a non-technical exec in the nextWhat Great Looks Like:Customers deploy in days, not weeks. They realize technical value from our tracing and behaviors immediately. You work closely with them and visit on-site when it matters until they have a clear picture of how their agents fail.Their judges reflect their domain. You accurately capture the failure modes that matter for their industry, form factor, and criticality. You keep iterating until the eval signal is well-calibrated and the customer trusts it.Their team understands their failure patterns. Engineers and stakeholders are aligned on how and why their agents fail, and they have alerting to know when this happens in production.They expand. More eval volume, more engineers using Judgment as part of their standard workflow, ABM embedded across the full product development lifecycle. Judgment becomes an essential part of how they build and operate AI.Who You Are:You're comfortable with technical peers who push back. Your customers are AI-native — their engineers have built agents, read the papers, and will ask hard questions. You earn credibility through depth, drawing on insights from our research team on why certain methods do or don't work.You can run multiple things at once without dropping them. 2–3 active deployments means 2–3 different teams, codebases, and timelines in your head simultaneously. You've figured out how to juggle competing priorities without letting anything slip.You think in systems. When an agent misbehaves, your instinctive curiosity kicks in. You don't just flag it — you evaluate why and come back with a theory.You have a commercial instinct. You understand that a customer who doesn't understand their failure modes doesn't renew. You can speak to both technical and non-technical stakeholders without breaking a sweat, and without coming off as salesy.You're genuinely curious about this space. You have opinions about what makes a good eval, why LLM-as-a-judge is hard, and where agent observability is going. Those opinions came from doing the work, not reading about it.Why Judgment?Agents can’t work without this. Today’s agents hallucinate, drift, and break in production. We’re building the infrastructure that fixes this: the monitoring layer that makes agents self-improving.We’re wired to win. We're a team of less than 20 but we ship like 50+ on the daily. You'll be working with olympiad medalists, debate champions, and competitive athletes who bring that same intensity to company building.Fast track to founding. Our engineers interface directly with customers, ship code into their environments, and use their feedback to dictate what’s next on the roadmap. Everyone on the team is either an ex-founder or a founder-to-be.We make sure our people do their best work. If you deserve a spot on the team, money will never get in the way of it. Full benefits, Equinox, and a private chef to take care of you. We sprint hard but we play hard, ask us about our Smash/Mario Kart tournaments.