ML Infrastructure Engineer
ARCHIVED
We can't find an active application page for this role right now. It may reopen or be listed elsewhere. Use Next Steps to search for an active apply link and similar live jobs.
ML Infrastructure EngineerSan Francisco, CA (On-Site M-F)Our client is a fast-growing, Series B AI startup building the infrastructure layer that connects complex enterprise data with large language models. Backed by top-tier investors, they're processing data for customers ranging from startups to Fortune 10 enterprises.About the RoleOwn and scale the training and inference stack at a high-growth AI data processing company. You'll work closely with ML researchers to ensure models ship quickly and reliably — building everything from serving infrastructure to data pipelines — with the goal of making infrastructure a non-issue for the products being served. This is a high-impact IC role for a strong generalist who understands ML mechanics end-to-end and thrives in a fast-paced, founder-led environment.ResponsibilitiesBuild and maintain model serving infrastructure — improving inference speed, monitoring, and reliability to ensure it never becomes a bottleneck for customersSet up and improve training infrastructure for models ranging from 300M to 30B parameters across 1-to-3 node environmentsDevelop observability, logging, and monitoring systems across the full ML stackBuild internal data pipelines and tooling that enable ML researchers to move faster from experiment to productionArchitect infrastructure to arbitrate inference across multiple cloud providers, optimizing for accuracy, latency, and costQualifications3+ years of experience in ML infrastructure engineering, with a focus on model serving and training infrastructureHands-on experience with multi-node training and serving environments (1-to-3 node training; single-to-double node serving)Strong proficiency in Python and containerization/orchestration tools (Docker, Kubernetes, Helm)Deep familiarity with PyTorch and production ML deployment workflowsComfort operating as an IC in an early-stage, high-ownership environmentRequired SkillsExperience optimizing inference performance and cost across cloud providersPrior work in document processing, OCR, or unstructured data pipelinesTrack record building internal ML tooling that accelerates research-to-production cyclesExperience at a fast-scaling AI startup or similarly high-intensity engineering environment