{"schemaVersion":"jobsearcher.job.v1","id":"9a2a6e2dbd59feed4127649f","url":"https://jobsearcher.com/jobs/9a2a6e2dbd59feed4127649f","canonicalUrl":"https://jobsearcher.com/jobs/9a2a6e2dbd59feed4127649f","title":"Engineer, Inference & Model serving","description":"Job Description ML Model Serving Engineer Want to build the layer that actually makes AI usable in real time? You'll join a team focused on inference, where performance is the product. This is about delivering low-latency, high-throughput systems across LLMs, speech, and vision models running in production, not offline experiments. They're building real-time AI systems that need to respond instantly, reliably, and at scale. That means solving hard problems around batching, GPU efficiency, memory constraints, and system-level bottlenecks that most teams never fully crack. You'll sit at the core of the platform, working across model serving, infrastructure, and performance optimisation. A big part of the role is pushing current tooling beyond its limits, extending frameworks, profiling bottlenecks, and designing systems that hold up under real-world load. This is not about training models. It's about making them fast, efficient, and production-ready. What you'll work on:Building high-performance serving systems for LLM, speech, and vision modelsScaling inference to production workloads with strict latency requirementsOptimising GPU utilisation and execution efficiencyImplementing techniques like continuous batching, KV cache optimisation, speculative decoding, and prefill/decode separationImproving frameworks such as vLLM, TensorRT-LLM, Triton, and SGLangProfiling and debugging performance across GPU, memory, and system layers What you'll bring:Strong experience with ML inference or model serving systems ID: 34247 Copilot Symbol Access Evo Actions Engineer, Inference & Model serving Sesame AI Job ID: 34247 Applications 57 Shortlisted 4 Sent 11 1st Interview 13 2nd+ Interview 0 Offers 0 Placed 0 Renewal 0 Details Custom Fields Descriptions & Ratings Compensation & Fees Activities Files Onboarding Approval process Shift Setting Integrations Upload JD No file chosen Original document Job Summary Public job description Internal job description Ratings & Screening questions Note: This JD will be posted to job boards; please remember to remove the Company details and Contact information. Quick Post Job Job title Engineer, Inference & Model serving Job owner: Marc Powell Company: Sesame AI Contact: Brown Ryan Privacy Only Public Jobs can be shared Private Public Apps Visit the App Store indeed Your job will go live on Indeed once it adheres to their quality standards. For more information on this, please head to our Help Center Your changes have been saved successfully.Deep understanding of latency and throughput optimisation in productionSolid Python and PyTorch skills, plus a systems or performance engineering mindsetFamiliarity with distributed systems and production infrastructure Exposure to CUDA, GPU profiling tools, or systems like Kubernetes and Ray is useful, but the key is knowing how to make models run efficiently at scale. You'll join a highly technical team with experience across major AI labs and big tech. The environment is pragmatic, focused on solving real performance problems rather than abstract research. There's real ownership here. You'll help define how next-generation AI systems are served. Package: $220,000 - $320,000 base + equity San Francisco, onsite 3 days per week If you're interested in working on the part of AI that actually determines whether it works in the real world, this is worth exploring. All applicants will receive a response.","company":"Techire Ai","rawCompany":"techire ai","city":"Millbrae","state":"CA","isRemote":false,"isActive":false,"createdAt":"2026-06-02T05:37:32.242Z","occupations":[{"code":"15-1299.08","title":"Computer Systems Engineers/Architects","slug":"computer-systems-engineers-architects"},{"code":"15-1252.00","title":"Software Developers","slug":"software-developers"},{"code":"15-1221.00","title":"Computer and Information Research Scientists","slug":"computer-and-information-research-scientists"}],"industries":[{"code":"541512","title":"Computer Systems Design Services","slug":"computer-systems-design-services"},{"code":"518210","title":"Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services","slug":"computing-infrastructure-providers-data-processing-web-hosting-and-related-services"},{"code":"513210","title":"Software Publishers","slug":"software-publishers"}],"jobPosting":{"@context":"https://schema.org","@type":"JobPosting","title":"Engineer, Inference & Model serving","description":"Job Description ML Model Serving Engineer Want to build the layer that actually makes AI usable in real time? You'll join a team focused on inference, where performance is the product. This is about delivering low-latency, high-throughput systems across LLMs, speech, and vision models running in production, not offline experiments. They're building real-time AI systems that need to respond instantly, reliably, and at scale. That means solving hard problems around batching, GPU efficiency, memory constraints, and system-level bottlenecks that most teams never fully crack. You'll sit at the core of the platform, working across model serving, infrastructure, and performance optimisation. A big part of the role is pushing current tooling beyond its limits, extending frameworks, profiling bottlenecks, and designing systems that hold up under real-world load. This is not about training models. It's about making them fast, efficient, and production-ready. What you'll work on:Building high-performance serving systems for LLM, speech, and vision modelsScaling inference to production workloads with strict latency requirementsOptimising GPU utilisation and execution efficiencyImplementing techniques like continuous batching, KV cache optimisation, speculative decoding, and prefill/decode separationImproving frameworks such as vLLM, TensorRT-LLM, Triton, and SGLangProfiling and debugging performance across GPU, memory, and system layers What you'll bring:Strong experience with ML inference or model serving systems ID: 34247 Copilot Symbol Access Evo Actions Engineer, Inference & Model serving Sesame AI Job ID: 34247 Applications 57 Shortlisted 4 Sent 11 1st Interview 13 2nd+ Interview 0 Offers 0 Placed 0 Renewal 0 Details Custom Fields Descriptions & Ratings Compensation & Fees Activities Files Onboarding Approval process Shift Setting Integrations Upload JD No file chosen Original document Job Summary Public job description Internal job description Ratings & Screening questions Note: This JD will be posted to job boards; please remember to remove the Company details and Contact information. Quick Post Job Job title Engineer, Inference & Model serving Job owner: Marc Powell Company: Sesame AI Contact: Brown Ryan Privacy Only Public Jobs can be shared Private Public Apps Visit the App Store indeed Your job will go live on Indeed once it adheres to their quality standards. For more information on this, please head to our Help Center Your changes have been saved successfully.Deep understanding of latency and throughput optimisation in productionSolid Python and PyTorch skills, plus a systems or performance engineering mindsetFamiliarity with distributed systems and production infrastructure Exposure to CUDA, GPU profiling tools, or systems like Kubernetes and Ray is useful, but the key is knowing how to make models run efficiently at scale. You'll join a highly technical team with experience across major AI labs and big tech. The environment is pragmatic, focused on solving real performance problems rather than abstract research. There's real ownership here. You'll help define how next-generation AI systems are served. Package: $220,000 - $320,000 base + equity San Francisco, onsite 3 days per week If you're interested in working on the part of AI that actually determines whether it works in the real world, this is worth exploring. All applicants will receive a response.","datePosted":"2026-06-02T05:37:32.242Z","dateModified":"2026-06-02T05:37:32.242Z","hiringOrganization":{"@type":"Organization","name":"Techire Ai","sameAs":"https://jobsearcher.com"},"jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Millbrae","addressRegion":"CA","addressCountry":"US"}},"identifier":{"@type":"PropertyValue","name":"JobSearcher","value":"9a2a6e2dbd59feed4127649f"},"url":"https://jobsearcher.com/jobs/9a2a6e2dbd59feed4127649f"}}