{"schemaVersion":"jobsearcher.job.v1","id":"713ffb70ca87a5fca882f724","url":"https://jobsearcher.com/jobs/713ffb70ca87a5fca882f724","canonicalUrl":"https://jobsearcher.com/jobs/713ffb70ca87a5fca882f724","title":"Machine Learning Engineer (LLM inference)","description":"MLE (LLM inference)\r\nAbout US\r\nGMI Cloudis a fast-growing AI infrastructure company backed by Headline VC and one of only six cloud providers worldwide to earn NVIDIA’s prestigious Reference Platform Cloud Partner designation . We operate 8 of our own GPU clusters across the U.S. and Asia, delivering a full spectrum of services from GPU compute service to AI model inference API solutions. As an NVIDIA Reference Platform Cloud Partner, our infrastructure meets the highest standards for performance, security, and scalability in AI deployments. We empower AI startups and enterprises to “build AI without limits,” providing everything they need to prototype, train, and deploy AI models quickly and reliably.About this role\r\nWe are hiring a Machine Learning Engineer, LLM Optimization to build a world-leading inference optimization team and make GMI Cloud the industry benchmark for LLM serving performance.\r\nThis role is for engineers who want to work at the frontier of AI systems. You will drive the research, validation, and productionization of the most advanced inference optimization techniques, and turn them into real competitive advantage across GMI’s inference platform.\r\nOur goal is to make GMI the company that leads the industry in how fast we discover, evaluate, combine, and operationalize the best optimization strategies for real customer workloads. That means not only adopting the latest advances, but also defining best practices, developing our own optimization methodologies, and building the internal framework that keeps GMI ahead of the curve.\r\nYou will focus on B200-first optimization, with support for H200 evolution, across core domains including quantization, speculative decoding, KV cache and memory management, prefill/decode disaggregation, and system-level inference optimization. You will work closely with platform and infrastructure teams to transform cutting-edge ideas into measurable gains in latency, throughput, cost efficiency, and production scalability.\r\nKey Responsibilities\r\nDrive frontier research and engineering in LLM inference optimization, building GMI’s industry-leading capabilities in performance, efficiency, and scalability.\r\nDevelop next-generation optimization strategies for large-scale LLM serving across model execution, runtime systems, and production inference platforms.\r\nAdvance state-of-the-art techniques inquantization and precision optimizationto improve throughput, latency, memory efficiency, and cost-performance across modern GPU systems.\r\nPush the frontier ofspeculative decodingand related acceleration methods, including both systems and model-level approaches for faster generation.\r\nLead innovation inKV cache and memory optimization , improving long-context serving efficiency, memory utilization, and multi-tenant performance.\r\nDevelop advanced architectures forprefill/decode disaggregationand other distributed inference optimization strategies for large-scale production environments.\r\nDrivesystem-level optimizationacross scheduling, batching, routing, gateway orchestration, adapter serving, and end-to-end inference efficiency.\r\nBuild scalable optimization frameworks, performance methodologies, and engineering practices that allow GMI to stay ahead of the industry as models, hardware, and serving patterns evolve.\r\nTurn cutting-edge optimization ideas into production-ready capabilities that improve real-world customer workloads across latency, throughput, quality, and cost.\r\nCollaborate closely with platform, infrastructure, and product teams to make inference optimization a core technical advantage of GMI Cloud.\r\nRequired Skills\r\nStrong hands-on experience withLLM inference systemsand performance optimization.\r\nSolid understanding of inference metrics and tradeoffs, includingTTFT, ITL, throughput, goodput, tail latency, GPU utilization, memory efficiency, and quality/cost tradeoffs .\r\nExperience with one or more modern serving stacks such asSGLang, vLLM, TensorRT-LLM, Triton,or similar systems.\r\nDeep familiarity withGPU-based inference , model serving architecture, and production bottlenecks around compute, memory bandwidth, KV-cache behavior, and scheduling.\r\nStrong experimentation skills: able to design benchmarks, interpret results, debug regressions, and produce actionable conclusions rather than isolated microbenchmark wins.\r\nComfortable working across research-style validation and production engineering, with a bias toward measurable impact in real customer scenarios.\r\nStrong coding and systems skills inPython , with practical experience in profiling, observability, and performance debugging.\r\nClear communication skills and the ability to explain technical tradeoffs to both engineers and cross-functional stakeholders.\r\nPreferred Qualifications\r\n1+ years of hands-on experience inLLM inference optimization ,ML systems optimization , or closely related areas.\r\nExperience working on optimization for large-scale model serving, such as latency reduction, throughput improvement, memory efficiency, or cost-performance tuning.\r\nFamiliarity with one or more major areas of inference optimization, includingquantization ,speculative decoding ,KV cache optimization ,prefill/decode disaggregation , orsystem-level serving optimization .\r\nExperience with modern LLM serving stacks, GPU inference systems, or production ML infrastructure is a strong plus.","company":"Gmi Cloud","rawCompany":"gmi cloud","city":"Mountain View","state":"CA","isRemote":false,"isActive":false,"createdAt":"2026-05-10T01:30:49.998Z","occupations":[{"code":"15-1299.08","title":"Computer Systems Engineers/Architects","slug":"computer-systems-engineers-architects"},{"code":"15-1252.00","title":"Software Developers","slug":"software-developers"},{"code":"15-1221.00","title":"Computer and Information Research Scientists","slug":"computer-and-information-research-scientists"}],"industries":[{"code":"518210","title":"Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services","slug":"computing-infrastructure-providers-data-processing-web-hosting-and-related-services"},{"code":"513210","title":"Software Publishers","slug":"software-publishers"},{"code":"541512","title":"Computer Systems Design Services","slug":"computer-systems-design-services"}],"jobPosting":{"@context":"https://schema.org","@type":"JobPosting","title":"Machine Learning Engineer (LLM inference)","description":"MLE (LLM inference)\r\nAbout US\r\nGMI Cloudis a fast-growing AI infrastructure company backed by Headline VC and one of only six cloud providers worldwide to earn NVIDIA’s prestigious Reference Platform Cloud Partner designation . We operate 8 of our own GPU clusters across the U.S. and Asia, delivering a full spectrum of services from GPU compute service to AI model inference API solutions. As an NVIDIA Reference Platform Cloud Partner, our infrastructure meets the highest standards for performance, security, and scalability in AI deployments. We empower AI startups and enterprises to “build AI without limits,” providing everything they need to prototype, train, and deploy AI models quickly and reliably.About this role\r\nWe are hiring a Machine Learning Engineer, LLM Optimization to build a world-leading inference optimization team and make GMI Cloud the industry benchmark for LLM serving performance.\r\nThis role is for engineers who want to work at the frontier of AI systems. You will drive the research, validation, and productionization of the most advanced inference optimization techniques, and turn them into real competitive advantage across GMI’s inference platform.\r\nOur goal is to make GMI the company that leads the industry in how fast we discover, evaluate, combine, and operationalize the best optimization strategies for real customer workloads. That means not only adopting the latest advances, but also defining best practices, developing our own optimization methodologies, and building the internal framework that keeps GMI ahead of the curve.\r\nYou will focus on B200-first optimization, with support for H200 evolution, across core domains including quantization, speculative decoding, KV cache and memory management, prefill/decode disaggregation, and system-level inference optimization. You will work closely with platform and infrastructure teams to transform cutting-edge ideas into measurable gains in latency, throughput, cost efficiency, and production scalability.\r\nKey Responsibilities\r\nDrive frontier research and engineering in LLM inference optimization, building GMI’s industry-leading capabilities in performance, efficiency, and scalability.\r\nDevelop next-generation optimization strategies for large-scale LLM serving across model execution, runtime systems, and production inference platforms.\r\nAdvance state-of-the-art techniques inquantization and precision optimizationto improve throughput, latency, memory efficiency, and cost-performance across modern GPU systems.\r\nPush the frontier ofspeculative decodingand related acceleration methods, including both systems and model-level approaches for faster generation.\r\nLead innovation inKV cache and memory optimization , improving long-context serving efficiency, memory utilization, and multi-tenant performance.\r\nDevelop advanced architectures forprefill/decode disaggregationand other distributed inference optimization strategies for large-scale production environments.\r\nDrivesystem-level optimizationacross scheduling, batching, routing, gateway orchestration, adapter serving, and end-to-end inference efficiency.\r\nBuild scalable optimization frameworks, performance methodologies, and engineering practices that allow GMI to stay ahead of the industry as models, hardware, and serving patterns evolve.\r\nTurn cutting-edge optimization ideas into production-ready capabilities that improve real-world customer workloads across latency, throughput, quality, and cost.\r\nCollaborate closely with platform, infrastructure, and product teams to make inference optimization a core technical advantage of GMI Cloud.\r\nRequired Skills\r\nStrong hands-on experience withLLM inference systemsand performance optimization.\r\nSolid understanding of inference metrics and tradeoffs, includingTTFT, ITL, throughput, goodput, tail latency, GPU utilization, memory efficiency, and quality/cost tradeoffs .\r\nExperience with one or more modern serving stacks such asSGLang, vLLM, TensorRT-LLM, Triton,or similar systems.\r\nDeep familiarity withGPU-based inference , model serving architecture, and production bottlenecks around compute, memory bandwidth, KV-cache behavior, and scheduling.\r\nStrong experimentation skills: able to design benchmarks, interpret results, debug regressions, and produce actionable conclusions rather than isolated microbenchmark wins.\r\nComfortable working across research-style validation and production engineering, with a bias toward measurable impact in real customer scenarios.\r\nStrong coding and systems skills inPython , with practical experience in profiling, observability, and performance debugging.\r\nClear communication skills and the ability to explain technical tradeoffs to both engineers and cross-functional stakeholders.\r\nPreferred Qualifications\r\n1+ years of hands-on experience inLLM inference optimization ,ML systems optimization , or closely related areas.\r\nExperience working on optimization for large-scale model serving, such as latency reduction, throughput improvement, memory efficiency, or cost-performance tuning.\r\nFamiliarity with one or more major areas of inference optimization, includingquantization ,speculative decoding ,KV cache optimization ,prefill/decode disaggregation , orsystem-level serving optimization .\r\nExperience with modern LLM serving stacks, GPU inference systems, or production ML infrastructure is a strong plus.","datePosted":"2026-05-10T01:30:49.998Z","dateModified":"2026-05-10T01:30:49.998Z","hiringOrganization":{"@type":"Organization","name":"Gmi Cloud","sameAs":"https://jobsearcher.com"},"jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Mountain View","addressRegion":"CA","addressCountry":"US"}},"identifier":{"@type":"PropertyValue","name":"JobSearcher","value":"713ffb70ca87a5fca882f724"},"url":"https://jobsearcher.com/jobs/713ffb70ca87a5fca882f724"}}