Machine Learning Solutions Engineer (ML + Infrastructure Focus)
Who We AreLightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction.Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in.We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute.Our ValuesMove Fast: We act with speed and precision, breaking down big challenges into achievable steps. Focus: We complete one goal at a time with care, collaborating as a team to deliver features with precision. Balance: Sustained performance comes from rest and recovery. We ensure a healthy work-life balance to keep you at your best. Craftsmanship: Innovation through excellence. Every detail matters, and we take pride in mastering our craft. Minimal: Simplicity drives our innovation. We eliminate complexity through discipline and focus on what truly matters. What We're Looking ForLightning is looking for a Machine Learning Solutions Engineer with a focus on ML and Infrastructure to join ou Sales team in New York. As a Machine Learning Solutions Engineer, you will operate at the intersection of machine learning, distributed systems, and cloud infrastructure. You will partner with customers to design and deploy end-to-end AI systems, spanning:Model development and trainingGPU infrastructure and cluster designDistributed inference and production deploymentThis role goes beyond traditional ML solutions engineering—you will act as a technical architect, helping customers make critical decisions across compute, orchestration, and system design.The role is hybrid out of our New York City office hub, with an in-office requirement of at least 3 days per week and occasional team and company offsites. We are not able to provide visa sponsorship for this role at this time.What You’ll DoCustomer Architecture & Technical LeadershipPartner with customers to understand ML workloads, infrastructure constraints, and scaling requirementsArchitect end-to-end solutions across:Data pipelines (CPU → GPU workflows)Distributed training (multi-node, multi-GPU)High-throughput inference systemsTranslate business goals (latency, cost, throughput) into technical system design decisionsGPU & Infrastructure DesignDesign and optimize workloads across GPU clusters (H100, H200, B200, etc.)Advise on:Training vs inference cluster designInterconnect choices (Ethernet vs Infiniband / RDMA vs Roce)Storage strategies (local NVMe vs networked / object storage)Model and optimize for:Tokens/sec, tokens/$Throughput vs latency tradeoffsGPU utilization and scheduling efficiencyKubernetes & Platform SystemsDesign and support deployments on Kubernetes (EKS, GKE, on-prem clusters)Work with:GPU scheduling (time-slicing, MIG, bin-packing)Autoscaling and workload orchestrationHelm-based deployments and multi-tenant environmentsHelp customers balance:Raw Kubernetes flexibility vs platform abstraction (Lightning)Demos, POCs, and ExecutionBuild and deliver technical demos and POCs that showcase:Distributed training workflowsScalable inference endpointsEnd-to-end ML pipelines on Lightning AIScope and lead POCs aligned to customer success metrics (latency, cost, reliability)Cross-Functional ImpactAct as the bridge between customers, product, and engineeringProvide feedback on:Platform gaps in infrastructure, orchestration, and performanceEmerging patterns in GPU usage and distributed systemsInfluence roadmap across ML workflows and infrastructure capabilitiesEnablement & Thought LeadershipCreate technical contentArchitecture guides (e.g., high-throughput LLM inference systems)Best practices for GPU utilization and scalingEducate customers on modern AI infrastructure patternsWhat You’ll NeedML + Systems Expertise3–6+ years experience in:Machine Learning / AI EngineeringSolutions Engineering / Sales Engineering / ML ConsultingStrong understanding of:Training vs inference workloadsModel optimization (quantization, batching, caching, etc.)GPU & Distributed SystemsExperience working with:GPU clusters (NVIDIA stack preferred)Distributed training or inference systemsFamiliarity with:NCCL, CUDA, or GPU performance profilingNetworking concepts (RDMA, Roce, Infiniband, high-throughput systems)Kubernetes & Cloud PlatformsHands-on experience with:Kubernetes (EKS, GKE, or on-prem)Slurm Containerization (Docker)Exposure to:GPU scheduling in Kubernetes environmentsMulti-tenant or production ML deploymentsProgramming & ToolingStrong Python skills (PyTorch preferred)Experience building:ML pipelinesAPIs or inference servicesFamiliarity with Lightning AI, PyTorch Lightning, or similar frameworks is a plusCustomer-Facing Excellence Ability to:Explain complex infrastructure and ML tradeoffs clearlyRun technical discovery and uncover quantifiable success metricsExperience working cross-functionally with:Sales, product, and engineering teamsCompensationThe annual base pay range for this role is $150,000 - $195,000, in addition to a variable pay component and meaningful equity.Benefits And PerksWe offer a comprehensive and competitive benefits package designed to support our employees’ health, well-being, and long-term success. Benefits may vary by location, team, and role.Benefits IncludeComprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.)Retirement and financial wellness support (U.S.); Pension contribution (U.K.)Generous paid time off, plus holidaysPaid parental leaveProfessional development supportWellness and work-from-home stipendsFlexible work environmentAt Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.