Software Engineer
We’re looking for an Inference Engineer to design and optimize the systems that power our models in production.This role sits at the intersection of:ML systemsdistributed systemshardware-aware performance engineeringYou’ll take cutting-edge models and make them fast, scalable, and efficient in real-world environments.What You’ll Work OnInference Systems & ServingDesign and build low-latency inference pipelines for large multimodal modelsImplement advanced serving techniques such as:continuous batchingKV cache optimizationWork with modern inference frameworks (e.g. vLLM, SGLang, TensorRT-LLM, Triton)Performance OptimizationOptimize inference across:model level (quantization, architecture-aware tuning)hardware level (GPU / accelerator utilization, kernel optimization)Improve latency, throughput, and cost efficiency for production systemsProfile and debug bottlenecks using tools like Nsight, nsys, or similarDistributed & Real-Time SystemsBuild high-throughput, distributed inference infrastructureDesign systems for real-time workloads with strict latency constraintsOptimize multi-GPU / multi-node inference using:tensor parallelismpipeline parallelismdistributed schedulingInfrastructure & ObservabilityDevelop robust monitoring, benchmarking, and evaluation systemsTrack metrics such as:GPU utilizationBuild tooling to support rapid iteration and production reliabilityResearch → ProductionWork closely with research teams to productionize new model architecturesTranslate experimental ideas into high-performance serving systemsContribute to the design of next-generation inference stacksWhy This RoleWork on cutting-edge AI systems that go beyond current model limitationsSolve hard systems problems at the core of how modern AI runsJoin a team that values:speedownershiptechnical excellenceCompensation & BenefitsCompetitive salary + equityFull medical, dental, and vision coverageIn-office meals and a highly collaborative environmentHow to ApplyIf you’re excited about building high-performance inference systems and pushing the limits of real-time AI, we’d love to hear from you.