JOBSEARCHER

Software Engineer

Techire.Millbrae, CAApril 19th, 2026
We’re looking for an Inference Engineer to design and optimize the systems that power our models in production.This role sits at the intersection of:ML systemsdistributed systemshardware-aware performance engineeringYou’ll take cutting-edge models and make them fast, scalable, and efficient in real-world environments.What You’ll Work OnInference Systems & ServingDesign and build low-latency inference pipelines for large multimodal modelsImplement advanced serving techniques such as:continuous batchingKV cache optimizationWork with modern inference frameworks (e.g. vLLM, SGLang, TensorRT-LLM, Triton)Performance OptimizationOptimize inference across:model level (quantization, architecture-aware tuning)hardware level (GPU / accelerator utilization, kernel optimization)Improve latency, throughput, and cost efficiency for production systemsProfile and debug bottlenecks using tools like Nsight, nsys, or similarDistributed & Real-Time SystemsBuild high-throughput, distributed inference infrastructureDesign systems for real-time workloads with strict latency constraintsOptimize multi-GPU / multi-node inference using:tensor parallelismpipeline parallelismdistributed schedulingInfrastructure & ObservabilityDevelop robust monitoring, benchmarking, and evaluation systemsTrack metrics such as:GPU utilizationBuild tooling to support rapid iteration and production reliabilityResearch → ProductionWork closely with research teams to productionize new model architecturesTranslate experimental ideas into high-performance serving systemsContribute to the design of next-generation inference stacksWhy This RoleWork on cutting-edge AI systems that go beyond current model limitationsSolve hard systems problems at the core of how modern AI runsJoin a team that values:speedownershiptechnical excellenceCompensation & BenefitsCompetitive salary + equityFull medical, dental, and vision coverageIn-office meals and a highly collaborative environmentHow to ApplyIf you’re excited about building high-performance inference systems and pushing the limits of real-time AI, we’d love to hear from you.