Inference Engineer (Alameda)
Inference EngineerWe're partnered with an AI infrastructure company building next-generation systems for large-scale AI workloads.Their platform is rethinking how inference runs at scale - intelligently orchestrating workloads across heterogeneous hardware to unlock major gains in performance, efficiency, and cost. The team is solving some of the hardest problems in modern AI infrastructure: inference scheduling, KV cache management, runtime optimization, memory efficiency, and low-latency serving across distributed systems.They're looking for engineers who care deeply about how models execute in production — not just training models, but making them fast, scalable, and reliable under real-world load.What You'll Work OnDesigning and optimizing large-scale inference pipelinesImproving latency, throughput, and concurrency under production workloadsBuilding inference runtimes and serving infrastructureOptimizing batching, scheduling, and request orchestrationManaging KV cache allocation, reuse, placement, and eviction strategiesImproving prefill/decode performance and memory efficiencyProfiling bottlenecks across model, runtime, and distributed system layersCollaborating closely with compiler, kernel, and systems engineersWhat They're Looking ForStrong systems engineering fundamentalsExperience building or scaling ML inference / model serving systemsDeep understanding of performance optimization and memory behaviorExperience with runtimes such as vLLM, TensorRT-LLM, or custom serving infrastructureStrong understanding of transformer architectures and attention mechanismsFamiliarity with batching, scheduling, concurrency, and cache managementStrong Python and/or C++ engineering skillsWhy JoinWork on cutting-edge inference infrastructure and AI systems problemsBuild systems designed for next-generation AI scaleSmall, highly technical engineering teamSignificant ownership and technical impactOpportunity to shape foundational infrastructure for future AI workloads