Member of Technical Staff, ML Infrastructure & Inference
Member of Technical Staff, ML Infrastructure & InferenceOverviewWe are a cutting-edge AI infrastructure company is building a scalable cloud platform designed for next-generation machine learning workloads ($80M series A).As AI systems continue to grow in complexity, traditional infrastructure models are facing limitations in efficiency, scalability, and cost. The platform addresses these challenges through a hardware-agnostic architecture that dynamically maps workloads across diverse accelerator environments, enabling higher utilization and better performance across multi-vendor systems.The company is also developing production-grade infrastructure for agentic AI applications, allowing customers to deploy and manage workloads through simple APIs without handling low-level optimization or hardware orchestration.The RoleThe team is seeking a Member of Technical Staff focused on ML systems and inference infrastructure.In this role, you will build and optimize large-scale inference systems that serve modern AI models efficiently in production environments. You'll work across runtime behavior, scheduling, memory management, and system optimization to improve latency, throughput, and scalability.This opportunity is well suited to engineers who understand how modern models execute at scale and enjoy solving deep performance challenges across the inference stack.ResponsibilitiesDesign and optimize end-to-end inference pipelines from request intake through response generationBuild scalable inference runtimes optimized for latency, throughput, and concurrencyImprove batching, scheduling, and queueing strategies under real-world production workloadsDevelop efficient KV cache allocation, reuse, and eviction strategiesOptimize prefill and decode execution paths, including attention and memory performanceDebug and profile bottlenecks across models, runtimes, and distributed systemsPartner with compiler, kernel, networking, and infrastructure teams to improve system-wide performanceRequired QualificationsStrong software engineering and systems fundamentalsExperience building or operating ML inference or model serving systemsUnderstanding of runtime performance, memory usage, and system behavior under loadPreferred QualificationsExperience with inference frameworks such as TensorRT-LLM, vLLM, or custom serving infrastructureStrong understanding of transformer architectures and attention mechanismsExperience with batching, scheduling, and concurrency optimization in inference systemsFamiliarity with KV cache management and memory placement strategiesExperience tuning latency- and throughput-sensitive systemsStrong programming skills in Python and C++Based onsite in SFKeywords:ML Systems, Inference Infrastructure, LLM Inference, Model Serving, Distributed Systems, GPU Infrastructure, AI Infrastructure, Inference Runtime, TensorRT-LLM, vLLM, Transformer Architecture, Attention Mechanisms, KV Cache, Memory Optimization, Latency Optimization, Throughput Optimization, Concurrency Control, Batching, Scheduling Systems, Runtime Optimization, Performance Profiling, Scalable Inference, Distributed Inference, CUDA, PyTorch