JOBSEARCHER

Software Engineer, High Performance Computing

ARCHIVED
EventualMillbrae, CAJune 25th, 2026

We can't find an active application page for this role right now. It may reopen or be listed elsewhere. Use Next Steps to search for an active apply link and similar live jobs.

Your RoleAs a Systems Engineer on the Dataloading team, you'll build the layer that turns multi-petabyte video corpora into dict[str, Tensor] already on the GPU at line rate. We work with the top labs training Physical AI on the newest generation hardware H100, B200, GB200, NVL72, with Vera Rubin on the horizon on billions of dollars worth of compute, in collaboration with partners that are the largest public AI companies on Earth. Our job is to keep those GPUs fed: rank-aware sampling, NVMe caching, video and sensor co-loading, random access into clips, decode pipelining. Streaming alone can already saturate a B200; the hard part is enabling the complex sampling patterns researchers actually need without giving up a single percentage point of MFU.This is a systems engineering role for someone who feels physical pain when a system is slow. You won't need GPU experience on day one we'll uplevel you on NVL72, CUDA, and SLURM. We will need you to bring real expertise on what happens between NVMe, network, memory, and CPU, and a deep instinct for where bytes go.Key ResponsibilitiesDesign and build the video-native dataloader: rank-aware, NVMe-cached, random-access into clips, returns tensors directly to the GPU.Profile and optimize the full data path from object store ? NVMe ? page cache ? host RAM ? device RAM. Eliminate every avoidable copy and stall.Saturate the latest hardware (B200, GB200, NVL72) on real customer training jobs. Push toward Vera Rubin bandwidth requirements.Own performance benchmarks against customer baselines (custom DataLoaders, DALI, decord, LeRobot) and against our own historical numbers regressions get caught at PR time.Partner with researchers at our partner labs to land the loader in their training stack and measure MFU end-to-end.Work cross-team with Storage Infrastructure on the index/format boundary and with Visual Understanding on the model-output ingestion path.What We Look ForObsession with systems-level performance. You can recite Jeff Dean's "numbers every programmer should know" in your sleep. You eat flamegraphs for breakfast.Strong opinions on io_uring love it or hate it, you've earned the opinion.Live and breathe Rust, C++, or C. You reach for them when it matters and you know why.Strong familiarity with operating systems page cache, scheduling, syscalls, NUMA, memory hierarchies.A sense for where bytes actually go: NVMe vs. memory vs. network vs. PCIe vs. NVLink, and the throughput and latency budgets of each.Nice To HaveExperience working with GPUs is a plus, but you don't need it on day one.Experience working with SLURM, Kubernetes for GPU workloads, or other HPC schedulers.Hands-on CUDA experience.Deep expertise on memory and caching subsystems page cache tuning, hugepages, NUMA pinning, GPU-Direct Storage.Worked on video decode pipelines (PyAV, decord, NVDEC) or PyTorch DataLoader internals.Contributed to open-source systems projects in Rust/C++.Perks & BenefitsIn-person, tight-knit team 4 days/week in our SF Mission office.Competitive comp and meaningful startup equity.Catered lunches and dinners for SF employees.Commuter benefit.Team-building events and poker nights.Health, vision, and dental coverage.Flexible PTO.Latest Apple equipment.401(k) plan with match.If slow systems evoke emotional pain for you and you want to spend the next few years making the most expensive GPU clusters on the planet earn their keep, we'd love to talk.