Inference Performance Engineer

MidjourneyAlameda, CAMay 10th, 2026

Computer Systems Engineers/ArchitectsAgents and Managers for Artists, Athletes, Entertainers, and Other Public Figures

About MidjourneyMidjourney is an independent research lab exploring new mediums of thought and expanding the imaginative powers of the human species. We are a small self-funded team focused on design, human infrastructure, and AI. Our compute-to-headcount ratio is among the highest in the world, which means you will have more immediate influence over global AI infrastructure here than at any Big Tech firm.About the RoleAs Inference Performance Engineer, you own the performance and execution health of the entire inference pipeline, and the toolchain that lets the rest of the team move quickly on top of it. You have immediate influence over how the product feels, what it costs to run, and how fast research can move.In this role, you join a small team core to our business: take research proofs of concept and turn them into systems that are scalable, reliable, and performant in production, and keep us ahead of an evolving landscape of new accelerators, serving libraries, and model architectures.You will benchmark every layer of the stack, qualify new hardware and serving libraries for production, and own the debugging tools to quickly identify and fix regressions. Performance gains directly reduce user-perceived latency, lower compute costs, and increase research iteration speed. We are aiming for subsecond image generation and enabling real-time generative experiences.The work involves complex systems engineering: extracting high performance from diverse accelerator families, qualifying new hardware and libraries, and building custom instrumentation for generative pipelines. We don't tie ourselves to a single vendor, and our scale and market influence means vendors regularly bring us new accelerator architectures early. Our work is at the cutting edge of the industry, and you get to sharpen it.What You'll DoOwn end-to-end performance benchmarking of the inference pipeline: latency, throughput, batch behaviour, accelerator utilisation, memory pressure, and tail behaviour under load.Qualify new accelerators and new serving libraries for production readiness.Profile, find, and fix performance regressions across the stack: kernels, schedulers, batching logic, networking, and orchestration.Design microbenchmarks that lets us constantly understand the performance of the codebase.Shorten the loop between \"research has a candidate model\" and \"we know what it costs to serve at scale,\" so model decisions can be made with real numbers.Pair with the rest of the inference team on production debugging when latency or efficiency regressions hit.What We're Looking ForInference performance engineer who is a problem-solver at their core, with strong performance instincts and a deep ability to analyze flame graphs, profiles, and traces.Hands-on experience with GPU inference: kernel-level profiling, batching strategies, memory layout, attention math, quantisation tradeoffs.CUDA or Triton kernel writing experience, including custom kernels (bonus points for kernels on exotic hardware!)Deep familiarity with GPU-based model serving (PyTorch, JAX, TensorRT, or equivalent custom stacks).Experience qualifying new hardware or library versions in a production setting, including writing the benchmarks that decide go or no-go.Self-directed and proactive. You write production-ready code for the team. Tooling must be robust enough for shared use, not just local development.Comfort with fast-moving codebases. You leave them more measurable than you found them.You take care of the people around you, not just the code. Caring about humanity starts with caring about your teammates. We hire engineers who actively help the people next to them grow.We care more about what you have built than where you have built it. Great people at Midjourney have started in high school just as often as they have come from notable companies and universities. If you feel like you’ve got most of the skills we’re looking for, but are worried that it might not be enough, reach out anyway. Let’s chat.Nice to HaveExperience with diffusion model inference, multi-stage model pipelines, or multimodal serving.Distributed inference experience (tensor parallelism, pipeline parallelism, sharded weights, distributed kv-cache).Contributions to open-source ML inference projects.Why MidjourneySpeed of a startup, freedom and resources of a research lab. No investors, no quarterly reporting cycle, no committees deciding the roadmap.Tiny team, large ambitions. Decisions show up in production within days.We move at the speed of thought. Iterate fast, isolate variables, ship.Hardware-agnostic by design. Inference architecture choices are real architecture choices, not procurement choices.Your work directly determines how the product feels and how much it costs to run. On a team this small, those effects show up immediately and at scale.Flexible location. London or San Francisco preferred for time-zone overlap, but strong remote candidates are welcome.*The wage range for this role takes into account the wide range of factors that are considered in making compensation decisions including but not limited to skill sets, experience and training, certifications and licensure, location, and other business or organizational needs.Equal Opportunity EmployerWe provide and promote equal opportunity in employment, compensation, and other terms and conditions of employment without discrimination because of race, color, creed, religion, national origin, ancestry, citizenship status, sex or gender, gender identity or gender expression (including transgender status), sexual orientation, marital status, military service and veteran status, physical or mental disability, family medical history, genetic information or other protected medical condition, political affiliation, or any other characteristic protected by and in accordance with applicable laws.

Inference Performance Engineer

matching similar jobs near Alameda, CA