JOBSEARCHER

HPC Engineer

About Periodic LabsThe most important scientific discoveries of our time won’t happen in a traditional lab. We’re an AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly, we operate at the pace the frontier requires. Our team brings deep expertise, genuine ownership, and an insatiable drive to push the boundaries of what’s scientifically possible.About The RoleAs HPC Engineer at Periodic Labs, you will design, build, and operate the high-performance computing infrastructure that powers our AI and scientific research. Our models demand extreme compute at scale — large GPU and CPU clusters, high-speed interconnects, low-latency parallel storage, and workload schedulers that make every cycle count. You will work directly with researchers and infrastructure engineers to ensure our compute environment is fast, reliable, and optimized for scientific discovery at the frontier.This is a deeply hands-on role. You will architect and tune systems, automate provisioning, diagnose performance bottlenecks, and design for resilience at scale. You’ll partner with research and ML teams to understand their workloads and shape an HPC environment that removes friction and accelerates science.What You’ll DoDesign, deploy, and operate large-scale GPU and CPU clusters for AI training, scientific simulation, and research workloadsManage and optimize high-speed interconnect fabrics (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS, WEKA, or equivalent) for maximum throughput and minimum latencyOwn workload scheduling and resource management using Slurm, Kubernetes, or similar systems — tuning for throughput, fairness, and researcher productivityImplement and maintain automated cluster provisioning, configuration management, and lifecycle tooling using Ansible, Terraform, or custom orchestrationMonitor cluster health, performance, and utilization; build dashboards and alerting to proactively identify and resolve bottlenecksPartner with research and ML engineering teams to profile workloads, diagnose performance issues, and tune hardware and software stacks for specific computational demandsDesign and implement backup, disaster recovery, and fault-tolerance strategies for research data and compute infrastructureEvaluate and integrate new hardware (GPUs, accelerators, networking) and software technologies as the field evolvesEstablish standards and runbooks for HPC operations, capacity planning, and incident responseCollaborate with security and infrastructure teams to implement access controls, network segmentation, and compliance controls appropriate for a research environmentYou Will Thrive in This Role If You HaveExperience designing and operating large-scale HPC or GPU clusters in research, cloud, or enterprise environmentsDeep knowledge of high-speed interconnects such as InfiniBand (HDR/NDR) or RoCE, including fabric management, tuning, and troubleshootingHands-on experience with parallel and distributed storage systems (Lustre, GPFS, WEKA, BeeGFS, or similar) — configuration, performance tuning, and capacity managementExperience with workload managers and schedulers such as Slurm, PBS Pro, LSF, or Kubernetes-based HPC orchestrationLinux systems administration at scale, including kernel tuning, NUMA optimization, CPU and memory affinity, and GPU driver managementInfrastructure automation using Ansible, Terraform, or equivalent — you treat infrastructure as codeExperience with GPU computing environments including CUDA, NCCL, MPI, and multi-node distributed training or simulation setupsPerformance profiling, benchmarking, and tuning of computational workloads across CPU, GPU, memory, network, and storageExperience with monitoring and observability tooling (Prometheus, Grafana, or equivalent) in large, heterogeneous compute environmentsAbility to collaborate with researchers or data scientists to understand workload requirements and translate them into infrastructure decisionsEspecially Strong Candidates May Also HaveExperience operating GPU clusters for large-scale AI or ML training workloads such as multi-node transformer trainingFamiliarity with AI accelerators beyond GPUs, such as TPUs, Trainium, or custom ASIC environmentsExperience in mixed on-prem and cloud HPC environments, including burst-to-cloud or hybrid scheduling patternsBackground in scientific computing domains such as computational chemistry, physics simulation, or bioinformaticsExperience with containerized HPC environments (Singularity/Apptainer, Docker, or container-aware schedulers)Knowledge of network security, access control, and compliance requirements for regulated research dataContributions to open-source HPC tooling or published work on HPC system design or performanceMechanicsMinimum education: Bachelor’s degree or an equivalent combination of education and training or experienceLocation: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on roleCompensation: The annual base compensation range for this role is $350,000-$450,000Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.We’re building a team of the world’s best — the scientists, engineers, and problem-solvers who don’t just follow the frontier, they define it. If you’re driven to bring AI to life in the physical world and make discoveries that have never been made before, you belong here.What You’ll DoDesign, deploy, and operate large-scale GPU and CPU clusters for AI training, scientific simulation, and research workloadsManage and optimize high-speed interconnect fabrics (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS, WEKA, or equivalent) for maximum throughput and minimum latencyOwn workload scheduling and resource management using Slurm, Kubernetes, or similar systems — tuning for throughput, fairness, and researcher productivityImplement and maintain automated cluster provisioning, configuration management, and lifecycle tooling using Ansible, Terraform, or custom orchestrationMonitor cluster health, performance, and utilization; build dashboards and alerting to proactively identify and resolve bottlenecksPartner with research and ML engineering teams to profile workloads, diagnose performance issues, and tune hardware and software stacks for specific computational demandsDesign and implement backup, disaster recovery, and fault-tolerance strategies for research data and compute infrastructureEvaluate and integrate new hardware (GPUs, accelerators, networking) and software technologies as the field evolvesEstablish standards and runbooks for HPC operations, capacity planning, and incident responseCollaborate with security and infrastructure teams to implement access controls, network segmentation, and compliance controls appropriate for a research environmentYou Will Thrive in This Role If You HaveExperience designing and operating large-scale HPC or GPU clusters in research, cloud, or enterprise environmentsDeep knowledge of high-speed interconnects such as InfiniBand (HDR/NDR) or RoCE, including fabric management, tuning, and troubleshootingHands-on experience with parallel and distributed storage systems (Lustre, GPFS, WEKA, BeeGFS, or similar) — configuration, performance tuning, and capacity managementExperience with workload managers and schedulers such as Slurm, PBS Pro, LSF, or Kubernetes-based HPC orchestrationLinux systems administration at scale, including kernel tuning, NUMA optimization, CPU and memory affinity, and GPU driver managementInfrastructure automation using Ansible, Terraform, or equivalent — you treat infrastructure as codeExperience with GPU computing environments including CUDA, NCCL, MPI, and multi-node distributed training or simulation setupsPerformance profiling, benchmarking, and tuning of computational workloads across CPU, GPU, memory, network, and storageExperience with monitoring and observability tooling (Prometheus, Grafana, or equivalent) in large, heterogeneous compute environmentsAbility to collaborate with researchers or data scientists to understand workload requirements and translate them into infrastructure decisionsEspecially Strong Candidates May Also HaveExperience operating GPU clusters for large-scale AI or ML training workloads such as multi-node transformer trainingFamiliarity with AI accelerators beyond GPUs, such as TPUs, Trainium, or custom ASIC environmentsExperience in mixed on-prem and cloud HPC environments, including burst-to-cloud or hybrid scheduling patternsBackground in scientific computing domains such as computational chemistry, physics simulation, or bioinformaticsExperience with containerized HPC environments (Singularity/Apptainer, Docker, or container-aware schedulers)Knowledge of network security, access control, and compliance requirements for regulated research dataContributions to open-source HPC tooling or published work on HPC system design or performanceMechanicsMinimum education: Bachelor’s degree or an equivalent combination of education and training or experienceLocation: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on roleCompensation: The annual base compensation range for this role is $350,000-$450,000Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.We’re building a team of the world’s best — the scientists, engineers, and problem-solvers who don’t just follow the frontier, they define it. If you’re driven to bring AI to life in the physical world and make discoveries that have never been made before, you belong here.