HPC / GPU Engineer
HPC / GPU Cluster Engineer - European - REMOTEWe’re working with a market-leading Neo-Cloud company specialising in AI infrastructure and GPU engineering. Operating globally and expanding at pace, they are redefining how high-performance compute is delivered at scale.As part of continued growth, they are strengthening their HPC team and are looking for an experienced HPC / GPU Cluster Engineer to help design, operate, and optimise large-scale GPU environments used by AI and machine learning workloads worldwide.This is a highly technical, hands-on engineering role focused on GPU cluster engineering, performance optimisation, and reliability. You’ll work closely with platform, SRE, and AI teams to ensure GPU infrastructure performs reliably and efficiently at scale.If you’re passionate about GPU performance, scalable infrastructure, and HPC engineering, this is a rare opportunity to work at the forefront of AI infrastructure.Key ResponsibilitiesDesign, deploy, and operate large-scale GPU clustersManage and optimise GPU scheduling and resource utilisationPerformance benchmarking and tuning for AI / HPC workloadsWork with SLURM, Kubernetes, and hybrid orchestration environmentsTroubleshoot performance, scalability, and reliability issuesContribute to SRE practices: monitoring, automation, incident responseImprove cluster observability, resilience, and operational toolingSkills & RequirementsStrong experience with GPU cluster engineering at scaleHands-on knowledge of SLURM and/or Kubernetes for GPU workloadsExperience with performance benchmarking, profiling, and optimisationComfortable operating production systems in high-availability environmentsFamiliar with SRE principles (monitoring, alerting, automation, reliability)Solid Linux background and strong scripting skills (Python, Bash, etc.)Experience supporting AI, ML, or HPC workloads in productionNice to HaveExperience with multi-tenant GPU platforms or cloud-scale environmentsKnowledge of NVIDIA GPU ecosystems (CUDA, NCCL, drivers, firmware)Exposure to InfiniBand, high-speed networking, or distributed storage