Principal Cluster Engineer, Training Infrastructure
Imagine a future where everyone has instant, low-cost access to intelligence. We’re building a fully featured European AI cloud - with everything one needs to train, experiment with, and deploy AI models at scale. Our GPUs run on 100% renewable energy, powering a more sustainable AI ecosystem.We’re ambitious, curious, and gutsy doers. Low hierarchy, high ownership, and a big vision. After raising $64M in Series A, we’re only getting started.Join Verda while it’s still being built - not once it’s finished.About the roleWe’re looking for a Principal Cluster Engineer to own and evolve our InfiniBand-connected GPU training infrastructure. This is a highly technical role focused on building and operating large-scale AI and HPC clusters that power the next generation of machine learning workloads.You will work closely with ML researchers, cloud platform teams, datacenter operations, and procurement to ensure Verda’s GPU infrastructure is fast, reliable, and ready to support cutting-edge training workloads. In this role you will architect and operate large-scale InfiniBand fabrics, push storage and compute performance to their limits, build automation and observability tooling, and help define the technical and operational standards the team works to.You’ll play a key role in translating customer and product requirements into real infrastructure capabilities, ensuring clusters are designed for performance, reliability, and scale.Why VerdaCompetitive cash and equity package, plus benefits (healthcare, lunch, wellbeing, etc.)Profitable operations with rapid, sustained growthA genuine once-in-a-lifetime opportunity to join one of Finland’s few true explosive growth stories, shaping a category-defining AI cloud from the ground upWork alongside world-class engineers, researchers, and partners across the global AI ecosystemA small, high-performing team of around 70 people representing 27 nationalitiesPracticalitiesLocation: Remote - EUStart Date: As soon as possibleContract Type: Full-timeWorking Language: EnglishYour responsibilitiesDesign, deploy, and continuously improve large-scale InfiniBand-connected GPU training clustersDrive cluster-level storage performance, translating customer SLAs into internal throughput and IOPS performance targetsBuild and maintain automation for cluster provisioning, OS imaging, firmware management, and day-two operations using PythonContribute to infrastructure-as-code and CI/CD pipelines for cluster and platform managementEstablish and own performance baselines across compute, network fabric, and storage layersIdentify, diagnose, and resolve performance bottlenecks across the full cluster stackImplement and maintain observability tooling including metrics, alerting, and anomaly detection systemsWork closely with datacenter operations, cloud platform teams, ML researchers, and procurement to translate requirements into infrastructure architectureParticipate in the on-call rotation and help maintain production reliability of the training clustersYour key competencies7+ years of hands-on infrastructure or systems engineering experienceExperience operating large-scale HPC or AI training clusters (1000+ GPU nodes)Strong production experience with InfiniBand fabricsExperience working with NVIDIA GPU hardware in training workloads (Hopper or newer preferred)Proven experience leading or tech-leading engineering teams, setting technical direction, reviewing work, and mentoring engineersExperience with automation and scripting (Python preferred)Experience working with infrastructure-as-code tools such as Terraform, Ansible, or SaltNice-to-havesExperience with the NVIDIA HPC software stack or UFMKnowledge of NCCL and debugging distributed GPU training workloadsExperience tuning Linux kernels or using eBPF for performance optimization in HPC environmentsSuccess criteria for this role in the next 6-12 monthsOptimized production AI/HPC clusters with measurable improvements in reliability, performance, and job success ratesImplemented automation and tooling that significantly reduces operational overhead and speeds up incident resolutionEstablished strong operational practices for monitoring, alerting, capacity planning, and incident managementBuilt strong collaboration with datacenter operations, ML researchers, and cloud platform teams to translate workload requirements into infrastructure improvementsMentored engineers and helped build deeper internal expertise in GPU cluster operations and performance engineeringHow the process looks likeIntroduction chat with the TA Partner (45 mins): Learn more about Verda and share your career aspirations.Conversation with the CTO (30 mins): A focused discussion with our CTO to explore technical vision, infrastructure strategy, and how your experience aligns with the future of Verda’s AI platform.Technical interview with the team (60 mins): Learn about the role and its requirements and dive deeper into your expertise and discuss technical challenges.Final interview (45 mins): Meet with our COO for a culture-fit conversation.What's nextApply sooner than later. This job ad will be removed when we’ve found the right person.Please submit your application through our Careers page. We don’t accept applications sent by email.