JOBSEARCHER

HPC Infrastructure & Scheduler Engineer – PBS, Linux, Python, Bash - Remote, Europe

A large global organisation are looking for an HPC Infrastructure & Scheduler Engineer who can design, build, and operate infrastructure integrations around HPC job schedulers—primarily PBS Professional (PBS Pro/OpenPBS). This role sits at the intersection of systems engineering, automation, and platform integration.This will be an initial 3-6 months+ contract with the opportunity for extensions.The role can be worked 100% remotely with one possible trip required to either Stockholm or Gothenburg.Key Skills & Responsibilities:Strong Linux systems engineering (RHEL/Rocky/SLES)Deep experience with HPC schedulers:PBS Professional, Torque, Slurm, or similarScripting and automation:Python, Bash (required)Go or Rust (nice to have)Experience with distributed systems and cluster operationsHPC-Specific Experience:MPI workloads (OpenMPI, MPICH)GPU scheduling (NVIDIA stack, MIG/MPS concepts)Parallel file systems (Lustre strongly preferred)Understanding of job scheduling concepts:Queues, priorities, backfill, fairshare, reservationsInfrastructure & Integration:Configuration management (Ansible, Puppet, or similar)CI/CD pipelines for infrastructureAPIs and service integration patternsExposure to cloud platforms and hybrid HPC modelsBuilt custom PBS hooks or scheduler extensions in productionDesigned hybrid HPC + Kubernetes or cloud bursting architecturesSolved real scaling problems (10k+ cores, multi-petabyte storage)Experience with security/compliance in HPC environments (STIGs, NIST, etc.)Strong debugging instincts across system layers (network → storage → scheduler)Scheduler Integration & Automation:Build and maintain integrations with PBS Professional and/or OpenPBSDevelop hooks, prolog/epilog scripts, and custom scheduling logicAutomate job lifecycle workflows (submission → execution → teardown)Extend scheduler capabilities via APIs, CLI tooling, and event-driven systemsInfrastructure Engineering:Design and manage HPC environments (bare metal, VM, hybrid cloud)Integrate scheduler with:High-performance storage (Lustre, NFS, object stores)Networking (InfiniBand, Ethernet fabrics)Identity systems (LDAP, Kerberos, RBAC)Optimize node provisioning, boot workflows, and image managementPlatform Integrations:Bridge HPC schedulers with modern platforms:Kubernetes (e.g., batch offload, hybrid scheduling)MLOps stacks (e.g., ClearML, Kubeflow)Cloud bursting workflows (AWS, Azure, GCP)Build tooling for data locality, environment parity, and job portability