JOBSEARCHER

HPC Kubernetes Architect (AI / GPU Platforms)

HPC Kubernetes Architect (AI / GPU Platforms) Location: Dallas, TX (Hybrid – 3/2) | Relocation available Type: Direct Hire• $175K–$250K base + performance bonus • 100% company-paid benefitsOverviewWe are seeking an HPC Kubernetes Architect to lead the design and delivery of GPU-accelerated container platforms supporting next-generation AI, machine learning, and high-performance computing workloads.This organization operates at the forefront of large-scale compute infrastructure, building platforms that power scientific research, advanced simulation, and data-intensive innovation. This role sits at the intersection of Kubernetes, HPC, and GPU infrastructure, driving architecture decisions that directly impact performance, scalability, and multi-tenant platform efficiency.This is a customer-facing architecture role with ownership across the full solution lifecycle, from early discovery and requirements definition through design, proof-of-concept, deployment, and long-term optimization. You will serve as a trusted advisor to both internal stakeholders and external users, shaping how GPU-based Kubernetes platforms are built and scaled across complex environments.Key ResponsibilitiesArchitecture & Customer Engagement • Serve as the primary architectural lead for GPU-accelerated Kubernetes platforms supporting HPC and AI/ML workloads • Translate complex workload requirements into scalable, production-ready reference architectures • Lead discovery sessions, technical design workshops, and performance benchmarking engagements • Guide customers through platform adoption, integration, and long-term optimization strategies • Present architectural solutions and act as a subject matter expert in Kubernetes-based HPC environmentsKubernetes & GPU Platform Engineering • Design and optimize Kubernetes clusters for GPU-intensive workloads in on-prem and hybrid environments • Implement and tune NVIDIA ecosystem components, including GPU Operator, DCGM, MIG, and device plugins • Optimize GPU scheduling and utilization through Kubernetes extensions (Volcano, Slurm integration, scheduler plugins) • Develop and extend Kubernetes operators and controllers (Go/Python) to automate infrastructure servicesInfrastructure Integration (Compute, Storage, Network) • Architect end-to-end platform integration across compute, storage, and networking layers • Integrate high-performance storage solutions (Lustre, GPFS, Ceph, VAST) into Kubernetes environments • Design and support high-performance networking (InfiniBand, RDMA, RoCE, NVLink) for distributed workloads • Define multi-tenant architectures with strong isolation, security, and resource governance (RBAC, OPA/Gatekeeper)Observability, Automation & Performance • Implement observability frameworks using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry • Drive workload profiling, benchmarking, and performance tuning across distributed compute environments • Support GitOps-based deployment models using ArgoCD, FluxCD, Helm, and Kustomize • Partner with HPC and ML teams to validate performance and scalability at production scaleEcosystem & Product Collaboration • Collaborate with internal product, engineering, and operations teams to influence platform roadmap • Engage with key ecosystem partners (NVIDIA, networking and storage vendors) to integrate emerging technologies • Provide forward-looking guidance on GPU architectures, interconnect evolution, and orchestration trendsRequired Experience• Extensive experience designing and operating Kubernetes platforms in HPC or GPU-intensive environments • Deep expertise with the NVIDIA GPU ecosystem (GPU Operator, MIG, DCGM, device plugins) • Strong understanding of Kubernetes internals, including CRDs, RBAC, scheduling, and custom controllers • Experience integrating distributed storage systems for high-performance workloads • Strong knowledge of high-performance networking (InfiniBand, RDMA, RoCE) in containerized environments • Proven ability to design scalable, secure, and highly available distributed compute platforms • Proficiency in Go or Python for infrastructure automation or operator development • Experience with workload benchmarking, profiling, and performance optimization • Strong communication skills with the ability to translate complex technical concepts into actionable solutionsPreferred Experience• Experience delivering end-to-end customer solutions from design through deployment and adoption • Familiarity with HPC workload orchestration tools (Slurm, Kubernetes schedulers, Apptainer/Singularity) • Exposure to GitOps and infrastructure-as-code practices in Kubernetes environments • Contributions to open-source Kubernetes or GPU ecosystem projects • Experience advising on long-term platform strategy and emerging technology adoption • Relevant certifications such as CKA, CKAD, CKS, or cloud architecture certifications (AWS, Azure)Why This Role• High-impact role shaping next-generation AI and HPC infrastructure • Direct influence on platform architecture, performance, and scalability at scale • Strong visibility across engineering, product, and customer environments • Backed by significant investment and long-term growth in AI compute platforms