LLM Inference & GPU Systems Consultant
Job Title: LLM Inference & GPU Systems ConsultantLocation: Charlotte, NCInterview: Video Interview Description:We are seeking an AI Infrastructure Runtime Engineer to build and maintain large-scale on-prem LLM infrastructure. This is an enterprise private GenAI environment running on NVIDIA H200 GPU clusters and an OpenShift AI deployment ecosystem. You will manage production inference internally, including self-hosting open-source LLMs like Llama. We are focused exclusively on inferencing; this role involves no model training infrastructure or fine-tuning pipelines. Key Responsibilities NVIDIA GPU Runtime Optimization: Drive extreme runtime efficiency and optimization for the token generation pipeline. Specifically manage prefill/decode optimization and KV cache management. Inference Serving: Deploy and manage inference engines including vLLM and TensorRT-LLM. Hardware Utilization: Optimize GPU throughput tuning, batching strategies, and latency optimization. Manage workload orchestration using RunAI and Kubernetes GPU orchestration. Model Lifecycle Management: Oversee the complete Hugging Face model lifecycle, including model onboarding, deployment, and retirement. Platform Operations: Operate and maintain the OpenShift AI ecosystem as the primary container platform for GenAI workloads. Required Qualifications 8+ years’ experience working as an LLM Systems Engineer or AI Infrastructure Runtime Engineer. 8+ years hands-on experience with NVIDIA H200 clusters and runtime optimization techniques (KV Cache, prefill/decode). Proficiency in OpenShift AI and GPU orchestration tools like RunAI. Strong experience with modern inference frameworks, specifically vLLM and TensorRT-LLM. Proven track record managing the Hugging Face deployment lifecycle. Must be onsite at client in Charlotte, NC at least 3 days/week