JOBSEARCHER

On-Prem Cloud Engineer

Title: Cloud EngineerLocation: Brevard, CharlotteExperience: 5 to 8 yrsW2, C2CMust Have: Arize AI, Claude Cowork, GCP, TerraformTechnical Skilled Required:VLLM, TensorRT-LLM-Triton Inference Server, SGLang, Inference, Optimization, Continuous Batching, Speculative Decoding KV, Cache / Prefix Caching, FP8 /AWQ/GPTQ, Tensor, Parallelism, Kubernetes ML Serving, KServe OpenShift Al. Helm /Operators, GPU, Orchestration, Run:AI., Performance, Benchmarking, CUDA/NCCL/MIG, Prometheus /Grafana ML Observability GuideLLM, Locust.Responsibilities: Build, configure, and operate on-prem Kubernetes/OpenShift Al platforms for deploying and serving GenAl models and LLM inference workloads.Design and optimize high-performance inference stacks using vLLM, TensorRT-LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).Manage GPU orchestration and capacity using Run:AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.Implement observability and ML monitoring using Prometheus, Grafana, Arize Al, ensuring SLA/SLO compliance for GenAl services.Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.