JOBSEARCHER

On-prem Platform Engineer

LLM Inference & OptimizationvLLM, TensorRT-LLM, Triton Inference Server, SGLangInference optimization techniques:Continuous batchingSpeculative decodingKV cache / Prefix cachingModel optimization:FP8, AWQ, GPTQDistributed & GPU SystemsTensor parallelism and large model scalingCUDA, NCCL, GPU architectureGPU partitioning & optimization (MIG)Kubernetes & ML ServingKubernetes-based ML serving platformsKServe, OpenShift AIHelm charts, Operators, platform automationGPU OrchestrationRun: AI or similar GPU scheduling/orchestration platformsMulti-tenant GPU workload managementPlatform EngineeringExperience building internal AI/ML platforms (on-prem or hybrid)Strong automation and system design mindsetObservability & PerformancePrometheus, GrafanaML observability (model latency, throughput, drift, resource utilization)Performance benchmarking and tuningGood to Have / Preferred SkillsExperience with LLMOps / GenAI pipelinesExposure to hybrid cloud (on-prem + GCP/Azure integration)Familiarity with Inferentia / alternative acceleratorsKnowledge of service mesh / networking in GPU clusters· Build, configure, and operate on‑prem Kubernetes/OpenShift AI platforms for deploying and serving GenAI models and LLM inference workloads.· Design and optimize high‑performance inference stacks using vLLM, TensorRT‑LLM, Triton Inference Server, SGLang, and advanced techniques (continuous batching, speculative decoding, KV caching).· Manage GPU orchestration and capacity using Run: AI, MIG, CUDA/NCCL, and tensor parallelism to maximize utilization and throughput.· Deploy and operate Kubernetes ML serving frameworks (KServe, Helm, Operators) for scalable, reliable model serving.· Drive inference optimization and benchmarking, leveraging FP8, AWQ, GPTQ, and performance tools such as GuideLLM and Locust.· Implement observability and ML monitoring using Prometheus, Grafana, Arize AI, ensuring SLA/SLO compliance for GenAI services.· Collaborate with ML and research teams to onboard new models, tune inference performance, and productionize GenAI use cases.