Platform Engineer (AI/LLM Infrastructure)
Role: Platform Engineer (AI/LLM Infrastructure)Location: Santa Clara, CA (3 days onsite in a week)Day to Day Job Duties:Lead the design, implementation, and operation of scalable infrastructure platforms supporting AI/LLM-based solutions for enterprise clientsAct as a hands-on technical lead (player-coach), contributing to development while guiding a team of engineersOwn end-to-end infrastructure architecture below the application layer, including compute, container orchestration, CI/CD, observability, and securityPartner directly with clients and stakeholders to design, present, and deliver robust AI infrastructure solutionsArchitect and manage production-grade Kubernetes environments (AKS/EKS), including cluster operations and RBACDesign and operationalize RAG pipelines, including ingestion, chunking, embedding workflows, and vector database managementLead GPU infrastructure provisioning and optimization (NVIDIA A100/H100 or similar)Drive Infrastructure-as-Code adoption using Terraform and GitOps practices (ArgoCD/Flux)Build and maintain CI/CD pipelines using GitHub Actions and Azure DevOpsEstablish observability standards using Datadog, OpenTelemetry, and ELK/OpenSearchLead incident response, on-call processes, and post-mortem analysisEnsure strong security posture and lead InfoSec review processesCoordinate delivery across multiple teams and client engagementsBasic Qualifications:5–8 years of experience in Platform Engineering, SRE, or Infrastructure Engineering3+ years of Proven experience delivering and leading infrastructure for AI/LLM-based production systemsStrong hands-on expertise in Kubernetes, Docker, Helm3+ years of experience with Terraform and GitOps (ArgoCD/Flux)3+ years of experience with Azure (Key Vault, Monitor, DevOps Pipelines)3+ years of experience leading client-facing technical engagements3+ years of experience managing multiple concurrent projects or teams3+ years of Hands-on experience with incident management and SLA-driven environments3+ years of Experience leading security/InfoSec reviewsStrong understanding of vector databases, RAG pipelines, and LLM inference systems3+ years of Experience with CI/CD and container registry managementDegree:Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.Nice to Have (But Not Required):Experience with AWS in addition to AzureFamiliarity with Azure API Management and AKSExperience with Pulumi (Python/TypeScript)Knowledge of NIM deployment and lifecycle managementPython scripting for infrastructure automationExperience with load testing tools (k6, Locust, JMeter)Exposure to FinOps and cost optimization practices