MLOps/LLMOps Engineer (LLM, DevOps, Cloud SME)
MLOps/LLMOps Engineer (LLM, DevOps, Cloud SME)San Francisco, Bay Area, CADuration: Six months may extend to 12 monthsMust be in the Greater Bay area - or in CaliforniaDomain: utilitiesMUST be a US Citizen or GC holderOperationalizing Large Language Models requires specialized expertise beyond traditional MLOps practices. LLMs present unique operational challenges including significantly larger computational requirements, complex data pipelines, specialized infrastructure needs, and unique performance optimization requirements. This specialized role ensures GenAI solutions can scale effectively from proof-of-concept to enterprise-wide deployment in a utility environment.Ensures GenAI solutions move successfully from prototype to production with proper operational supportEstablishes specialized monitoring for model performance, inference latency, and data qualityEnables efficient scaling of LLM solutions across multiple business unitsCreates high-performance deployment architectures that balance speed, cost, and reliabilityDevelops operational data pipelines to continuously improve model performance with new utility-specific dataKey Responsibilities:Design and implement LLM-specific deployment architectures with Docker containers for both batch and real-time inferenceConfigure GPU infrastructure on-premises or in the cloud with appropriate CI/CD pipelines for model updatesBuild comprehensive monitoring and observability systems with appropriate logging, metrics, and alertsImplement load balancing and scaling solutions for LLM inference, including model sharding if necessaryCreate automated workflows for model retraining, versioning, and deploymentOptimize infrastructure costs through intelligent resource allocation, spot instances, and efficient compute strategiesCollaborate with client's Cyber team on implementing appropriate security controls for GenAI applicationsDevelop automated testing frameworks to ensure consistent output quality across model updatesExpected Skillset:DevOps + ML: Expertise in Kubernetes, Docker, CI/CD tools, and MLflow or similar platformsCloud & Infrastructure: Understanding of GPU instance options, cloud services (AWS/Azure/GCP), and optimization techniquesAutomation: Proficiency in Python, Bash, and infrastructure-as-code tools like Terraform or AnsibleLLM-Specific Frameworks: Experience with tools like TensorBoard, MLFLow, or equivalent for scaling LLMsPerformance Optimization: Knowledge of techniques to monitor and improve inference speed, throughput, and costCollaboration: Ability to work effectively across technical teams while adhering to enterprise architecture standards