AIOps
Role: Technical Lead – AI Operations & SupportLocation: Remote (Quarterly on-site requirement in Dallas, TX)Duration: One-year contract, open-endedOverviewWe are seeking a Technical Lead to oversee our AI Operations (AIOps) and AI Support function. This individual will serve as a senior technical liaison between architecture teams and internal engineering groups, providing strategic thought leadership, hands-on technical guidance, and operational oversight for AI systems in production.The Technical Lead will be responsible for ensuring the reliability, performance, and observability of machine learning, generative AI, and agentic systems. This includes monitoring model health, detecting drift, assessing groundedness and hallucinations, and ensuring production-ready operational standards across AI platforms.Key ResponsibilitiesAct as the technical authority for AI Operations and Support, overseeing monitoring, alerting, and incident response for AI systems in productionPartner closely with architecture and engineering teams to guide AI systems from development through production deployment and ongoing operational supportLead model monitoring efforts, including detection of data drift, performance degradation, hallucinations, and groundedness issuesSupport both traditional ML models and Generative AI solutions, ensuring best practices for observability and operational readinessMonitor and support generative AI applications such as chatbots, RAG/grounded solutions, and internally developed AI productsEvaluate and support the rollout of Amazon Quick (formerly Quick Suite) to enable self-service development through a digital workspaceLead operational readiness for the launch and scale of AWS AgentCore, including monitoring agent task performance, success rates, latency, reasoning steps, and semantic evaluationsOversee a growing portfolio of production AI models, each with established alerting and support processesServe as an escalation point for production issues, with a strong focus on diagnosing data pipeline and data source failuresProvide technical leadership over a dedicated support team handling day-to-day ticketing and incidentsCommunicate effectively with both engineering teams and executive leadership, clearly articulating system health, incidents, root causes, and remediation plansTechnical EnvironmentCloud-native architecture built on AWSHeavy usage of AWS infrastructure services, CloudWatch, and AWS AI platformsGenerative AI frameworks including LangChain (primary) and limited use of CrewAIAI services such as Amazon Bedrock, Amazon Quick, and AWS AgentCoreRequirementsDeep AWS experience and architectural expertise, including core compute, storage, networking, IAM, CloudWatch, and AI services (Bedrock, Quick/Quick Suite; AgentCore is a plus)Hands-on AIOps and production support experience across ML, Generative AI, and agent-based systemsStrong understanding of model monitoring fundamentals, including data drift, performance degradation, alerting strategies, hallucination detection, and agent performance metricsExperience supporting Generative AI systems in production, including chatbots and RAG/grounded architecturesExpertise in agent and workflow monitoring, including task success rates, latency, reasoning steps, and semantic evaluationProven ability to provide technical leadership, mentor teams, and drive operational best practicesExcellent communication skills, with the ability to engage both deeply technical stakeholders and executive leadershipNice to HaveAWS certification(s)Early experience with AWS AgentCoreI