JOBSEARCHER

Engineer, Platform Engineering - AI

OverviewJob PurposeWe are on a mission as a team. We are problem solvers and partners, always starting with our customers to solve their challenges and create opportunities. Our start-up roots keep us nimble, flexible, and moving fast. We take ownership and make decisions. We all work for one company and work together to drive growth across the business. We engage in robust debates to find the best path, and then we move forward as one team. We take pride in what we do, acting with integrity and passion, so that our customers can perform better. We are experts and enthusiasts - combining ever-expanding knowledge with leading technology to consistently deliver results, solutions and opportunities for our customers and stakeholders. Every day we work toward transforming global markets.The AI Platform Engineer is responsible for the technical implementation, maintenance, and optimization of AI/ML infrastructure. This hands-on role focuses on GPU cluster deployment, container image management, platform tooling development, and deep technical troubleshooting. In addition, the engineer deploys and maintains AI-enabled workflow automation tools across LLM, MCP, and agentic capabilities, ensuring these systems operate efficiently and securely within a containerized architecture. This includes deploying and maintaining vector store infrastructure, implementing end-to-end RAG workflows, tuning agent memory systems, and hosting and managing MCP servers. The engineer also deploys and operates Agentic AI systems, including multi-agent orchestration frameworks and tool-use pipelines. The engineer serves as a core technical contributor on the AI Platform Operations team, translating architectural decisions into working infrastructure and enabling advanced, automated workflows across the platform.ResponsibilitiesDeploy, configure, and maintain GPU clusters and associated infrastructureDesigning, building, and maintaining the workflow automation platform that uses AI capabilities (LLM/MCP/Agentic capabilities)Manage NVIDIA driver versions, CUDA toolkits, and container runtimesBuild and maintain approved container images with ML frameworks (PyTorch, TensorFlow, etc.)Implement monitoring, alerting, and observability for GPU infrastructureDeploy and maintain vector store infrastructure for RAG pipelines, agent memory, and semantic searchImplement and maintain end-to-end RAG workflows, including document ingestion, chunking, embedding generation, and retrieval optimizationMaintain and tune agent memory systems, including short-term context windows, long-term persistent memory stores, and episodic memory retrieval patternsDeploy, operate, and maintain Agentic AI systems, including multi-agent orchestration frameworks and tool-use pipelinesDeploy, host, and maintain MCP servers within the containerized platform infrastructureManage MCP server configurations, versioning, access controls, and integration with agentic workflowsMonitor MCP server health, performance, and availability; respond to incidents and perform root cause analysisDevelop automation and tooling to improve platform reliability and efficiencyProvide L2/L3 technical support and vendor escalation for complex issuesImplement security controls including network policies, RBAC, and secrets managementExecute change requests and maintain technical documentationRespond to and assist in production operations in a 24/7 environmentProvide technical analysis, resolve problems, and propose solutionsProvide support to, and coordinate with, developers, operations staff, release engineers, and end-usersEducate and mentor team members and operations staffParticipate in a weekly on-call rotation for after-hours supportKnowledge and Experience3+ years in infrastructure engineering, systems administration, or DevOps3+ years in scripting and automation skills (Python, Ansible, GitOps)3+ years hands-on experience with Kubernetes in production2+ years experience with Linux administrationDirect experience with GPU infrastructure (NVIDIA preferred)1+ years experience using CUDA1+ years experience using MCPs1+ years experience with vector databases and embedding infrastructure1+ years experience with RAG pipeline design and deployment1+ years experience with agent memory patterns (in-context, external stores, retrieval-augmented memory)1+ years experience with agentic AI systems using orchestration frameworks1+ years experience with semantic search, embedding models, and ANN search techniques1+ years working with workflow/orchestrion automation toolsExperience with enterprise monitoring and observability toolsAbility to work in a service-oriented team environmentProject Management, organization, and time managementCustomer focused, and dedicated to the best possible user experienceCommunicate effectively with both technical and business resourcesFluent speaking, reading, and writing in EnglishDesired Knowledge and Experience1+ years of experience with AI developer toolkits (NVIDIA drivers, CUDA, cuDNN, and NCCL)1+ years of experience with Run:AI, NVIDIA AI Enterprise, or DGX systems1+ years of experience with n8n1+ years of experience with GitHub ActionsIntercontinental Exchange, Inc. is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to legally protected characteristics.J-18808-Ljbffr