JOBSEARCHER

DevOps Architect

Our client is seeking a highly experienced, hands-on DevOps Architect to lead the design, scalability, and reliability of mission-critical cloud infrastructure. This role supports a strategic initiative to build and scale a centralized AI platform that will enable high-volume, real-time workflows across the organization. This is a high-impact, high-visibility role focused on building resilient, production-grade systems where uptime, performance, and reliability are critical. This position combines architecture, consulting, and hands-on engineering. The ideal candidate brings deep experience designing and operating distributed systems at scale, with a strong focus on reliability and performance in production environments. Key Responsibilities Architect and scale highly available, fault-tolerant cloud infrastructure Design systems to support high-throughput, low-latency workloads Build resilient architectures with failover, redundancy, and disaster recovery strategies Establish and improve observability, monitoring, and alerting capabilities Develop and mature performance and load testing frameworks Optimize auto-scaling and resource management strategies Design for failure, including mitigation of external service dependencies Partner with engineering teams to improve deployment reliability and operational excellence Drive best practices across DevOps and Site Reliability Engineering (SRE) Required Qualifications 8+ years of experience building and operating large-scale distributed systems Deep expertise in cloud architecture and services (compute, storage, networking) Strong background in SRE principles, including scalability, reliability, and performance engineering Proven experience supporting production-critical systems with high uptime requirements Hands-on coding experience (e.g., Python or similar) Experience with containerized environments and modern CI/CD pipelines Strong understanding of system bottlenecks, scaling patterns, and failure modes Preferred Qualifications Experience with AI/ML or large-scale data platforms Familiarity with modern AI infrastructure or inference platforms Experience with observability and monitoring tools Knowledge of cloud security and data protection best practices Exposure to multi-region or globally distributed architectures Key Challenges Scaling platforms from early-stage to enterprise-wide adoption Designing around third-party service limitations and dependencies Advancing system maturity across testing, monitoring, and scaling capabilities Ensuring high availability for business-critical, real-time workflows