DevOps Architect
Our client is seeking a highly experienced, hands-on DevOps Architect to lead the design, scalability, and reliability of mission-critical cloud infrastructure. This role supports a strategic initiative to build and scale a centralized AI platform that will enable high-volume, real-time workflows across the organization.
This is a high-impact, high-visibility role focused on building resilient, production-grade systems where uptime, performance, and reliability are critical.
This position combines architecture, consulting, and hands-on engineering. The ideal candidate brings deep experience designing and operating distributed systems at scale, with a strong focus on reliability and performance in production environments.
Key Responsibilities
Architect and scale highly available, fault-tolerant cloud infrastructure
Design systems to support high-throughput, low-latency workloads
Build resilient architectures with failover, redundancy, and disaster recovery strategies
Establish and improve observability, monitoring, and alerting capabilities
Develop and mature performance and load testing frameworks
Optimize auto-scaling and resource management strategies
Design for failure, including mitigation of external service dependencies
Partner with engineering teams to improve deployment reliability and operational excellence
Drive best practices across DevOps and Site Reliability Engineering (SRE)
Required Qualifications
8+ years of experience building and operating large-scale distributed systems
Deep expertise in cloud architecture and services (compute, storage, networking)
Strong background in SRE principles, including scalability, reliability, and performance engineering
Proven experience supporting production-critical systems with high uptime requirements
Hands-on coding experience (e.g., Python or similar)
Experience with containerized environments and modern CI/CD pipelines
Strong understanding of system bottlenecks, scaling patterns, and failure modes
Preferred Qualifications
Experience with AI/ML or large-scale data platforms
Familiarity with modern AI infrastructure or inference platforms
Experience with observability and monitoring tools
Knowledge of cloud security and data protection best practices
Exposure to multi-region or globally distributed architectures
Key Challenges
Scaling platforms from early-stage to enterprise-wide adoption
Designing around third-party service limitations and dependencies
Advancing system maturity across testing, monitoring, and scaling capabilities
Ensuring high availability for business-critical, real-time workflows