Site Reliability Engineer (Santa Clara)
Location; Santa Clara ,CA Onsite day 1Duration; 1+ yrWe're looking for a seasoned SRE to support the multifaceted and fast-paced Infrastructure, Planning and Processes organization where you will be working as a Senior SRE Engineer (Contract). The position will be part of a fast-paced crew to build and scale a self-service compute platform that lets engineering teams provision Kubernetes clusters, workloads and VMs on demand across public cloud and on-prem Kubernetes. Ownership will include platform end-to-end — control plane, reliability and uptime, runtime, and deploy pipelines.The team works with various other business units withinSoftware such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure & systems needs.As an SRE, you'll also be working in conjunction with various teams such as software engineering to deploy these new products and manage our infrastructure, associated processes and systems. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential.What you'll be doing:Design and operate a multi-cluster Kubernetes platform that provisions machines, workloads, and cloud instances on demand, including the controllers, CRDs, and ingress/DNS/TLS automation behind them.Build and harden the platform's microservices —CI/CD, SSO, RBAC, secret encryption, and real-time monitoring workflows.Integrate AI tooling into workloads, work on building agents and tools to support SRE teams to efficiently scale and operateOwn the production release path: Helm-driven deployments, multi-arch container builds, staged rollouts, and clean rollback playbooks.Instrument the platform with audit logging, usage analytics, and automation that lets the SRE team support a large user base.What we need to see:6+ years of DevOps/SRE experience operating production Kubernetes either on-premises or in the cloud, with depth in CRDs, operators, ingress, and cluster networking.Experience in integrating AI tools with workflows.Strong Python or go and understanding of TypeScript/React — comfortable moving across backend services, frontend UX, and infrastructure-as-code.Production experience with cloud provisioning (AWS or equivalent), identity federation (OIDC/SAML), and secret management.Solid grounding in relational databases, caching layers, and async networking patterns (SSH tunnels, WebSocket's, message queues).BS/MS in CS or equivalent, with a track record of shipping internal developer platforms and CI/CD pipelines.Ways to stand out from the crowd:Prior work on Linux, multi-tenant environments, Virtual Machines, Kubernetes administration, and Orchestration.Deep experience in agentic workflows, skills, tooling like CLI's and MCP's.Comfort building AI-assisted tooling on top of platform telemetry — automated runbooks, anomaly detection, or LLM-driven ops workflows.Knowledge and prior usage of CI tools like Jenkins/Gitlab CI, CD tools like Argo or Flux, Monitoring tools like Prometheus/Grafana or Victoria Metrics, Datadog, Splunk or Kibana.Strong proponent of documentation and root causing issues — you leave behind runbooks and docs that let the next engineer ship on day one., Project Code :