Cloud DevOps Engineer
Occupations:
Computer Systems Engineers/ArchitectsSoftware DevelopersNetwork and Computer Systems AdministratorsComputer Systems AnalystsInformation Security EngineersIndustries:
Computing Infrastructure Providers, Data Processing, Web Hosting, and Related ServicesWeb Search Portals, Libraries, Archives, and Other Information ServicesSoftware PublishersComputer Systems Design and Related ServicesEmployment ServicesCloud Engineer – Observability and SRE (Grade 10)Bay Area CA- onsite roleMax pay rate: $65/hr w2 + benefits7 month initial durationPosition SummaryThe Grade 10 Cloud Engineer within the Customer’s Cloud Collaboration Technology Group will play a key role in building and operating scalable observability and infrastructure platforms supporting Webex microservices. This role requires strong hands-on expertise in Kubernetes, cloud infrastructure, and observability systems, along with the ability to operate independently and to own components end-to-end in production environments. Candidates will demonstrate extensive use of generative AI tools for code generation and production system troubleshooting.Key Responsibilities• Design, develop, and operate observability platforms – to perform logging, metrics, and/or tracing – for Webex microservices.• Manage and optimize Kubernetes clusters across multi-region environments.• Own CI/CD pipelines using Argo CD and Helm.• Implement Infrastructure as code (IaC) using Terraform on AWS.• Operate monitoring ecosystems, including but not limited to:o OpenSearch/ELK,o Prometheus,o Grafana,o Splunk, ando Kafka.• Build automation to detect and remediate production issues.• Ensure security compliance through vulnerability patching.• Collaborate cross-functionally to improve reliability.• Participate in on-call rotations and incident response.• Contribute to distributed system design and operations.Required SkillsGeneral Abilities• Bachelor’s degree in computer science or related fieldGeneral Technical Skills• At least eight (8) years of experience in a DevOps and/or SRE platform engineering role• Incident response and on-call operations: Demonstrated experience in a 24/7 production environment, including but not limited to:o Triaging alertso Leading incident responseo Writing post-incident reviewso Maintaining SLA commitments across large-scale distributed systems• IaC and automation: Proficiency with Terraform, Ansible, and/or equivalent IaC tooling for provisioning and managing cloud infrastructure at scale on AWS• Scripting and development: Working proficiency in Python, Golang, and/or Bash for building automation scripts, operational tooling, and/or CI/CD pipeline integrations (e.g., Drone, GitHub Actions, Argo CD)Specific Technical Skills• Kubernetes and container orchestration: Production experience operating and troubleshooting workloads on Kubernetes at large scale (i.e., hundreds of deployments and thousands of pods), including but not limited to:o Helm chart managemento Pod schedulingo Resource tuningo Multi-cluster operations• Observability stack expertise: Hands-on experience – performing pipeline design, query optimization, and/or capacity planning for high-volume environments – in at least two (2) of the following:o OpenSearch/Elasticsearcho Prometheus/Mimiro Grafanao Lokio Splunko LogstashDesired Skills• Apache Kafka/AWS MSK: Experience in at least one (1) of the following:o Operating or tuning Kafka clusters at scaleo Managing the following across high-throughput streaming pipelines: Topic configurations, ACLs, Consumer lag, and/or Schema registries• Splunk administration: Experience deploying, managing, and/or migrating Splunk Enterprise environments with Kubernetes-based log shipping architectures, including but not limited to:o Forwarder management,o Search optimization,o Index lifecycle, and/oro Integration• OpenTelemetry and distributed tracing: Experience with deploying OpenTelemetry for data collection and application performance monitoring• Security frameworks and container hardening: Familiarity with at least one (1) of the following (for vulnerability remediation at scale):o Government or industry security certification standards; examples: FedRAMP STIG IL5 ISO 27001 SOC 2o Container image hardening practiceso Security scanning tools (e.g., Anchore, Grype)• AI-augmented operations: Experience using LLMs, AI coding assistants, and/or custom AI agents (e.g., MCP servers, Copilot, Claude) to:o Accelerate engineering workflows,o Automate runbooks, and/oro Assist with incident triage• Deployment pipelines (Argo CD/Helm bundles): Experience with at least one (1) of the following across multi-region clusters:o GitOps-style deployment workflowso Argo CD application managemento Helm bundle patternso Blue/green or canary release strategies• Cost optimization and capacity planning: Experience in at least one (1) of the following in large-scale logging and/or metrics platforms:o Right-sizing cloud resourceso Analyzing spending across AWS serviceso Optimizing data retention policies (ISM/ILM)o Reducing storage costs