DevOps Engineer
ARCHIVED
We can't find an active application page for this role right now. It may reopen or be listed elsewhere. Use Next Steps to search for an active apply link and similar live jobs.
Job Summary
We are seeking a highly capableSenior DevOps Engineer / Platform Engineerto build, operationalize, and scale the infrastructure and deployment foundation for a strategicsite-builder / network automation platform . This role will focus on creating reliable CI/CD pipelines, production-grade Kubernetes deployment patterns, managed database services, observability, environment reproducibility, secrets management, and Infrastructure as Code across development, testing, staging, and production environments.
This engineer will play a critical role in moving the platform from an early-stage, partially manual operating model into a repeatable, supportable, and production-ready DevOps model. The environment includes Kubernetes-hosted services, AWS managed services, workflow orchestration with Temporal, integration with Nautobot, Argo-based promotion flows, and the supporting tooling required for debugging, snapshotting, local development, and production support.
This is a hands-on engineering role for someone who can design the right platform patterns, implement them directly, and establish a durable operating model between development and DevOps teams.
Key Responsibilities
Platform Deployment & CI/CD
• Design, implement, and maintain CI/CD pipelines fortesting, staging, and productionenvironments.
• Build and maintain deployment workflows that support safe and seamless promotion across environments.
• Improve and maintainArgo-based deployment workflowsto enable controlled release progression from test to staging to production.
• Establish baseline deployment mechanisms for the site-builder application and related services.
• Standardize Kubernetes application packaging and deployment patterns, with a strong preference towardHelm-based lifecycle managementfor complex services and third-party components.
• Migrate existing deployments toHelm chartswhere appropriate.
Kubernetes & Runtime Platform Engineering
• Support the deployment and ongoing operation of services running in Kubernetes.
• Improve runtime reliability, resiliency, and troubleshooting for distributed services operating inside shared Kubernetes clusters.
• Investigate and harden service-to-service connectivity patterns, especially for workflow components such as workers connecting to the Temporal engine.
• Partner with development teams to define production-grade runtime requirements, resource sizing, restart policies, and platform support boundaries.
Infrastructure as Code & Cloud Services
• Design and implementfully declarative Infrastructure as Codefor managed cloud services, especially in AWS.
• Provision and maintain managed data services such asRDS/PostgreSQLandMongoDB-compatible document databasesacross all environments.
• Eliminate manual infrastructure setup where possible and replace it with reproducible, version-controlled deployment patterns.
• Prepare the platform for future scale across multiple environments and regions through repeatable IaC and GitOps-aligned practices.
Data Services, Snapshots & Developer Enablement
• Setup and maintainRDS, MongoDB, Redis/cache services , and related dependencies for all environments.
• Build tooling and operational processes for:
◦ production and staging database snapshots,
◦ restoring snapshots into development environments,
◦ enabling local debugging and development from realistic data states.
• Support creation of local and development environments, includingMinikube-based environment-as-codeapproaches that mirror production behavior as closely as practical.
• Improve platform reproducibility so engineers can quickly stand up close-to-production development environments.
Workflow Orchestration & Temporal Support
• Lead the setup, deployment, and operational support ofTemporalfor workflow orchestration.
• Support production operations for Temporal, including troubleshooting performance issues, restarts, scaling concerns, and resource shortages.
• Establish maintainable deployment patterns for Temporal using supported packaging and lifecycle management approaches.
• Partner with engineering teams to ensure workflow platform reliability and upgradeability over time.
Observability, Reliability & Incident Readiness
• Design and maintain observability across testing, staging, and production using tools such asPrometheus and Grafana .
• Define and implement monitoring for:
◦ service and cluster utilization,
◦ CPU, memory, storage,
◦ IOPS / throughput metrics,
◦ database connections and session counts,
◦ cache hit / miss / coverage metrics,
◦ RDS and MongoDB utilization,
◦ service health and alerting.
• Build and maintain logging, tracing, and correlation capabilities, separated appropriately by environment.
• Create tools to support deep debugging and operational inspection, including raw database reads, cleanup of unused volumes, and emergency cache invalidation.
Security, Access & Secrets Management
• Maintain secrets management processes across environments.
• Build tooling for short-lived internal token generation and long-lived secret rotation.
• Support secure access from deployed services to active production devices and southbound systems.
• Help establish credential management patterns for southbound integrations and device-facing access.
• Partner with related teams to define safe operational limits and controls for service integrations.
External Integrations & Platform Support
• Support integration patterns withNautobotand help define safe client-side behaviors such as rate limiting, retry/backoff, and service protection mechanisms.
• Partner with application teams to understand and mitigate integration issues such as rate limiting or request rejection.
• Support staging and testing by enablingvirtual device environmentswhere needed.
• Contribute to end-to-end acceptance testing and production readiness activities.
Operating Model & Cross-Functional Execution
• Help define an effective operating model between Development and DevOps, whether viaRACI , embedded Agile delivery, or a hybrid support model.
• Support deployment readiness, incident management, environment ownership boundaries, and lifecycle responsibilities.
• Work closely with software engineering, infrastructure, application owners, and partner teams to drive production readiness and sustainable operations.
Required Qualifications
• Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent practical experience.
• 7+ years of experience inDevOps, Platform Engineering, SRE, or Infrastructure Engineeringroles.
• Strong hands-on experience withKubernetesin production environments.
• Strong experience building and maintainingCI/CD pipelinesfor multi-environment software delivery.
• Strong experience withArgoCD , GitOps workflows, or equivalent deployment tooling.
• Strong experience withHelmand Kubernetes package/deployment lifecycle management.
• Experience withAWS managed services , especiallyRDS/PostgreSQL , document databases, and related infrastructure.
• Strong experience withInfrastructure as Code , such as Terraform and/or similar declarative tooling.
• Experience withPrometheus, Grafana , and modern observability practices.
• Experience withRedis/cache services , secrets management, and operational debugging.
• Strong Linux, networking, and distributed systems troubleshooting skills.
• Strong scripting and automation skills in one or more languages such as Python, Bash, or Go.
• Proven ability to work cross-functionally and operate effectively in environments where ownership boundaries are still evolving.
Preferred Qualifications
• Experience withTemporaldeployment and production operations.
• Experience supporting developer platforms with local environment reproducibility using Minikube, kind, or similar tools.
• Experience withMongoDB / DocumentDBoperations and restore workflows.
• Experience integrating withNautobot , NetBox, or similar infrastructure source-of-truth platforms.
• Experience operating in shared-cluster environments with multi-team tenancy and constrained access models.
• Experience designing platform patterns for internal products that must scale across regions or multiple deployment footprints.
• Familiarity with network automation or infrastructure orchestration platforms is a plus.
What Success Looks Like
• CI/CD pipelines are reliable, repeatable, and support safe promotion across all environments.
• Kubernetes deployments are standardized, maintainable, and production ready.
• Managed infrastructure is defined as code rather than through manual setup.
• Temporal, databases, cache layers, and observability tooling are stable and supportable.
• Development teams can reproduce realistic environments locally for faster debugging and delivery.
• Secrets, access patterns, and operational tooling are mature enough to support production-scale operations.
• The DevOps operating model is clearly defined and enables faster deployments with less operational risk.