DevOps Engineer
ARCHIVED
We can't find an active application page for this role right now. It may reopen or be listed elsewhere. Use Next Steps to search for an active apply link and similar live jobs.
About the Project & Team At the Stellar Development Foundation (SDF), a small team of roughly 8 engineers is incubating a novel distributed systems prototype. Not a blockchain as you know it, something new. This is a rare opportunity for a zero-to-one build where much of the infrastructure and tooling does not exist yet. You will be building it.We operate as a fast-moving, highly technical team with no layers of management. You will work directly with the protocol architects and core systems engineers to take this completely greenfield project to production. If you thrive on the urgency of a new launch and want the agility of an early-stage startup with the stability of SDF, this is the place to be.About the Role We need a dedicated DevOps Engineer to own the deployment, release, and observability lifecycle of our software and network.Unlike a core systems role, your focus isn't writing the C++ or Rust consensus logic, it is building the robust, automated infrastructure that takes that logic from a GitHub repository to a globally distributed, highly available network. Right now, we can run the system locally. Your job is to build the pipelines, orchestration, and monitoring required to run it reliably in a hostile, distributed production environment.You will own the operational baseline of the project, ensuring that our deployment processes are reproducible, our network topology is easily configurable, and our system health is perfectly transparent to the rest of the engineering team.Where You'll Make an ImpactDeployment & Orchestration: Design and implement multi-region cluster deployments across bare metal and cloud environments. Your goal is fully automated, reproducible infrastructure (Infrastructure as Code).CI/CD & Release Engineering: Build and maintain rigorous CI/CD pipelines. Ensure deterministic builds, manage versioning, and integrate automated E2E and smoke tests that exercise the system upon every merge.Observability & Alerting: Architect the monitoring stack. Build out comprehensive Prometheus and Grafana dashboards for node health, consensus progress, and system throughput. Set up intelligent alerting for network anomalies.Network Operations & Chaos Testing: Define the configuration management for node identity and network topology. Work with the core engineers to actively chaos-test the network: partitioning nodes, simulating localized outages, and verifying automated recovery.Operational Readiness: Write the runbooks and establish the procedures for network upgrades, state snapshots, and incident response.About YouDeep experience in a DevOps, SRE, or Infrastructure role managing complex, highly available systems.Exceptional command of Infrastructure as Code (Terraform, Ansible) and container orchestration (Kubernetes, Docker).Extensive experience building robust CI/CD pipelines (GitHub Actions, GitLab CI, etc.) for compiled languages.Mastery of modern observability stacks (Prometheus, Grafana, ELK/Loki, Sentry).Strong scripting skills (Python, Bash, etc.) and the ability to comfortably read and navigate Rust, Go and C++ build systems and codebases to understand system behavior, even if you aren't writing core features.A methodical approach to operational stability and a deep appreciation for system correctness.Strong Nice-to-HavesExperience operating blockchain networks, validator nodes, or other PAXOS/Raft-inspired distributed systems.Experience with bare-metal provisioning and tuning network I/O for high-performance distributed databases.What the First Few Months Look LikeMonth 1: Standardize the local and single-node deployment process. Audit and upgrade our current CI pipelines. Map out the path to a fully automated multi-node deployment.Month 2-3: Deliver a fully automated deployment pipeline for our private testnet. Launch the V1 observability stack so the team has a real-time view into system health. Integrate automated end-to-end smoke tests into the deployment process.Beyond: Advanced chaos testing, multi-region topological scaling, load testing infrastructure, and refining the operational runbooks as we prepare for external partners to join the network.