Astro/Airflow Engineer
Must Have Technical/Functional Skills5-8+ years building/operating data or platform systems; 3+ years running Airflow in production at scale (hundreds-thousands of DAGs and high task throughput)Deep Airflow expertise: DAG design and testing, idempotency, deferrable operators/sensors, dynamic task mapping, task groups, datasets, pools/queues, SLAs, retries/backfills, cross-DAG dependenciesStrong Kubernetes experience running Airflow and supporting services: Helm, autoscaling, node/pod tuning, topology spread, network policies, PDBs, and blue/green or canary strategiesObservability and SRE practices: Prometheus/Grafana/StatsD, centralized logging, alert design, capacity/throughput modeling, performance tuningSecurity/compliance: SSO/OIDC, RBAC, secrets management (Vault/Secrets Manager), auditing, least-privilege connection management, and change controlProven incident leadership, runbook creation, and platform roadmap execution; excellent cross-functional communicationExperience operating and leading migrations to/from AirflowOpenLineage/Marquez adoption; Great Expectations or other data quality frameworks; data contractsCost optimization and capacity planning for schedulers and workers; spot instance strategiesMulti-region HA/DR for Airflow metadata DB; backup/restore and disaster drillsBuilding internal developer platforms/portals (e.g., Backstage) for self-service pipelinesContributions to Apache Airflow or provider packages; familiarity with recent AlPs/ Airflow 2.7+ featuresArchitect, deploy, and operate production-grade Airflow on Kubernetes including all components and user application dependencies, with focus on upgrades, capacity planning, HA, security, and performance tuningOperate a multi-scheduler ecosystem: determine when to use Airflow, distributed compute schedulers, or lightweight task runners based on workload requirements; provide unified developer experience across schedulersBuild automation infrastructure: Terraform modules and Helm charts with GitOps-driven CI/CD for environment provisioning, upgrades, and zero-downtime rolloutsStandardize the developer experience: DAG repo templates, shared operator libraries, connection and secrets management, dependency packaging, code ownership, linting, unit testing, and pre-commit hooksImplement comprehensive observability: metrics collection, dashboards, distributed tracing, SLA/latency monitoring, intelligent alerting, and runbook automationEnable resilient workflow patterns: build idempotency frameworks, retry/backoff strategies, deferrable operators and sensors, dynamic task mapping, and data aware schedulingEnsure reliability at enterprise scale: architect and tune resource allocation (pools, queues, concurrency limits) to support high-throughput workloads; optimize large-scale backfill strategies; develop comprehensive runbooks and lead incident response/postmortemsPartner with teams across the organization to provide enablement, documentation, and self-service toolingSalary Range- $110,000-$120,000 a year