JOBSEARCHER

Staff Database Reliability Engineer

Job Description: Own the data tier end-to-endDesign schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databasesReview migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraintsCatch N+1 patterns and missing selectrelated/prefetchrelated in reviewEstablish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning)Scale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emergeCapacity planning as traffic and engineering throughput growZero-downtime schema migrations and cutoversMulti-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZsBackups, PITR, failover testing, retentionOwn the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake)DMS task design and tuning, replication slot hygiene on the Postgres sideSchema evolution as Django migrations roll through — so a column rename doesnt silently break the warehouse at 6 AMParquet layout and partitioning, reliability of the Snowflake handoffAutomated checks that flag migrations likely to break downstream consumersDrive observability across three complementary tools: pganalyze, CloudWatch, HoneycombRequirements: Deep PostgreSQL - EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum. Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling)Strong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) - predict the SQL a query will generate, spot N+1 problems on sight and how to control eager loading (joins vs. batched IN queries), column projection, aggregations, and subqueriesSingle-region multi-AZ design - practical understanding of what it does and doesnt protect againstProduction CDC experience, ideally AWS DMS - comfortable with logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake (or BigQuery/Redshift)Hands-on with pganalyze (or Datadog DBM / Performance Insights / pgstatstatements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high-cardinality tracing tool) - comfortable with OpenTelemetry and opinionated about what makes a trace usefulReal experience making AI coding and review tools useful for a team - writing AGENTS.md files, configuring review agents, versioning and iterating on prompts and configsOpenSearch at scale - sizing, sharding, JVM tuning, rolling upgrades, snapshotsProduction Redis - persistence tradeoffs, cluster mode, hot keys, thundering herdsAt least one production message broker (SQS, RabbitMQ, Kafka) - delivery semantics, idempotency, failure modesStrong automation and IaC background - real code (Python, Go, or similar) and TerraformTrack record leading cross-team initiatives, writing design docs that hold up, influencing without authorityComfortable in a high-growth environment where the right answer for 50 engineers isnt the right answer for 100Pragmatic outlook during incidents - focused on preventing the next oneBenefits: Some of the nicest and smartest teammates you'll ever work withCompetitive salariesComprehensive healthcare benefitsExciting and motivating equityFlexible PTO401kParental LeaveCommuter Benefits (SF office employees)WFH Stipend

matching similar jobs near New York, NY

VIEW MORE