Staff Database Reliability Engineer
Occupations:
Database ArchitectsDatabase AdministratorsComputer Systems Engineers/ArchitectsData Warehousing SpecialistsSoftware DevelopersIndustries:
Private HouseholdsTraveler AccommodationOther Residential Care FacilitiesChild Care ServicesResidential Intellectual and Developmental Disability, Mental Health, and Substance Abuse FacilitiesJob Description: Own the data tier end-to-endDesign schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databasesReview migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraintsCatch N+1 patterns and missing selectrelated/prefetchrelated in reviewEstablish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning)Scale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emergeCapacity planning as traffic and engineering throughput growZero-downtime schema migrations and cutoversMulti-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZsBackups, PITR, failover testing, retentionOwn the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake)DMS task design and tuning, replication slot hygiene on the Postgres sideSchema evolution as Django migrations roll through — so a column rename doesnt silently break the warehouse at 6 AMParquet layout and partitioning, reliability of the Snowflake handoffAutomated checks that flag migrations likely to break downstream consumersDrive observability across three complementary tools: pganalyze, CloudWatch, HoneycombRequirements: Deep PostgreSQL - EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum. Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling)Strong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) - predict the SQL a query will generate, spot N+1 problems on sight and how to control eager loading (joins vs. batched IN queries), column projection, aggregations, and subqueriesSingle-region multi-AZ design - practical understanding of what it does and doesnt protect againstProduction CDC experience, ideally AWS DMS - comfortable with logical replication, slot hygiene, schema evolution, and Parquet-based data lakes feeding Snowflake (or BigQuery/Redshift)Hands-on with pganalyze (or Datadog DBM / Performance Insights / pgstatstatements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high-cardinality tracing tool) - comfortable with OpenTelemetry and opinionated about what makes a trace usefulReal experience making AI coding and review tools useful for a team - writing AGENTS.md files, configuring review agents, versioning and iterating on prompts and configsOpenSearch at scale - sizing, sharding, JVM tuning, rolling upgrades, snapshotsProduction Redis - persistence tradeoffs, cluster mode, hot keys, thundering herdsAt least one production message broker (SQS, RabbitMQ, Kafka) - delivery semantics, idempotency, failure modesStrong automation and IaC background - real code (Python, Go, or similar) and TerraformTrack record leading cross-team initiatives, writing design docs that hold up, influencing without authorityComfortable in a high-growth environment where the right answer for 50 engineers isnt the right answer for 100Pragmatic outlook during incidents - focused on preventing the next oneBenefits: Some of the nicest and smartest teammates you'll ever work withCompetitive salariesComprehensive healthcare benefitsExciting and motivating equityFlexible PTO401kParental LeaveCommuter Benefits (SF office employees)WFH Stipend