{"schemaVersion":"jobsearcher.job.v1","id":"ac7721a1fc6ab934f687f707","url":"https://jobsearcher.com/jobs/ac7721a1fc6ab934f687f707","canonicalUrl":"https://jobsearcher.com/jobs/ac7721a1fc6ab934f687f707","title":"Staff Database Reliability Engineer","description":"About the role\nWe're hiring a Staff Database Reliability Engineer to own the strategy, architecture, and operational excellence of our data infrastructure. This is an expert-level IC role with deep influence on engineering direction, partnering closely with platform, backend, and DevOps engineers.\n\nWhy this role matters\nYou will own the data tier end-to-end. Design schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databases. When a migration script seizes up mid-deploy and writes start queueing behind an ACCESS EXCLUSIVE lock, your runbooks and automation resolve the incident quickly.\n\nMake the Django ORM a strength, not a liability:\n\nReview migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraints\n\nCatch N+1 patterns and missing select_related/prefetch_related in review\n\nEstablish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning)\n\nScale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emerge\n\nLead major infrastructure initiatives:\n\nCapacity planning as traffic and engineering throughput grow\n\nZero-downtime schema migrations and cutovers\n\nMulti-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZs\n\nBackups, PITR, failover testing, retention\n\nOwn the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake):\n\nDMS task design and tuning, replication slot hygiene on the Postgres side\n\nSchema evolution as Django migrations roll through — so a column rename doesn't silently break the warehouse at 6 AM\n\nParquet layout and partitioning, reliability of the Snowflake handoff\n\nAutomated checks that flag migrations likely to break downstream consumers\n\nDrive observability across three complementary tools:\n\npganalyze — query‑level performance, index advisor, schema insights - the go‑to for 'why is this ORM query slow'\n\nCloudWatch — infrastructure metrics and alarms for Aurora, OpenSearch, ElastiCache, SQS, DMS\n\nHoneycomb — high‑cardinality tracing that ties slow DB calls back to users, flags, deploys, and flows\n\nShape how the three fit together, including Django‑side instrumentation and trace attributes on ORM queries\n\nBuild tooling and guardrails:\n\nMigration review automation and CI checks for risky patterns\n\nSlow query pipelines fed from pganalyze\n\nSelf‑service dashboards so teams understand their own query footprint\n\n> Support and evolve the rest of the stack:\n\nOpenSearch — index design, sharding, mapping changes, reindexing strategy, Django‑side indexing pipelines\n\nRedis — caching patterns, eviction, sizing, Django cache framework, Celery/RQ usage, avoiding hot keys and thundering herds\n\nSQS + RabbitMQ — queue design, DLQs, visibility timeouts, exchange/queue topology, AZ mirroring, consumer backpressure, Celery behavior under load\n\nWhat makes you a great fit\nCore expertise:\n\nDeep PostgreSQL — EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum. Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling)\n\nStrong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) — predict the SQL a query will generate, spot N+1 problems on sight and how to control eager loading (joins vs. batched IN queries), column projection, aggregations, and subqueries\n\nSingle-region multi‑AZ design — practical understanding of what it does and doesn't protect against\n\nData movement and observability:\n\nProduction CDC experience, ideally AWS DMS — comfortable with logical replication, slot hygiene, schema evolution, and Parquet‑based data lakes feeding Snowflake (or BigQuery/Redshift)\n\nHands‑on with pganalyze (or Datadog DBM / Performance Insights / pg_stat_statements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high‑cardinality tracing tool) — comfortable with OpenTelemetry and opinionated about what makes a trace useful\n\nAI‑assisted workflow:\n\nReal experience making AI coding and review tools useful for a team — writing AGENTS.md files, configuring review agents, versioning and iterating on prompts and configs\n\nThe rest of the stack:\n\nOpenSearch at scale — sizing, sharding, JVM tuning, rolling upgrades, snapshots\n\nProduction Redis — persistence tradeoffs, cluster mode, hot keys, thundering herds\n\nAt least one production message broker (SQS, RabbitMQ, Kafka) — delivery semantics, idempotency, failure modes\n\nEngineering and leadership:\n\nStrong automation and IaC background — real code (Python, Go, or similar) and Terraform\n\nTrack record leading cross‑team initiatives, writing design docs that hold up, influencing without authority\n\nComfortable in a high‑growth environment where the right answer for 50 engineers isn't the right answer for 100\n\nPragmatic outlook during incidents — focused on preventing the next one\n\nFull‑Time US Employee Benefits Include\n\nSome of the nicest and smartest teammates you’ll ever work with\n\nCompetitive salaries\n\nComprehensive healthcare benefits\n\nExciting and motivating equity\n\nFlexible PTO\n\n401k\n\nParental Leave\n\nCommuter Benefits (SF office employees)\n\nWFH Stipend\n\nCompensation\n$200k-$250k base + equity\n\nWe consider several factors when determining compensation, including location, experience, and other job‑related factors.\n\nAt Scribe, we celebrate our differences and are committed to creating a workplace where all employees feel supported and empowered to do their best work. We believe this benefits not only our employees but our product, customers, and community as well. Scribe is proud to be an Equal Opportunity Employer.\n\n#J-18808-Ljbffr","company":"Scribehowcom","rawCompany":"scribehowcom","city":"Millbrae","state":"CA","isRemote":false,"isActive":false,"createdAt":"2026-06-17T04:22:16.501Z","occupations":[{"code":"15-1243.00","title":"Database Architects","slug":"database-architects"},{"code":"15-1242.00","title":"Database Administrators","slug":"database-administrators"},{"code":"15-1243.01","title":"Data Warehousing Specialists","slug":"data-warehousing-specialists"}],"industries":[{"code":"541512","title":"Computer Systems Design Services","slug":"computer-systems-design-services"},{"code":"513210","title":"Software Publishers","slug":"software-publishers"},{"code":"541511","title":"Custom Computer Programming Services","slug":"custom-computer-programming-services"}],"jobPosting":{"@context":"https://schema.org","@type":"JobPosting","title":"Staff Database Reliability Engineer","description":"About the role\nWe're hiring a Staff Database Reliability Engineer to own the strategy, architecture, and operational excellence of our data infrastructure. This is an expert-level IC role with deep influence on engineering direction, partnering closely with platform, backend, and DevOps engineers.\n\nWhy this role matters\nYou will own the data tier end-to-end. Design schemas and access patterns that scale, tune Aurora for latency and throughput, and set the standards for how engineers interact with our databases. When a migration script seizes up mid-deploy and writes start queueing behind an ACCESS EXCLUSIVE lock, your runbooks and automation resolve the incident quickly.\n\nMake the Django ORM a strength, not a liability:\n\nReview migrations for safety at scale — locks, backfills, concurrent index builds, NOT VALID constraints\n\nCatch N+1 patterns and missing select_related/prefetch_related in review\n\nEstablish conventions for QuerySet usage and physical schema design (indexes, constraints, partitioning)\n\nScale review through automation, not heroics — author AGENTS.md files and DNA scaffolding that encode our conventions, configure AI review bots (Claude Code, Cursor, etc.) to flag risky migrations and ORM anti-patterns, and iterate on those configs as new failure modes emerge\n\nLead major infrastructure initiatives:\n\nCapacity planning as traffic and engineering throughput grow\n\nZero-downtime schema migrations and cutovers\n\nMulti-AZ resilience within a single region — Aurora writer/reader placement, failover behavior and RTO/RPO, ElastiCache and OpenSearch AZ topology, RabbitMQ survivability across AZs\n\nBackups, PITR, failover testing, retention\n\nOwn the CDC pipeline (Aurora → DMS → S3 Parquet → Snowflake):\n\nDMS task design and tuning, replication slot hygiene on the Postgres side\n\nSchema evolution as Django migrations roll through — so a column rename doesn't silently break the warehouse at 6 AM\n\nParquet layout and partitioning, reliability of the Snowflake handoff\n\nAutomated checks that flag migrations likely to break downstream consumers\n\nDrive observability across three complementary tools:\n\npganalyze — query‑level performance, index advisor, schema insights - the go‑to for 'why is this ORM query slow'\n\nCloudWatch — infrastructure metrics and alarms for Aurora, OpenSearch, ElastiCache, SQS, DMS\n\nHoneycomb — high‑cardinality tracing that ties slow DB calls back to users, flags, deploys, and flows\n\nShape how the three fit together, including Django‑side instrumentation and trace attributes on ORM queries\n\nBuild tooling and guardrails:\n\nMigration review automation and CI checks for risky patterns\n\nSlow query pipelines fed from pganalyze\n\nSelf‑service dashboards so teams understand their own query footprint\n\n> Support and evolve the rest of the stack:\n\nOpenSearch — index design, sharding, mapping changes, reindexing strategy, Django‑side indexing pipelines\n\nRedis — caching patterns, eviction, sizing, Django cache framework, Celery/RQ usage, avoiding hot keys and thundering herds\n\nSQS + RabbitMQ — queue design, DLQs, visibility timeouts, exchange/queue topology, AZ mirroring, consumer backpressure, Celery behavior under load\n\nWhat makes you a great fit\nCore expertise:\n\nDeep PostgreSQL — EXPLAIN (ANALYZE, BUFFERS), MVCC, bloat, lock contention, vacuum/autovacuum. Aurora Serverless V2 / Limitless experience strongly preferred (storage model, reader/writer split, ACU scaling)\n\nStrong ORM fluency (Django, SQLAlchemy, ActiveRecord, or similar) — predict the SQL a query will generate, spot N+1 problems on sight and how to control eager loading (joins vs. batched IN queries), column projection, aggregations, and subqueries\n\nSingle-region multi‑AZ design — practical understanding of what it does and doesn't protect against\n\nData movement and observability:\n\nProduction CDC experience, ideally AWS DMS — comfortable with logical replication, slot hygiene, schema evolution, and Parquet‑based data lakes feeding Snowflake (or BigQuery/Redshift)\n\nHands‑on with pganalyze (or Datadog DBM / Performance Insights / pg_stat_statements pipelines), CloudWatch (custom metrics, composite alarms, log insights), and Honeycomb (or another high‑cardinality tracing tool) — comfortable with OpenTelemetry and opinionated about what makes a trace useful\n\nAI‑assisted workflow:\n\nReal experience making AI coding and review tools useful for a team — writing AGENTS.md files, configuring review agents, versioning and iterating on prompts and configs\n\nThe rest of the stack:\n\nOpenSearch at scale — sizing, sharding, JVM tuning, rolling upgrades, snapshots\n\nProduction Redis — persistence tradeoffs, cluster mode, hot keys, thundering herds\n\nAt least one production message broker (SQS, RabbitMQ, Kafka) — delivery semantics, idempotency, failure modes\n\nEngineering and leadership:\n\nStrong automation and IaC background — real code (Python, Go, or similar) and Terraform\n\nTrack record leading cross‑team initiatives, writing design docs that hold up, influencing without authority\n\nComfortable in a high‑growth environment where the right answer for 50 engineers isn't the right answer for 100\n\nPragmatic outlook during incidents — focused on preventing the next one\n\nFull‑Time US Employee Benefits Include\n\nSome of the nicest and smartest teammates you’ll ever work with\n\nCompetitive salaries\n\nComprehensive healthcare benefits\n\nExciting and motivating equity\n\nFlexible PTO\n\n401k\n\nParental Leave\n\nCommuter Benefits (SF office employees)\n\nWFH Stipend\n\nCompensation\n$200k-$250k base + equity\n\nWe consider several factors when determining compensation, including location, experience, and other job‑related factors.\n\nAt Scribe, we celebrate our differences and are committed to creating a workplace where all employees feel supported and empowered to do their best work. We believe this benefits not only our employees but our product, customers, and community as well. Scribe is proud to be an Equal Opportunity Employer.\n\n#J-18808-Ljbffr","datePosted":"2026-06-17T04:22:16.501Z","dateModified":"2026-06-17T04:22:16.501Z","hiringOrganization":{"@type":"Organization","name":"Scribehowcom","sameAs":"https://jobsearcher.com"},"jobLocation":{"@type":"Place","address":{"@type":"PostalAddress","addressLocality":"Millbrae","addressRegion":"CA","addressCountry":"US"}},"identifier":{"@type":"PropertyValue","name":"JobSearcher","value":"ac7721a1fc6ab934f687f707"},"url":"https://jobsearcher.com/jobs/ac7721a1fc6ab934f687f707"}}