Senior Bioinformatics Data Engineer
Job Summary We are seeking a Data/Platform Engineer to accelerate a biomarker data lake modernization within Computational Discovery. The project replaces a fragmented ETL landscape with a unified AWS-native platform: Dagster orchestration, dbt transformations, Iceberg Bronze/Silver on S3, and Redshift Gold serving downstream AI and visualization consumers. Core infrastructure is live; this role provides the engineering capacity to drive real data through production pipelines across multiple oncology clinical studies. The successful candidate will embed as a technical partner with the lead engineer, working across the full stack from vendor ingestion through gold-layer delivery. This is a hands-on role that demands strong data quality instincts, a drive to automate repetitive workflows, and the ability to build scalable processes around AI-assisted development. Key ResponsibilitiesBuild and maintain Dagster-orchestrated ingestion pipelines for genomics vendors (Caris, Predicine, Tempus, Olink, CellCarta), including IO managers, Iceberg writers, and row-level accounting.Develop and harden dbt Silver-to-Gold transformations: real-data test coverage, store-failures patterns, staging/intermediate/mart models, and macro consolidation.Implement clinical data ingestion paths (SDTM and ADaM), reconciliation logic, and subject-dimension routing.Deliver platform infrastructure: FastAPI endpoints, CI/CD pipelines, containerized deployments, observability instrumentation, and Redshift performance tuning.Extract transformation rules from legacy R and PySpark code and reconcile against new platform implementations.Identify repetitive processes and convert them into automated workflows, guardrails, or reusable tooling.Participate in adversarial design and code reviews, identifying edge cases and pushing back on suboptimal patterns.Collaborate with the lead engineer on design decisions and jointly own delivery velocity through paired working sessions and PR reviews.Ensure all work meets reproducibility standards: CI on every PR, automated tests, no ad-hoc notebook-based production processes.Minimum QualificationsAI-native engineering practice: demonstrated experience building systems and workflows around AI coding agents (Claude Code, Cursor, Codex, or equivalent) - not just prompting them. You recognize when a repeated process should become an automated pipeline, when agent output needs guardrails, and when to build infrastructure that makes future work faster. Surface-level tool usage is insufficient.Education: Bachelor's or master's degree in computer science, Data Engineering, Bioinformatics, or related field.Experience: 5+ years of professional experience in data engineering with shipped production pipelines on AWS (S3, ECS/Fargate, Redshift or equivalent MPP).Strong proficiency in Python and SQL with working knowledge of modern data engineering libraries.Advanced proficiency with dbt and a workflow orchestration tool (Dagster, Airflow, or Prefect).Data quality instinct: track record of catching silent failures, questioning data correctness assumptions, and noticing lossy joins or incomplete deliveries.Solid understanding of lakehouse architecture patterns, ETL processes, and schema design for complex multi-modal datasets.Ability to handle PHI-adjacent clinical data under contractor policy (background check, compliance training, VPN access).Willingness to work within legacy codebases (R, PySpark) to extract business rules and validate new implementations.Excellent communication skills and ability to work in an embedded pair model with tight feedback loops.Preferred QualificationsDirect experience with Apache Iceberg, AWS Glue Catalog, or lakehouse table formats.Comfort reading genomic data (VAF, HGVS nomenclature, VCFs, CNV/fusion semantics) or demonstrated ability to ramp on unfamiliar scientific domains quickly.Familiarity with clinical data standards including SDTM, ADaM, and CDISC.Pharma, clinical research, or life sciences background.Experience with containerization (Docker/ECS) and infrastructure-as-code (CloudFormation).Proficiency in R for interoperability with bioinformatics teams. Kaztronix is an equal opportunity employer and does not discriminate on the basis of race, color, national origin, sex, age, religion, disability, veteran status or any other consideration made unlawful by federal, state or local laws. In addition, all human resource actions in such areas as compensation, employee benefits, transfers, layoffs, training and development are to be administered objectively, without regard to race, color, religion, age, sex, national origin, disability, veteran status or any other consideration made unlawful by federal, state or local laws. By applying to the position, you acknowledge that your information will be used by Kaztronix in processing your application.