Solutions Applied Data Scientist, Healthcare
Company OverviewWe are building Protege to solve the biggest unmet need in AI — getting access to the right training data. The process today is time intensive, incredibly expensive, and often ends in failure. The Protege platform facilitates the secure, efficient, and privacy‐centric exchange of AI training data.Solving AI's data problem is a generational opportunity. We're backed by world‐class investors and already powering partnerships with some of the most ambitious teams in AI. The company that succeeds will be one of the largest in AI — and in tech.We're a lean, fast‐moving, high‐trust team of builders who are obsessed with velocity and impact. Our culture is built for people who thrive on ambiguity, own outcomes, and want to shape the future of data and AI.Role OverviewWe are hiring a Solutions Applied Data Scientist to help design, construct, and validate complex healthcare data cohorts used for AI model training. This role sits within the delivery organization , working closely with Solutions Leads and delivery engineers to solve complex data challenges that arise during customer projects.Solutions Leads own the customer relationship and overall delivery of projects. The Solutions Applied Data Scientist serves as their technical partner for more complex data problems , including cohort construction, multi‐source dataset assembly, feasibility analysis, and data validation.You will help translate research generated by Protege's Data Lab and customer requirements into practical dataset definitions, determine whether those requirements can be met with available data, and build the SQL and analysis needed to construct the resulting datasets.You will also collaborate with delivery engineers when solutions require changes to data pipelines, infrastructure, or large‐scale data movement.This is a highly applied role focused on solving real‐world dataset challenges , not research or model development.The ideal candidate is someone who enjoys solving messy real‐world data problems, working directly with large healthcare datasets, writing complex SQL and collaborating closely with cross‐functional teams. Our environment has a lot going on as we grow — so we're looking for someone energized by and excited by the fast pace of the industry and our company!What You'll DoTechnical Escalation & Delivery CollaborationDuring delivery projects, Solutions Leads may encounter complex data challenges that require deeper analysis or technical problem‐solving. You will act as a technical partner , helping solve things such as:Complex cohort definitions that require multi‐source joinsLinking datasets across different data partnersInvestigating unexpected gaps or anomalies in delivered dataEvaluating whether requested variables or labels exist in available datasetsDetermining whether a dataset can realistically satisfy model requirementsYou will work collaboratively with Solutions Leads to unblock delivery challenges while keeping projects moving toward successful completion.When solutions require infrastructure or pipeline changes, you will partner with the Solutions Engineer and internal platform engineering teams to implement the required workflows.Cohort Definition & Dataset ConstructionWork with Solutions Leads to translate customer requirements into concrete dataset logic. You will help ensure that datasets accurately represent the intended population and meet customer specifications.Responsibilities include:Writing complex SQL queries to construct cohortsImplementing inclusion and exclusion logicJoining datasets across multiple data sourcesValidating linkage between datasetsIdentifying and resolving inconsistencies or missing fieldsPartner with Solutions Leads to resolve complex data questions that arise during project deliveryEscalate or collaborate with delivery engineers when dataset construction requires pipeline changes or large‐scale data processingData Quality Validation & Completeness AnalysisBefore complex datasets are delivered to customers you will help validate that they meet required standards. You will work closely with Solutions Leads before datasets are delivered to ensure that the datasets meet agreed acceptance criteria. Review bespoke QA methodology and suggest platform improvements to Product and Engineering to decrease custom work across engagements.Responsibilities include:Performing data completeness analysisInvestigating missing or anomalous dataVerifying cohort logic resultsValidating row counts and dataset structureCreating summary statistics and validation outputsData FeasibilityMany customer projects involve AI researchers who are defining the healthcare datasets required to train or evaluate models. You will work with these customer teams to translate research goals into practical dataset specifications.Responsibilities include:Reviewing dataset requests from AI researchers and model development teamsHelping clarify and refine requirements for model training or evaluation datasetsEvaluating whether requested variables or labels exist in available data sourcesIdentifying proxy variables or alternative dataset structures when ideal variables are unavailableAssessing feasibility of requested cohort definitions given real‐world data constraintsExplaining data limitations, tradeoffs, and potential biases to technical stakeholdersIterating with researchers to converge on datasets that are both scientifically meaningful and operationally feasibleThis role requires someone who is comfortable engaging with technically sophisticated stakeholders while grounding conversations in the realities of messy, real‐world data.Data Partner & Source Data AnalysisMany datasets originate from external healthcare data partners.You will help analyze partner datasets to:understand schema and field availabilityassess data quality and completenessidentify required transformationsevaluate feasibility of cohort logicThis work helps ensure that projects are grounded in what data actually exists.Delivery Tooling & Workflow ImprovementsAs delivery patterns emerge, you will help develop tools and reusable workflows that improve efficiency.Examples include:reusable SQL templates for cohort constructionautomated validation checksscripts for dataset preparationtools that reduce manual delivery workThis role is an important bridge between manual dataset delivery and scalable data infrastructure.What Success Looks Like30 days: Learn the delivery motion and source‐data reality. Build working knowledge of Solutions workflows, healthcare data partners, common cohort patterns, and how complex requests get escalated. Shadow active projects, understand existing QA approaches, and start contributing to scoped feasibility and validation work.60 days: Own scoped technical escalations and create early leverageIndependently support complex cohort‐definition and dataset‐construction work, write and validate SQL / Python workflows, and help Solutions Leads answer hard feasibility questions with clear tradeoffs.90 days: Become a trusted technical partner across deliveryHandle the hardest dataset problems with limited oversight, improve QA and repeatability, and propose workflow or platform improvements that reduce bespoke work across engagements.What You BringExperience working with large structured healthcare datasetsStrong SQL and python skills and experience writing complex queriesExperience using Claude Code / CodexExperience joining and transforming large datasetsExperience performing data validation and exploratory analysisStrong Python skills for data analysis and scriptingExperience working with structured file formats (CSV, Parquet, etc.)Ability to translate ambiguous requirements into concrete data logicStrong communication skills and ability to collaborate with technical and non‐technical stakeholdersProtege ValuesWe pass the loved ones' test — integrity isn't negotiable, even when it's costlyWe always find a way — obstacles are expected, giving up isn'tWe go fast and grow fast — velocity is a competitive advantage and we treat it that wayWe practice kindness and candor — hard conversations happen here, and they happen with careWe deliver together — no silos, no lone heroes, no passengersWe own the outcome — full accountability, continuous improvement, mastery over time#J-18808-Ljbffr