JOBSEARCHER

Principal Big Data Site Reliability Developer (US Citizenship Required) US REMOTE

Ll OefentherapieRemoteMay 23rd, 2026
This role requires U.S. Citizenship and eligibility for a Federal Security Clearance Our Team Building off our Cloud momentum, Oracle has formed a new organization - Oracle Health Data,Analytics Platform.This team will focus on product development and product strategy for Oracle Health, while building out a complete platform supporting modernized, automated healthcare. This is a net new line of business, constructed with an entrepreneurial spirit that promotes an energetic and creative environment. We are unencumbered and will need your contribution to make it a world class engineering center with the focus on excellence.Oracle Health Data, Analytics Platform has a rare opportunity to play a critical role in how Oracle Health products impact and disrupt the healthcare industry by transforming how healthcare and technologyintersect.You will have the opportunity to:Reach billions of people with our products & servicesCreate technology in which truly impacts the worldAbility to have immediate impact on developingtechnologyUnlimitedgrowth potential with inspiring workWork with the best minds in the industryEnjoy working in an open, diverse, and productive environmentAbout TheJob This role provides technical leadership for the core data platforms behind Oracle Health's Data & Analytics Platform. As a Principal Site Reliability Engineer (SRE), you will own shared, mission-critical systems used by multiple products and teams.You will lead the design and operation of large-scale, stateful distributed platforms, including Hadoop ecosystem components (HDFS, YARN, HBase) deployed on Oracle Big Data Service (BDS), Kafka, and Storm. These multi-tenant platforms are deployed and operated through Ansible- and Terraform-based automation and require strong architectural ownership to manage scale, change, and broad blast radius.What You\'ll Do Platform Ownership & Technical LeadershipOwn the end-to-end reliability, scalability, and operability of shared data platformsDefine platform standards, architectural direction, and operational guardrailsInfluence cross-team technical decisions and long-term platform strategyDrive long-term platform evolution and influence reliability strategy across the data ecosystemArchitecture & DesignLead platform architecture and design reviewsClearly articulate system behavior, dependencies, and failure modesMake principled trade-offs between reliability, performance, cost, and complexityProvide guidance and guardrails that enable downstream teams to use platforms safely and effectivelyOperations EngineeringEstablish capacity models, scaling strategies, and operational best practicesDesign platforms that behave predictably under load, failure, and changeOwn platform lifecycle events: upgrades, expansions, decommissioning, and recoveryDistributed Systems ExpertiseOperate and evolve stateful distributed systems where data placement, replication, and recovery are criticalReason about failure modes such as backpressure, rebalancing, region movement, replication lag, and rolling upgradesSecurityOperate and maintain Kerberized platforms, including authentication, authorization, and secure service-to-service communicationTreat security as a first-class architectural concernAutomationDesign and evolve an Ansible- and Terraform-driven automation frameworkTreat automation as production software: versioned, reviewed, tested, and improvedEliminate operational toil by encoding reliability and safety into the platformIncident Leadership & PreventionServe as the ultimate escalation point for complex or ambiguous incidentsFocus on eliminating entire classes of failure, not just resolving individual issuesRepresentationRepresent SRE and platform engineering in high-visibility and sensitive forumsCommunicate clearly with engineering leadership and partner teams#J-18808-Ljbffr