JOBSEARCHER

Principal Infrastructure & Site Reliability Engineer (US REMOTE)

oracleRemoteMay 12th, 2026
Job DescriptionJoin Oracle's Health Data Intelligence (HDI) team as a Software Engineer 4, focused on Site Reliability Engineering for large-scale healthcare analytics platforms. In this role, you will design, build, and operate highly reliable, scalable infrastructure and data pipelines that power mission-critical analytics globally.You will also contribute to the next evolution of cloud operations by advancing automation, observability, and AI-assisted reliability practices. This includes exploring the use of Generative AI and intelligent automation to improve incident response, system resilience, and operational efficiency.You will work within a collaborative team to deliver robust solutions that handle massive datasets with precision and performance, while continuously improving system reliability and operational excellence.U.S. citizenship is required for this position, as the successful candidate will be required to obtain (and maintain) a U.S. government security clearance after hire.Required SkillsInfrastructure & ReliabilityExperience building and operating high-availability, fault-tolerant systemsStrong understanding of distributed systems, performance monitoring, and resiliency patternsExperience with incident response, root-cause analysis, and production troubleshootingAI-Native Engineering (NEW)Hands-on experience applying Generative AI or Agentic AI (e.g., LangChain, AutoGPT, custom agents) to:Infrastructure lifecycle managementObservability and anomaly detectionIncident response and remediation automationAbility to design or integrate AI-driven workflows for operational efficiency and reliabilityFamiliarity with building or integrating autonomous agents for DevOps/SRE use casesCloud & Multi-Cloud EcosystemsStrong experience with multi-cloud environments (OCI, AWS/Azure)Deep understanding of cloud infrastructure design, deployment, and resource optimizationExperience managing hybrid or cross-cloud architecturesDevOps/SRE PracticesAdvanced competency in CI/CD pipelines (Jenkins, Kubernetes)Infrastructure as Code (Terraform)Observability tools (Prometheus, Grafana)Strong focus on automation-first operationsData TechnologiesProficiency in Data Warehousing platforms (e.g., Vertica, Snowflake)Experience with ETL frameworks and large-scale data processingUnderstanding of columnar storage systemsBI & ReportingExperience supporting or integrating BI tools (Tableau, Power BI, Oracle Analytics)Programming & ToolsStrong proficiency in Python, Java, or GoExperience with Docker, Kubernetes, and shell scriptingProblem-SolvingStrong troubleshooting skills with ability to perform root-cause analysisExperience resolving complex production issues in distributed systemsResponsibilitiesWork with the Site Reliability Engineering (SRE) team to take shared ownership of services and platform components. Develop a strong understanding of end-to-end system architecture, dependencies, and production behavior.Design, build, and operate reliable, scalable, and secure infrastructure supporting large-scale analytics workloadsImprove system reliability through automation, monitoring, and performance optimizationContribute to the adoption of AI-assisted approaches for operations, including:Enhancing observability and alertingSupporting automated incident detection and remediationExploring intelligent automation for infrastructure lifecycle managementPartner with development teams to enhance service architecture, scalability, and operabilityParticipate in on-call rotations and act as an escalation point for complex production issuesPerform root cause analysis and implement long-term fixes to prevent recurrenceApply knowledge of distributed systems to troubleshoot issues and optimize system performanceDrive continuous improvement in DevOps/SRE practices, including CI/CD, Infrastructure as Code, and automation at scaleDevelop & MaintainImplement and optimize infrastructure for Oracle HDI Analytics PlatformEnsure system uptime, reliability, and scalabilityAI-Driven Automation (NEW)Design and implement GenAI-powered or agent-based solutions for:Observability and anomaly detectionIncident triage and remediationInfrastructure provisioning and lifecycle managementBuild tools and frameworks that enable self-service and autonomous operationsData Pipeline ExecutionBuild and optimize scalable data pipelines using Vertica and ETL frameworksOperational ExcellenceApply DevOps/SRE practices to automate deployments and operationsEnhance observability using Prometheus/Grafana and AI-driven insightsCloud IntegrationSupport multi-cloud initiatives across OCI, AWS, and AzureOptimize cost, performance, and compliance across environmentsIncident ResponseParticipate in on-call rotationsImplement preventative and automated remediation solutionsCollaborationWork closely with engineers to execute technical roadmapsContribute to code reviews and infrastructure improvementsWhat You Bring10 years of software engineering experience, with 8 years in cloud infrastructure, SRE, or DevOpsProven ownership of production system reliability in cloud environmentsCore ExpertiseCloud infrastructure design and automationDistributed systems and performance optimizationData warehousing and ETL frameworksAI-Native ExperienceDemonstrated experience applying GenAI / LLMs / agentic frameworks to infrastructure or operationsExperience building or integrating AI-powered automation for DevOps/SRE workflowsFamiliarity with tools like LangChain, AutoGPT, or custom AI agentsTechnical SkillsTerraform, Docker, KubernetesObservability stacks (Prometheus, Grafana)Python, Java, or GoAdditional StrengthsStrong problem-solving mindset with a focus on automation and scalabilityExperience improving system reliability through intelligent automationPreferred QualificationsExperience in healthcare or regulated environments (HIPAA, compliance frameworks)Familiarity with Oracle HDI or large-scale analytics platformsExperience working in environments requiring security clearanceExperience building self-healing or autonomous infrastructure systemsWhy Join Oracle HDI?Own and shape AI-native SRE and automation strategy for a mission-critical platformWork on large-scale, data-intensive healthcare systemsBe part of Oracle's investment in AI-driven infrastructure and healthcare innovationBuild the future of autonomous, self-healing cloud platformsCollaborate with top-tier engineers solving complex, real-world problemsBenefitsMedical, dental, and vision insurance, including expert medical opinionShort term disability and long term disabilityLife insurance and AD&DSupplemental life insurance (Employee/Spouse/Child)Health care and dependent care Flexible Spending AccountsPre-tax commuter and parking benefits401(k) Savings and Investment Plan with company matchPaid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation.11 paid holidaysPaid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours.Paid parental leaveAdoption assistanceEmployee Stock Purchase PlanFinancial planning and group legalVoluntary benefits including auto, homeowner and pet insuranceCareer Level – IC4DisclaimerCertain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.About UsOnly Oracle brings together the data, infrastructure, applications, and expertise to power everything from industry innovations to life-saving care. And with AI embedded across our products and services, we help customers turn that promise into a better future for all. Discover your potential at a company leading the way in AI and cloud solutions that impact billions of lives.True innovation starts when everyone is empowered to contribute. That's why we're committed to growing a workforce that promotes opportunities for all with competitive benefits that support our people with flexible medical, life insurance, and retirement options. We also encourage employees to give back to their communities through our volunteer programs.We're committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by emailing accommodation-request_mb@oracle.com or by calling 1-888-404-2494 in the United States.Oracle is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans' status, or any other characteristic protected by law. Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.J-18808-Ljbffr

Showing 10,000+ matching similar jobs

VIEW MORE