Cloud Data SRE (Spark / Data Platform)- Visa Independent
Cloud Data SRE (Spark / Data Platform) 6 8 Years ExperienceRole OverviewWe are looking for an experienced Cloud Data SRE with 6 8 years of relevant experience to support, manage, and optimize Spark-based data workloads in production. This role is not development-focused; instead, it emphasizes production support, troubleshooting, system reliability, platform migration, and operational excellence across Spark and data ecosystem components.Key ResponsibilitiesProduction Support & Incident ManagementProvide on call support for production alerts and critical issues.Perform log analysis, debug application failures, and drive quick resolution.Handle incident management, root-cause analysis, and permanent remediation.Conduct alert retrospectives, reduce noise, and fine-tune alert thresholds.Monitoring & Operational ExcellenceMonitor Spark jobs, data pipelines, and underlying infrastructure across Hadoop/Kubernetes/serverless platforms.Manage server health, Hadoop cluster nodes, and disk utilization.Configure resource parameters and optimize Spark job performance.Support developers by helping diagnose and resolve job issues.Data & Platform ManagementManage data access, quotas, file permissions, and HDFS/Kube resources.Handle data management operations including data copy, DR, retention planning, and utilization checks.Tooling & AutomationBuild/maintain tools for automation, reporting, dashboarding, and incident analysis.Improve operational efficiency through scripts, utilities, and internal platforms.Migration ProjectsMigrate Projects FromLegacy schedulers to Data PlatformHadoop HDFS ACOSYARN / Kubernetes Serverless SparkSupport data and compute migration initiatives end-to-end.Required Experience6 8 years of experience in Data SRE / Production Support roles.Strong Knowledge OfSpark job execution & tuningHadoop ecosystem (HDFS, YARN)Kubernetes basicsServerless Spark environmentsHands-on experience with monitoring, troubleshooting, alerting, and incident response.Comfort with shell scripting / Python for automation (nice to have, not mandatory coding heavy).