JOBSEARCHER

Site Reliability Engineer

The Site Reliability Engineer (SRE) Lead is responsible for ensuring the reliability, scalability, performance, and security of enterprise Linux-based systems. This role combines deep technical expertise in Linux administration with leadership in automation, observability, incident management, and infrastructure engineering. The SRE Lead drives operational excellence by implementing best practices, improving system resilience, and mentoring engineering teams.Platform Engineering OperationsLead the administration, monitoring, and performance tuning of Oracle Enterprise Linux (OEL) environments in a large-scale enterprise ecosystem.Oversee the design, build, and lifecycle management of Linux servers, including storage, virtualization, and associated infrastructure.Manage high availability (HA) configurations, clustering, and load-balanced environments to ensure minimal downtime.Drive capacity planning, performance optimization, and system scalability initiatives.Reliability Automation (SRE Practices)Define and implement SRE principles, including SLIs, SLOs, and error budgets.Lead initiatives for infrastructure automation (provisioning, configuration, patching) using tools such as Ansible.Build and maintain self-healing systems, reducing manual intervention and improving system resilience.Develop automation for system installation, configuration, and deployment pipelines.System Administration Infrastructure ManagementInstall, configure, and maintain Oracle Enterprise Linux (OEL) operating systems and related software stacks.Manage Logical Volume Manager (LVM) configurations, including volume groups and filesystem expansion.Administer distributed file systems, NFS servers/clients, and automount configurations.Maintain network services such as DNS, NTP, LDAP/Kerberos, SMTP (sendmail/postfix), and OpenSSH.Troubleshoot and support network protocols (TCP/IP, HTTP, HTTPS, RPC).Monitoring, Incident Management SupportImplement and enhance monitoring, alerting, and observability frameworks for proactive issue detection.Lead incident response, root cause analysis (RCA), and postmortem reviews.Drive continuous improvement by identifying systemic issues and implementing preventive solutions.Oversee break/fix operations, ensuring timely resolution and minimal business impact.Security ComplianceEnsure systems are secure, hardened, and compliant with enterprise security standards.Manage patching, vulnerability remediation, and OS upgrades.Partner with security teams to implement best practices for access control, auditing, and encryption.Leadership CollaborationProvide technical leadership and mentorship to SRE and infrastructure teams.Collaborate with application, DevOps, and platform teams to improve system reliability and deployment processes.Define and enforce operational standards, runbooks, and best practices.Drive cross-functional initiatives to enhance platform stability and efficiency.Documentation GovernanceMaintain comprehensive documentation for architecture, processes, and operational procedures.Ensure adherence to change management and incident governance frameworks.Standardize operational workflows across environments.Required Qualifications5+ years of experience in Linux system administration in enterprise environments.Strong expertise in Oracle Enterprise Linux (OEL) systems.Proven experience in high availability systems, virtualization, and storage management.Hands-on experience with automation and configuration management tools (Ansible preferred).Proficiency in at least one scripting/programming language (Bash, Python preferred).Strong experience with infrastructure troubleshooting, performance tuning, and incident management.Solid understanding of enterprise infrastructure (compute, storage, network).Excellent analytical, problem-solving, and organizational skills.Strong communication and collaboration skills in a global team environment.