SRE Architect
Title: Senior Observability / SRE Engineer (Splunk | Linux | Python)Location: RemoteDuration: Long termWe are looking for a highly experienced Senior Observability & Site Reliability Engineer to support large-scale enterprise platforms and mission-critical applications. The ideal candidate will have deep hands-on experience in building and operating end-to-end monitoring, logging, and alerting solutions across distributed environments.This role involves close collaboration with development, infrastructure, and operations teams to ensure platform reliability, performance visibility, and incident response effectiveness.Key ResponsibilitiesDesign, implement, and maintain enterprise observability solutions using Splunk Enterprise including dashboards, alerts, and data ingestion pipelinesDevelop and enhance monitoring frameworks for infrastructure, applications, and web platformsAutomate operational processes using Linux shell scripting and PythonImplement intelligent alerting strategies to reduce noise and improve incident response efficiencyProvide L3 production support for business-critical applications and infrastructureSupport cloud and containerized deployments across AWS and Kubernetes environmentsCollaborate with engineering teams to standardize logging and telemetry practicesDrive root cause analysis, post-incident reviews, and continuous reliability improvementsBuild operational runbooks, disaster recovery procedures, and service continuity plansIntegrate monitoring and deployment workflows with CI/CD tools such as Jenkins, Git, and TeamCitySupport database monitoring and performance analysis across SQL Server, Oracle, DB2, and MySQL platformsParticipate in ITIL-based change, incident, and problem management processesRequired SkillsStrong hands-on expertise in Splunk engineering, administration, and architectureAdvanced experience in Linux / Unix environmentsProficiency in Python, Shell scripting, and automation frameworksExperience with AWS cloud services and Kubernetes / Docker platformsKnowledge of monitoring tools such as Nagios and custom observability solutionsExperience supporting high-availability web platforms and distributed systemsStrong troubleshooting and production incident management skillsUnderstanding of CI/CD pipelines and deployment automationFamiliarity with ITIL processes and service management tools like ServiceNowPreferred QualificationsSplunk certifications (Power User / Admin / Architect)Experience building large-scale telemetry platformsBackground in financial services or high-transaction enterprise environmentsExperience designing intelligent alerting and automated incident workflows Experience Level15+ years in production engineering / SRE / observability rolesPrior experience supporting mission-critical enterprise systems