JOBSEARCHER

Staff SRE

Job Description: Build tools and frameworks to monitor systems and ensure highest level of uptime on production environmentsMentor the SRE team on best practicesDevelop culture of innovationTake lead in enhancing our 24/7 on call and incident management processBuild and maintain Run-booksContribute to design and documentation of the cloud services and SOPsInfluence service design by working closely with Architects, DBAs, Developers, DevOps, Data engineers to bake reliability, scalability and cost optimizations early in the development processLead blameless post-mortemsTake ownership of publishing RCA documents for internal and external consumptionLead initiatives with Service Owners to define the SLOs and build SLIs to ensure systems are meeting the SLAsResearch and evaluate new cloud technologies and vendor offerings to enhance product stability and manageabilityReduce Operational Toil and maintain high degree of automation by adapting IaC first and Gitops principalsAcquire and maintain significant understanding of Lytx production services to ensure timely resolution of production incidentsRequirements: 8+ years of experience as a SRE in an AWS environment at medium to large scale organization6+ years of hands-on experience implementing and managing Observability tools (Prometheus, New Relic, Grafana, etc.)High degree of proficiency in programming, preferably using Python, Groovy and BashHands-on experience managing database technologies (SQL and NoSQL)5+ years of experience building Infrastructure deployment pipelines using Git, Terraform, Helm, Jenkins/JenkinX/ArgoCD etc.Proficient in designing production environments in AWS cloud using various AWS services (VPCs, EKS, IAM, AMI, EC2, CloudWatch, CloudTrail's, Control Tower, Guard duty, MSK, S3, Glacier, Gateways, Direct Connects, Route53, RDS, ALBs, Autoscaling etc)Extensive experience with Linux systems and various protocols and technologies (HTTP, REST, TCP/IP, SSL, DNS, SMTP, SSH, NTP, Load Balancing, SQL/NoSQL, Message Brokers, Nginx, Vault, ELK etc)Hands-on experience with Kubernetes and various container and cloud native technologiesSignificant experience in participating, implementing, and managing 24/7 on call rotation for SRE team, creating run books, building support procedures and proactively monitor systems across geographical locationsAbility to work well under pressure within a technically challenging environmentBenefits: Medical, dental and vision insuranceHealth Savings AccountFlexible Spending AccountsTelehealth401(k) and 401(k) matchLife and AD&D insuranceShort-Term and Long-Term DisabilityFTO or PTOEmployee Well-Being program11 paid holidays plus 1 inclusive holiday per yearVolunteer Time OffEmployee Referral programEducation Reimbursement ProgramEmployee Recognition and Appreciation programAdditional perk and voluntary benefit programs