Site Reliability Engineer
As an SRE, we are responsible for ensuring that our platforms are stable and healthy. We break down barriers to run our products by fostering developer run ownership and empowering developers to build resilient products. We support our developers during the application build phase in software run principals that includes operational design, automation, capacity planning, certificate lifecycle management, and monitoring that leads to fault-tolerant, scalable products. We see the big picture and help create and enforce operations standards while facilitating an agile and learning culture. We support daily operations with a hyper focus on triage, root cause analysis by understanding the business impact of our products and subsequently performing blameless post-mortems. The goal of every Business Operations team is to engage early in the development lifecycle to be more proactive and upfront in the development process, and to proactively manage production and change activities to maximize customer experience and increase the overall value of supported applications. Business Operations teams also focus on risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation across all our environments. Ultimately, the role of Business Operations is to align Product and Customer Focused priorities with Operational needs by providing continuous feedback throughout the lifecycle. Team Specific Skills:Role QualificationsThe ideal candidate will have experience in many of these areas:• BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience.• Appetite for change and pushing the boundaries of what can be done with automation. Be curious about new technology, infrastructure, and practices to scale our architecture and prepare for future growth.• Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.• Interest in designing, analyzing, and troubleshooting large-scale distributed systems.• Willingness and ability to learn and take on challenging opportunities and to work as a member of matrix based diverse and geographically distributed project team.• Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.• Comfortable collaborating with cross-functional teams to ensure that expected system behavior is understood, and monitoring exists to detect anomalies.Qualifications:• Experience with cryptography and certificate lifecycle management, data structures, scripting, pipeline management, and software design.• Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must.• Experience in a SRE role or related field.• Experience using Excel, Access, Notepad++, and other desktop tools to manage and analyze large data sets• Proven expertise in relational database management systems (RDBMS) such as PostgreSQL and Oracle.• Proficiency in SQL, PL/SQL, and PostgreSQL-specific features.• Strong understanding of database architecture, performance tuning, and query optimization.• Experience in Monitoring tools such as Splunk, Dynatrace.• Experience in production support environments and ITIL processes.• Experience with industry standard CI/CD tools like Git/Bitbucket, Jenkins, Maven, Artifactory, Groovy and Chef. Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is required.• Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. • Interest in designing, analyzing, and troubleshooting large-scale distributed systems. • Willingness and ability to learn and take on challenging opportunities and to work as a member of matrix based diverse and geographically distributed project team. • Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system. • Comfortable collaborating with cross-functional teams to ensure that expected system behavior is understood, and monitoring exists to detect anomalies. Great to have: • Awareness of security implementations, certificate management