Remote Senior DevOps Engineer
ARCHIVED
We can't find an active application page for this role right now. It may reopen or be listed elsewhere. Use Next Steps to search for an active apply link and similar live jobs.
Our team brings huge of cutting-edge, specialized expertise in Machine Learning and Speech Technologies, which are used daily by hundreds of millions of people worldwide.We already have several major projects underway and are looking to strengthen our team for a DevOps/SRE Engineer! Minimum 5 years of experience in a DevOps and/or Site Reliability Engineering roleStrong hands-on experience with Linux system administrationExtensive experience deploying, operating, and scaling Kubernetes in both cloud and bare-metal environmentsDeep expertise and practical experience with at least one major cloud provider (preferably Google Cloud Platform)Experience with ML inference on GPU/CPU is a strong plusProven experience implementing SRE practices and building observability stacks using Grafana, Prometheus, and LokiStrong adherence to GitOps, Infrastructure as Code (IaC), and CI/CD principlesAdvanced expertise in Terraform, Ansible, and PythonComfortable working in high-uncertainty environments: we are building a new product, requirements evolve quickly, and the ability to rapidly learn new technologies and patterns is essentialProactive mindset: ability to look beyond DevOps tasks and actively debug and understand the productStrategic thinking: ability to choose technologies and architectural approaches based on long-term goals rather than short-term compromises Deploy, operate, and evolve a microservices-based platform running in Kubernetes clusters across AWS, GCP, and on-prem (Rancher)Operate and support GPU-based ML inference services (Triton Inference Server, vLLM) deployed on RunPod, Scaleway, and NebiusBuild and maintain Docker images for all microservices and ensure a stable service lifecycleMaintain and scale development and production Kubernetes clusters, actively participate in deployment debugging, incident investigation, and performance troubleshootingDevelop, maintain, and evolve custom Helm charts for each serviceDesign and operate CI/CD pipelines using GitHub (code and pipelines) and GitLab for on-prem customer deploymentsEnsure platform compliance with SOC 2 requirements and actively contribute to improving security and compliance processesManage cluster access via NetBird VPN, implementing role-based access control using group policiesDeploy and manage infrastructure using IaC practices with Terraform and AnsibleDevelop and continuously improve observability systems:Grafana & Prometheus for metricsELK stack for centralized log storage and analysisContinuously optimize infrastructure in the areas of IaC, IAM, Observability, and CI/CDWork with a technology stack, including: Python, Kubernetes, Linux, Docker, GitHub CI/CD, PostgreSQL, ClickHouse, Kafka, Superset, Terraform, Ansible Experienced team, Aiphoria is formed by a team of enthusiastic professionals who created award-winning devices, voice assistants and other AI-driven products for BigTech corporations.Cutting-edge technologies, we build a technology using our areas of expertise including Computer Vision, Speech Technologies, Natural Language Understanding, Generative AI incl. LLM and Diffusion models.Rapid career progression, facilitated by our team of seasoned senior professionals who hail from prestigious, industry-leading companies.Remote work opportunities.Company has prominent clients with an opportunity for you to work on different projects and/or to be involved in developing our proprietary own products.Competitive compensation surpassing market standards.A company with entrepreneurial spirit. We offer a unique mix of a secure workspace thanks to the big clients raised along with a true start-up culture!