Software Engineer (Go, Kubernetes)
About the RoleWe are seeking a Software Engineer (L3/L4) to help build scalable infrastructure and platform services for next-generation AI/ML workflows. This role focuses on distributed systems, Kubernetes-native services, observability, and performance optimization across GPU and ARM-based environments.The ideal candidate has strong Golang development experience, deep familiarity with cloud-native infrastructure, and exposure to AI/ML platforms, CUDA acceleration, or GPU-enabled workloads.ResponsibilitiesDesign, develop, and maintain backend platform services using Golang.Build and optimize Kubernetes-native infrastructure supporting AI/ML training and inference workloads.Develop scalable systems for orchestration, scheduling, telemetry, monitoring, and observability.Improve reliability, performance, and scalability of distributed systems running GPU-intensive applications.Work on infrastructure automation, deployment pipelines, and containerized environments.Integrate observability tooling including metrics, logging, tracing, and alerting frameworks.Collaborate with AI/ML engineers to support model training, inference, and workflow orchestration.Optimize infrastructure for CUDA-enabled GPUs and ARM-based compute platforms.Troubleshoot production issues across distributed environments and improve system resiliency.Contribute to architecture discussions, technical design reviews, and engineering best practices.Required QualificationsSoftware Engineer Level 33+ years of software engineering experience.Strong programming experience in Golang.Hands-on experience with Kubernetes and containerized applications.Experience building distributed systems or cloud-native backend services.Familiarity with observability tools such as Prometheus, Grafana, OpenTelemetry, ELK, or similar platforms.Experience with Linux systems and cloud infrastructure.Understanding of CI/CD pipelines and infrastructure automation.Software Engineer Level 45+ years of software engineering experience.Proven experience designing and scaling distributed platform infrastructure.Strong expertise in Kubernetes ecosystem and production-grade cloud-native systems.Experience leading technical initiatives or owning major platform components.Strong debugging, performance tuning, and systems optimization skills.Ability to mentor engineers and drive architectural decisions.Preferred QualificationsExperience with AI/ML frameworks such as PyTorch, TensorFlow, JAX, Ray, Kubeflow, or Triton Inference Server.Exposure to GPU computing, CUDA, NCCL, or distributed training infrastructure.Experience supporting AI/ML workflows in production environments.Familiarity with ARM architectures and performance optimization.Knowledge of high-performance computing (HPC) or large-scale inference/training systems.Experience with service mesh, networking, and Kubernetes operators/controllers.Exposure to cloud platforms such as AWS, GCP, or Azure.Technical SkillsLanguages: Golang, Python, BashPlatforms: Kubernetes, Docker, LinuxObservability: Prometheus, Grafana, OpenTelemetry, Loki, JaegerAI/ML: PyTorch, TensorFlow, CUDA, GPU orchestrationInfrastructure: Helm, Terraform, GitOps, CI/CDCloud: AWS, GCP, AzureNice to HaveExperience with multi-cluster Kubernetes environments.Familiarity with MLOps pipelines and model lifecycle management.Contributions to open-source cloud-native or AI infrastructure projects.Experience building internal developer platforms or platform engineering tooling.