Systems and Infrastructure Engineer
Systems and Infrastructure EngineerKey ResponsibilitiesDeploy & Maintain Bare Metal and Virtual Hosts: Provision, configure, and manage 100% physical hardware (servers, storage, switches) and virtual servers across global labs / data centers (on-prem and hybrid cloud) for high-availability workloads.Own Enterprise Storage Infrastructure: Design, implement, and optimize SAN/NAS storage (e.g., Dell EMC, NetApp) for global R&D teams, including replication, disaster recovery, and performance tuning across time zones.Manage Global MDM Workflows: Implement and scale mobile device management (MDM) for globally dispersed secure labs (e.g., Intersos, Jamf), ensuring compliance and device lifecycle management.Automate Infrastructure Operations: Reduce manual effort through infrastructure-as-code (Terraform, Ansible) and CI/CD pipelines for both bare metal and virtual machine provisioning, patching, and global inventory management (no network-focused automation).Diagnose Critical Infrastructure Incidents: Resolve physical hardware failures (server crashes, storage corruption) and global infrastructure outages (e.g., data center downtime) within SLAs; lead post-mortems for R&D teams.Optimize Global Infrastructure Performance: Tune bare metal and cloud workloads (e.g., Kubernetes, HPC) for low-latency operations using monitoring (PRTG, Datadog, Prometheus) and storage performance analysis (e.g., I/O latency, throughput).Required Qualifications & Experience5+ years of enterprise systems infrastructure experience managing bare metal and virtual servers (physical hardware deployment, OS installation, patching) in global environments (on-prem, cloud hybrid)Direct hands-on experience with enterprise storage (SAN/NAS) for global workloads (e.g., data replication across regions, high-availability storage clusters)Proven ability to automate bare metal and virtual operations using IaC (Terraform, Ansible) and scripting (Python, Shell) for server provisioning, inventory tracking, and global infrastructure scalingInvestigate and Diagnose Physical infrastructure incidents: Server hardware failures, storage corruption, network connectivity loss in multi-geo environments (e.g., Tokyo, Singapore, US East)Global infrastructure deployment experience: Managed infrastructure across 3+ time zones with cross-team coordination (R&D, engineering, security). Management and integration of identity and AuthZ across cloud and corporate environments.MDM implementation expertise: Deployed and maintained mobile device management for enterprise-scale device fleets (10K+ endpoints)Operational monitoring proficiency: Used tools like PRTG, Datadog, Prometheus to track server/storage health, uptime, and performance metrics globallyHybrid infrastructure knowledge: Experience with bare metal + cloud environments (e.g., AWS EC2 instances without and with virtualization, GCP bare metal nodes)Document & Scale Knowledge: Create runbooks, architecture diagrams, and operational guides for consistency.Preferred Qualifications & ExperienceR&D Hardware Agility: Experience deploying bare metal infrastructure for AI/ML workloads (e.g., optimizing storage for GPU clusters, low-latency data pipelines), isolated R&D environments, malicious environments (malware, honeypots, etc), or temporary environments for forensic researchGlobal Hardware Operations: Proven success in documenting processes for non-English speaking teams across the world (e.g., APAC/NA)HPC Storage Optimization: Experience tuning storage performance for high-performance computing (HPC) environments (e.g., reducing I/O latency for 10K+ nodes)Certifications: AWS Certified Systems Administrator (SAS-C02), Azure Fundamentals (AZ-900), or CompTIA Server+ (with hardware deployment focus)