Site Reliability Engineer

Ifg International Financial GroupSan Jose, CAApril 23rd, 2026

Computer Systems Engineers/ArchitectsLand Subdivision

Job Title: Senior Azure Infrastructure & Cost Optimization Engineer Location: Onsite: Redmond, WA / Bay Area, CA Contract on T4 Top IT Firm Duration: 6 months (through end of June) + extension potential All potential applicants are encouraged to scroll through and read the complete job description before applying. What This Role Actually Is This is not a general cloud engineering role. It is a specialized FinOps and Azure infrastructure optimization role focused on identifying cost inefficiencies and landing measurable savings across high‑impact Azure services. The team is specifically targeting 5× savings in Cosmos DB and meaningful reductions across compute, Redis, storage, event streaming, and Postgres. Core Responsibilities 1. Azure Cost & Utilization Analysis Analyze spend patterns using Azure Advisor and Azure Cost Management + Billing. Identify top cost drivers, inefficiencies, and waste. Build clear baseline → root cause → recommendations → savings forecast reports. 2. Optimize High‑Impact Azure Services (Primary Focus Areas) You will work hands‑on to tune and rightsize: Cosmos DB – RU/s optimization – Indexing & partitioning strategy – Autoscale vs. manual throughput – TTL, point‑read usage, query efficiency – Target: up to 5× cost reduction Compute (VMs / VMSS / AKS) – Right‑sizing vCPU/RAM – Node pool strategy – Limits/requests cleanup – HPA/VPA evaluation – Reservations, Spot, Savings Plans Azure Cache for Redis – SKU sizing, eviction policies, memory pressure tuning Storage (Blobs, Files) – Tiering, lifecycle management, retention optimization Event Hubs – Throughput units, consumer efficiency, capture strategy Azure Database for Postgres – Instance sizing, query tuning, autovacuum/I/O alignment These represent the largest and most addressable cost categories. 3. Microservices & PaaS Footprint Review Evaluate microservices for idle workloads, over‑provisioned instances, unnecessary persistence layers, excessive replicas. Identify opportunities to consolidate or scale down. 4. Capacity Planning & Scaling with Engineering Leads Work with leads to define realistic SLO‑aligned capacity baselines. Optimize autoscaling triggers and eliminate chronic over‑allocation. Ensure sustainable long‑term cost governance. 5. Deliver Clear, Actionable Recommendations Your output is not academic analysis—it must be: Practical Quantified Prioritized by impact Low‑risk to implement Examples: Rightsize VMSS SKU Reduce RU/s on least‑used partitions Fix Redis memory pressure Adjust node pool strategy Introduce storage lifecycle policies Optimize Postgres IOPS vs. workload 6. Operate Independently After Onboarding Own the full investigation → diagnosis → recommendation → validation pipeline. Communicate findings clearly to senior technical and non‑technical stakeholders. Must‑Have Skills (Top 3) Azure Infrastructure Cost Optimization & FinOps expertise Strong with Azure Advisor, Cost Management + Billing, reservations, tagging, anomaly detection, dashboards. Hands‑on optimization experience with Cosmos DB, Compute (VMs/VMSS/AKS), Redis, Storage, Event Hubs, and Azure Postgres Ability to materially reduce spend across these services. Infrastructure performance analysis & capacity/utilization engineering Deep understanding of CPU/memory behavior, microservice patterns, and scaling mechanisms. Required Experience 5+ years with Azure infrastructure at scale. Strong understanding of AKS, VMSS, microservices resource behavior. Proven track record of measurable cost savings in large Azure environments. Ability to translate complex technical findings into clear action plans. Excellent stakeholder communication skills. Nice to Have Terraform / Bicep experience. Observability stack knowledge: Azure Monitor, Log Analytics, App Insights. JVM/container tuning background for services running on AKS or VMSS. Experience implementing FinOps governance practices. Success Metrics Cosmos DB spend reduced by up to 5× without performance regression. Compute savings of 30–50% through rightsizing and scale adjustments. Storage & data savings of ~25% through lifecycle and tiering. Improved governance (tag coverage, budgets, alerting). Please let me know if this is something you would love to do, and help me with your updated resume. xywuqvp Feel free to reach out at if you have any questions. Thanks

Site Reliability Engineer

matching similar jobs near San Jose, CA