Senior AI-Native DevOps / Operations Engineer (AMER)
ARCHIVED
We can't find an active application page for this role right now. It may reopen or be listed elsewhere. Use Next Steps to search for an active apply link and similar live jobs.
About Valency Valency Systems is a small, dynamic team of engineers, scientists, and researchers building the global hub for the agentic research era. We're based in Berkeley, California, and we're building something that matters. If you care about open science, advancing research at the speed of thought, and using AI to accelerate discovery, we'd love to talk. Our team is hybrid. We come together in person 3 days a week, with the option for 2 days of remote work. The Position We're hiring an AI-native DevOps / Operations Engineer to help build and operate the platform behind Valency. This is not a narrow infrastructure maintenance role. We want builders who can design and harden production systems, improve CI/CD and release quality, raise reliability and response times, and create the observability, analytics, and guardrails needed to safely operate a rapidly evolving platform. This role sits at the intersection of platform engineering, cloud infrastructure, production operations, and AI-era software delivery. You will help close the loop from agentically written software to reliable, performant systems in production. That means better tests, better release controls, stronger guardrails, richer production telemetry, clearer workflows for human approval, and tighter feedback into product and engineering. This is an especially strong fit for someone who has helped scale high-growth SaaS systems, likes building from first principles, and wants to experience that kind of growth again in a new context. What You'll OwnDesign, build, and improve the production platform powering ValencyTighten CI/CD processes so changes are tested, gated, observable, and safe to shipImprove production reliability, latency, deployment safety, and incident responseBuild the operational feedback loops that help engineering and product teams act on real production behaviorEstablish the right logging, analytics, tracing, alerting, and workflow instrumentation as the platform scalesDefine and implement guardrails for agent-involved software delivery and operationsIntroduce human-in-the-loop approval flows where autonomy needs stronger controlsImprove cost efficiency across cloud infrastructure and platform operationsHelp shape security, compliance, and auditability foundations for SOC 2, ISO 27001, and FedRAMP-oriented environmentsContribute to the long-term platform engineering direction as the team grows and specializes As the senior engineer on-site, you will:Own production operations and operational excellence for this functionLead incident response expectations for the roleEstablish the operating model the broader team will scale onWork onsite in the SF Bay Area What Success Looks Like In the first 6-12 months, you will help Valency begin tracking and materially improve: Deployment frequency and release confidenceChange failure rate and rollback qualityMTTR and incident handlingp95 / p99 latency and system responsivenessUptime and service reliabilityAlert quality and signal-to-noise ratioInfrastructure cost efficiencyOperational visibility into agent workflows and production behaviorGuardrail coverage for agent-authored or agent-assisted changesWhat You'll Work With Today the platform makes use of AWS and adjacent infrastructure including: ECS / FargateEKS / container orchestration environmentsRDSS3CloudflareCloudWatchQueues, caches, schedulers, and batch / background processing systemsWe currently use GitHub Actions and expect this person to help evolve that into a stronger long-term platform engineering and delivery foundation Our observability and analytics stack is still open for innovation. We want someone who is comfortable evaluating the tradeoffs and building the right system as complexity grows. What Makes This Role AI-Native This is not "DevOps, but with AI in the title." You will help build the operational system around software and workflows that increasingly involve agents. That includes: Tracing workflows across agent-driven and human-driven systemsDeveloping production guardrails to keep systems from going off the railsDesigning approval paths for high-risk or high-impact actionsTurning production signals into actionable inputs for product and engineeringHelping close the loop between what the system is doing, how users experience it, and how the platform should evolveWe do not require prior experience operating AI-native systems at scale. We do require strong judgment, strong production systems experience, and a willingness to build the right AI-era operating model. ResponsibilitiesOwn and improve CI/CD pipelines, release controls, and deployment workflowsBuild and maintain highly reliable AWS-based production systemsImprove observability across logs, metrics, traces, events, and workflow stateInstrument platform behavior so system issues, regressions, and slowdowns are quickly visible and actionableCreate operational analytics that help close the loop between engineering, product, and customer experienceDrive cost engineering and infrastructure efficiency as the system scalesBuild safer operating patterns for agent-assisted code changes and operational actionsImplement testing, validation, approval, and rollback mechanisms that reduce operational riskImprove batch, queue, cache, and job-processing reliability and monitoringSupport incident response, root cause analysis, postmortems, and follow-throughPartner with external vendors and partners when neededHelp define platform standards, reliability practices, and operational maturity across the company What We're Looking For Required8+ years of progressively increasing responsibility operating important production systemsDemonstrated success shipping and running high-reliability systems in productionDeep AWS experience in real production environmentsStrong background in software engineering and testing, not just infrastructure administrationExperience designing or significantly improving CI/CD systems and release processesExperience building or operating logging, monitoring, alerting, and observability systemsExperience improving production reliability, performance, and operational responseComfort with container-based systems and orchestration platformsStrong hands-on ability in at least some of: Python, Go, Elixir, CDKStrong judgment around guardrails, operational safety, and change managementAbility to work in ambiguity and build systems that do not yet fully exist Strongly PreferredStartup experience, especially in fast-scaling environmentsExperience at high-scale SaaS companies that have gone through periods of rapid growthExperience owning or materially influencing platform engineering functionsExperience with cost engineering / FinOps in AWS-heavy environmentsExperience designing systems for compliance-oriented environmentsExperience with SOC 2, ISO 27001, or FedRAMP-related operational requirementsExperience evaluating or implementing modern observability and workflow tracing stacksExperience creating human-in-the-loop approval systems for sensitive production workflows Why This RoleYou will help define how an AI-native research platform is actually operated in productionYou will work on systems that connect agents, researchers, product behavior, and infrastructure realityYou will have broad scope across infrastructure, reliability, analytics, and operational guardrailsYou will help build the production foundation for a category-defining company at an early stageYou will not inherit a frozen stack; you will help choose and build the right one Compensation, Benefits & Equity We offer a competitive salary, benefits, and meaningful equity in a company building something important from the ground up. Work Authorization: Candidates must be legally authorized to work in the United States.