Remote Senior Software Engineer AI Evaluation & Benchmarks (Python)

G2iHouston, TXRemoteJune 21st, 2026

Before ApplyingThis role is open to contractors in accepted locations only. Please confirm your country is on the list before applying — we're unable to process applications from unlisted locations. List of accepted countries and locations.For US applicants: This is a 1099 independent contractor role. It is not compatible with F-1 OPT, STEM OPT, or any visa status that requires W-2 employment, guaranteed hours, or employer sponsorship. We are unable to provide offer letters or employment verification for this role.What You'll Be DoingDesign and build the coding benchmarks and evaluation pipelines used to test frontier AI models on real software engineering work:Design coding benchmarks that evaluate frontier models on real-world programming tasks — reasoning, debugging, and production-quality codeBuild and maintain scalable data pipelines for evaluation workflowsAnalyze model-generated code for correctness, reliability, and edge-case failuresConstruct structured evaluation scenarios across large repos and multi-language environmentsProvide detailed technical feedback on model performance and failure patternsContribute to evaluation frameworks that set the bar for how coding ability is measuredEnd result: benchmarks that meaningfully separate what frontier models can and can't do — and shape how the next generation is trained and improved.AI coding evaluation in one line: Design task → build harness → run model → analyze failures → feed findings back into the benchmark → evaluations that actually distinguish strong models from weak ones.What You'll Need4+ years of professional software engineering experience (non-negotiable)Expert Python — clean, performant, well-tested codeHands-on experience working in large, complex codebasesProven experience designing and implementing LLM coding benchmarks and evaluation data pipelinesStrong command of Git and modern development workflowsTrack record at a high-growth tech company or top-tier software organizationStrong written English communicationIdentity verification: Applicants will be required to verify their identity and confirm they have valid documentation to work as an independent contractor in their country of residence.Nice to haveSenior or Lead-level profile with a history of technical ownershipBachelor's or Master's in CS, ML, or related field (or equivalent professional experience)Proficiency in additional languages: JavaScript, Go, C++, or othersCI/CD experience and writing robust unit tests (pytest, Mocha, JUnit)Background in security engineering or significant open-source contributionsFamiliarity with AI/ML evaluation methodologies or model benchmarkingLogisticsLocation: Fully remote — work from anywhere on the accepted locations listCompensation: $80–$100/hr based on location and seniorityContract length: 3 months, with potential for extensionHours: Full-time availability preferred — hours vary by project and are not guaranteed week to weekEngagement: 1099 independent contractorPayment: Weekly via PayPal or Stripe⚠️ Important: Hours are project-dependent and can vary week to week. We recommend keeping other work options open alongside this engagement rather than relying on it as your sole source of income.

No matching similar jobs found for matching similar jobs near Houston, TX

No similar jobs found