A
Contract — AI Agent Evaluation
contract
mid
About this role
Evaluate and benchmark Aaru's synthetic research agents. Design evaluation protocols, run A/B tests, and measure agent accuracy against human researchers.
3-month contract with extension possibility. Remote-friendly.
Requirements
Experience with LLM evaluation or user research. Statistical analysis skills. Python. Available to start immediately.