A
Data Scientist — LLM Evaluation
full-time
mid
About this role
Design and implement evaluation frameworks for large language models. Build benchmarks, run experiments, and measure model quality across dimensions.
Your work determines which models ship and which don't.
Requirements
Strong statistics background. Experience with LLM evaluation or NLP benchmarking. Python required. Experience with statistical testing.