Bookmark: Evaluating LLMs is a minefield

lqdev👽12/11/2023

https://www.cs.princeton.edu/~arvindn/talks/evaluating_llms_minefield/

...many things can go wrong when we are trying to evaluate LLMs’ performance on a certain task or behavior in a certain scenario.

It has big implications for reproducibility: both for research on LLMs and research that uses LLMs to answer a question in social science or any other field.

Permalink: /feed/evaluating-llms-minefield/

Tags: #ai #llms #evaluation

Back to feed

Send me a message or webmention