7th Oct | Dom.Vin

Got to really dig into GDPval from OpenAI

Previous AI evaluations like challenging academic tests and competitive coding challenges have been essential in pushing the boundaries of model reasoning capabilities, but they often fall short of the kind of tasks that many people handle in their everyday work.

It’s an incredible dataset. Looking at the range of tasks, it’s hard not to be a little blown away. You can see the full prompts and input files over on Hugging Face. The obvious design questions are around where these prompts are executed and what user interface will wrap them. If any.

Important caveat:

The current version of the evaluation is also one-shot, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts—for example, revising a legal brief after client feedback or iterating on a data analysis after spotting an anomaly.

I suspect this is where the real value will be tested, over iterative tasks that aren’t powered by simple clean consistent input files. Looking forward to the new version.