Dom.Vin
AI Design Journal

Got to really dig into GDPval from OpenAI

Previous AI evaluations like challenging academic tests and competitive coding challenges have been essential in pushing the boundaries of model reasoning capabilities, but they often fall short of the kind of tasks that many people handle in their everyday work.

It’s an incredible dataset. Looking at the range of tasks, it’s hard not to be a little blown away. You can see the full prompts and input files over on Hugging Face. The obvious design questions are around where these prompts are executed and what user interface will wrap them. If any.

Important caveat:

The current version of the evaluation is also one-shot, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts—for example, revising a legal brief after client feedback or iterating on a data analysis after spotting an anomaly.

I suspect this is where the real value will be tested, over iterative tasks that aren’t powered by simple clean consistent input files. Looking forward to the new version.

AI is not conscious. Or maybe it is and we’re not. I’m not sure it’s a particularly interesting distinction.

I do find it interesting to frame AI as an aggregation of our collective consciousness. The words of history, the ideas of today, the iterative sum of a network of interwoven cultures, synthesized down into a shared identity. That’s kinda cool.

Can AI finally clean my inbox? by Azeem Azhar:

In 1954, Dwight Eisenhower articulated a truth that most of us live viscerally: “What is important is seldom urgent and what is urgent is seldom important” — the ringing telephone, the knock on the door, the crisis meeting that devoured afternoons earmarked for deep thinking. Other people’s priorities have always had a peculiar talent for masquerading as emergencies.

Most of us spend so little time in important-not-urgent, we may have lost the ability to articulate what is truly important to us.

AI learned how to chat, so OpenAI build a chat interface. Then it learned how to act, who’s going to build an act interface?

The Jetsons Dream by my friend and colleague David Geere, makes the case for redefining our first-principles in response to the cataclysmic technological change that’s underway:

The Jetsons premiered in 1962 with chatty appliances, airborne commutes, and a house-bot named Rosie tidying life’s loose ends. Six decades later we’re still hunched over rectangles, arguing with voice assistants that forget what we said two sentences ago.

Amongst the industry hum of Jony and Sam, he sketches out a fork in the road for both Apple and OpenAI:

Here’s what I think is really happening. Sam Altman sees himself not just as the next Steve Jobs, but as something more — the person who finally delivers on technology’s broken promises. He’s got the AI. He’s got the ambition. And now he’s got the designer who made the last revolution beautiful.

Things are aligning. The iPhone moment in AI is coming.

AI Horseless Carriages by Pete Koomen:

Whenever a new technology is invented, the first tools built with it inevitably fail because they mimic the old way of doing things. “Horseless carriage” refers to the early motor car designs that borrowed heavily from the horse-drawn carriages that preceded them.

Whenever paradigms shift this hard and this fast, there will be an inevitable lag while the industry awaits on visionaries to redefine its first principles. Pete makes the compelling case that we can mitigate some of this lag by exposing more of the inner configuration of these new agentic systems, allowing users more control.

My core contention in this essay is this: when an LLM agent is acting on my behalf I should be allowed to teach it how to do that by editing the System Prompt.

Most AI apps should be agent builders, not agents.

The distinction between developer-land and user-land is blurring. Kent Beck’s take from a couple of years ago expressed the anxiety that developers are facing:

The value of 90% of my skills just dropped to $0. The leverage for the remaining 10% went up 1000x. I need to recalibrate.

We all need to recalibrate.

GPT-4.1 Prompting Guide from OpenAI:

We expect that getting the most out of this model will require some prompt migration. GPT-4.1 is trained to follow instructions more closely and more literally than its predecessors, which tended to more liberally infer intent from user and system prompts.

The One-Shot Paradigm with Agents by Helena Zhang:

Over the past two months, we have studied the designs behind powerful AI agents like Cursor and Claude Code. These tools have created new ways for AI to interact with codebases.

It’s extraordinary how tools like Cursor have embedded themselves so centrally into so many devs workflows without many of us having a clear understanding of how they actually work.

Chatbot Arena:

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences.

Check out the Prompt to Leaderboard feature. Very cool.

The Bitter Lesson: Rethinking How We Build AI Systems:

Recently, I was tending to my small garden when it hit me - a perfect analogy for this principle. My plants don’t need detailed instructions to grow. Given the basics (water, sunlight, and nutrients), they figure out the rest on their own. This is exactly how effective AI systems work.

When we over-engineer AI solutions, we’re essentially trying to micromanage that plant, telling it exactly how to grow each leaf. Not only is this inefficient, but it often leads to brittle systems that can’t adapt to new situations.

This is beautifully articulating something I have been wrestling with for a while. What’s the best optimisation strategy for refining agentic experiences? To what extent can throwing compute at a problem solve for complexity?

Ankit is really onto something here, wonderful framing.