Another Crazy Day in AI: How to Know If Your AI Model Will Succeed — or Fail

Wowza Team
May 13, 2025
4 min read

Hello, AI Enthusiasts.

Made it through the day? Here’s a little something to wind down with that isn’t more email replies.

Microsoft researchers say it’s not enough to know how AI performs—we need to know why. Their new system aims to predict AI behavior across tasks and make evaluation more human-readable.

Meanwhile, Cybersecurity just got bumped as the #1 tech budget item. Generative AI has officially taken the wheel.

And while we’re on the topic of who’s in the driver’s seat… economists are calling out the U.S. auto industry for stalling where it should’ve sped up.

That’s the latest. You’re (almost) up to speed.

Here's another crazy day in AI:

Microsoft advances AI evaluation methods
Generative AI now a bigger budget item than cybersecurity
The economics behind America’s auto market
Some AI tools to try out

TODAY'S FEATURED ITEM: Predicting AI Performance Before Deployment

A robotic scientist in a classic white coat with 'AI Scientist' on its back stands beside a human scientist with 'Human Scientist' on their coat, looking towards the AI Scientist.

Image Credit: Wowza (created with Ideogram)

What if we could predict AI success before testing?

As AI systems take on increasingly complex and critical roles, simply knowing whether a model performs well is no longer enough. We need to understand why it performs the way it does — and anticipate how it might behave on new, unfamiliar tasks.

In a recent post on the Microsoft Research Blog, Lexin Zhou and Xing Xie share insights from their groundbreaking study, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power.” Supported by Microsoft’s Accelerating Foundation Models Research (AFMR) program, this research proposes a new way to evaluate AI models that goes beyond surface-level scores — aiming instead to predict future performance and explain results in human terms. The full study is co-authored by a broad team of researchers across Microsoft, Cambridge, and international institutions.

Instead of relying on traditional benchmarks, the team created ADeLe — an ability-based evaluation system that assesses what a task demands cognitively and compares it to a model’s skillset.

A closer look at what the study introduces:

A framework grounded in 18 cognitive and knowledge-based scales, adapted from human-centered assessment practices
A method for rating task difficulty and linking it to a model’s capabilities through structured evaluation
Profiles of AI model “abilities” that help explain how — and why — certain systems perform better on specific tasks
An evaluation of 15 large language models, highlighting patterns in reasoning, abstraction, and subject knowledge
Findings that challenge the completeness of some popular benchmarks, many of which test only narrow difficulty ranges or mix in unrelated demands
A predictive system that forecasts model performance on new tasks with around 88% accuracy

The ADeLe framework brings something new to the table: a way to make sense of AI performance in terms we can interpret. Rather than averaging a model’s score across a benchmark and calling it done, this approach builds a fuller picture — one that recognizes the range of skills a task might require and matches them with what the model is actually equipped to do. That context matters, especially when these systems are being used in more complex or consequential environments.

The team’s work doesn’t aim to replace traditional benchmarks, but to strengthen how we understand them. In doing so, it opens up new possibilities for more thoughtful evaluation — not just comparing models against one another, but assessing whether any given model is suited for a specific use case. As AI systems continue to scale and diversify, that kind of clarity could help researchers and practitioners make better decisions about development, safety, and deployment.

Read the full blog here.

Read the full paper here.

OTHER INTERESTING AI HIGHLIGHTS:

Generative AI Now a Bigger Budget Item Than Cybersecurity

/John K. Waters, on Campus Technology

A new global survey by AWS finds that generative AI has leapfrogged cybersecurity in 2025 tech budget priorities. With 90% of organizations now exploring or deploying generative AI, the technology is moving from experimental to essential. While security still plays a vital role in AI governance, the shift highlights how companies are racing to integrate generative tools into core workflows. Notably, nearly half of respondents are already using generative AI in production environments.

Another Crazy Day in AI: How to Know If Your AI Model Will Succeed — or Fail

Recent Posts

Comments

Subscribe to Another Crazy Day in AI

Comments

Subscribe to Another Crazy Day in AI​

Subscribe to Another Crazy Day in AI