Science

Rethinking AI Intelligence: Insights from Melanie Mitchell

Published

6 months ago

4 December, 2025

Artificial intelligence (AI) evaluation methods are under scrutiny, according to insights shared by computer scientist and professor Melanie Mitchell during a keynote address at the NeurIPS conference in December 2023. In her talk titled “On the Science of ‘Alien Intelligences’: Evaluating Cognitive Capabilities in Babies, Animals, and AI,” Mitchell argues that current methods for assessing AI lack rigor and relevance, suggesting a need for new approaches informed by developmental and comparative psychology.

Mitchell, known for her influential book, Artificial Intelligence: A Guide for Thinking Humans, emphasizes that the concept of intelligence is multifaceted. She highlights that researchers often focus on different aspects, such as reasoning, abstraction, and world modeling, when discussing AI capabilities. By using the term “cognitive capabilities,” she advocates for a more precise understanding of how AI can be evaluated.

One of the key points raised by Mitchell is the inadequacy of existing benchmarks used to assess AI systems. Traditional evaluations often involve running AI on set tasks and reporting accuracy rates. While many current AI systems excel in these benchmarks, she notes that this does not necessarily translate to effective performance in real-world applications. For instance, excelling in an examination does not guarantee that an AI will perform well as a lawyer or in other practical scenarios.

Mitchell points out that the methodologies used in psychological research can provide valuable insights for AI evaluation. She argues that AI research has largely overlooked experimental methodologies that have been developed to study nonverbal agents, such as infants and animals. These methods often involve controlled experiments and variations in stimuli to explore robustness and failure modes, which can yield deeper insights than success alone.

Applying Psychological Insights to AI Research

Mitchell cites the example of Clever Hans, a horse that seemed to perform arithmetic tasks by tapping its hoof. A psychologist discovered that the horse was actually responding to subtle facial cues from the questioner, rather than demonstrating true numerical understanding. This emphasizes the importance of skeptical inquiry in research, a practice she believes is lacking in AI studies.

She also shares insights from research on infants, where initial findings suggested that babies possess an innate moral sense. However, subsequent investigations revealed that the results depended heavily on the framing of the stimuli presented to the babies. This underscores the need for careful experimental design to avoid drawing incorrect conclusions.

Mitchell advocates for a culture of skepticism in AI research, suggesting that researchers should not only critique others’ work but also critically examine their own hypotheses. This mindset is vital for scientific progress and can help refine the understanding of AI capabilities.

The Role of Replication in Scientific Progress

Another significant lesson from psychology that Mitchell believes AI researchers should adopt is the importance of replication. She notes that replicating studies is often undervalued within the AI community. Papers that focus on replication and incremental improvements frequently face criticism for lacking novelty. This attitude, she argues, undermines the scientific method, which relies on building on existing knowledge.

As discussions around artificial general intelligence (AGI) continue to evolve, Mitchell expresses skepticism about the clarity of the term. She suggests that definitions of AGI vary widely and highlight the challenges of measuring progress toward such a nebulous goal. Historically, aspirations for AGI included human-level intelligence and physical capabilities. However, as research has progressed, the focus has shifted more toward cognitive aspects, which remain complex and intertwined with physical abilities.

Mitchell’s insights at NeurIPS serve as a reminder that the evaluation of AI must adapt and incorporate diverse methodologies from psychology. By moving beyond traditional benchmarks and embracing a more nuanced understanding of intelligence, researchers can better assess the capabilities of AI systems, ultimately fostering more effective and reliable technologies.

Up Next

Researchers Develop Eco-Friendly Heat-Dissipating Material from Egg Whites

Don't Miss

Shingles Vaccine May Lower Dementia Risk and Enhance Outcomes

Editorial

Our Editorial team doesn’t just report the news—we live it. Backed by years of frontline experience, we hunt down the facts, verify them to the letter, and deliver the stories that shape our world. Fueled by integrity and a keen eye for nuance, we tackle politics, culture, and technology with incisive analysis. When the headlines change by the minute, you can count on us to cut through the noise and serve you clarity on a silver platter.