The Business & Technology Network
Helping Business Interpret and Use Technology
S M T W T F S
 
 
 
 
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
 

Why small AI models can’t keep up with large ones

Tags: new
DATE POSTED:February 18, 2025
Why small AI models can’t keep up with large ones

The rise of large language models (LLMs) has been nothing short of transformative. These AI systems excel at complex reasoning, breaking down problems into structured, logical steps known as chain-of-thought (CoT) reasoning. However, as AI research pushes for efficiency, a key question emerges: Can smaller models inherit these advanced reasoning capabilities through distillation from larger models?

A new study by Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, and Radha Poovendran from the University of Washington, Carnegie Mellon University, and Western Washington University suggests the answer is more complicated than previously thought. In the study called “Small Models Struggle to Learn from Strong Reasoners,” the researchers have identified what they call the Small Model Learnability Gap—a phenomenon where small models (≤3B parameters) struggle to benefit from the intricate reasoning of their larger counterparts. Instead, these models perform better when trained on shorter, simpler reasoning steps or distilled from other small models.

This finding challenges the conventional belief that bigger is always better when it comes to AI knowledge transfer. The study also proposes a new approach to AI distillation—one that mixes reasoning complexity to help smaller models learn more effectively.

Why small AI models struggle with complex reasoning

LLMs like GPT-4o, Claude 3 Opus, and Gemini are trained on massive datasets and optimized to process intricate reasoning chains. Their step-by-step explanations enhance problem-solving accuracy in fields like mathematics, logical inference, and structured decision-making.

Naturally, AI researchers have attempted to “shrink” this intelligence into smaller models—fine-tuning them using outputs from larger models. The idea is straightforward: train a smaller model on long, detailed reasoning traces generated by a larger AI, hoping it will absorb the same structured logic.

But the study finds this approach often backfires.

  • Small models fail to internalize long reasoning steps: When trained on lengthy and intricate explanations, smaller models struggle to generalize, leading to performance drops.
  • They learn better from simpler reasoning chains: Training small models on shorter, more concise reasoning sequences improves their ability to process logical steps.
  • Bigger is not always better for teaching AI: Large model-generated reasoning chains don’t always improve smaller models’ reasoning—sometimes they hinder it.

This effect is particularly evident in math-related tasks, where structured problem-solving plays a crucial role. The research team evaluated small models across various benchmarks, including MATH, GSM8K, AIME, AMC, and OlympiadBench, finding that complex reasoning distillation often led to diminished performance.

The fix: Mix Distillation

To address this learning bottleneck, the researchers propose a Mix Distillation approach. Instead of exclusively training small models on long CoT sequences or distilling from large models, this method balances reasoning complexity by combining multiple reasoning styles.

Their strategy consists of two configurations:

  1. Mix-Long: A combination of short and long reasoning chains, ensuring that small models are exposed to both detailed and simplified logic.
  2. Mix-Large: A blend of reasoning steps from large and small models, optimizing knowledge transfer without overwhelming the smaller models.

Experiments show that Mix Distillation significantly improves small model reasoning compared to training on single-source data.

For instance:

  • Qwen2.5-3B-Instruct improved by 8+ points on MATH and AMC benchmarks using Mix-Long, compared to training on only long CoT data.
  • The same model gained 7+ points using Mix-Large, compared to direct distillation from a large teacher model.

The takeaway? Small models don’t need to imitate large models verbatim—they need a carefully curated mix of reasoning complexity.

Featured image credit: Kerem Gülen/Midjourney

Tags: new