DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
by DeepSeek · free · Last verified 2026-03-17
This paper introduces Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning algorithm. GRPO enables scalable RLHF-style training by replacing the critic model with group-sampled reward baselines, a technique used to enhance the mathematical reasoning of models like DeepSeekMath.
https://arxiv.org/abs/2402.03300 ↗B
B—Above Average
Adoption: B+Quality: AFreshness: ACitations: B+Engagement: F
Specifications
- License
- Open Access
- Pricing
- free
- Capabilities
- Group Relative Policy Optimization (GRPO), Critic-less Reinforcement Learning, Memory-Efficient Policy Optimization, Scalable RLHF Training, Advanced Mathematical Reasoning, Step-by-Step Problem Solving, Language Model Fine-Tuning
- Integrations
- Use Cases
- [object Object], [object Object], [object Object], [object Object]
- API Available
- No
- Tags
- reinforcement-learning, grpo, math-reasoning, deepseek, policy-optimization, llm-training, rlhf, language-models, ai-research, critic-less-rl
- Added
- 2026-03-17
- Completeness
- 0.8%
Index Score
66.1Adoption
72
Quality
89
Freshness
88
Citations
78
Engagement
0