DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning
Abstract
Dynamic cross-modal coordination is integrated into reinforcement learning with verifiable rewards to improve visual reasoning in multimodal large language models by measuring attention shifts and aligning token roles during chain-of-thought reasoning.
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.
Community
Why do visual reasoning failures persist even after RLVR training ?
We find that reasoning failures are often associated with breakdowns in this coordination process.
We find that the issue is often not visual perception error or text reasoning error alone, but a failure of dynamic cross-modal coordination. During Chain-of-Thought generation, successful reasoning requires models to continuously switch between looking at visual evidence and thinking on previously established textual context. Existing RLVR methods optimize final outcomes but largely ignore this token-level behavior.
Through token-level analyses and causal interventions, we show that reasoning failures frequently occur when visually-oriented tokens stop attending to relevant image content, or when text-oriented tokens fail to remain grounded in prior reasoning history.
To address this problem, we introduce DyCo-RL, a plug-and-play RLVR framework that explicitly rewards effective cross-modal coordination. DyCo-RL identifies token functional roles using Fisher–Rao attention dynamics and reweights policy optimization according to role-attention alignment. The resulting models exhibit substantially stronger reasoning performance across diverse visual and mathematical reasoning benchmarks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping (2026)
- IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning (2026)
- Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization (2026)
- VISD: Enhancing Video Reasoning via Structured Self-Distillation (2026)
- Improving Vision-language Models with Perception-centric Process Reward Models (2026)
- Structured Role-Aware Policy Optimization for Multimodal Reasoning (2026)
- Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.08035 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper