Papers
arxiv:2606.08035

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Published on Jun 6
· Submitted by
Hangui Lin
on Jun 11
Authors:
,
,
,
,
,
,
,

Abstract

Dynamic cross-modal coordination is integrated into reinforcement learning with verifiable rewards to improve visual reasoning in multimodal large language models by measuring attention shifts and aligning token roles during chain-of-thought reasoning.

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.

Community

Paper author Paper submitter

Why do visual reasoning failures persist even after RLVR training ?

We find that reasoning failures are often associated with breakdowns in this coordination process.
We find that the issue is often not visual perception error or text reasoning error alone, but a failure of dynamic cross-modal coordination. During Chain-of-Thought generation, successful reasoning requires models to continuously switch between looking at visual evidence and thinking on previously established textual context. Existing RLVR methods optimize final outcomes but largely ignore this token-level behavior.

Through token-level analyses and causal interventions, we show that reasoning failures frequently occur when visually-oriented tokens stop attending to relevant image content, or when text-oriented tokens fail to remain grounded in prior reasoning history.

To address this problem, we introduce DyCo-RL, a plug-and-play RLVR framework that explicitly rewards effective cross-modal coordination. DyCo-RL identifies token functional roles using Fisher–Rao attention dynamics and reweights policy optimization according to role-attention alignment. The resulting models exhibit substantially stronger reasoning performance across diverse visual and mathematical reasoning benchmarks.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.08035
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.08035 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.08035 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.08035 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.