LongTraceRL-8B

Model Description

LongTraceRL-8B is an 8-billion parameter reasoning model trained with reinforcement learning on long-context multi-hop QA tasks using trajectory-based tiered distractors and entity-level rubric rewards.

Model Details

Base Model: DeepSeek-R1-0528-Qwen3-8B
Parameters: 8B
Architecture: Qwen3 (36 layers, hidden size 4096, GQA with 8 KV groups)
Training Method: GRPO with entity-level rubric reward
Context Length: 128K prompt + 32K response
Language: English

Training Details

Training Data: 2,815 long-context multi-hop QA samples (LongTraceRL Dataset)
Training Steps: 200
Learning Rate: 2e-6 (constant)
Global Batch Size: 128
GRPO Group Size: 8
Rubric Reward Weight (η): 0.3
Framework: Slime (Megatron-LM + SGLang)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("THU-KEG/LongTraceRL-8B")
tokenizer = AutoTokenizer.from_pretrained("THU-KEG/LongTraceRL-8B")

Citation

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for THU-KEG/LongTraceRL-8B

Base model

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Finetuned

(58)

this model

Dataset used to train THU-KEG/LongTraceRL-8B

Collection including THU-KEG/LongTraceRL-8B

LongTraceRL

Collection

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards • 4 items • Updated 1 day ago • 1