LongTraceRL-8B

Paper Code

Model Description

LongTraceRL-8B is an 8-billion parameter reasoning model trained with reinforcement learning on long-context multi-hop QA tasks using trajectory-based tiered distractors and entity-level rubric rewards.

Model Details

  • Base Model: DeepSeek-R1-0528-Qwen3-8B
  • Parameters: 8B
  • Architecture: Qwen3 (36 layers, hidden size 4096, GQA with 8 KV groups)
  • Training Method: GRPO with entity-level rubric reward
  • Context Length: 128K prompt + 32K response
  • Language: English

Training Details

  • Training Data: 2,815 long-context multi-hop QA samples (LongTraceRL Dataset)
  • Training Steps: 200
  • Learning Rate: 2e-6 (constant)
  • Global Batch Size: 128
  • GRPO Group Size: 8
  • Rubric Reward Weight (η): 0.3
  • Framework: Slime (Megatron-LM + SGLang)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("THU-KEG/LongTraceRL-8B")
tokenizer = AutoTokenizer.from_pretrained("THU-KEG/LongTraceRL-8B")

Citation


Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for THU-KEG/LongTraceRL-8B

Finetuned
(58)
this model

Dataset used to train THU-KEG/LongTraceRL-8B

Collection including THU-KEG/LongTraceRL-8B