openai/gsm8k
Benchmark • Updated • 17.6k • 971k • 1.35k
This model is a fine-tuned version of Qwen/Qwen3-1.7B specifically adapted for the GSM8K dataset using Generative Reinforcement Learning with Policy Optimization (GRPO) via the Verl framework.
You can use this model directly with the Hugging Face transformers library:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Makrrr/Qwen3-1.7B-GSM8K-GRPO-verl"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # Recommended for H100/A100 (bfloat16) or try torch.float16 for other GPUs
device_map="auto" # Automatically load model onto available GPU(s)
)
model.eval() # Set model to evaluation mode
prompt = "Katy makes coffee using teaspoons of sugar and cups of water in the ratio of 7:13. If she used a total of 120 teaspoons of sugar and cups of water, calculate the number of teaspoonfuls of sugar she used.\nSolution:"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
with torch.no_grad():
output_ids = model.generate(
input_ids,
max_new_tokens=256, # Generate up to 256 new tokens for the response
num_return_sequences=1,
do_sample=True, # Set to False for greedy decoding
temperature=0.7, # Adjust temperature for creativity (0.0 for deterministic)
top_p=0.9, # Adjust top_p for nucleus sampling
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)
This model was trained using the Verl framework with the following configuration:
After completing 15 epochs of training, the model achieved the following final validation metric on the GSM8K test set:
mean@1: 0.8377558756633814 (approximately 83.78% average reward)