BigScience: LMs for Historical Texts

non-profit

https://bigscience.huggingface.co/

BigscienceW

bigscience

Activity Feed Request to join this org

AI & ML interests

historical texts, named-entity recognition, big science

Recent Activity

stefan-it submitted a paper 17 days ago

GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction

stefan-it submitted a paper 25 days ago

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

clefourrier authored a paper 3 months ago

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

View all activity

stefan-it

submitted a paper to Daily Papers 17 days ago

GLiNER-Relex: A Unified Framework for Joint Named Entity Recognition and Relation Extraction

Paper • 2605.10108 • Published 19 days ago • 1

stefan-it

submitted a paper to Daily Papers 25 days ago

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Paper • 2604.28075 • Published 30 days ago • 20

clefourrier

authored a paper 3 months ago

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Paper • 2603.12180 • Published Mar 12 • 65

stefan-it

submitted a paper to Daily Papers 4 months ago

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Paper • 2601.22146 • Published Jan 29 • 12

christopher

authored a paper 6 months ago

Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem

Paper • 2512.03073 • Published Nov 27, 2025 • 7

stefan-it

authored a paper 7 months ago

SindBERT, the Sailor: Charting the Seas of Turkish NLP

Paper • 2510.21364 • Published Oct 24, 2025 • 2

christopher

authored a paper 7 months ago

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Paper • 2510.13996 • Published Oct 15, 2025 • 9

stefan-it

authored a paper 7 months ago

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Paper • 2510.13996 • Published Oct 15, 2025 • 9

christopher

posted an update 8 months ago

Post

764

Something very cool is cooking at

Lichess

1 reply

versae

authored a paper 8 months ago

BOE-XSUM: Extreme Summarization in Clear Language of Spanish Legal Decrees and Notifications

Paper • 2509.24908 • Published Sep 29, 2025 • 3

stefan-it

authored a paper 9 months ago

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Paper • 2509.05668 • Published Sep 6, 2025 • 8

davanstrien

posted an update 9 months ago

Post

2697

I fine-tuned a smol VLM to generate specialized art history metadata!

https://huggingface.co/davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with TRL + HF Jobs - single UV script, no GPU needed!

Space to explore predictions on a test set: davanstrien/iconclass-predictions

Blog soon!

1 reply

davanstrien

posted an update 12 months ago

Post

3745

Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp

1 reply

clefourrier

posted an update about 1 year ago

Post

2659

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

davanstrien

posted an update about 1 year ago

Post

2416

Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)

davanstrien

posted an update about 1 year ago

Post

1795

I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.

stefan-it

posted an update about 1 year ago

Post

4801

Wohoo 🥳 I have finished my 2025 GPU workstation build and I am very excited to train new awesome open source models on it.

I built my last GPU workstation 5 years ago featuring an AMD Ryzen 5900X, 64GB of G.SKILL Trident Z RGB on an ASRock X570 Taichi cooled by an Alphacool Eisbär 420. GPU was a Zotac RTX 3090 AMP Extreme. Unfortunately, I was never satisfied with the case - some Fractal Define 7, as it is definitely too small, airflow is not optimal as I had to open the front door all the time and it also arrived with a partly damaged side panel.

For my new build, I've used the following components: an outstanding new AMD Ryzen 9950X3D with 64GB of Corsair Dominator Titanium (what a name). As a huge Noctua fan - warm greetings to my Austrian neighbors - I am using the brand new Noctua NH-D15 G2 on an ASRock X870E Taichi in an amazing Lian Li LANCOOL III chassis. One joke that only NVIDIA Blackwell users will understand: you definitely need a tempered glass panel to check if your GPU cables/connectors start melting 😂 And the best is yet to come: I returned my previously bought Zotac RTX 5090 Solid to the eBay seller (because of... missing ROPs, only NVIDIA Blackwell users will again understand) and bought a Zotac 5090 AMP Extreme INFINITY (yes, the long name indicates that this is the flagship model from Zotac) from a more trustworthy source (NBB in Germany).

I am so happy to start training and fine-tuning new open source models - stay tuned!!!

3 replies

clefourrier

posted an update about 1 year ago

Post

2747

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

stefan-it

posted an update about 1 year ago

Post

1045

🇹🇷 😍 I'm very happy to finally announce my new Turkish LM called "BERT5urk":

stefan-it/bert5urk

It is a 1.42B T5-based model, trained with UL2 pretraining objective on the Turkish part of the awesome HuggingFaceFW/fineweb-2 dataset.

Feel free to check it out!

1 reply

davanstrien

posted an update over 1 year ago

Post

3025

📊 Introducing "Hugging Face Dataset Spotlight" 📊

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4

AI & ML interests

Recent Activity

Team members 10

bigscience-historical-texts's activity