AI & ML interests

historical texts, named-entity recognition, big science

Recent Activity

christopherΒ 
posted an update 8 months ago
view post
Post
764
Something very cool is cooking at
Lichess

  • 1 reply
Β·
davanstrienΒ 
posted an update 9 months ago
davanstrienΒ 
posted an update 12 months ago
view post
Post
3745
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp
  • 1 reply
Β·
clefourrierΒ 
posted an update about 1 year ago
view post
Post
2659
Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"πŸ‘Œ
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks
  • 2 replies
Β·
davanstrienΒ 
posted an update about 1 year ago
view post
Post
2416
Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)
davanstrienΒ 
posted an update about 1 year ago
view post
Post
1795
I've created a v1 dataset ( davanstrien/reasoning-required) and model ( davanstrien/ModernBERT-based-Reasoning-Required) to help curate "wild text" data for generating reasoning examples beyond the usual code/math/science domains.

- I developed a "Reasoning Required" dataset with a 0-4 scoring system for reasoning complexity
- I used educational content from HuggingFaceFW/fineweb-edu, adding annotations for domains, reasoning types, and example questions

My approach enables a more efficient workflow: filter text with small models first, then use LLMs only on high-value content.

This significantly reduces computation costs while expanding reasoning dataset domain coverage.
stefan-itΒ 
posted an update about 1 year ago
view post
Post
4801
Wohoo πŸ₯³ I have finished my 2025 GPU workstation build and I am very excited to train new awesome open source models on it.

I built my last GPU workstation 5 years ago featuring an AMD Ryzen 5900X, 64GB of G.SKILL Trident Z RGB on an ASRock X570 Taichi cooled by an Alphacool EisbΓ€r 420. GPU was a Zotac RTX 3090 AMP Extreme. Unfortunately, I was never satisfied with the case - some Fractal Define 7, as it is definitely too small, airflow is not optimal as I had to open the front door all the time and it also arrived with a partly damaged side panel.

For my new build, I've used the following components: an outstanding new AMD Ryzen 9950X3D with 64GB of Corsair Dominator Titanium (what a name). As a huge Noctua fan - warm greetings to my Austrian neighbors - I am using the brand new Noctua NH-D15 G2 on an ASRock X870E Taichi in an amazing Lian Li LANCOOL III chassis. One joke that only NVIDIA Blackwell users will understand: you definitely need a tempered glass panel to check if your GPU cables/connectors start melting πŸ˜‚ And the best is yet to come: I returned my previously bought Zotac RTX 5090 Solid to the eBay seller (because of... missing ROPs, only NVIDIA Blackwell users will again understand) and bought a Zotac 5090 AMP Extreme INFINITY (yes, the long name indicates that this is the flagship model from Zotac) from a more trustworthy source (NBB in Germany).

I am so happy to start training and fine-tuning new open source models - stay tuned!!!
  • 3 replies
Β·
clefourrierΒ 
posted an update about 1 year ago
view post
Post
2747
Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.
stefan-itΒ 
posted an update about 1 year ago
view post
Post
1045
πŸ‡ΉπŸ‡· 😍 I'm very happy to finally announce my new Turkish LM called "BERT5urk":

stefan-it/bert5urk

It is a 1.42B T5-based model, trained with UL2 pretraining objective on the Turkish part of the awesome HuggingFaceFW/fineweb-2 dataset.

Feel free to check it out!
  • 1 reply
Β·
davanstrienΒ 
posted an update over 1 year ago
view post
Post
3025
πŸ“Š Introducing "Hugging Face Dataset Spotlight" πŸ“Š

I'm excited to share the first episode of our AI-generated podcast series focusing on nice datasets from the Hugging Face Hub!

This first episode explores mathematical reasoning datasets:

- SynthLabsAI/Big-Math-RL-Verified: Over 250,000 rigorously verified problems spanning multiple difficulty levels and mathematical domains
- open-r1/OpenR1-Math-220k: 220,000 math problems with multiple reasoning traces, verified for accuracy using Math Verify and Llama-3.3-70B models.
- facebook/natural_reasoning: 1.1 million general reasoning questions carefully deduplicated and decontaminated from existing benchmarks, showing superior scaling effects when training models like Llama3.1-8B-Instruct.

Plus a bonus segment on bespokelabs/bespoke-manim!

https://www.youtube.com/watch?v=-TgmRq45tW4