Sentence Similarity
sentence-transformers
PyTorch
Safetensors
Transformers
Russian
bert
pretraining
russian
fill-mask
embeddings
masked-lm
tiny
feature-extraction
text-embeddings-inference
Instructions to use cointegrated/rubert-tiny2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use cointegrated/rubert-tiny2 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("cointegrated/rubert-tiny2") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Transformers
How to use cointegrated/rubert-tiny2 with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForPreTraining tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2") model = AutoModelForPreTraining.from_pretrained("cointegrated/rubert-tiny2") - Inference
- Notebooks
- Google Colab
- Kaggle
| language: | |
| - ru | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - russian | |
| - fill-mask | |
| - pretraining | |
| - embeddings | |
| - masked-lm | |
| - tiny | |
| - feature-extraction | |
| - sentence-similarity | |
| - sentence-transformers | |
| - transformers | |
| license: mit | |
| widget: | |
| - text: Миниатюрная модель для [MASK] разных задач. | |
| This is an updated version of [cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny): a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details. | |
| The differences from the previous version include: | |
| - a larger vocabulary: 83828 tokens instead of 29564; | |
| - larger supported sequences: 2048 instead of 512; | |
| - sentence embeddings approximate LaBSE closer than before; | |
| - meaningful segment embeddings (tuned on the NLI task) | |
| - the model is focused only on Russian. | |
| The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task. | |
| Sentence embeddings can be produced as follows: | |
| ```python | |
| # pip install transformers sentencepiece | |
| import torch | |
| from transformers import AutoTokenizer, AutoModel | |
| tokenizer = AutoTokenizer.from_pretrained("cointegrated/rubert-tiny2") | |
| model = AutoModel.from_pretrained("cointegrated/rubert-tiny2") | |
| # model.cuda() # uncomment it if you have a GPU | |
| def embed_bert_cls(text, model, tokenizer): | |
| t = tokenizer(text, padding=True, truncation=True, return_tensors='pt') | |
| with torch.no_grad(): | |
| model_output = model(**{k: v.to(model.device) for k, v in t.items()}) | |
| embeddings = model_output.last_hidden_state[:, 0, :] | |
| embeddings = torch.nn.functional.normalize(embeddings) | |
| return embeddings[0].cpu().numpy() | |
| print(embed_bert_cls('привет мир', model, tokenizer).shape) | |
| # (312,) | |
| ``` | |
| Alternatively, you can use the model with `sentence_transformers`: | |
| ```Python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer('cointegrated/rubert-tiny2') | |
| sentences = ["привет мир", "hello world", "здравствуй вселенная"] | |
| embeddings = model.encode(sentences) | |
| print(embeddings) | |
| ``` | |
| For those who want to run the inference with [VLLM](https://docs.vllm.ai/en/latest/), there is a vLLM-optimized version of this model: [WpythonW/rubert-tiny2-vllm](https://huggingface.co/WpythonW/rubert-tiny2-vllm) |