Skip to content
View AI-Bod's full-sized avatar
🏆
Code is acknowledged Philosophical Poem!
🏆
Code is acknowledged Philosophical Poem!

Block or report AI-Bod

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

[TMLR] A curated list of language modeling researches for code (and other software engineering activities), plus related datasets.

2,960 193 Updated Oct 15, 2025

Code for the paper "Greed is All You Need: An Evaluation of Tokenizer Inference Methods"

Python 10 1 Updated Nov 26, 2024

Simple-to-use scoring function for arbitrarily tokenized texts.

Python 46 5 Updated Feb 19, 2025

Codebase for EMNLP Findings Submission titled: Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Python 9 1 Updated May 30, 2025

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Python 1,397 173 Updated May 26, 2025

✨✨Latest Advances on Multimodal Large Language Models

16,483 1,065 Updated Oct 16, 2025

A Keras TensorFlow 2.0 implementation of BERT, ALBERT and adapter-BERT.

Python 810 196 Updated Jan 13, 2023

⭐️ NLP Algorithms with transformers lib. Supporting Text-Classification, Text-Generation, Information-Extraction, Text-Matching, RLHF, SFT etc.

Jupyter Notebook 2,388 405 Updated Sep 29, 2023

怎么训练一个LLM分词器

Python 153 29 Updated Jul 13, 2023

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining…

Python 245 34 Updated Jan 24, 2023

Supercharge Your Model Training

Python 5,419 455 Updated Oct 6, 2025

[NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623

Python 87 5 Updated Sep 26, 2024
Python 44 4 Updated Feb 5, 2023

A tool for extracting plain text from Wikipedia dumps

Python 3,928 1,005 Updated May 23, 2024

PyTorch implementation of BERT in "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"

Python 109 29 Updated Nov 1, 2018

A PyTorch implementation of Transformer in "Attention is All You Need"

Python 106 30 Updated Dec 6, 2020

Production infrastructure for machine learning at scale

Go 8,034 603 Updated Jun 12, 2024

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Python 165 11 Updated Dec 28, 2022

repository for Publicly Available Clinical BERT Embeddings

Python 734 151 Updated Aug 25, 2020

BERT models pretrained on the CORD-19 Kaggle dataset

15 4 Updated Jun 8, 2020

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Python 2,363 349 Updated Mar 23, 2024

A BERT model for scientific text.

Python 1,647 232 Updated Feb 22, 2022

Bringing BERT into modernity via both architecture changes and scaling

Python 1,546 127 Updated Jun 30, 2025

Builds wordpiece(subword) vocabulary compatible for Google Research's BERT

Python 231 48 Updated Dec 4, 2020

End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service

Jupyter Notebook 400 126 Updated Jun 12, 2023

北京航空航天大学大数据高精尖中心自然语言处理研究团队对信息抽取领域的调研。包括实体识别,关系抽取,属性抽取等子任务,每类子任务分别对学术界和工业界进行调研。

472 69 Updated Apr 29, 2022

Online playground for OpenAPI tokenizers

TypeScript 1,375 157 Updated Apr 24, 2025

The best way to start a full-stack, typesafe Next.js app

TypeScript 28,112 1,395 Updated Oct 11, 2025

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Python 16,247 1,263 Updated Oct 6, 2025

A feature-rich command-line audio/video downloader

Python 130,995 10,520 Updated Oct 15, 2025
Next