I built the algorithm behind ChatGPT from scratch — here's what I learned
By Tushar Singla | First Year BTech CSE (AI/ML) Student "What I cannot create, I do not understand." — Richard Feynman That quote hit different when I was staring at my screen at 2am, watching my t...

Source: DEV Community
By Tushar Singla | First Year BTech CSE (AI/ML) Student "What I cannot create, I do not understand." — Richard Feynman That quote hit different when I was staring at my screen at 2am, watching my tokenizer learn the word "the" by merge #17. Let me explain. The origin story Every time you type something into ChatGPT, Claude, or any LLM, something happens before the AI even sees your message. Your text gets tokenized. "Hello, how are you?" → [15496, 11, 703, 389, 345, 30] Those numbers are what the model actually sees. Not your words. And the thing doing this conversion? A tokenizer. I wanted to understand exactly how it works. Not from a YouTube video. Not from a HuggingFace tutorial. I wanted to build one myself, from scratch, in pure Python. So I did. Meet TewToken — a bilingual BPE tokenizer trained on English + Hindi text, built with zero ML libraries. pip install git+https://github.com/tusharinqueue/tewtoken.git Wait, what even is tokenization? Computers do not understand language.