The Irony of Language Models That Don't Speak Your Language
This is a personal project and article. The opinions expressed here are my own and do not reflect the opinions of AWS or Amazon. This project is not an AWS product and is not endorsed by or affilia...

Source: DEV Community
This is a personal project and article. The opinions expressed here are my own and do not reflect the opinions of AWS or Amazon. This project is not an AWS product and is not endorsed by or affiliated with AWS. AI is plugged everywhere now and its a breakthrough advanced technology. However, there is a key element which turns out to be an elephant in the room that is not in the major headliner topics: LLMs are fundamentally centric to high-resource languages, and most specifically, English. Over 92% of training tokens in leading LLMs are English (Brown et al., NeurIPS 2020). Of the roughly 7,000 languages spoken worldwide, most LLMs meaningfully support only about 50. And by support here, we're saying that its capable of "answer" something (we're not talking about accuracy). The remaining languages — spoken by billions of people — are either poorly represented through low-quality machine-translated English content, or absent entirely. This means that when a Thai farmer asks about crop