and they were doing the same thing.They had a bunch of examples of Catalan and they were labeling them as entailment or contradiction or neutral.

Checking out this list of countries around the world that are grappling with large language models like GPT-3 and GPT-4 is a big challenge. Models like ChatGPT use natural language processing to recognize sentiment, summarize, translate, and generate responses or recommendations based on analyzed data. However, these models are only as good as the data they are fed. A paper showed that only 20 of the 7,000 languages spoken globally make up the bulk of NLP research. This means that low-resource languages, which don’t show up as much on the Internet as text, become unintelligible to AI.

Researchers like Ruth-Ann Armstrong are attempting to create new datasets to explain low-resource languages. Armstrong went through 650 examples of Jamaican patois and labeled them as entailment, contradiction, or neutral. Similarly, Catalan researchers did the same thing with Catalan. This requires a lot of work, as these languages are not on the Common Crawl list. People are trying to evaluate how well big language models like GPT-3 do on Catalan, a language spoken in an autonomous community of Spain. GPT-3 has 92% English words, 1.4% German words, 0.7% Spanish words, and only 0.01% Catalan words in its training set. Despite this, it still performs very well. The problem is the amount of data available - Common Crawl says 0.2335% of their survey is Catalan, while GPT-3 only read 140 pages of Catalan. This means that it is dependent on the performance or goodwill of a few institutions or companies.

In response, Thomas Wolf, co-founder of Hugging Face, created an open-source multilingual model called Big Science’s BLOOM, which includes low-resource languages. They partnered with local communities to gather data and know where the data comes from and how it was obtained.

Even for the target audience of English speakers, there are reasons to want all languages to be well represented. For example, when voice assistants like Siri first came out, it had difficulty understanding certain accents. Expanding the training dataset to include more accents has been helpful. Imagine what would happen if we tried to expand another piece of that. We’re building technologies for more languages as well. So if you want to have this model everywhere, you need to be able to trust them. If you trust Microsoft, that’s fine. But if you don’t trust them, then it’s our language. We are Catalan speakers, and because of a small language or one with a moderately small amount of speakers, they may have a very, very little digital footprint and are bound to just…disappear.