No Language Left Behind
ai.facebook.comI'll believe it when I actually see it. I'm a native of a reasonably small language spoken by about a million people and never have I ever seen a good automatic translation for it. The only translations that are good are the ones that have been manually entered, and those that match the structure of the manually entered ones. I think the sentiment is laudable and wish godspeed to the people working on this, but for the time being I don't see it becoming a reality yet. When Google Translate regularly struggles even with big pairs such as German-English-German, I have reservations about someone making it work for languages where datasets are orders of magnitude smaller.
It's an extremely difficult problem indeed. A lot of people on the team speak low-resource languages too (my native language as well!), so definitely resonate with what you're saying. My overall feeling is: yeah it's hard, and after decades we can't even do German translation perfectly. But if we don't work on it, it's not gonna happen. I really hope that people who are excited about technology for more languages can use what we've open sourced.
> But if we don't work on it, it's not gonna happen.
That’s exactly right. There’s too much bias in society that if something isn’t perfect, then why bother? Nothing is perfect, so with that attitude there can be no progress. Thank you for doing important work!
Personally I'm hoping that globalisation prunes out as many languages as possible before we end up with brain implants automatically translating everything for us and no one can communicate without these chips.
That's a silly thing to wish for: like wishing global warming kills off as many species of animals as possible, in order to simplify zoology.
Becoming bilingual is one thing. Completely extinguishing a language is a totally different matter. It is usually associated with migrating away from the geographic area of the language and/or physically losing speakers (old age, wars, genocides, etc.)
You can check the list at https://en.wikipedia.org/wiki/List_of_languages_by_time_of_e...
To make an example and be blunt: I do not expect any European country official language to get extinct anytime during our lifespan unless that country gets destroyed, which obviously won't be a good thing.
As for brain implants, I won't hold my breath.
There are several reasons a language can go extinct, and while the reasons you mentioned are included there are several other reasons that you didn't consider. The most common way language death occurs is through contact with a prestige language, which results in social pressure that makes the non-prestige language become less and less commonly spoken[1]. A community becoming bilingual is part of this process -- it doesn't always result in language death, but it is an early stage of that form of language death.
As for a European country official language (strange line to draw), Icelandic is already in a state where younger Icelandic natives speak to each other in English because smartphones do not support their native language. It is entirely possible that Icelandic will be in danger of extinction this century[2].
[1]: https://www.youtube.com/watch?v=t3qbYFvOHwk [2]: https://www.nytimes.com/2017/04/22/world/europe/iceland-icel...
Icelandic won't disappear soon. (it's a bit of a mistake to say latin disappeared but I digress)
It might take some work (and more media being produced in it, surely) but I don't think it's at risk
> because smartphones do not support their native language
Nowhere in the article says that. And most smartphones do even have support (including the Icelandic keyboard)
what is not supported is voice activated devices:
> "“Not being able to speak Icelandic to voice-activated fridges, interactive robots and similar devices would be yet another lost field,"
(I guess we've skipped over the discussion about whether language death can occur without genocide.)
> it's a bit of a mistake to say latin disappeared but I digress
Latin is a dead language. It's not extinct (because it's used for ceremonial reasons by the Catholic Church) but it has no native speakers. This is not a remotely controversial statement. Old Church Slavonic has a similar role in Slavic Orthodox Churches and is also a dead language for the same reason.
> Nowhere in the article says that. And most smartphones do even have support (including the Icelandic keyboard).
In my defense, I can't read the article -- I linked it because this Tom Scott video[1] uses that article as a reference for the Icelandic example he gave where he explicitly says that Icelandic is not supported by modern smartphones. My phone does have Icelandic support now so I guess the statement was only true at the time he said it?
Thanks for clarifying
You're of course right about "Latin is dead" (for practical purposes). But there's lots of nuances that get lost in this statement
Latin is dead in the same way as Middle English is dead. The Latin that died is the one around the late Roman Empire time that was "photographed" and frozen in time as Ecclesiastical Latin (oversimplification, of course)
And of course the modern romance languages derived from it, but there's no exact time where those "flipped the switch" and became modern French, Spanish, Catala, Portuguese, etc. So in a way it could be argued that they're Latin 3.0 w/ DLCs (which I'll totally give that it's a big stretch)
By your criteria no language that has survived to now will disappear without a major global calamity. Effectively, latin is a not a language for a native speaker that wants to go through their life and leave future native speakers.
Regarding Icelandic, I know little about it except that it has a lot of non-ANSI if written properly. I already know many latin language dialect speakers that developed their behavior when the limits of SMS unicode were significantly smaller than ANSI, with long term repercussions. I'm not sure why facts like that would need to be in the article to discuss here?
So, Ukrainian?
> prunes out as many languages as possible
Would you feel the same way if that includes all languages you know, including all versions of English?
Yes, I’d be happy to learn a new language if it was the step to one unified language. Of course there is no way to back that up since it will never happen but I at least assume I’d do it if that was the case.
Would you not feel even slightly unhappy that your children (or grandchildren) would not be able to read the original versions of Shakespeare, Dickens, or Austen or that they wouldn't be able to watch the original versions of movies and shows you've enjoyed? They would only be able to watch and read translations, and all of the linguistic artistry would be lost to them.
It's not just about becoming bilingual, a population becoming bilingual in a "prestige language" is the first stage of language death (though of course it doesn't /always/ lead to language death).
No not really. English has shifted so far that those works are hardly as understandable now as a modern translated work.
That isn't the case for every language, I'm just trying to come up with an equivalent explanation of why wishing for languages to die out is a bad thing.
How would you feel about being the one that has to teach your children a language that will hinder their prospects instead of one that will help them succeed, just so that the speakers of the bigger language feel good about themselves that they are good people or something?
I could not think of a more obvious way of telling everyone that you are monolingual than implying that knowing another language is a burden. Children are not burdened by being multilingual.
It is true that social pressures kill off local languages, but it's usually not because parents don't want to teach their children their mother tongue, it's that people stop using the language to communicate because of the influence of the "prestige language". My parents (and all of the parents in the immigrant community I live in) went through great pains to teach their children their native language.
English is, of course, my second language, maybe that's why you didn't quite grasp what I was saying.
The burden is not in knowing another language, the burden is in making the "less useful" language your main one.
Your parents and all of the parents in the immigrant community you live in are very happy that English is now your main language which you acquired through school, friends, tv etc. Or maybe they put you in your-language-only schools, made sure you socialized with your-language-only friends and watched your-language-only media?
I don’t think I my prospects in any way is hindered by being a native speaker of a fairly small language. If anything, I prefer having English as a second language.
How would you feel if your kids only learned Chinese[1], and not a word of English?
[1] I’m assuming you’re not Chinese
I'm still a bit disappointed that Esperanto isn't the official language of the EU.
I speak a medium-resource language with 11 million speakers. Google Translate works so poorly with it that translations are often nonsensical. But DeepL works so well with it that translations are often indistinguishable from native speaking translations. I'm a big believer that the model can make a huge difference.
On the other hand, as a non-native Japanese learner, it is very obvious when Japanese text has been DeepL-translated because it often makes 敬語/register and context mistakes (and translating Japanese to English it does even worse because it struggles with null-subject languages). I am sure a native Japanese speaker would be able to see even more mistakes than I can.
DeepL seems to handle grammar a bit better (ex. run-on sentences) but for whatever reason, it struggles with basic vocabulary sometimes. Also, when it does make mistakes, they change the meaning subtly enough to render the translation unusable.
Google Translate does the same in many languages, to the point that it will often reverse the meaning of a sentence. I honestly feel like these tools are still mostly useful when you don't really need to know what the text means.
> I ever seen a good automatic translation for it. > > When Google Translate regularly struggles even with big pairs such as German-English-German, I have reservations about someone making it work for languages where datasets are orders of magnitude smaller.
I speak a language where I've never seen any translation for it... and when translated manually, my mum totally butchers the meaning lol.
Either way, any work in this area is more than welcome, but damn it's a hard problem.
There's a section where you can try reading translated children's books. See if your language is supported and how good the translation is.
Burmese and Cambodian are 100% useless on google/bing translate, but the children books translations on the example page are really, really good.
Surprisingly, translations of the books into Russian seem considerably better than into English (at least for the first three books I tried)
There's a large tradition of having texts translated into Russian, whereas English speakers would very rarely read anything translated from another language.
"Tradition" sounds a bit funny knowing that a lot of Russian book publishing was/is published without author permission. Russian publishing isn't well known for following copyright laws :V
I'm pretty sure that Russian (Soviet?) publishers observed whatever copyright laws there were at the time. It's just that international copyright law is a recent phenomenon. USA is also a very late signatory of international book copyright treaties AFAIK.
Blog post: https://ai.facebook.com/blog/nllb-200-high-quality-machine-t...
Paper: https://research.facebook.com/publications/no-language-left-...
Github: https://github.com/facebookresearch/fairseq/tree/nllb/
Also note comments from hello_im_angela (= Angela Fan) and jw4ng (= Jeff Wang). Those are the HN accounts for Angela and Jeff from No Languages left Behind.
Hey all, I work on this project. Full list of languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...
As well as in the research paper: https://research.facebook.com/publications/no-language-left-...
The analogy I like the most is that they've found the "shape" of languages in high dimensions, and if you rotate the shape for English the right way, you get an unreasonably good fit for the shape of Spanish, again for all the other languages.
We're at a point where it's now possible to determine the shape of every language, provided there are enough speakers of the language left who are both able and willing to help.
<Snark> Once done, Facebook can then commodify their dissent, and sell it back to them in their native language. </Snark>
Anyone who knows or is learning another language can easily tell you that the "warping" methodology of MTL is insufficient. There was a really good video by Tom Scott [1] that talked about this but the short version is that there is critical bits of language in context and inferred by speakers. Any accurate MTL needs nearly full context both on the page and in the cultural moment, in addition to probably needing to ask questions of the author.
So, if I had a corpus of all the literature from 1800-1850 digitized, the context would be sufficiently different as to be a new language?
It seems to me that the happy accident of doing this research at the start of getting all human knowledge digitized is part of the unreasonable effectiveness of this overall technique.
Had it happened in 200 years, it might not have worked, right?
Darmok and Jalad at Tanagra.
The shape analogy doesn't really apply with modern language models. Each word gets its own context dependent high dimensional point. With everything being context dependent, simple transformations like rotations are impossible. A more accurate perception is that any concept expressible in language now has its own high dimensional representation, which can then be decoded into any other language.
>REAL-WORLD APPLICATION
>Translating Wikipedia for everyone
Hmmm.
While there is very definitely utility in doing things like this, I do kinda fear "poisoning the well"-like effects of feeding (even partially-) AI-generated-data into extremely common AI-data-sources.
There's some info on it in a blog post[1] and the MediaWiki "Content translation" page[2], but does anyone know of any studies on the quality of the translations produced? I can absolutely see it being a huge time-saver for people who are essentially fluent in both (there's a lot of semi-mechanical drudgery in translating stuff like this that could be mostly eliminated)... but people are pretty darn good at choosing the easy option of trusting whatever they're given rather than being as careful as they should be. It kinda feels like it runs the risk of passively encouraging people to trust the machine's choice over their own, as long as it isn't obviously nonsense, and the cumulative effect could be rather large after a while.
[1]: https://diff.wikimedia.org/2021/11/16/content-translation-to...
Yeah, I really hope they don't do this. I live in a country where I don't speak the language well, so I am using Google Translate and DeepL [0] all day every day. The quality of translations of real-world text is so incredibly variable. There is literally no way to know when it will suddenly reverse the meaning of a sentence, or produce something that sounds like it makes sense, but in terms of meaning bears no relation to the input at all.
A machine-translated Wikipedia would not be a trustworthy source of information at all, yet would look like one. I think that does significantly more harm than good.
[0] Suggestions for better alternatives welcomed.
On top of that - a lot of language specific content has to include sources in that same language.
(As an example, it would be absurd for lithuanian wikipedia to include sources in japanese - that would be not usable AND not usefull for the wikipedia readers, editors...)
Jeff Wang here with my fellow Meta AI colleague Angela Fan from No Languages left Behind, seeing the comments flowing through. If you want to ask us anything, go for it!
Hi Jeff,
I currently host the largest collection of bilingual Manx[0] <-> English texts (~1MM words). How would I formally get in contact to chat about the steps to make machine translation available (and would there be grant opportunities available for further production of machine-readable data?)
could you send me an email please? It's available on our paper, page 1: https://research.facebook.com/publications/no-language-left-...
Regarding grants: we have offered compute grants previously with the Workshop on Machine Translation (last year: https://www.statmt.org/wmt21/flores-compute-grants.html, this year: https://statmt.org/wmt22/large-scale-multilingual-translatio...) and we have an RFP, but it's currently focused on African languages: https://ai.facebook.com/research/request-for-proposals/trans...
Done, let me know if it didn't go through
Thank you for your exciting work and for coming onto HN to respond to questions.
I am a former professional translator (Japanese to English) and am now supervising research at the University of Tokyo on the use of machine translation in second-language education. As I have written in a few papers and essays [1], advances in MT have raised serious questions for language teachers. The ready availability of MT today, including on Facebook and Instagram, means that language students use it a lot while studying. We don’t know yet, though, how that use of MT might affect our students’ acquisition of other languages or their motivation to keep studying those languages.
One of the hurdles educators and researchers face is finding out how MT is being used in the real world. Most education in modern languages is focused on giving students language skills that they will be able to use later in work, education, and daily life, and textbooks and other learning materials are typically shaped around real-world situations. We are now struggling to adapt those materials for the age of MT, because data on the actual use of MT is very hard to get.
Like Google, Microsoft, Baidu, DeepL, and others, Meta must have huge amounts of data on how your users are using MT to communicate. Any information and insights about that MT usage that you can share with world—just as you have generously shared your NLLB models—would be most welcome.
I've learned two languages with the help of MT. I'm sure you've interviewed people like me, but I get excited about the potential of MT for language learning, so I'd like to share my thoughts.
When I learned Spanish, I spent a lot of time chatting on Facebook with native speakers, and using Google Translate as "training wheels" to help me formulate sentences, and understand words and phrases I hadn't learned yet. It worked pretty well at the time (2012) except in cases of slang and typos that Google couldn't handle. I also used it a lot to help me translate blog posts from English to Spanish. Eventually, I graduated from the training wheels and was able to use Spanish fluently without the help of MT. More than once, while not using MT, I was told that I spoke Spanish with a "Google Translate accent", which I'm sure was more of a reference to my grammar than my accent, since my spoken practice was 100% with native speakers.
When I learned Hungarian (2019-now), at the beginning, Google Translate wasn't good enough to use for much more than getting a rough understanding of formal text, so I learned in a more traditional way at a school and with native speakers. Then the pandemic prevented me from doing both of those. I started chatting with native speakers on Facebook, but it was very difficult without MT and involved a lot of asking my conversation partners for translations and explanations. Progress was frustratingly slow. Then I discovered DeepL's MT, which was extremely good with Hungarian. I started using for chat conversations and emails, and people were shocked that I was managing to communicate with them so fluently. My progress of actually learning the language for myself accelerated dramatically. I've become conversational (B2/C1) in Hungarian in 2.5 years with very little in-person practice. Often, it takes native English speakers 5 years of in-person practice to reach that level. I'm convinced that MT played a key role in my ability to learn quickly.
When I use MT, I have a simple rule, that I have to understand each word of a translation before I send it. So I carefully read the translation, making sure that I understand each word. Sometimes that means I have to look up individual words/grammar before sending a message (I often use wiktionary for that, because it shows etymology), and other times, it means that I'll replace unfamiliar words or phrases in a translation with words and phrases from my own vocabulary. Over time I rely on MT less and less because my own vocabulary becomes stronger. I really believe that they key to learning a language quickly is to start USING the language as quickly as possible. Once you're using a language, your brain automatically starts picking up the skill. With traditional language learning, using a language can be very difficult in the beginning until you've reached a conversational level, but with MT, you can start using a language before you know everything.
For Spanish, I almost never use MT anymore. Sometimes I use it as a quick dictionary for an unfamiliar word, but my Spanish level is C2 and I use Spanish every day so it feels natural. I'm not ever translating in my head anymore.
For Hungarian, I'm still using MT often, but I don't need it during conversation (either written or spoken). Besides using it to translate things I don't know, I also find it useful for inputting Hungarian characters that are a pain to type with my US keyboard, and for conjugating words correctly when I know the root but am struggling for the correct ending. Often I'll know what I want to say in Hungarian, but I'll open DeepL and type in English, then adjust the translation to use the words I want before I copy and paste the Hungarian. I'm essentially using MT as a guide to help me craft my sentences even when I know what I want to say.
In summary, MT is awesome for language learning and for assisting language skills in development.
Thank you! Those are really interesting and valuable comments. I haven’t, in fact, heard many stories like yours, especially with such clear insights about how you have been able to use MT constructively in your language learning.
Most of the discussions I’ve had about MT have focused on language learning in school contexts. In Japan and most other countries (though often not in English-speaking countries), all children have to study at least one foreign language in school. As with all compulsory education, low motivation and poor study skills are a constant problem. In such contexts, MT seems to many teachers and students just to be a way to cheat on classwork. And since very few of today’s veteran teachers were able to use MT when we were young and studying languages, we don’t understand how we can guide even our motivated students on using it productively. I will be sure to share your insights with my colleagues.
A couple of comments:
> More than once, while not using MT, I was told that I spoke Spanish with a "Google Translate accent", which I'm sure was more of a reference to my grammar than my accent....
This can happen with traditional methods of language learning, too. The language of textbooks, like the output of MT, usually reflects the standard written language, which can be very different from how people actually speak, especially in the case of languages with large dialect and register variations.
> When I use MT, I have a simple rule, that I have to understand each word of a translation before I send it.
That sounds like an excellent rule. I will pass on that advice to educators I know who are trying to figure out how to guide their students on the use of MT.
Many thanks again.
If your goal is to make inclusive translation more widely available why license the models under a non-commercial license? This basically makes it impossible to use legally (or at least without a lot of legal risk) for essentially anyone due to the vague definition of what's commercial. Is Facebook hurting for money and looking to commercially license this model on request?
This enables any researcher to use our code freely, and build on top of it for their own research. We are not intending to commercially licensing our project.
Okay, but why?
If your aim is to make this technology more widely available and, as you claim, "give people the opportunity to access and share web content in their native language, and communicate with anyone, anywhere, regardless of their language preferences", then why make it so that the model essentially can't be used for anything useful? It doesn't really make any sense.
Legally even the use case which you're promoting on your frontpage - the Wikipedia Foundation's Content Translation - is illegal under the non-commercial license in certain jurisdictions! For example, see here: https://www.techdirt.com/2014/03/27/german-court-says-creati...
Even using it for research would be illegal as it's also not exactly "personal use".
Hey Jeff, I’m a native speaker of Dhivehi — the language spoken by the people of Maldives. Since I couldn’t find a full list of supported languages I was wondering if Dhivehi is / would be integrated.
Dhivehi is currently not supported, unfortunately. We view this as a starting point and are committed to expanding to many other languages as in the spirit of our project name.
Full list of currently supported languages can be found here: https://github.com/facebookresearch/flores/tree/main/flores2...
I'm curious how much work it takes to prepare training data for a language. From anecdotal experience, I've always been able to learn some basic survival skills in a new language by studying the translations of about 20 key phrases for a week or so, which give me the ability to combine them into a few hundred different phrases and survive most daily transactions. So I always imagine that training a language model is similar, just on a much larger scale. It seemed to me that there could be a standard text that includes a lot of important topics and contexts, which just needs to be manually translated into a target language and then fed to the model. I imagine it being about the size of a large book, so I imagine that adding a new language to a model would cost a similar amount to paying to have a book translated. Obviously the size of the input text would have an effect on how good the model's translations are, and domain specific translations would require more specific input. While having a full translation of an entire library seems like a good way to train a model that's used to translate everything, it seems like a small percentage of the library would be enough to produce native-level translations for most domains.
How far off are my intuitions on this? What are the costs of adding a new language to a model like this? Is there a ballpark dollar amount per language?
Without any supervised training data, it's pretty difficult to create a very good translation model. For many languages, data might only be available through religious domains such as the Bible, or not available at all. We created a dataset called NLLB-Seed for this reason --- it's approx 6K sentences available for 39 languages that translates a broad set of topics from Wikipedia. We found that with a dataset like NLLB-Seed, we're able to have sufficient supervised signal to jumpstart our automatic dataset creation pipeline. Of course, the more high quality aligned data the better the model performance, but our project explores how we can make models more efficient at learning even when the training data is small.
Importantly, models can learn from other languages are similar. If we train separate models for each direction on small amounts of data, the performance is significantly worse than grouping languages in one large multilingual model.
These initiatives are always couched in "inclusion" rhetoric (the very name of your project is telling); I don't doubt for a second that it's a genuine sentiment, but I strongly suspect your team hasn't thought through the full, self-defeating implications of universal language translation.
The problem is that it increases the risk of monoculture to 100%. Without language barriers, cultural diversity is lost, not gained, since you have winner-take-all effects[0]. Instead of helping revive languages, it'll make American ideas, mores, morality (Puritanism), philosophies, and political values more dominant worldwide.
To be clear, this will increase economic opportunity, but will inevitably kill cultural diversity.
Is your team considering or studying this?
[0]: https://www.sampaxinos.com/mental-models/the-winner-takes-al... (or see Taleb's works)
Hi, I'm putting together an online event called 31 Days of AI for Book-Lovers to coincide with US National Book Month, October 2022. I was struck by the specific call-out to translating literature on your demo page and would like to feature a specifically book-related application of NLLB on one of 'anchor days'. Can someone work with me on this?
Hi, I'm looking but can't seem to find instructions on how to do tokenization. Where is spm model, is it "flores200_sacrebleu_tokenizer_spm.model" or something else? And is it direct or spm -> dict? Or how to prime model for a specific language pair?
We tokenize with the flores-200 spm model, correct. To generate from the model, check out the instructions here: https://github.com/facebookresearch/fairseq/tree/nllb/exampl...
Are all the 200x200 translations going directly or is English (or another language) used as an intermediate for some of them?
All translation directions are direct from language X to language Y, with no intermediary. We evaluate the quality through 40,602 different translation directions using FLORES-200. 2,440 directions contain supervised training data created through our data effort, and the remaining 38,162 are zero-shot.
What is the greatest insight you gained and could share with non-experts from working on this project?
I gained a deeper understanding of what it truly means to be inclusive. Every language is unique just like everybody and making sure content works for all and including as many people as possible is really really hard, but through this project i'm hopeful we are taking it one step further
> Every language is unique just like everybody
TBH it just sounds like you've redefined the word "unique".
Gangi þér vel!
I wonder how it differs from what Yandex.Translate did back in 2016: [0]
>The affinity of languages allows one common model to be trained for their translation. That is, “under the hood” of the translator, the same neural network translates into Russian from Yakut, Tatar, Chuvash and other Turkic languages. This approach is called many-to-one, that is, "from many languages \u200b\u200binto one." This is a more versatile tool than the classic bilingual neural network. And most importantly, it is the many-to-one approach that makes it possible to use knowledge about the structure and vocabulary of the Turkic languages, learned on the rich material of Turkish or Tatar, to translate languages like Chuvash or Yakut, which are less “resource-rich”, but no less important for the cultural diversity of the planet.
>In order to create a unified model for translating Turkic languages, Yandex developed a synthetic common script. Any Turkic language is translated into it, so that, for example, the Tatar “dүrt” (“four”) written in Cyrillic becomes similar to the Turkish dört (“four”), not only from the point of view of a person, but also at the level of similarity of lines for a computer.
This way they added support for Turkic and Uralic languages which are very underrepresented on the Internet. But I don't know what the quality of their translation is: even though I live in a region where Mari is spoken (indigenous Uralic language) and my wife is Mari, none of us, sadly, speak the language.
[0] https://techno-yandex-ru.translate.goog/machine-translation/...
We represent all languages in their natural script, rather than transliterating them into a common synthetic one.
Regarding Mari: extremely interesting language, exciting to hear that you are from that region. We are interested in working on this one (likely in the "Hill Mari" variant), but currently do not support it.
As a native Swiss German speaker, my native language is not only low resource in general, but has the additional difficulty of not having a standardized orthography (many native speakers will exclusively write in Standard German, and use Swiss German only for spoken communication).
So you have a language with some economic opportunity (a few million speakers in a fairly wealthy country) but no clearly defined written interface, and an ambivalent attitude of many speakers towards the very idea of writing the language.
sooo real. Many low-resource languages have many different natural variants, can be written in multiple scripts, don't have as much written standardization, or are mainly oral. As part of the creation of our benchmark, FLORES-200, we tried to support languages in multiple scripts (if they are naturally written like that) and explored translating regional variants (such as Moroccan Arabic, not just Arabic).
As an aside, the question of how to think about language standardization is really complex. We wrote some thoughts in Appendix A of our paper: https://research.facebook.com/publications/no-language-left-...
Another avenue for machine translation is to use audio instead of text. There is much more audio data available and being generated on a daily basis, especially for cases like yours it would be very useful.
Similar issue with Scots, which has many variant orthographies but is frequently written in mostly-English anyway.
This only makes the problem behind the NLLB project even more interesting to solve
Not a single mesoamerican language is present. Maya, Náhuatl, Otomí, Zapoteco, etc. And these languages are big, they are spoken by millions and even have literature. Náhuatl and Maya are spoken in Central America.
Are there online corpora, like Wikipedia, that could be used to train the models? Are those under a permissive enough license to be used for model training?
If there are spoken, with enough budget, a library of voices could be recorded. I think you’d prefer that collection to be gathered and maintained by a non-profit rather than Meta.
For náhuatl, I found this: Wikipedia in nahuatl https://nah.wikipedia.org/wiki/Cal%C4%ABxatl
I’m wondering if 7065 articles is enough to train the model.
Note that very recently Google has done something very similar: "Building Machine Translation Systems for the Next Thousand Languages": https://arxiv.org/abs/2205.03983 https://ai.googleblog.com/2022/05/24-new-languages-google-tr...
The Facebook paper has some direct comparison to that work.
Evaluation was important to us, and we really wanted to have a benchmark that covers all 200 languages
Hopefully the Scots language model wasn't trained on Wikipedia.
I'm not entirely sure why low resource languages are seen as such a high priority for AI research. It seems that by definition there's little payoff to solving translation for them.
I don't really remember the exact numbers anymore, but covering only the top 5 languages will cover maybe 40% of the world population, while covering the top 200 languages (many of them low resource) will cover maybe 90% of the world population.
Some numbers (but you can not exactly infer from them such accumulated numbers): https://en.wikipedia.org/wiki/List_of_languages_by_total_num...
Some more numbers from here: https://www.sciencedirect.com/science/article/pii/S016763931...
"96% of the world’s languages are spoken by only 4% of its people."
Although this statement is more about the tail from the approx 7000 languages.
It doesn't sound like you're considering that people are very often fluent in a major language in addition to their regional one?
I am. That's why I mentioned that you can not infer my statements directly from the numbers you find on Wikipedia etc. You can not simply add up those numbers.
"Low-resource language" isn't just a euphemism for "language almost nobody speaks". There are many languages that are widely spoken but nonetheless are hard to obtain training data for. Getting something like Wikipedia going for a minority language can be a difficult chicken-and-egg problem because users will use English for its completeness/recency, despite their limited fluency, and the native-language Wikipedia remains neglected. So you can end up in a situation where users use one language for social media and another for news/research, and Facebook is in a unique position to care about the former.
Aside from the fact that being able to generalise a model with very little training data is an important AI research problem to solve, language death is a serious concern and is being accelerated due to the fact that many languages are not supported at all by modern technology (leading to "prestige language" pressures that are a known cause of historical language death).
For instance, Icelandic is not supported by any modern smartphone platform which has lead to Icelandic natives communicating with each other in English and very little information is translated to Icelandic[1,2].
That being said, I am worried that having translations that are "too good" could also act to accelerate language death as the importance of keeping languages alive will seem less significant (to non-language-nerds) if we can translate works written in that language to any other language with very small datasets. Luckily I'm not convinced that AI models will be able to produce convincing and consistent translations for a long time -- languages are so different in so many ways that I can't see how adding more dimensions and parameters to a model would account for them.
[1]: https://youtu.be/qYlmFfsyLMo?t=141 [2]: https://www.nytimes.com/2017/04/22/world/europe/iceland-icel...
The point is that there are lots of humans who speak these languages and use tech. They just don’t use Wikipedia so getting a good translation corpus going was harder.
And it's both cumulative across all those languages (see above), cheap/amortized (if you can do a good multilingual NMT for 50 languages, how hard can 50+1 languages be?), and many of those languages are likely to grow both in terms of sheer population and in GDP. (Think about South Asian or African countries like Indonesia or Nigeria.) The question isn't why are FB & Google investing so much in powerful multilingual models which handle hundreds of languages, but why aren't other entities as well?
what other entities would really have access to the text resources that FB & Google? outside of a few other large companies I can't imagine many
Surely the fact that they did all the high-resource languages first and are only now getting round to the less-popular ones demonstrates that that is not, in fact, the case?
I think the reason low resource languages are prioritized is to compensate for the fact that AI research normally has a tendency to marginalize these languages.
yes, but what principles justify the importance placed on low resource languages?
Low resource in this context means that there are few resources available to train a neural network with, not that there are few speakers. Although many low resource languages have relatively few speakers, there are also ones with tens of millions of speakers.
The reason for emphasis is in my opinion twofold: 1) Allowing these people to use the fancy language technology in their own language is good in and of itself. 2) Training neural networks on fewer resources is more difficult than using more resources and therefore a fun and interesting challenge.
Plus presumably we learn more from solving harder problems, and we prepare for one day needing to translate some alien language in a hurry.
The examples given are, with native speaker numbers, Assamese (15 million), Catalan (4 million) and Kinyarwanda (10 million). These alone are more than an Australia.
Furthermore, Facebook considers the internet to consist of Facebook and Wikipedia (Zero).
I view this as just another extension of their Next Billion initiative, an effort to ensure that another billion people are monopolised by Facebook.
That's the payoff.
We think it's important for AI to truly support everyone in the world. A world where AI only serves a subset of the population is not ideal. In machine translation, this means supporting as many language as possible at high quality. We also imagine a future where anyone will be able to communicate with anyone else seamlessly; this also means solving translations for all languages.
Wouldn't that also entail a bot speaking in any language?
Text to speech is a separate problem.
Small data, big meaning is much more important than big data, little meaning. Much closer to real intelligence.
Cynical answer: It's good PR.
hi @btheshoe, I work on this project in the data part. As others mentioned, the amount of data available for a language is not correlated to the number of speakers of that language, which explains the potential impact of focusing on these.
I'll know AI translators are any good when the United Nations starts using them
"Skills required: United Nations translators are required to have a perfect command of their main language and an excellent knowledge of, in most cases, two other official languages"
My ex is a translator at an embassy, and she always said that ai translators are a godsend.
On one side they make their work easier as they can focus more on correcting the ai produced text and focus on author's meaning while eliminating lots of plumbing.
On the other hand they increased the amount of business because much more text is translated than at any other point in history, which requires validation in most business, legal and even personal contexts. Without ai translators those translations would've not happened in the first place.
Most media translators consider MTL worse than nothing, because editing it is actually harder than just doing it yourself. Can especially be an issue for neural MTL because the output is both fluent (looks natural) and inaccurate.
I'm surprised this didn't occur to me until after I posted because it fits with my general feeling that AIs will be nothing more than collaborative tools for the foreseeable future.
An organization built out of pure prestige, with no concept of monetary profit, has zero pressure to stop employing their classmates as translators, ever.
Does this mean that Facebook's advertising system will finally start rejecting ads calling for genocide in Myanmar, and that they will finally flag comments expressing the same intent? As recently as March of this year there were reports that Facebook accepted ads that said "The current killing of the Kalar is not enough, we need to kill more!" or "They are very dirty. The Bengali/Rohingya women have a very low standard of living and poor hygiene. They are not attractive".
Full story: https://abcnews.go.com/Business/wireStory/kill-facebook-fail...
These were submitted to test Facebook's systems, because there's a good reason not to trust their promises on this front. Facebook was used extensively to propagate hate speech in Myanmar during the crisis of 2017, with their moderation tools and hate speech detection system letting through a ton of hateful content with real-world consequences, in the course of an actual ethnic cleansing campaign.
Other references: "Facebook Admits It Was Used to Incite Violence in Myanmar" https://www.nytimes.com/2018/11/06/technology/myanmar-facebo... (2018)
"Violent hate speech continues to thrive on Facebook in Myanmar, AP report finds" https://www.cbsnews.com/news/myanmar-facebook-violent-hate-s... (9 months ago)
The issue here wasn’t that Facebook didn’t have resources for a basic translation tool (able to translate open death threats) but that Burmese had inconsistent encoding. That delayed the translation effort.
https://www.localizationlab.org/blog/2019/3/25/burmese-font-...
What are hardware requirements to run this?
I see the mixture model is ~ 300 GB and was trained on 256 GPUs.
I assume distilled versions can easily be run on one GPU.
We release several smaller models as well: https://github.com/facebookresearch/fairseq/tree/nllb/exampl... that are 1.3B and 615M parameters. These are usable on smaller GPUs. To create these smaller models but retain good performance, we use knowledge distillation. If you're curious to learn more, we describe the process and results in Section 8.6 of our paper: https://research.facebook.com/publications/no-language-left-...
"All models are licensed under CC-BY-NC 4.0" :
So, to clarify, does this mean that companies cannot use these models in the course of business, or is it more about selling the translation results directly?
What is a "low resource language"?
hey there, I work on this project. We categorize a language as low-resource if there are fewer than 1M publicly available, de-duplicated bitext samples.
also see section 3, table 1 in the paper: https://research.facebook.com/publications/no-language-left-...
hey, this sounds silly but I can't seem to find a link of all the languages covered in the 200 hundred languages. I've looked at the website and the blogpost and neither have a readily available link. Seems like a major oversight. There is of course a drop down in both but the languages there are a lot less than 200. I'm particularly interested in a list of the 55 African languages for example.
We have a full list here (copy pastable): https://github.com/facebookresearch/flores/tree/main/flores2... and Table 1 of our paper (https://research.facebook.com/publications/no-language-left-...) has a complete list as well.
Nice to see Esperanto made the cut — the only artificial language to do so, AFAICT.
I was happy to see that as well!
ha yes, that's correct. If you have thoughts on specific constructed languages where having translation would really help people, let us know!
thank you!
Looking at the list, I see a lack of Native American languages. Did anyone try to contact the tribes during this?
We interviewed speakers of low-resource languages from all over the world to understand the human need for this kind of technology --- what do people actually want, how would they use it, and what's the quality they would find useful? Many low-resource languages lack data online, but are spoken by millions. However, many indigenous languages are spoken by smaller numbers of people, and we are definitely interested in partnering with local communities to co-develop technology and have been actively investigating these collaborations but don't have much to share yet.
I'll take that in good faith, but I will say Facebook has been a particular pain for many tribal folks given its true name policy and banning people who it thinks are using a fake name. Yellow Horse was one that was wildly reported, but their are others. Mostly anything that takes the form Adjective Noun. Had a rather painful thread with someone claiming to be a Facebook employee that defended this practice. I haven't heard of anyone reaching out, and Lord knows we could of used the help because COVID has been a particular disaster for language preservation even with an extremely high vaccination rate.
I do admit I'm a bit bitter given another of the big silicon valley companies (Apple) claiming they specifically help the TCUs (Tribal Colleges and Universities) when I can find no one that knows about this help other than taking our money for product at the same price as other accredited educational institutions.
I was unfamiliar with this issue. Was their name Yellow Horse in English? Or was it supposed to be written in a language not available so an English translation was used?
I have a feeling that if it was written in the original language it would go through, since many English names also have adjective noun original meanings like ‘beautiful flower’.
Was their name Yellow Horse in English? Yes
I have a feeling that if it was written in the original language it would go through, since many English names also have adjective noun original meanings like ‘beautiful flower’. The legal name is in English, so anyone expecting it to be written in the original language is expecting too much.
My concern with this is that in low resource languages the unavoidable biases of the ML models might overpower their own organic development.
We shrug off all the little quirks of machine translated text because it usually gets the point across, and we recognize them as quirks because most of what we read was written by real people with no such quirks. But when most of what you read contain those quirks, I fear those will quickly become the standard way of writing and even speaking in those languages.
This happens without machine translation in the wild already with pidgin. If you want to see real life pidgin in action, watch korean and english gamers interact in FPS games. This has been common at the borders of cultures where two languages interact.
Point being, I'm not sure if language purity is more valuable than functionally allowing its people to interact with things they couldn't otherwise. Put another way, should we leave these people locked out of many online resources they can't read because we fear of corrupting their language? Give these people the option and let them decide. Language evolves over time anyway.
People present these as the choice between 0 (“locked out”) and 1.
In real world instances (the proverbial 80%), it’s more often transforming a 0.4 (“don’t know much english”) into a 0.7. And the people who get away with near 0 knowledge will usually have no critical need for translation, or an access to other means (an actual translator, social help etc.) when really needed.
My mental image is grandmas reading online news, and machine translation would be a blessing and a curse. Or low grade school kids trying to look for some help on a topic, and a I’d wish they get more time with the original text to at least somewhat learn, than only getting the rough translation full of errors.
For interpersonal communication, people adjust, that’s what has been happening for centuries now.
> This happens without machine translation in the wild already with pidgin.
I said nothing about purity, I said organic evolution, which this is an example of. If the actual speakers want to develop a pidgin, fine, I just think it should be a decision made by people and not models.
In a worst case you can end up with the Scots Wikipedia situation, where some power editor created a bunch of pages using an entirely fabricated, overly stereotypical language and that influenced what people thought Scots actually was.
This is one of the examples we keep in mind and that's also why we can't 100% trust public dataset labels. This motivated us to train a Language IDentification system for all the languages we wanted to handle in order to build the monolingual dataset. More details in the paper ;) Or here, if you have questions
I think it will interesting when it runs into a language (e.g. Dakota) where the women and men speak differently. Should be an interesting test.
Doesn't seem to be a big issue for Arabic, where verbs are gendered (so in the sentence "I am going to the store", the verb "to go" will be either masculine or feminine, reflecting the speaker's gender).
> so in the sentence "I am going to the store", the verb "to go" will be either masculine or feminine, reflecting the speaker's gender
But there the rules are the same for everyone. This is not true in general; there are languages where men and women speak according to different rules.
Here's a selection from Empires of the Word:
> These works [written by women] are usually written in Emesal, 'the fine tongue', a separate dialect of Sumerian, well documented in scribal dictionaries. In dialogue works this dialect is used for the speech of goddesses. It differs from standard Sumerian, Emegir, 'the princely tongue', both in vocabulary (including the names of many gods) and also in pronunciation (consonants by and large being articulated farther forward in the mouth); it differs not at all in its grammar. For example, when the goddess Inanna is affecting to repel the advances of an importunate suitor, she cries:
> kuli Mulila šu bamu emeše daŋen amaŋu lulaše ta munaben amaŋu Gašangale lulaše ta munaben
> Friend of Enlil, let me free! Let me go to my house! What lie shall I tell my mother? What lie shall I tell my mother Ningal?
> Both Enlil and Ningal are, of course, gods. In Emegir this would have been (with the differences highlighted):
> kuli Enlila šu bamu eŋuše gaŋen amaŋu lulaše ana munaben amaŋu Ningale lulaše ana munaben
Arabic is the 5th or 6th most spoken language. I think the concern for low resource languages is that nuances like that won't get picked up.
That's fair, I was mostly just responding to the parent comment's point about language models running into potential difficulties in languages where the men and women speak differently (though I don't speak Dakota, so the gender-specific differences there may be more pronounced than in Arabic, where there's also the "default"/neutral option of just picking the masculine version of verbs unless you know the subject(s) are female).
That's not what I meant. It isn't the words that are gendered, but the way the speaker talks that is gendered. My old boss was taught to speak by her uncles. Her female relatives teased her since she talked like a man.
Won't people trying to learn a low resource language as as a second language also bring their influence?
So they have a system that can translate to languages for which there isn't as much data as English, Spanish, etc. Waiting for a Twitter thread from a native speaker of one of these "low resource languages" to let us know how good the actual translations are. Cynically, I'd venture that they hired some native speakers to cherry pick their best translations for the story books. But mostly this just seems like a nice bit of PR (calling it a "breakthrough", etc.). I can't imagine this is going to help anyone who actually speaks a random, e.g., Nilo-Saharan language.
If you're curious to try the system yourself, it's actually being used to help Wikipedia editors write articles for low-resource language Wikipedias: https://twitter.com/Wikimedia/status/1544699850960281601
How is the license of the models (CC NC) compatible with licenses used in Wikipedia? Did you sign an special agreement with the Wikimedia Foundation?
Twitter may not be representative imho because of the short text. It should first come to a problem of reliable language detection, and Twitter is quite often wrong there
in this work we tried to rely not only on automated evaluation scores but also on human evaluation for exactly this reason: we wanted to have a better understanding of how our model actually performs and how it correlates to automated scores.
> Essential cookies
> These cookies are required to use Meta Products. They’re necessary for these sites to work as intended.
What cookies does Facebook "need" to serve a simple article?
Facebook translations are horrifying for the mainstream languages already. They go from completely wrong to kinda understandable but still wrong.
Looks like they're investing to get better. The model is also available and they called for contributions to improve it.
Why would I help them? If it was public data sure.
Look, I fucking hate Facebook to the point that I can't really be objective about their research. Whenever I see a section on ethical implications or impacts I just think about shit like Myanmar or the insurrection and laugh (cry).
But this is a shallow dismissal that doesn't add anything valuable to the discussion.
"Oh they made their _terrible_ (probably state of the art) machine translation _better_??! Those monsters!!"
I know DeepL doesn't do low-resource languages, but it would be interesting to see a translation quality comparison between the two.
I was two sentences in before I realized the headline wasn’t “No Luggage Left Behind”
this is actually our recurring joke for our team meeting offsites!
I wonder if spy agencies have already developed, but not published, high-quality SMT methods for lots of minority and little-known languages. :-(
(Edit: and speech-to-text models.)
"No Language Left Behind" - really?
Did the people at Meta think about the Signed Languages of the Deaf?
I didn't find a mention. Even Ctrl-F deaf didn't yield anything.
So so many words but not a hint of any demo. It's just magic according to Facebook. Plz couldn't they at least have a crappy demo to break?
tl/dr: Now your words can be misconstrued by far more people than before, because AIs will translate the misunderstandings into as many languages as possible.
So glad it's Facebook doing this and not some other weird company, when translating and delivering information to every culture on the planet it's good to have a trustworthy, ethical company without any past (or heck, even any current, ongoing) issues in spreading misinformation around the globe and contributing to the rise of fascism across the world while profiting massively off of it and denying any culpability, making sure it all goes smoothly.
Great! Facebook no longer have to provide content moderation in all the various corners of the world where they could accidentally enable the dissemination of misinformation and hate speech in minority languages. They can simply transform it into English and run it back through the existing moderation tooling!
Understanding foreign culture is about reading automated translations of online comments into your native language. It has nothing to do with putting the effort into learning a language and understanding the nuances and current events and issues of the culture it embeds.
The ESL (English as a single language) speakers over at Facebook don't even need to understand foreign cultures, because they already know everyone in the world needs to spend their lives staring into the Metaverse. So grateful that they are working on the world's fattest pipeline for exporting Anglophone culture to every corner of the planet!