AI and Endangered Languages: Friend or Foe?

Image: Joshua Hoehne

Image: Joshua Hoehne

Endangered languages, many at risk of disappearing within a generation, are becoming the focus of new initiatives that incorporate artificial intelligence to aid in their documentation and preservation.

Fittingly, ChatGPT wrote that sentence.

Artificial intelligence is renowned for its linguistic ability to create coherent, fluent, and creative responses. But when it comes to preserving endangered languages, the verdict is still out.

Of the around 7,000 languages spoken across the globe, no less than 40% are in danger of disappearing by 2100. Every two weeks, a language dies. To try and stave off their decline, the United Nations has declared 2022-2032 the International Decade of Indigenous Languages.

Language becomes endangered when it is no longer transmitted intergenerationally. This is often a long-term impact of colonisation and the pressure to adopt dominant languages.

Anna Luisa Daigneault, a PhD student focusing on AI and indigenous language revitalisation, explains: “North and South America were colonised by different European countries over time. We see the impact of Spanish and Portuguese in South America, and English and some French in North America. Those are the legacies of the colonial empires that took over those regions and pushed aside indigenous languages.

“Language assimilation can be forced through violence and oppression, but can also be more of a gradual shift.

“There is also a phenomenon called chain endangerment. This happens when smaller, local languages are taken over by regional languages, those regional languages are taken over by other regional languages, and those ones are taken over by larger colonial languages.”

Yulha Lhawa is a computational linguist, a Language Revitalization Mentor, and a speaker of Khroskyabs, an endangered language from the Tibetan Plateau.

She explains how the break in intergenerational transmission is often sociocultural. For example, sometimes governments do not support endangered languages, or people feel that they need to make personal choices which means abandoning their mother tongue.

“In my part of the world, it is a life necessity to speak a few languages. At home in my community, we speak my mother tongue, and then we have a larger ethnic group where we speak more standardized Tibetan - these language are very different. If you interact with other communities you have to be able to speak the local variety of Mandarin. And of course, school is all in Mandarin."

She recounts a memory of arriving at boarding school and not being able to register because she didn’t speak Mandarin.

If there are few speakers of a language, how does a community ensure its longevity?

Over the years, linguists have been creative in their attempts to archive language. The Rosetta Project, for example, run by the Long Now Foundation, is a global collection of language specialists and native speakers working to develop a contemporary version of the historic Rosetta Stone

The project aims to create a near-permanent archive of 1,500 languages that will enable comparative linguistic research and education, and may help recover lost languages in the future. 

In 2004, a ‘Rosetta disc’ inscribed with 6,500 pages of languages translations was launched into space and is now on a 6.44 year orbit round the sun.

Danielle Engelman, director of programmes at the Long Now Foundation, said: “Language is critical to long term thinking because language encodes culture. 

“It took Long Now eight years to actually gather at least a subset of data on all the world’s languages. 

“One thing we know about language and words is that they’re living and don’t stay still. So, as you try to gather all of the world’s information and understanding of itself you first run into the question of how it will be understood in the future.”

The Living Tongues Institute for Endangered Languages was established in 2005 and stands at the intersection of linguistics and community activism.

The project supports communities in safeguarding their languages through campaigning, education, and technology.

Anna Luisa Daigneault, who works as lead digital curator for Living Tongues, described how AI is a double-edged sword when it comes to language preservation.

“On the plus side, AI could speed up language documentation with new tools like transcribing phonetics. If it’s used ethically and with community guidance, it could also help create educational materials, but that could only happen once there is a corpus of data that has been documented correctly.

“On the flip side, there are a lot of errors that could come out of the process. Languages that are understudied can contain entirely different sound and grammar systems that take linguists many years to understand. If you take an AI model that has been trained on English or another dominant language, it really won’t work when you apply it to a language that hasn’t been studied well.”

Daigneault points out that the problems that could arise are concerning. There have been cases of people using AI translation to cross borders, flee war zones, or get medical attention, and then realising that the tools are mistranslating.

In 2022, the Federal Emergency Management Agency (FEMA) sent unintelligible disaster relief information to Alaska natives after they were hit by Typhoon Merbok.

Dr Anna Belew, executive director of the Endangered Languages Project, highlights this danger of AI ‘hallucinating’ when dealing with fragile languages.

Belew stresses: “Language endangerment is really a socio-political problem, and AI has never to my knowledge fixed any of those. It actually exacerbates them in many ways.

“All AI knows what to do is read what exists, feed it back to you, and amplify those patterns. There is a danger that instead of solving these problems it will actually perpetuate them.”

For example, Belew highlights the propensity for the media to use fatalistic terminology and perpetuate saviour mentalities when discussing endangered languages, using terminology such as ‘dying’, ‘extinct’, or ‘saving’. This can harm the well-being, rights, and aspirations of Indigenous and minoritized people. Unfortunately, these words and phrases are fed into AI large language models (LLMs), which in turn perpetuate their use – as exemplified by the first sentence of this article.

Belew states that language revitalization efforts should always respect the community, and that AI often feels disrespectful. 

“If I could include short bullet points for when AI could actually be helpful, they would include: Is it developed for and controlled by the community? Is it being used to support human efforts?” 

Protocols around how language should be treated are becoming increasingly important in indigenous communities. AI is unable to discern between language that should not be shared publicly or that should be treated in a certain way; it simply ingests all of it and spits it back out without knowing these protocols or respecting community preferences. 

In essence, the most important cases are where it’s used by and for indigenous communities.

Image: Emiliano Vittoriosi

Image: Emiliano Vittoriosi

Belew adds: “The Endangered Languages Project draws back from places that use AI when it is replicating systems of colonial harm. We are very strong against extractive types of AI production.

“Language preservation is very difficult work that is a response to trauma in a community; language endangerment never happens when things are stable.

“It’s about the people. It’s about the experience of people being ripped away from their communities and healing from historical trauma. And that’s why we do it. We only use technology to support that human connection element.”