Can A.I. Assist Clear up the Thriller of Misplaced Languages?

Can A.I. Help Solve the Mystery of Lost Languages?

Francesco Riccardo Iacomino / Getty Images

There are many things that set people apart from other species, but one of the most important is language. The ability to combine various elements in essentially infinite combinations is a trait that “in the past was often viewed as the core trait of modern man, the source of human creativity, cultural enrichment, and complex social structure” as linguist Noam Chomsky once said.

But as important as language was in human evolution, we still don’t know much about how language evolved. While dead languages ​​like Latin have an abundance of written records and descendants through which we can better understand them, some languages ​​of history are being lost.

Researchers have been able to reconstruct some lost languages, but the process of decoding them can be tedious. For example, the ancient Linear B font was “solved” over half a century after it was discovered, and some of those who worked on it did not see the work complete. An older script called Linear A, the writing system of the Minoan civilization, remains undeciphered.

However, modern linguists have a powerful tool: artificial intelligence. By training the AI ​​to localize the patterns in undeciphered languages, researchers can reconstruct them and uncover the secrets of ancient times. A recent, novel neural approach from researchers at the Massachusetts Institute of Technology (MIT) has already shown success in deciphering Linear B and could one day lead to the solution of other lost languages.

Resurrection of the Dead (Languages)

Similar to skinning a cat, there are several ways to decipher a lost language. In some cases, the language has no written records, so linguists try to reconstruct it by following the evolution of the sounds through their descendants. Such is the case with Proto-Indo-European, the hypothetical ancestor of numerous languages ​​in Europe and Asia.

In other cases, archaeologists discover written records, as was the case with Linear B. After archaeologists discovered tablets on the island of Crete, researchers puzzled over the writings for decades and finally deciphered them. Unfortunately, this is not currently possible with Linear A because the researchers don’t have nearly as much source material to study. But this may not be necessary.

But English and French are living languages ​​with centuries of cultural overlap. Deciphering a lost language is far more difficult.

A project by researchers at MIT shows the difficulties of decoding and the potential of AI to revolutionize the field. The researchers developed a neural approach to deciphering lost languages, “which is informed by patterns of language change documented in historical linguistics”. As pointed out in a 2019 article, the previous AI needed to be tailored to a specific language to decipher languages, but it didn’t.

“If you look at a commercially available translator or translation product,” says Jiaming Luo, lead author of the paper, “all of these technologies have access to a large amount of so-called parallel data. You can think of them as Rosetta stones, but in very large quantities. “

A parallel corpus is a collection of texts in two different languages. For example, imagine a series of sentences in English and French. Even if you don’t speak French, by comparing the two sentences and observing patterns you can map words in one language to corresponding words in the other language.

“If you train a person to see more than 40 million parallel sentences,” explains Luo, “I am confident that you can find a translation.”

But English and French are living languages ​​with centuries of cultural overlap. Deciphering a lost language is far more difficult.

“We don’t have the luxury of parallel data,” explains Luo. “So we have to rely on certain linguistic knowledge, how language develops, how words develop to their offspring.”

Can A.I. Assist Clear up the Thriller of Misplaced Languages?Neural decryption / MIT

In order to create a model that can be used regardless of the languages ​​involved, the team set constraints based on trends that can be observed as the languages ​​evolve.

“We have to rely on insights into linguistics on two levels,” says Luo. “One is on the character level, which is all we know that when words develop, they usually develop from left to right. You can think of this development as a kind of string. Perhaps a Latin string is ABCDE, which you would most likely change to ABD or ABC. However, they keep the original order in some way. We call that monotonous. “

At the vocabulary level (the words that make up a language), the team used a technique called “one-to-one mapping”.

“In other words, if you pull out all of the Latin vocabulary and all of the Italian vocabulary, you see a kind of one-to-one correspondence,” Luo offers as an example. “The Latin word for” dog “is likely to become the Italian word for” dog “and the Latin word for” cat “is likely to become the Italian word for” cat “.”

To test the model, the team used some datasets. They translated the ancient language Ugaritic into Hebrew, Linear B into Greek, and performed related recognition (words with common ancestors) in the Romance languages ​​Spanish, Italian and Portuguese to confirm the effectiveness of the model.

It was the first known attempt to automatically decipher Linear B, and the model successfully translated 67.3% of the relatives. The system has also been improved over previous models for translating Ugaritic. Since the languages ​​come from different families, this shows that the model is flexible and more accurate than previous systems.

The future

Linear A remains one of the great mysteries of language, and cracking this old nut would be a remarkable feat for AI. This is currently completely theoretical for several reasons, says Luo.

First, Linear A offers a smaller amount of data than Linear B. It’s also about finding out what kind of script Linear A is.

“I would say the unique challenge for Linear A is that you have a lot of figurative or logographic characters or symbols,” says Luo. “And usually when you have a lot of these symbols it gets a lot harder.”

Can A.I. Assist Clear up the Thriller of Misplaced Languages?Brand X Pictures / Getty Images

As an example, Luo compares English and Chinese.

“English has 26 letters if you don’t count capitalization, and Russian has 33. These are called alphabetical systems. So all you have to do is figure out one card for those 26 or 30 characters, ”he says.

“But for the Chinese, you have to deal with thousands of them,” he continues. “I think an estimate of the minimum number of characters one needs to know to read a newspaper would be around 3,000 or 5,000. Linear A is not Chinese, but due to its pictorial or logographic symbols and the like it is definitely more difficult than Linear B. “

Although Linear A has not yet been deciphered, the success of MIT’s novel neural deciphering approach in automatically deciphering Linear B, which goes beyond the need for a parallel corpus, is a promising sign.

Editor’s recommendations

Leave a Reply

Your email address will not be published. Required fields are marked *

Main Menu