PI : Spoken Language Technologies for PI languages | GETALP : Groupe d'Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

ANR BLANC 2009-2012
Description : Despite the factors, that at a first glance might make a language appear less important, and thus unnecessary as target for human language technologies (HLT), good reasons exist for developing speech recognition systems for literally all languages in the world. First, the diversity of languages in the world is the basis of the rich cultural diversity. However in today’s globalized world, languages are frequently disappearing. The ongoing extinction of many languages is in part caused by a switch to more prevalent languages that might give their speakers an economic advantage. The lack of HLT systems for these languages only accelerates their extinction while on the other side HLT could help to stop this trend by making the less prevalent languages more attractive to their original speakers. A second reason why HLT should be available for all languages is that the political impact of a language can be very volatile. In today’s world, language is one of the few remaining barriers that hinder human-to-human interaction. Events such as armed conflicts or natural disasters might make it important to be able to communicate with speakers of a less-prevalent language, e.g. for humanitarian workers in a disaster area. Here, readily available technology such as speech translation systems can be highly beneficial. Such technology might be far from being perfect, but when being faced with the alternative of having no translation system at all for an unknown language in an emergency situation, the imperfect system will be of great use. Therefore, HLT needs to be developed especially for under-resourced languages!
Nowadays, almost all of the techniques and methods in spoken language technologies, in particular the automatic speech recognition (ASR) systems, use statistical approaches. However, given the statistical nature of these methods, a large amount of resources (vocabularies, text corpora, transcribed speech corpora, phonetic dictionaries) is crucial and required to train models and to test the performances of the systems. Consequently, a large speech corpus which contains hours of signals recorded by hundreds speakers (for acoustic modeling) and a text corpus with million words (for language modeling) is currently necessary for building an ASR system for a new language. However, these crucial resources are not directly available for under-resourced languages. Thus, a methodology for rapidly building them is necessary, and in the mean time, strategies to exploit a minimum amount of resources are necessary.
From a scientific point of view, the interest and originality of this project consists in proposing viable innovative methods that go far beyond the simple retraining or adaptation of acoustic and linguistic models. Consequently, a significant breakdown is needed to develop ASR systems for Π-languages. For instance we plan to question the use of the word as a fundamental unit for language modelling : low complexity language models could be obtained by using sub-word units (morphemes, syllables or even characters) which might be of interest when few training data is available. Concerning acoustic modelling, the originality of this project lies in the proposal of large coverage multilingual acoustic models based on our knowledge of the world speech sounds systems. Another goal is to explore more “language independent” ASR systems that would adapt themselves (without or with very few supervision) to the audio flow they have at their input.
From an operational point of view, this project aims at providing a free open source ASR development kit for Π-languages. This goal is realistic since elements of it have been already developed in a preliminary fashion by some partners of this consortium : text data collection and filtering tools (LIG, LIA), ASR training and decoding tools (LIA), phone mapping tools from a source to a target language (LIG). We plan to distribute and evaluate such a development kit by deploying ASR systems for new under-resourced languages with very poor resources (khmer and lao for instance).
If successful, this project would lead to a dynamic user group of our development kit, composed by research teams or individuals doing ASR research for their own language. It is also important to note that some under-resourced languages could show, in the future of their development, a very strong economic potential: Bengali, Malay and Vietnamese are for instance in the top-20 of the most spoken languages in the world. Some other languages may be of great interest for governmental or non governmental projects involving global security (see example of Iraqi dialect in US DARPA projects) or humanitarian issues. Finally, in the objective of saving some endangered languages (some mostly spoken and not written), the possibility to rapidly develop ASR systems to transcribe them is an important step for their preservation and would facilitate access to audio contents in these languages (notably languages from Africa).