GEOD History and Themes
GEOD started its activity in 1997, in the field of speech and dialogue research, to design interaction and spoken communication software and equip systems with a reliable and efficient language component.
Since the 1990s, the means of communication (mobile phones, Internet) and the media for the electronic dissemination of information (digital radio and television broadcasts) have grown steadily. At the same time, the progress in digital information processing and computer technology has been enormous. This development has opened up promising prospects for many applications in the field of man-machine or man-man mediatized oral communication, but also for specific applications in the medical field such as remote monitoring of patients at home (intelligent housing). At the same time, thanks to the ease of storage due in part to highly efficient compression algorithms, the corpus of audio and video documents continues to grow. Virtually all multimedia information is now available in digital format and its exploitation opens the field to new applications for indexing and searching documents by content.
Since 1997, GEOD’s main research objectives have been (a) the development of large vocabulary recognition systems for continuous speech, integrating robustness for speech acquisition and recognition, (b) the development of multimodal human-machine dialogue systems, (c) multimodal interaction in perceptual environments and spaces. To achieve these objectives, transversal research (d) on corpora and learning tools is still needed.
Robust multilingual continuous speech recognition
Acoustic pre-treatment, taking into account usability under real conditions and capable of achieving robust systems under severe acoustic conditions, constitutes the major problem of speech recognition. GEOD focused its efforts on locating the speaker using a microphone antenna (multi-sensor acquisition and signal inter-correlation technique), blind source separation method, acoustic reverberation cancellation by room frequency response estimation.
In the field of acoustic modelling for continuous speech recognition, GEOD has concentrated its research activities on techniques based on hidden Markov chains (or HMM), and more specifically on the statistical optimisation of phoneme models (context dependent and independent) by learning on large multi-speaker speech corpora (BREF80, BREF120 : two LIMSI bases and BRAF100 base registered at CLIPS with 10000 sentences and 100 speakers – see action “Transverse searches”).
The team is also studying acoustic modelling for voice servers, with the objective of minimizing signal distortion following speech signal compression, resulting in degraded recognition system performance.
GEOD is also developing a recognition system for Vietnamese, which is a 6-tone language, as part of CLIPS’ scientific cooperation with the Hanoi Polytechnic Institute (IPH).
On language models and theme detection, the approach chosen by GEOD consisted in using the WEB as a source of information to constitute the learning corpus from which the vocabulary is extracted and developed the language model. A minimum block filtering technique (vocabulary functions) was developed (D. Vaufreydaz), involving the use of linguistic lexicons and dictionaries (BDLEX 50000 and ABU: Association des Bibliophiles Universels, were used).
Since 2001, GEOD’s research has evolved from supporting large vocabulary to multilingualism.
GEOD’s research activities in this theme have focused on the development of multi-speaker acoustic models and language models for the laboratory’s French language automatic continuous speech recognition system. The originality lies in the approach that consists in “aspiring” a large number of websites in a given language and filtering the recovered text data in order to make them usable for calculating statistical language models. An adaptation of this methodology to poorly endowed languages marks a trend towards multilingualism that is becoming increasingly important in this research. Applications to the Vietnamese language, the Khmer language and Spanish-Mexican (Castilian) have been considered and have produced very encouraging results. Extensions of this research theme, in the sense of an “enriched transcription” (segmentation into speakers, detection of areas of interest, detection of audio “jingles”,…) for information search applications by the content in the databases, were also conducted as part of various participation in international evaluation campaigns. Finally, work for biometric applications has been carried out taking into account the often multimodal nature of the field.
Perceptual environments: speech and sound as a component of multimodal interaction
In collaboration with the TIMC laboratory, GEOD is developing the general concept of an “intelligent living space”. This involves designing rooms equipped with several types of sensors and managed by a computer system that analyses signals in real time so as to intervene automatically according to the needs, demands and expectations of human stakeholders. The use of sound sensors and specific processing on speech signals or everyday noise is an innovative approach.
This theme describes GEOD’s research work on speech and sound, in the application context of perceptual spaces and more particularly in the framework of cooperation with the TIMC13 laboratory for Intelligent Housing for Health (HIS). In the TIMC premises, an apartment (30m2) has been equipped to become a prototype HIS. Various algorithms for detection and classification of everyday sounds have been developed and validated for the detection of distress situations of a patient under remote medical monitoring. Similarly, a language model for the GEOD speech recognition system has been adapted for the recognition of distress calls in this environment. Some developments for “smart room” applications are also presented.
Development of multimodal H-M dialogue systems
The theme of human-machine dialogue encompasses oral interaction and multimodal interaction (speech and gestures). Modelling human-machine dialogue poses theoretical problems because human dialogue cannot be considered as a fully planned activity: at each moment, the interlocutors can make impacts or breaks, they use strategies that they adapt during the interaction according to the goals to be achieved and the opportunities offered by the situation.
Between 1997 and 1999, GEOD was mainly interested in 4 of the modules of the schematic representation of a dialogue system, namely the first 4 of: the management of the goals related to the task, the understanding of the speaker’s statements, their interpretation in a dialogue situation and relative to the goals to be achieved, the control and management of the dialogue and finally the generation of outputs: text, speech or graph.
This theme describes GEOD’s research work on human-machine dialogue. The main advances have been in game theory and dialogue representation theory (SDRT = Segmented Discourse Representation Theory). The analysis and exploitation of corpora continued in order to study the speakers’ expectations, their modes of understanding and their behaviour in the face of expressive conversational agents. For this purpose, various dialogue situations were simulated, notably within the framework of the ACE project (Expressive Conversational Agent). Finally, the DCR16 method was further developed to obtain a validated procedure in automatic evaluation of dialogue systems. All this research has focused on application areas provided by the PVE (Portal Vocal d’Entreprise) project, the aim of which is to develop natural speech dialogue services for the company’s social life (e.g. organisation of meetings, personal agendas, etc.). An opening towards dialogue with several speakers has been initiated, which places the team on an original ground.
Transversal research related to the axes described above concerns linguistic resources (written and oral corpora) and more generally the tools and methodologies necessary for the acquisition, production or management of these linguistic resources, but also very sophisticated phonetic alignment or automatic learning methods.