The GETALP teams creates resources of several kinds (corpora, lexical resources, software platform, etc.) that are usually made available to the public. Here is a excerpt of such resources created by GETALP or with GETALP as a partner.
Resources
Title | Dates | Licence | Availability | Purpose/Summary |
---|---|---|---|---|
Audiocite.net | 2024 | CC BY-NC-SA 4.0 | https://openslr.org/139/ | Audiocite.net is a corpus of read French speech downloaded in November 2021 from the Audiocite.net website. |
TArC: Tunisian Arabish Corpus | 2024 (2021-) | CC BY-NC-SA 4.0 | https://github.com/eligugliotta/tarc | A semi-automatically annotated corpus of Tunisian Arabic, including classification of tokens (Tunisian Arabic, foreign language, emotag), transcription into Arabic script, tokenization, and morpho-syntactic tagging. |
LeBenchmark | 2024 (2022-) | Apache-2.0 | https://huggingface.co/LeBenchmark | A reproducible and multifaceted benchmark for evaluating speech SSL models, focusing on four French language tasks: Speech Recognition (ASR), Spoken Language Understanding (SLU), Speech Translation (AST), and Emotion Recognition (AER). More information on the project page http://lebenchmark.com |
PxCorpus | 2023 (2022-) | Creative Commons Attribution 4.0 International | DOI 10.5281/zenodo.6482586 | PxCorpus (A Spoken Drug Prescription Dataset in French for Spoken Language Understanding and Dialogue) is the first distributed spoken medical drug prescription corpus in French. It contains 4 hours of transcribed and annotated dialogues from 55 participants (experts and non-experts) and is designed for training NLP models for Spoken Language Understanding and dialogue systems. Transcriptions were human-verified and aligned with semantic labels. With was a collaboration with CHU Grenoble and Calystène SA. |
DBnary | 2025 (2012-) | Creative Commons Attribution-ShareAlike 3.0 | http://kaiko.getalp.org/ | Dbnary is an effort to provide multilingual lexical data extracted from Wiktionary. The extracted data is made available as LLOD (Linguistic Linked Open Data). This data set won the Monnet challenge in 2012. Since then a new version is issued twice a month and new languages and data is added continuously. |
BDBRUIT | 2005 | ELRA | https://catalogue.elra.info/en-us/repository/browse/ELRA-S0033/ | A French speech database dedicated to the study of the perturbations of speech production due to noisy environments, and especially the Lombard effect. Environment: 4 noise conditions and the reference condition (quiet). The 2 noises used (a "white noise" and a "cocktail-party noise") were both produced to the ears of the speaker first without then with the feedback of his own voice. 10 speakers: 5 male and 5 female (4 CDROMs, approximately 2.2 Gigabytes) |
BDSONS | 2005 | ELRA | https://catalogue.elra.info/en-us/repository/browse/ELRA-S0005/ | The BDSONS Database is a French - speech database with two subsets: evaluation and acoustic modelling. The Corpora consist of 32 speakers: 16 male and 16 female (7 CD-ROMs of approximately 3,5 Gigabytes) |
BRAF100 | 2005 | ELRA | https://catalogue.elra.info/en-us/repository/browse/ELRA-S0197/ | GETALP (GEOD at that time) recorded a large oral database (BRAF100: Base pour la Reconnaissance Automatique du Français with 100 speakers - around 30h of speech) within a cooperation contract with ISL (Interactive Systems Laboratories) at the University of Karlsruhe for the joint development of a French speech database. This corpus has been distributed as part of GlobalPhone corpus on ELRA |
THERADIA WoZ | 2025 | End User License Agreement (EULA) | https://catalog.elra.info/ (summer 2025) | Ecological corpus specifically designed for the audiovisual detection of affective states in the context of healthcare domain. The corpus data come from natural interactions with a virtual assistant, operated remotely by a human acting as a Wizard-of-Oz, in French, involving 61 senior individuals, including both neurotypical and Mild Cognitive Impairment (MCI) participants, and 52 youth individuals. The participants’ expressions were fully transcribed and partially annotated based on the dimensions of recent appraisal theories models and 23 affective labels derived from the literature on achievement affects. |
Vlexique 2.0 | 2024 | Creative Commons Attribution Non Commercial Share Alike 4.0 International | https://doi.org/10.5281/zenodo.14069226 | Vlexique 2.0 is a collection of French verbal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis. The dataset follows the Paralex Standard (https://www.paralex-standard.org). |
mtgpy fr: a constituency parser for French | 2024 | Creative Commons Attribution Share Alike 4.0 International | https://doi.org/10.5281/zenodo.13121535 | A constituency parser for French released with projective and discontinuous pretrained models. |
FlauberTagger | 2024 | Creative Commons Attribution Share Alike 4.0 International | https://doi.org/10.5281/zenodo.10697867 | A morphological tagger for French released with pretrained models. |
mind-the-gap-py | 2021 | MIT License | https://doi.org/10.5281/zenodo.4775955 | A discontinuity-capable transition based parser released with pretrained models for English and German. |
Gazeplay | 2020 (2017-) | GPL-3.0 license | gazeplay software (demonstration, git repository): free and open source software for people with disabilities that brings together several mini-games that can be played using an eye-tracker | |
Jibiki Platform | 2020 (2004-) | LGPL-2.1 license | The Jibiki platform is a Java framework used to build collaborative dictionary/corpus development web sites | |
FlauBERT | 2020 | Multiple | Unsupervised Language Model Pre-training for French | |
Accolé | 2019 (2017-) | Creative Commons BY-NC-SA | http://lig-accole.imag.fr | Interactive platform for machine translation error analysis. |
UFSAC | 2019 (2014-) | Apache 2.0 | http://github.com/getalp/UFSAC | Unification of Sense Annotated Corpora and Tools: a format of corpus that can be used for training or testing a disambiguation system. |
Corpus Vocadom@A4H et VocADom@DOMUS | 2018 (2017-) | End-user License Agreement (EULA). Not for commercial use. | Corpus recorded during the VocADom project in the Amiqual4Home and DOMUS appartments of the LIG | |
GDEF dictionary | 2018 (2004-) | © Association Franco-Estonienne de Lexicographie | Great Estonian-French dictionary. GETALP contributed the tools and expertise used to collaboratively build this dictionary. | |
Wikipedia Company Corpus | 2018 | EULA | https://gricad-gitlab.univ-grenoble-alpes.fr/getalp/wikipediacompanycorpus | 51K descriptions of companies collected from freely available Wikipedia articles. The dataset has been primarily collected for NLG and NLU. |
DiLAF dictionaries | 2018 | Dictionaries African-French languages | ||
Mboshi-French | 2017 | MIT | GitHub | Speech corpus collected during a realistic language documentation process: 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. |
Translation Augmented LibriSpeech Corpus | 2017 | Creative Commons BY 4.0 | LibriSpeech corpus: English-French aligned corpus from English audio books (LibriSpeech ASR corpus increase) |
|
SPEECH-COCO corpus | 2017 | Creative Commons BY 4.0 | Description of images in natural language (increase of MS-COCO by adding speech) | |
Recola | 2016 (2015-) | End User License Agreement (EULA) | https://diuf.unifr.ch/main/diva/recola/ | Multimodal corpus (audio, video, physio) of natural and spontaneous social-emotional behaviours. This corpus was used to organize two editions of the Audio Visual Emotion Challenge (AVEC). |
METEOR-E tool | 2016 (2015-) | Tool for evaluating performance systems of machine translation systems (METEOR increase) | ||
Jibiki French-Japanese dictionary | 2016 | Creative Commons | https://jibiki.fr/ | Jibiki.fr project collectively builds a high-quality, wide-coverage French-Japanese dictionary and an aligned bilingual corpus. |
multivec tool | 2016 | Apache-2.0 license | Multilingual tool for building vector word representations | |
Seq2seq toolkit | 2015 | Apache 2.0 | GitHub | Sequence-to-sequence system (for NMT). 104 forks, more than 1000 downloads. |
WCE-LIG tool | 2015 | GPL-3.0 license | Tool to estimate the quality of an automatic translation at word level | |
Sweet-Home | 2014 (2011-) | End-user License Agreement (EULA). Not for commercial use. | http://sweet-home-data.imag.fr/ | Corpus recorded during the Sweet-Home project in the DOMUS apartment of the LIG. This corpus consists of three sets: the multimodal corpus, the home automation control corpus, and the interaction corpus. |
Iban corpus & tools | 2014 | Attribution-ShareAlike 2.0 | openslr & Kaldi master GitHub | This package has Iban text and speech corpora used for Automatic Speech Recognition (ASR) experiments. |
WCE-SLT-LIG | 2014 | Creative Commons BY 4.0 (?) | WCE-SLT-LIG corpus: assessment corpus of French-English speech systems | |
ALFFA-Public | 2014 | MIT Licence | ALFFA_PUBLIC corpus: multilingual corpus in African languages to build speech recognition systems | |
Cirdo-set corpus | 2014 | N/A | Corpus of calls for help recorded during the Cirdo project during falls in the DOMUS apartment | |
HIS corpus | 2010 | End-user License Agreement (EULA). Not for commercial use. | Corpus recorded at the end of the RESIDE-HIS and DESDHIS projects (thesis Anthony Fleury) in the HIS apartment of the TIMC-AFIRM team | |
C-Star 2 | 2001 | C-Star 2 consortium (French-Korean demonstration videos): realization of a first analyser from French to a pivot language based on the analysis of relevant islands using finite state automata | ||
NESPOLE! | 1999 | NESPOLE! consortium (English-Italian demonstration videos): realization of analysers from French to a pivot language (based on the detection of islands relevant for the task with finite state automata) and generators from a pivot language to French (concatenative generation from speech acts and arguments associated with them) | ||
MotÀMot dictionary | French-Khmer bilingual dictionary |
More pointers to GETALP resources
These collections references some of the resources that GETALP created or contributed to: