Demos/Resources

The GETALP teams creates resources of several kinds (corpora, lexical resources, software platform, etc.) that are usually made available to the public. Here is a excerpt of such resources created by GETALP or with GETALP as a partner.


Resources

TitleDatesLicenceAvailabilityPurpose/Summary
Audiocite.net2024CC BY-NC-SA 4.0https://openslr.org/139/Audiocite.net is a corpus of read French speech downloaded in November 2021 from the Audiocite.net website.
TArC: Tunisian Arabish Corpus2024 (2021-)CC BY-NC-SA 4.0https://github.com/eligugliotta/tarcA semi-automatically annotated corpus of Tunisian Arabic, including classification of tokens (Tunisian Arabic, foreign language, emotag), transcription into Arabic script, tokenization, and morpho-syntactic tagging.
LeBenchmark2024 (2022-)Apache-2.0https://huggingface.co/LeBenchmarkA reproducible and multifaceted benchmark for evaluating speech SSL models, focusing on four French language tasks: Speech Recognition (ASR), Spoken Language Understanding (SLU), Speech Translation (AST), and Emotion Recognition (AER). More information on the project page http://lebenchmark.com
PxCorpus2023 (2022-)Creative Commons Attribution 4.0 InternationalDOI 10.5281/zenodo.6482586PxCorpus (A Spoken Drug Prescription Dataset in French for Spoken Language Understanding and Dialogue) is the first distributed spoken medical drug prescription corpus in French. It contains 4 hours of transcribed and annotated dialogues from 55 participants (experts and non-experts) and is designed for training NLP models for Spoken Language Understanding and dialogue systems. Transcriptions were human-verified and aligned with semantic labels. With was a collaboration with CHU Grenoble and Calystène SA.
DBnary2025 (2012-)Creative Commons Attribution-ShareAlike 3.0http://kaiko.getalp.org/Dbnary is an effort to provide multilingual lexical data extracted from Wiktionary. The extracted data is made available as LLOD (Linguistic Linked Open Data). This data set won the Monnet challenge in 2012. Since then a new version is issued twice a month and new languages and data is added continuously.
BDBRUIT2005ELRAhttps://catalogue.elra.info/en-us/repository/browse/ELRA-S0033/A French speech database dedicated to the study of the perturbations of speech production due to noisy environments, and especially the Lombard effect. Environment: 4 noise conditions and the reference condition (quiet). The 2 noises used (a "white noise" and a "cocktail-party noise") were both produced to the ears of the speaker first without then with the feedback of his own voice. 10 speakers: 5 male and 5 female (4 CDROMs, approximately 2.2 Gigabytes)
BDSONS2005ELRAhttps://catalogue.elra.info/en-us/repository/browse/ELRA-S0005/The BDSONS Database is a French - speech database with two subsets: evaluation and acoustic modelling. The Corpora consist of 32 speakers: 16 male and 16 female (7 CD-ROMs of approximately 3,5 Gigabytes)
BRAF1002005ELRAhttps://catalogue.elra.info/en-us/repository/browse/ELRA-S0197/GETALP (GEOD at that time) recorded a large oral database (BRAF100: Base pour la Reconnaissance Automatique du Français with 100 speakers - around 30h of speech) within a cooperation contract with ISL (Interactive Systems Laboratories) at the University of Karlsruhe for the joint development of a French speech database. This corpus has been distributed as part of GlobalPhone corpus on ELRA
THERADIA WoZ2025End User License Agreement (EULA)https://catalog.elra.info/ (summer 2025)Ecological corpus specifically designed for the audiovisual detection of affective states in the context of healthcare domain. The corpus data come from natural interactions with a virtual assistant, operated remotely by a human acting as a Wizard-of-Oz, in French, involving 61 senior individuals, including both neurotypical and Mild Cognitive Impairment (MCI) participants, and 52 youth individuals. The participants’ expressions were fully transcribed and partially annotated based on the dimensions of recent appraisal theories models and 23 affective labels derived from the literature on achievement affects.
Vlexique 2.02024Creative Commons Attribution Non Commercial Share Alike 4.0 Internationalhttps://doi.org/10.5281/zenodo.14069226Vlexique 2.0 is a collection of French verbal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis. The dataset follows the Paralex Standard (https://www.paralex-standard.org).
mtgpy fr: a constituency parser for French2024Creative Commons Attribution Share Alike 4.0 Internationalhttps://doi.org/10.5281/zenodo.13121535A constituency parser for French released with projective and discontinuous pretrained models.
FlauberTagger2024Creative Commons Attribution Share Alike 4.0 Internationalhttps://doi.org/10.5281/zenodo.10697867A morphological tagger for French released with pretrained models.
mind-the-gap-py2021MIT Licensehttps://doi.org/10.5281/zenodo.4775955A discontinuity-capable transition based parser released with pretrained models for English and German.
Gazeplay2020 (2017-)GPL-3.0 licensegazeplay software (demonstration, git repository): free and open source software for people with disabilities that brings together several mini-games that can be played using an eye-tracker
Jibiki Platform2020 (2004-) LGPL-2.1 licenseThe Jibiki platform is a Java framework used to build collaborative dictionary/corpus development web sites
FlauBERT2020MultipleUnsupervised Language Model Pre-training for French
Accolé2019 (2017-)Creative Commons BY-NC-SAhttp://lig-accole.imag.frInteractive platform for machine translation error analysis.
UFSAC2019 (2014-)Apache 2.0http://github.com/getalp/UFSACUnification of Sense Annotated Corpora and Tools: a format of corpus that can be used for training or testing a disambiguation system.
Corpus Vocadom@A4H et VocADom@DOMUS 2018 (2017-)End-user License Agreement (EULA). Not for commercial use.Corpus recorded during the VocADom project in the Amiqual4Home and DOMUS appartments of the LIG
GDEF dictionary2018 (2004-)© Association Franco-Estonienne de LexicographieGreat Estonian-French dictionary. GETALP contributed the tools and expertise used to collaboratively build this dictionary.
Wikipedia Company Corpus2018EULAhttps://gricad-gitlab.univ-grenoble-alpes.fr/getalp/wikipediacompanycorpus51K descriptions of companies collected from freely available Wikipedia articles. The dataset has been primarily collected for NLG and NLU.
DiLAF dictionaries2018Dictionaries African-French languages
Mboshi-French2017MITGitHubSpeech corpus collected during a realistic language documentation process: 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations.
Translation Augmented LibriSpeech Corpus 2017Creative Commons BY 4.0LibriSpeech corpus: English-French aligned corpus from English audio books (LibriSpeech ASR corpus increase)
SPEECH-COCO corpus2017Creative Commons BY 4.0Description of images in natural language (increase of MS-COCO by adding speech)
Recola2016 (2015-)End User License Agreement (EULA)https://diuf.unifr.ch/main/diva/recola/Multimodal corpus (audio, video, physio) of natural and spontaneous social-emotional behaviours. This corpus was used to organize two editions of the Audio Visual Emotion Challenge (AVEC).
METEOR-E tool2016 (2015-)Tool for evaluating performance systems of machine translation systems (METEOR increase)
Jibiki French-Japanese dictionary2016Creative Commonshttps://jibiki.fr/Jibiki.fr project collectively builds a high-quality, wide-coverage French-Japanese dictionary and an aligned bilingual corpus.
multivec tool2016Apache-2.0 licenseMultilingual tool for building vector word representations
Seq2seq toolkit2015Apache 2.0GitHubSequence-to-sequence system (for NMT). 104 forks, more than 1000 downloads.
WCE-LIG tool2015GPL-3.0 licenseTool to estimate the quality of an automatic translation at word level
Sweet-Home2014 (2011-)End-user License Agreement (EULA). Not for commercial use.http://sweet-home-data.imag.fr/Corpus recorded during the Sweet-Home project in the DOMUS apartment of the LIG. This corpus consists of three sets: the multimodal corpus, the home automation control corpus, and the interaction corpus.
Iban corpus & tools2014Attribution-ShareAlike 2.0openslr & Kaldi master GitHubThis package has Iban text and speech corpora used for Automatic Speech Recognition (ASR) experiments.
WCE-SLT-LIG2014Creative Commons BY 4.0 (?)WCE-SLT-LIG corpus: assessment corpus of French-English speech systems
ALFFA-Public2014MIT LicenceALFFA_PUBLIC corpus: multilingual corpus in African languages to build speech recognition systems
Cirdo-set corpus2014N/ACorpus of calls for help recorded during the Cirdo project during falls in the DOMUS apartment
HIS corpus2010End-user License Agreement (EULA). Not for commercial use.Corpus recorded at the end of the RESIDE-HIS and DESDHIS projects (thesis Anthony Fleury) in the HIS apartment of the TIMC-AFIRM team
C-Star 2 2001C-Star 2 consortium (French-Korean demonstration videos): realization of a first analyser from French to a pivot language based on the analysis of relevant islands using finite state automata
NESPOLE!1999NESPOLE! consortium (English-Italian demonstration videos): realization of analysers from French to a pivot language (based on the detection of islands relevant for the task with finite state automata) and generators from a pivot language to French (concatenative generation from speech acts and arguments associated with them)
MotÀMot dictionaryFrench-Khmer bilingual dictionary

More pointers to GETALP resources

These collections references some of the resources that GETALP created or contributed to:

    Groupe d'Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole