The GETALP teams creates resources of several kinds (corpora, lexical resources, software platform, etc.) that are usually made available to the public. Here is a excerpt of such resources created by GETALP or with GETALP as a partner.

Resources

Title	Dates	Licence	Availability	Purpose/Summary
Audiocite.net	2024	CC BY-NC-SA 4.0	https://openslr.org/139/	Audiocite.net is a corpus of read French speech downloaded in November 2021 from the Audiocite.net website.
TArC: Tunisian Arabish Corpus	2024 (2021-)	CC BY-NC-SA 4.0	https://github.com/eligugliotta/tarc	A semi-automatically annotated corpus of Tunisian Arabic, including classification of tokens (Tunisian Arabic, foreign language, emotag), transcription into Arabic script, tokenization, and morpho-syntactic tagging.
LeBenchmark	2024 (2022-)	Apache-2.0	https://huggingface.co/LeBenchmark	A reproducible and multifaceted benchmark for evaluating speech SSL models, focusing on four French language tasks: Speech Recognition (ASR), Spoken Language Understanding (SLU), Speech Translation (AST), and Emotion Recognition (AER). More information on the project page http://lebenchmark.com
PxCorpus	2023 (2022-)	Creative Commons Attribution 4.0 International	DOI 10.5281/zenodo.6482586	PxCorpus (A Spoken Drug Prescription Dataset in French for Spoken Language Understanding and Dialogue) is the first distributed spoken medical drug prescription corpus in French. It contains 4 hours of transcribed and annotated dialogues from 55 participants (experts and non-experts) and is designed for training NLP models for Spoken Language Understanding and dialogue systems. Transcriptions were human-verified and aligned with semantic labels. With was a collaboration with CHU Grenoble and Calystène SA.
DBnary	2025 (2012-)	Creative Commons Attribution-ShareAlike 3.0	http://kaiko.getalp.org/	Dbnary is an effort to provide multilingual lexical data extracted from Wiktionary. The extracted data is made available as LLOD (Linguistic Linked Open Data). This data set won the Monnet challenge in 2012. Since then a new version is issued twice a month and new languages and data is added continuously.
BDBRUIT	2005	ELRA	https://catalogue.elra.info/en-us/repository/browse/ELRA-S0033/	A French speech database dedicated to the study of the perturbations of speech production due to noisy environments, and especially the Lombard effect. Environment: 4 noise conditions and the reference condition (quiet). The 2 noises used (a "white noise" and a "cocktail-party noise") were both produced to the ears of the speaker first without then with the feedback of his own voice. 10 speakers: 5 male and 5 female (4 CDROMs, approximately 2.2 Gigabytes)
BDSONS	2005	ELRA	https://catalogue.elra.info/en-us/repository/browse/ELRA-S0005/	The BDSONS Database is a French - speech database with two subsets: evaluation and acoustic modelling. The Corpora consist of 32 speakers: 16 male and 16 female (7 CD-ROMs of approximately 3,5 Gigabytes)
BRAF100	2005	ELRA	https://catalogue.elra.info/en-us/repository/browse/ELRA-S0197/	GETALP (GEOD at that time) recorded a large oral database (BRAF100: Base pour la Reconnaissance Automatique du Français with 100 speakers - around 30h of speech) within a cooperation contract with ISL (Interactive Systems Laboratories) at the University of Karlsruhe for the joint development of a French speech database. This corpus has been distributed as part of GlobalPhone corpus on ELRA
THERADIA WoZ	2025	End User License Agreement (EULA)	https://catalog.elra.info/ (summer 2025)	Ecological corpus specifically designed for the audiovisual detection of affective states in the context of healthcare domain. The corpus data come from natural interactions with a virtual assistant, operated remotely by a human acting as a Wizard-of-Oz, in French, involving 61 senior individuals, including both neurotypical and Mild Cognitive Impairment (MCI) participants, and 52 youth individuals. The participants’ expressions were fully transcribed and partially annotated based on the dimensions of recent appraisal theories models and 23 affective labels derived from the literature on achievement affects.
Vlexique 2.0	2024	Creative Commons Attribution Non Commercial Share Alike 4.0 International	https://doi.org/10.5281/zenodo.14069226	Vlexique 2.0 is a collection of French verbal paradigms, in phonemic and orthographic notation. They are suited for both computational and manual analysis. The dataset follows the Paralex Standard (https://www.paralex-standard.org).
mtgpy fr: a constituency parser for French	2024	Creative Commons Attribution Share Alike 4.0 International	https://doi.org/10.5281/zenodo.13121535	A constituency parser for French released with projective and discontinuous pretrained models.
FlauberTagger	2024	Creative Commons Attribution Share Alike 4.0 International	https://doi.org/10.5281/zenodo.10697867	A morphological tagger for French released with pretrained models.
mind-the-gap-py	2021	MIT License	https://doi.org/10.5281/zenodo.4775955	A discontinuity-capable transition based parser released with pretrained models for English and German.
Gazeplay	2020 (2017-)	GPL-3.0 license		gazeplay software (demonstration, git repository): free and open source software for people with disabilities that brings together several mini-games that can be played using an eye-tracker
Jibiki Platform	2020 (2004-)	LGPL-2.1 license		The Jibiki platform is a Java framework used to build collaborative dictionary/corpus development web sites
FlauBERT	2020	Multiple		Unsupervised Language Model Pre-training for French
Accolé	2019 (2017-)	Creative Commons BY-NC-SA	http://lig-accole.imag.fr	Interactive platform for machine translation error analysis.
UFSAC	2019 (2014-)	Apache 2.0	http://github.com/getalp/UFSAC	Unification of Sense Annotated Corpora and Tools: a format of corpus that can be used for training or testing a disambiguation system.
Corpus Vocadom@A4H et VocADom@DOMUS	2018 (2017-)	End-user License Agreement (EULA). Not for commercial use.		Corpus recorded during the VocADom project in the Amiqual4Home and DOMUS appartments of the LIG
GDEF dictionary	2018 (2004-)	© Association Franco-Estonienne de Lexicographie		Great Estonian-French dictionary. GETALP contributed the tools and expertise used to collaboratively build this dictionary.
Wikipedia Company Corpus	2018	EULA	https://gricad-gitlab.univ-grenoble-alpes.fr/getalp/wikipediacompanycorpus	51K descriptions of companies collected from freely available Wikipedia articles. The dataset has been primarily collected for NLG and NLU.
DiLAF dictionaries	2018			Dictionaries African-French languages
Mboshi-French	2017	MIT	GitHub	Speech corpus collected during a realistic language documentation process: 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations.
Translation Augmented LibriSpeech Corpus	2017	Creative Commons BY 4.0		LibriSpeech corpus: English-French aligned corpus from English audio books (LibriSpeech ASR corpus increase)
SPEECH-COCO corpus	2017	Creative Commons BY 4.0		Description of images in natural language (increase of MS-COCO by adding speech)
Recola	2016 (2015-)	End User License Agreement (EULA)	https://diuf.unifr.ch/main/diva/recola/	Multimodal corpus (audio, video, physio) of natural and spontaneous social-emotional behaviours. This corpus was used to organize two editions of the Audio Visual Emotion Challenge (AVEC).
METEOR-E tool	2016 (2015-)			Tool for evaluating performance systems of machine translation systems (METEOR increase)
Jibiki French-Japanese dictionary	2016	Creative Commons	https://jibiki.fr/	Jibiki.fr project collectively builds a high-quality, wide-coverage French-Japanese dictionary and an aligned bilingual corpus.
multivec tool	2016	Apache-2.0 license		Multilingual tool for building vector word representations
Seq2seq toolkit	2015	Apache 2.0	GitHub	Sequence-to-sequence system (for NMT). 104 forks, more than 1000 downloads.
WCE-LIG tool	2015	GPL-3.0 license		Tool to estimate the quality of an automatic translation at word level
Sweet-Home	2014 (2011-)	End-user License Agreement (EULA). Not for commercial use.	http://sweet-home-data.imag.fr/	Corpus recorded during the Sweet-Home project in the DOMUS apartment of the LIG. This corpus consists of three sets: the multimodal corpus, the home automation control corpus, and the interaction corpus.
Iban corpus & tools	2014	Attribution-ShareAlike 2.0	openslr & Kaldi master GitHub	This package has Iban text and speech corpora used for Automatic Speech Recognition (ASR) experiments.
WCE-SLT-LIG	2014	Creative Commons BY 4.0 (?)		WCE-SLT-LIG corpus: assessment corpus of French-English speech systems
ALFFA-Public	2014	MIT Licence		ALFFA_PUBLIC corpus: multilingual corpus in African languages to build speech recognition systems
Cirdo-set corpus	2014	N/A		Corpus of calls for help recorded during the Cirdo project during falls in the DOMUS apartment
HIS corpus	2010	End-user License Agreement (EULA). Not for commercial use.		Corpus recorded at the end of the RESIDE-HIS and DESDHIS projects (thesis Anthony Fleury) in the HIS apartment of the TIMC-AFIRM team
C-Star 2	2001			C-Star 2 consortium (French-Korean demonstration videos): realization of a first analyser from French to a pivot language based on the analysis of relevant islands using finite state automata
NESPOLE!	1999			NESPOLE! consortium (English-Italian demonstration videos): realization of analysers from French to a pivot language (based on the detection of islands relevant for the task with finite state automata) and generators from a pivot language to French (concatenative generation from speech acts and arguments associated with them)
MotÀMot dictionary				French-Khmer bilingual dictionary

More pointers to GETALP resources

These collections references some of the resources that GETALP created or contributed to:

GETALP : Groupe d'Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Demos/Resources

Resources

More pointers to GETALP resources

Groupe d'Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole