Emplois | GETALP : Groupe d'Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Offres de thèse

Système personnalisé de commande vocale en contexte – contrat de 3 ans, début en septembre 2017

Stages

Titre : Using Language Models to Check Documents “Language Quality”

Sujet proposé dans : M2R Informatique, Projet

Responsable(s) :

Herve Blanchon

(Herve.Blanchon@imag.fr)LIG-GETALP

Didier Schwab

(didier.schwab@imag.fr)LIG-GETALP

Mots-clés : Natural Language Processing, Language Model, Written Document Quality Estimation
Durée du projet : 6 mois
Nombre maximal d’étudiants : 1
Places disponibles : 1
Interrogation effectuée le : 07 novembre 2017, à 12 heures 11

Description Paid internship with a company

Supervisors

Hervé BLANCHON (GETALP-LIG)
Annelise JOST (Altica)
Jérôme GOULIAN (GETALP-LIG)
Clarisse BAYOL (Altica)
Didier SCHWAB (GETALP-LIG)

Keywords

Natural Language Processing
Language Model
Written Document Quality Estimation

Profile of the Candidate

M2 in computer science with an interest in natural language processing and machine learning

Context

Language Models for a given natural language L calculate the probability that a sequence of words of interest is a valid sequence in the language L. The higher the score, the more likely it is that the sequence is correct in language L. Conversely, if the score is low, it is probable that the sequence is incorrect in language L.

The Altica society, located in Voiron (20km away from Grenoble), is specialized in providing translations of documents for its clients. Altica would like to be able to check the quality of both the source document provided by the client and the translated document produced by Altica using a fast and efficient workflow in order to take proper actions.

Originality of the proposed subject

To our knowledge language models have not yet been used for the task we propose. On one hand, for a given a source document provided by a client, we would like to give the translator an approximate idea of the quality of the source in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in this source so that the translator may eventually correct them or ask the client to provide a correction. On the other hand, for a given translation produced by Altica, we would like to give the translator an approximate idea of the quality of its transation in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in his translation so that he may correct them.

Expected results

Pratical:

become familiar with the notion of language model
build language models using different approaches: n-gram LM, log-linear LM, feed-forward neural LM, recurrent neural network LM
evaluate the usability of the proposed LMs in measuring the language quality of a document (spelling, syntax, …)
evaluate the possiblilty of using the proposed LMs to identify problematic segments of a document (for the sake building a user friendy interface of a verification tool)

Theoretical:

State of the art: language modelling approaches
State of the art: language modelling usages
State of the art: language modelling and document cheching
Language modelling approaches: Evaluation of pros, cons, suitability for the task

References

Dan Jurafsky. (2017). CS 124: From Languages to Information, week 2 (language modelling). https://web.stanford.edu/class/cs124/

Joshua Goodman. (2003). The state of the art in language modeling. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials – Volume 5 (NAACL-Tutorials ’03), Vol. 5. Association for Computational Linguistics, Stroudsburg, PA, USA, 4-4. DOI: https://doi.org/10.3115/1075168.1075172

Graham Neubig. (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. http://arxiv.org/abs/1703.01619. Section 3.

Dengliang Shi. (2017). A Study on Neural Network Language Modeling. https://arxiv.org/abs/1708.07252

Oren Melamud, Ido Dagan, Jacob Goldberger. (2017) A Simple Language Model based on PMI Matrix Approximations. https://arxiv.org/abs/1707.05266

Youssef Oualil & Dietrich Klakow (2017). A Neural Network Approach for Mixing Language Models. 5710-5714. https://arxiv.org/abs/1708.06989

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu. (2017). Exploring the Limits of Language Modeling. https://arxiv.org/abs/1602.02410