Offres de thèse
- Système personnalisé de commande vocale en contexte – contrat de 3 ans, début en septembre 2017
Stages
Titre : Using Language Models to Check Documents “Language Quality”
Sujet proposé dans : M2R Informatique, Projet
Responsable(s) :
- Herve Blanchon
(Herve.Blanchon@imag.fr)LIG-GETALP
- Didier Schwab
(didier.schwab@imag.fr)LIG-GETALP
Mots-clés : Natural Language Processing, Language Model, Written Document Quality Estimation
Durée du projet : 6 mois
Nombre maximal d’étudiants : 1
Places disponibles : 1
Interrogation effectuée le : 07 novembre 2017, à 12 heures 11
Supervisors
- Hervé BLANCHON (GETALP-LIG)
- Annelise JOST (Altica)
- Jérôme GOULIAN (GETALP-LIG)
- Clarisse BAYOL (Altica)
- Didier SCHWAB (GETALP-LIG)
Keywords
- Natural Language Processing
- Language Model
- Written Document Quality Estimation
Profile of the Candidate
M2 in computer science with an interest in natural language processing and machine learning
Context
Language Models for a given natural language L calculate the probability that a sequence of words of interest is a valid sequence in the language L. The higher the score, the more likely it is that the sequence is correct in language L. Conversely, if the score is low, it is probable that the sequence is incorrect in language L.
The Altica society, located in Voiron (20km away from Grenoble), is specialized in providing translations of documents for its clients. Altica would like to be able to check the quality of both the source document provided by the client and the translated document produced by Altica using a fast and efficient workflow in order to take proper actions.
Originality of the proposed subject
To our knowledge language models have not yet been used for the task we propose. On one hand, for a given a source document provided by a client, we would like to give the translator an approximate idea of the quality of the source in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in this source so that the translator may eventually correct them or ask the client to provide a correction. On the other hand, for a given translation produced by Altica, we would like to give the translator an approximate idea of the quality of its transation in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in his translation so that he may correct them.
Expected results
Pratical:
- become familiar with the notion of language model
- build language models using different approaches: n-gram LM, log-linear LM, feed-forward neural LM, recurrent neural network LM
- evaluate the usability of the proposed LMs in measuring the language quality of a document (spelling, syntax, …)
- evaluate the possiblilty of using the proposed LMs to identify problematic segments of a document (for the sake building a user friendy interface of a verification tool)
Theoretical:
- State of the art: language modelling approaches
- State of the art: language modelling usages
- State of the art: language modelling and document cheching
- Language modelling approaches: Evaluation of pros, cons, suitability for the task
References
Dan Jurafsky. (2017). CS 124: From Languages to Information, week 2 (language modelling). https://web.stanford.edu/class/cs124/
Joshua Goodman. (2003). The state of the art in language modeling. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials – Volume 5 (NAACL-Tutorials ’03), Vol. 5. Association for Computational Linguistics, Stroudsburg, PA, USA, 4-4. DOI: https://doi.org/10.3115/1075168.1075172
Graham Neubig. (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. http://arxiv.org/abs/1703.01619. Section 3.
Dengliang Shi. (2017). A Study on Neural Network Language Modeling. https://arxiv.org/abs/1708.07252
Oren Melamud, Ido Dagan, Jacob Goldberger. (2017) A Simple Language Model based on PMI Matrix Approximations. https://arxiv.org/abs/1707.05266
Youssef Oualil & Dietrich Klakow (2017). A Neural Network Approach for Mixing Language Models. 5710-5714. https://arxiv.org/abs/1708.06989
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu. (2017). Exploring the Limits of Language Modeling. https://arxiv.org/abs/1602.02410