Offres de thèse


Titre : Using Language Models to Check Documents “Language Quality”

Sujet proposé dans : M2R Informatique, Projet

Responsable(s) :

        Herve Blanchon


        Didier Schwab


Mots-clés : Natural Language Processing, Language Model, Written Document Quality Estimation
Durée du projet : 6 mois
Nombre maximal d’étudiants : 1
Places disponibles : 1
Interrogation effectuée le : 07 novembre 2017, à 12 heures 11


 Paid internship with a company



  • Natural Language Processing
  • Language Model
  • Written Document Quality Estimation

Profile of the Candidate

M2 in computer science with an interest in natural language processing and machine learning


Language Models for a given natural language L calculate the probability that a sequence of words of interest is a valid sequence in the language L. The higher the score, the more likely it is that the sequence is correct in language L. Conversely, if the score is low, it is probable that the sequence is incorrect in language L.

The Altica society, located in Voiron (20km away from Grenoble), is specialized in providing translations of documents for its clients. Altica would like to be able to check the quality of both the source document provided by the client and the translated document produced by Altica using a fast and efficient workflow in order to take proper actions.

Originality of the proposed subject

To our knowledge language models have not yet been used for the task we propose. On one hand, for a given a source document provided by a client, we would like to give the translator an approximate idea of the quality of the source in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in this source so that the translator may eventually correct them or ask the client to provide a correction. On the other hand, for a given translation produced by Altica, we would like to give the translator an approximate idea of the quality of its transation in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in his translation so that he may correct them.

Expected results


  • become familiar with the notion of language model
  • build language models using different approaches: n-gram LM, log-linear LM, feed-forward neural LM, recurrent neural network LM
  • evaluate the usability of the proposed LMs in measuring the language quality of a document (spelling, syntax, …)
  • evaluate the possiblilty of using the proposed LMs to identify problematic segments of a document (for the sake building a user friendy interface of a verification tool)


  • State of the art: language modelling approaches
  • State of the art: language modelling usages
  • State of the art: language modelling and document cheching
  • Language modelling approaches: Evaluation of pros, cons, suitability for the task


Dan Jurafsky. (2017). CS 124: From Languages to Information, week 2 (language modelling). https://web.stanford.edu/class/cs124/

Joshua Goodman. (2003). The state of the art in language modeling. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials – Volume 5 (NAACL-Tutorials ’03), Vol. 5. Association for Computational Linguistics, Stroudsburg, PA, USA, 4-4. DOI: https://doi.org/10.3115/1075168.1075172

Graham Neubig. (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. http://arxiv.org/abs/1703.01619. Section 3.

Dengliang Shi. (2017). A Study on Neural Network Language Modeling. https://arxiv.org/abs/1708.07252

Oren Melamud, Ido Dagan, Jacob Goldberger. (2017) A Simple Language Model based on PMI Matrix Approximations. https://arxiv.org/abs/1707.05266

Youssef Oualil & Dietrich Klakow (2017). A Neural Network Approach for Mixing Language Models. 5710-5714. https://arxiv.org/abs/1708.06989

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu. (2017). Exploring the Limits of Language Modeling. https://arxiv.org/abs/1602.02410