- Custom voice control system in context – 3-year contract, starting September 2017
Title : Using Language Models to Check Documents “Language Quality”
Proposed topic in: M2R Informatique, Project
Person(s) responsible :
- Herve Blanchon
- Didier Schwab
Keywords: Natural Language Processing, Language Model, Written Document Quality Estimation
project duration: 6 months
Maximum number of students: 1
Places available: 1
Interview on: 07 November 2017, at 12.11 p.m.
- Hervé BLANCHON (GETALP-LIG)
- Annelise JOST (Altica)
- Jérôme GOULIAN (GETALP-LIG)
- Clarisse BAYOL (Altica)
- Didier SCHWAB (GETALP-LIG)
- Natural Language Processing
- Language Model
- Written Document Quality Estimation
Profile of the Candidate
M2 in computer science with an interest in natural language processing and machine learning
Language Models for a given natural language L calculate the probability that a sequence of words of interest is a valid sequence in the language L. The higher the score, the more likely it is that the sequence is correct in language L. Conversely, if the score is low, it is probable that the sequence is incorrect in language L.
The Altica society, located in Voiron (20km away from Grenoble), is specialized in providing translations of documents for its clients. Altica would like to be able to check the quality of both the source document provided by the client and the translated document produced by Altica using a fast and efficient workflow in order to take proper actions.
Originality of the proposed subject
To our knowledge language models have not yet been used for the task we propose. On one hand, for a given a source document provided by a client, we would like to give the translator an approximate idea of the quality of the source in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in this source so that the translator may eventually correct them or ask the client to provide a correction. On the other hand, for a given translation produced by Altica, we would like to give the translator an approximate idea of the quality of its transation in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in his translation so that he may correct them.
- become familiar with the notion of language model
- build language models using different approaches: n-gram LM, log-linear LM, feed-forward neural LM, recurrent neural network LM
- evaluate the usability of the proposed LMs in measuring the language quality of a document (spelling, syntax, …)
- evaluate the possiblilty of using the proposed LMs to identify problematic segments of a document (for the sake building a user friendy interface of a verification tool)
- State of the art: language modelling approaches
- State of the art: language modelling usages
- State of the art: language modelling and document cheching
- Language modelling approaches: Evaluation of pros, cons, suitability for the task
Dan Jurafsky. (2017). CS 124: From Languages to Information, week 2 (language modelling). https://web.stanford.edu/class/cs124/
Joshua Goodman. (2003). The state of the art in language modeling. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials – Volume 5 (NAACL-Tutorials ’03), Vol. 5. Association for Computational Linguistics, Stroudsburg, PA, USA, 4-4. DOI: https://doi.org/10.3115/1075168.1075172
Graham Neubig. (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. http://arxiv.org/abs/1703.01619. Section 3.
Dengliang Shi. (2017). A Study on Neural Network Language Modeling. https://arxiv.org/abs/1708.07252
Oren Melamud, Ido Dagan, Jacob Goldberger. (2017) A Simple Language Model based on PMI Matrix Approximations. https://arxiv.org/abs/1707.05266
Youssef Oualil & Dietrich Klakow (2017). A Neural Network Approach for Mixing Language Models. 5710-5714. https://arxiv.org/abs/1708.06989
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu. (2017). Exploring the Limits of Language Modeling. https://arxiv.org/abs/1602.02410