Employment | GETALP : Groupe d'Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Thesis offers

Custom voice control system in context – 3-year contract, starting September 2017

Internships

Title : Using Language Models to Check Documents “Language Quality”

Proposed topic in: M2R Informatique, Project

Person(s) responsible :

Keywords: Natural Language Processing, Language Model, Written Document Quality Estimation
project duration: 6 months
Maximum number of students: 1
Places available: 1
Interview on: 07 November 2017, at 12.11 p.m.

DescriptionPaid internship with a company

Supervisors

Hervé BLANCHON (GETALP-LIG)
Annelise JOST (Altica)
Jérôme GOULIAN (GETALP-LIG)
Clarisse BAYOL (Altica)
Didier SCHWAB (GETALP-LIG)

Keywords

Natural Language Processing
Language Model
Written Document Quality Estimation

Profile of the Candidate

M2 in computer science with an interest in natural language processing and machine learning

Context

Language Models for a given natural language L calculate the probability that a sequence of words of interest is a valid sequence in the language L. The higher the score, the more likely it is that the sequence is correct in language L. Conversely, if the score is low, it is probable that the sequence is incorrect in language L.

The Altica society, located in Voiron (20km away from Grenoble), is specialized in providing translations of documents for its clients. Altica would like to be able to check the quality of both the source document provided by the client and the translated document produced by Altica using a fast and efficient workflow in order to take proper actions.

Originality of the proposed subject

To our knowledge language models have not yet been used for the task we propose. On one hand, for a given a source document provided by a client, we would like to give the translator an approximate idea of the quality of the source in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in this source so that the translator may eventually correct them or ask the client to provide a correction. On the other hand, for a given translation produced by Altica, we would like to give the translator an approximate idea of the quality of its transation in terms of lexical and grammatical quality and locate as precisely as possible the problematic segments in his translation so that he may correct them.

Expected results

Pratical:

become familiar with the notion of language model
build language models using different approaches: n-gram LM, log-linear LM, feed-forward neural LM, recurrent neural network LM
evaluate the usability of the proposed LMs in measuring the language quality of a document (spelling, syntax, …)
evaluate the possiblilty of using the proposed LMs to identify problematic segments of a document (for the sake building a user friendy interface of a verification tool)

Theoretical:

State of the art: language modelling approaches
State of the art: language modelling usages
State of the art: language modelling and document cheching
Language modelling approaches: Evaluation of pros, cons, suitability for the task

References

Dan Jurafsky. (2017). CS 124: From Languages to Information, week 2 (language modelling). https://web.stanford.edu/class/cs124/

Joshua Goodman. (2003). The state of the art in language modeling. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Tutorials – Volume 5 (NAACL-Tutorials ’03), Vol. 5. Association for Computational Linguistics, Stroudsburg, PA, USA, 4-4. DOI: https://doi.org/10.3115/1075168.1075172

Graham Neubig. (2017). Neural Machine Translation and Sequence-to-sequence Models: A Tutorial. http://arxiv.org/abs/1703.01619. Section 3.

Dengliang Shi. (2017). A Study on Neural Network Language Modeling. https://arxiv.org/abs/1708.07252

Oren Melamud, Ido Dagan, Jacob Goldberger. (2017) A Simple Language Model based on PMI Matrix Approximations. https://arxiv.org/abs/1707.05266

Youssef Oualil & Dietrich Klakow (2017). A Neural Network Approach for Mixing Language Models. 5710-5714. https://arxiv.org/abs/1708.06989

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, Yonghui Wu. (2017). Exploring the Limits of Language Modeling. https://arxiv.org/abs/1602.02410