## Neural Networks (NN) and Recurrent Neural Networks (RNN)
The term "Neural Networks" describes a class of machine learning algorithm that are general function approximations.
These algorithms (or networks) will learn a specific mapping between an input and an output.
-So called Recurrent Neural Networks are a subclass of the NN that use an internal state which allows them to access their previous output in the next iteration.
-This allows RNNs to parse a sequence of inputs and produce sequences, e.g. text.
+So called Recurrent Neural Networks are a subclass of NN that use an internal state which allows them to access their previous output in the next iteration.
+This allows RNNs to parse a sequence of inputs and produce output sequences, e.g. text.
+
+## Attention
+Plain RNNs have only limited memory of generated output. This results in certain issues when generating output with homogenous structure. If e.g. a RNN learns to generate a sentence in the form <Substantive> <Verb> <Substantiv>, it may end in a <Substantiv> <Verb> <Substantiv> Loop.
+In comparison to "simple" NN, the attention based networks address this issue by providing the NN with access to the full input ontop of the last output and implent use a filter matrix to "blend in" certain features.
+
+\TODO{Introduce at least 2 attention models here!}
## Neural Machine Translation (NMT)
-Neural machine translation describes a branch of machine translations where NN are used to learn the mapping between two texts.
-Those usually consist of an input text and its corresponding translation and allow the network to translate similar texts.
+Neural machine translation describes a branch of machine translations where NN are used to translate text.
+These are trained on an input texts and its corresponding translations and allow the network to translate similar texts.
NMTs are commonly used for example in Google Translate \cite{Wu:Schuster:16} and can produce good results on short passages.
-With enough training's data they can even translate articles or larger texts.
-
-## Byte-Pair-Encoding (BPE)
-One of the main issues in current NMTs is the choice on the available vocabulary.
-If to many words are included, the networks tend to forget rarely used words or becomes to general.
-On the other hand if the vocabulary is too small, the networks tend to not learn correct grammar forms.
-Byte Pair Encoding is an approach to solve this by splitting up long words into multiple symbols.
-The symbols will then be used instead of words.
-This allows the network to learn common base words as well as pre and suffixes.
-\shortcite{Sennrich:15} has shown, that this results in an overall improvement.
-Moreover this is a crucial approach in tone control since it allows the model to learn more complex grammar forms.
+With enough training data they can even translate articles or larger texts.\TODO{state of machine translation?}
## Domain Control Mechanism (DCM)
-A given text always describes a certain perspective to a specific audience.
-This means that the language used can differ and words may have a different meaning depending on the context and the audience.
-Since different jargon or terminology can actually be conflicting, all translation process need to define the desired domain.
-
-A domain control mechanism is a specification that provides context information like the audience or the sector (e.g. finance or engineering).
-
-For human translation this is often a simple document called style sheet or style guide that lists this information as shown in figure \ref{fig:styleguide} on page \pageref{fig:styleguide}.
+A given text always represents a certain perspective to a specific audience.
+This means that the used language differs and words may have a different meaning depending on the context and the audience.
+Since different jargon or terminology can actually be conflicting, all translation processes need to address the target domain.
+A domain control mechanism is a specification that provides these information.
+This can be addressed in several ways, e.g. by using domain specific dictionaries or language models.
+For human translators this is often a simple document called style sheet or style guide that lists this information as shown in figure \ref{fig:styleguide} on page \pageref{fig:styleguide}.
### Side Constraints
-The same principle is used in NMTs under the term side constraints.
+A similar approach is used in NMTs under the term side constraints.
This refers to a method where the known words are extended by a set of tags.
Each of those tags stand for a special trait. In case of translations systems this may refer to a domain or a feature of the text.
-The underlying NN will learn to accept the token as part of the input and can be trained to prefer a certain behavior depending on the token.
+The underlying NN will learn to accept the token as part of the input and can be trained to express a certain behavior depending on the token.
-For politeness control the most common tokens are `<V>` for formal and `<T>` for informal.
+For example politeness control uses the tokens `<V>` for formal and `<T>` for informal.
The naming is taken over from \cite{Brown:60} and refers to the Latin pronouns "Tu" for informal and "Vos" for formal address.
+
+## Byte-Pair-Encoding (BPE)
+One of the main challenges in current NMTs is the choice on the available vocabulary.
+If too many words are included, the networks tend to forget rarely used words or become to general.
+On the other hand if the vocabulary is too small, the networks cannot learn complex grammar forms.
+An approach to solve this by splitting up long words into multiple symbols is Byte Pair Encoding.
+Instead of using a list of actual words, the dictionary consist of frequent used character pairs or subwords.
+This allows the network to learn common base words as well as pre- and suffixes.
+\shortcite{Sennrich:15} has shown, that this results in an overall improvement and allows the model to learn more complex grammar forms.
+
+
+
# Thesis Statement
-## Data quality concerns in current research
-The training of a neural network always depends on the available data.
-Since multi language text bodies are rare and often very domain specific, it is hard to train reliable cross domain translation networks.
-One of the first real innovations on domain control mechanism was the side constraint approach introduced by \cite{Sennrich:16}.
-The team used the OpenSubtitles \TODO{find correct link} data set to train and evaluate their models and so did many after them.
-To actually qualify the results, a review of the paper and especially a comparison with networks trained on "real life" data is necessary.
+## Qualify side constraint as a "style guide" for machine translation system
+The quality and availability of any machine translation system depends on the number of sectors it can be used in.
+Side constraints offer a robust and easy to implement way of extending a NMT on several domains as shown by \TODO{kobus paper} and \cite{Sennrich:16}.
+I will evaluate different attention models as well as try to reproduce the improvements through side constraints on more distant language pairs.
## Research Question
-In my thesis I will work on the following 4 questions.
+My thesis will specifically focus on the following 4 question
-### How can we transform industry data to use in NMT research?
-To train well performing models with "real life" data, a strategy to work with large text bodies from "real life" projects is necessary.
-As further described in the [methods] section, I will use [PhraseApp](https://phraseapp.rocks) to manage and import all data sets.
+### How can mixed data be merged and prepared to be used in common NMT frameworks?
+To train models with data from multiple domains, it is neccassary to merge, label and normalize the text corpora before training.
+As further described in the [methods] section, I will use [PhraseApp](https://phraseapp.rocks) to import and preprocess all data sets.
In the thesis I will evaluate this approach and briefly compare it to other common tools as referenced in the papers.
-### How big is the impact of automated tone labeling in comparison to manual labeling?
-Since the data used by \shortcite{Sennrich:16} is not categorized or labeled at all, the team implemented a simple algorithm to mark detect the tone in the data.
-I will use pre labeled data and train a network with the auto generated labeling as well as with the natural labeling.
-Afterwards I will calculate the BLEU difference on all texts to see if it increases.
-My hypothesis here is that the network trained on the hand labeled data will perform significantly better.
-
-### How well do current politeness control mechanism perform on "real life" data sets?
-On top of that I will train one net on the PhraseApp dataset without any labeling.
-This allows me to perform a BLEU test against the natural labeling of the data.
-My hypothesis here is that the network will perform worse without the labeling and even better with the natural labeling.
-
-### How well do current politeness control mechanism perform on distant language pairs?
-The data is available in multiple data so, it would be interesting to see how politeness control mechanism perform between distant language pairs.
-To measure this I plan to use distant languages pairs (like Japanese and English) as source and target language.
-Since more similar locales like German and English have a similar grammar structure, the impact of side constraints might be less big than other pairs.
-However I will test this hypothesis optionally if the training and evaluation of the other networks give me enough time.
+### Which attention model pairs best with side constraints as a domain control mechanism?
+\TODO{state the obvs points from 2 architecture}
+### How well do side constraints as a domain control mechanism perform on distant language pairs?
+Language pairs from the same language family have a similar structure, so side constraint may give them a larger boost than more distant language pairs.
+To evaluate this question I will include distant language pair as well as languages from the same family and test them with and without side constraints.
+### How well do side constraints as a domain control mechanism perform on a authentic industry data set?
+I will optionally evaluate if side constraints lead to an improvement of translation quality on an new introduced modern multi feature multi language text corpus.
+This data is provided by PhraseApp which is an translation management platform that allows companies to localize their applications in a simple manner.