Arabic Translation System: Seq2Seq Statistical learning

Build and modeling arabic machine translation system using deep neural networks with Pytorch.

Arabic Translation System: Seq2Seq  Statistical learning

In this post we discuss sequence to sequence model for arabic machine translation.

Introduction

Deep Neural Networks (DNNs) is very powerful machine learning models and it had been used in many natural language processing and computer vision tasks and achieved great performance.

We can see how awesome this models is being injected into Alexa which can literally read a web pages for you or google home when it responds to your commands and absolutely we won't forget facebook's face recognition and snapchat ability to manipulate faces. These Products actually built with the power of Neural Networks.

Neural Networks have many architectures with different use cases that can be applied on different problems for example Convolution Neural Network (CNN) which is very powerful when it comes to visual image recognition and object detection and in voice recognition applications too according to some researches literature.

In previous post of mine i explained the theoretical aspects of how Convolution Neural Network (CNN) might be working in recognizing and classifying sequence of handwritten digits VS. Recurrent Neural Network (RNN) and both of them showed a very good results.

Although Deep Neural Network is very powerful models and could capture complex information when it is trained well on large samples, it has weak points when the input and output is variant in length.

For example when we deal with classification problem for images we have samples of fixed size and if not the case we do multi pre-processing on the images to normalize the training samples then apply multi layer perceptron (MLP) or Convolution (CNN) which will encode the information of the training samples into fixed size vector and make predictions by computing fixed size probabilistic vector.

Classification VS. Object Detection

RNNs are type of neural network that is specialized in handling and processing sequential data and can scale very well on long sequences of values. RNN is very powerful given the fact that it can share weights across the entire model through time-steps iterations.

Example of problems that could be best expressed with sequence to sequence modeling:

  • Speech Recognition
  • Machine Translation
  • Question Answering
  • Named Entity Recognition
  • POS Tagging

Theory

Let's talk a bit about the theoretical part of RNN and the mechanism behind how it can remember states by constructing mapping between them through steps iterations applied on the input sequence.

Given a sequence of input tokens
\begin{equation}
(x_{1}, x_{2}, x_{3}, ... x_{T})
\end{equation}

The RNN should be able to compute a sequence of output tokens
\begin{equation}
(y_{1}, y_{2}, y_{3}, ... y_{T})
\end{equation}

We have already mentioned that the weights are shared for each time steps and probably you may want to understand the manifest behind these shared parameters and why it is important. For instance assume we have the following sentences we want to translate:

It was raining in Egypt yesterday.
In Egypt it was raining yesterday.
Yesterday it was raining in Egypt.

if we ask the machine to translate it for us it should produce:

كانت تمطر في مصر البارحة

if one or more of those three sentences were not given in training process the translation should be the same even though the words occurred in different time-steps (different positions). Parameters sharing is great advantage since every input is a function of the previous output in each time-step which generalize the learned language model across different sequences of different forms.

Moreover,

Suppose we train a traditional fully connected feed forward network (FC) on the same problem, a normal network would have separate weights for each individual input token resulting in learning independent language model for each one and in this case it might not generalize (dummy mapping).

Visualization of each step using GRU unit

Practice

We’ll need a unique index per word to use as the inputs and targets of the networks later. To keep track of all this we will use a helper class called Lang which has word → index (word2index) and index → word (index2word) dictionaries, as well as a count of each word word2count to use to later replace rare words.

The next snippet we are going to build Dictionary data structure to repesent the language dictionary for both input and output data, then we will use Reader instance to read and tokenize the data in more organized way.

SOS = 0
EOS = 1
FILES = {'ar': 'ara.txt', 'en': 'eng.txt'}


class Dictionary:
    def __init__(self, name):
        print('Building Dictionary for lang %s' %(name))
        self.name = name
        self.word_to_idx = {}
        self.idx_to_word = {0: "<S>", 1: "</S>"}
        self.words_count = {}
        self.n_words = 2  # Count SOS, EOS special tokens

    def add_word(self, word):
        if word not in self.word_to_idx:
            self.word_to_idx[word] = self.n_words
            self.words_count[word] = 1
            self.idx_to_word[self.n_words] = word
            self.n_words += 1
        else:
            self.words_count[word] += 1

    def add_sentence(self, sentence):
        for word in sentence.split():
            self.add_word(word)
            
class Reader:
    
    def __init__(self, lang, path, max_len=10, min_len=1, max_chars=40):
        self.lang = lang
        self.path = path
        self.sentences = []
        self.max_len = max_len
        self.min_len = min_len
        self.max_chars = max_chars

    def read(self):
        print('Reading Sentences for language %s ...' %(self.lang))
        with open(self.path, 'r') as reader:
            self.sentences += list(map(tools.cleaner_job, reader.readlines()))
        return self.sentences

    def get_tokenized(self):
        print('Tokenizing %s ...' %(self.lang))
        return [[w for w in s.split()] for s in self.sentences]

We wrote Reader abstract class we are going to extend it a little bit to include input and output sentence.

To read the data file we will split the file into lines, and then split lines into pairs.

class SentenceReader(Reader):

    def __init__(self, input_lang_map, outpt_lang_map, max_length=10):
        super(SentenceReader, self).__init__(input_lang_map, outpt_lang_map, max_len=max_length)
        in_lang, in_path = tuple(*(input_lang_map.items()))
        out_lang, out_path = tuple(*(outpt_lang_map.items()))
        self.input = Reader(lang=in_lang, path=in_path, max_len=max_length)
        self.outpt = Reader(lang=out_lang, path=out_path, max_len=max_length)
        self.lang_map = {in_lang: self.input, out_lang: self.outpt}
        self.input.read()
        self.outpt.read()
        self.to_remove = set()
        self.to_have = set()
        self.__filter_sentences()
        
    def read_sentences(self, lang):
        to_have = self.to_have
        lang = self.lang_map.get(lang)
        if lang:
            return lang.sentences
        else:
            raise AttributeError("Invalid Language Attribute")
        
    def get_tokenized(self, lang):
        print('Tokenizing %s ...' %(lang))
        sentences = self.lang_map.get(lang).sentences
        return [[w for w in s.split()] for s in sentences]
    
    def __filter_sentences(self):
        max_len, min_len, max_chars = self.max_len, self.min_len, self.max_chars
        to_have = self.to_have
        to_remove = self.to_remove
        
        for reader in [self.input, self.outpt]:
            for i, sentence in enumerate(reader.sentences):
                if len(sentence.split()) >= max_len:
                    to_remove.add(i)
                elif len(sentence.split()) < min_len:
                    to_remove.add(i)
                if i not in to_remove:
                    to_have.add(i)
                    
        for reader in [self.input, self.outpt]:
            to_have = [i for i in to_have if i not in to_remove]
            sent_series_obj = pd.Series(reader.sentences)
            reader.sentences = sent_series_obj[[i for i in to_have]].tolist()

The next code snippet to write function that accept input language file path and output language path.


def prepare_dataset(in_lang_path, out_lang_path, max_length=10):
    print('Preparing Dataset')
    
    reader = SentenceReader({'en': in_lang_path}, {'ar': out_lang_path}, max_length=max_length)

    pairs = [(in_sent, out_sent) for (in_sent, out_sent) in zip(reader.read_sentences('en'), 
                                                                reader.read_sentences('ar'))]

    print('Building Language Dictionary')

    in_dictionary = Dictionary('english')
    ou_dictionary = Dictionary('arabic')
    for in_sent, out_sent in pairs:
        in_dictionary.add_sentence(in_sent)
        ou_dictionary.add_sentence(out_sent)

    print('Encoding Tokenized Sentences to unique identifiers')
    in_word_to_idx = in_dictionary.word_to_idx
    ou_word_to_idx = ou_dictionary.word_to_idx
    pairs_encoded = [([in_word_to_idx[in_w]  for in_w in in_sent], 
                      [ou_word_to_idx[out_w] for out_w in out_sent]) 
                                             for (in_sent, out_sent) 
                                             in zip(reader.get_tokenized('en'), 
                                                    reader.get_tokenized('ar'))]
    
    return {
        'pairs': pairs, 
        'pairs_encoded': pairs_encoded, 
        'in_dictionary': in_dictionary, 
        'out_dictionary': ou_dictionary,
    }
    
def pad_sequence(indices):
    max_len = 0
    for seq in indices:
        if len(seq) > max_len:
            max_len = len(seq)
            
    for seq in indices:
        seq += [EOS] * (max_len - len(seq))

def index_to_tensor(pairs):
    input_idxs = []
    output_idxs = []
    tensors = {}
    
    for pair in pairs.copy():
        input_idxs += [pair[0] + [EOS]]
        output_idxs += [[SOS] + pair[1] + [EOS]]
        
    for indices in [input_idxs, output_idxs]:
        pad_sequence(indices)
        
    print(input_idxs[:10])
    print(output_idxs[:10])
        
    tensors['input'] = torch.tensor(input_idxs)
    tensors['target'] = torch.tensor(output_idxs)
    
    return tensors

data = loader.prepare_dataset('./eng.txt', './ara.txt', max_length=16)
Preparing Dataset
Reading Sentences for language en ...
Reading Sentences for language ar ...
Building Language Dictionary
Building Dictionary for lang english
Building Dictionary for lang arabic
Encoding Tokenized Sentences to unique identifiers
Tokenizing en ...
Tokenizing ar ..
import random

data_text = [random.choice(data.get('pairs')) for i in range(10)]
data_encoded = [random.choice(data.get('pairs_encoded')) for i in range(10)]

for item in data_text:
  print(item)

for item in data_encoded:
  print(item)

('Im sure Tom will tell us the truth ', 'متأكد بأن توم سيخبرنا بالحقيقة ')
('Watch out Theres a big hole there ', 'انتبه هناك حفرة كبيرة هناك ')
('May I borrow your dictionary ', 'هل لي أن أستعير قاموسك؟ ')
('Im sure Ill find a good gift for Tom ', 'أنا متأكد أني سأجد هدية جيدة لتوم ')
('I dont know if I have the time ', 'لا أعلم إن كان لدي ما يكفي من الوقت ')
('Dont forget these ', 'لا تنس هذه ')
('Michael Jackson died ', 'توفي مايكل جاكسون ')
('Who knows what might happen in the future ', 'من يعلم ما قد يحصل مستقبلًا؟ ')
('I also went ', 'ذهبت أيضاً ')
('May I interrupt ', 'هل لي أن أقاطع؟ ')
([403, 802, 115, 1921], [627, 1722, 4256])
([159, 472, 8], [232, 88, 1441])
([11, 310, 332, 90, 2306], [1563, 1562, 2008])
([235, 55, 67], [77, 775, 18, 108])
([2986, 2987, 115, 1235, 2988], [7294, 3936, 7295])
([317, 3290, 115, 3291, 396, 694], [8257, 8258, 2953])
([1155, 451, 115, 833], [816, 716, 3437])
([19, 3116, 396, 451, 8, 396, 2841], [2630, 1397, 209, 7657])
([19, 559, 124, 804, 1550, 127], [861, 6070, 6071])
([11, 362, 124, 667, 3728, 454], [162, 411, 724, 9740, 9741])

Neural Network Modeling Using Pytorch

HIDDEN_SIZE = 128
INPUT_SIZE = data.get('in_dictionary').n_words
OUTPUT_SIZE = data.get('out_dictionary').n_words

tensors = data.get('pairs_encoded')

dictionaries = {'output': data.get('out_dictionary'),
                'input' : data.get('in_dictionary')}

model = Seq2seqModel(iterations=1000000, lr=0.01, hidden_size=HIDDEN_SIZE, dictionaries=dictionaries,
                     input_size=INPUT_SIZE, output_size=OUTPUT_SIZE, max_length=16)


""" 
------ Trainging ------ 
"""

model.train(tensors=tensors) 


""" 
------ Evaluating ------ 
"""
sentences = data.get('pairs')
model.evaluate(sentences)

Input  | Shut the door 
Output | أغلق الباب 
Pred   | أغلق  الباب  </S>
========================================================================
Input  | Im afraid Tom doesnt want to talk to you 
Output | أخشى أن توم لا يريد التحدث معك 
Pred   | أخشى  أن  لا  لا  </S>
========================================================================
Input  | Do you remember the day when you and I first met 
Output | هل تذكر اليوم الذي تقابلنا فيه أنا وأنت أول مرة؟ 
Pred   | هل  تذكر  اليوم  الذي  تقابلنا  فيه  اليوم  </S>
========================================================================
Input  | I dont care about profit 
Output | أنا لا أهتم للربح 
Pred   | لا  أهتم  أهتم  للربح  </S>
========================================================================
Input  | Tom had to make a decision 
Output | توجب على توم أن يتخذ قرارا 
Pred   | توجب  أن  توم  يتخذ  يتخذ  قرارا  </S>
========================================================================
Input  | He tried to absorb as much of the local culture as possible 
Output | حاول أن يستوعب أكبر قدر من الثقافة المحلية قدر الإمكان 
Pred   | حاول  أن  يستوعب  أكبر  قدر  الثقافة  الثقافة  المحلية  المحلية  قدر  </S>
========================================================================
Input  | In Japan a new school year starts in April 
Output | في اليابان السنة الدراسية الجديدة تبدأ في أبريل 
Pred   | في  هناك  السنة  الدراسية  تبدأ  في  تبدأ  </S>
========================================================================
Input  | Dont play baseball here 
Output | لا تلعب كرة القاعدة هنا 
Pred   | لا  تلعب  كرة  القاعدة  هنا  </S>
========================================================================
Input  | Whats important is not the goal but the journey 
Output | ما هو مهم ليس الهدف ولكن الرحلة 
Pred   | ليس  ما  مهم  الهدف  ولكن  الرحلة  </S>
========================================================================
Input  | I figured youd be impressed 
Output | توقعت أنك ستنبهر 
Pred   | توقعت  أن  ستنبهر  </S>
========================================================================

References