Learning Attentional LSTM model

Tutorials · 12 Aug 2018 - 18 minutes to read.

Learning Attentional LSTM models

This dataset includes the inventory of Seattle’s Library
Using Book’s title and subject to get a representation for each book
- a representation using a simple attentional LSTM/GRU model
Clustering similar books based on the title using a simple kmean algorithm

A Brief about Embedding Layer

Embedding Layer maps discrete input words to dense vectors for computational efficiency
The output of the embedding layer is fed to an encoder
- LSTM type encoder
- having 3 gates to protect and control the cell state:
Input Gate: it defines how much of the newly computed state you want to let through
- Forget Gate: decide what information is to be kept and what is to be thrown away
- Output Gate: the output of the update gate is then updating the old cell state
GRU type encoder
- Same as LSTM except it has 2 gates only
Reset gate : it determines how to combine the new input with the previously saved input state
- Update gate : defines how much of the memory to keep around

Now Reading our Dataset

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import matplotlib.pyplot as plt
root_dir = "../input"
batch_size = 64
vocab_size = 30
maxlen = 30
nrows = 450000 # due to kaggle limit set this to the number of needed training  rows
library = pd.read_csv(os.path.join(root_dir,'library-collection-inventory.csv'),usecols=['Title','Author','Publisher','Subjects'])
library.dropna(inplace=True)
library = library.iloc[np.random.randint(0,library.shape[0], size=nrows)]

Now to the Important part of this tutorial Model Definition

In case the dataset is small then GRU will perform better than LSTM training weights of 2 gates (as explained previously) is faster than training 3 gates and will make overfitting less likely to happen (for sure you can overfit )
Try to make the number of trainable parameters small for small datasets to avoid overfitting and add dropout layer Now to The Model part (using Keras as it is easy to use)

Attention Layer Implementation

The attention layer (Bahdanau et al., 2015) part is adapted from https://github.com/bfelbo/DeepMoji

Why Attention ?

As explained previously LSTM/GRU unit is keeping previous output as internal state , so this unit have information about previous context
but while encoding the last word in the input sentence the context of the first word for example vanishes from the stored info GRU/LSTM unit and in some cases this might be an important info in text classification or in machine translation or in chat bots to keep track of the whole sentence context while encoding current word

Attention * attention layer creates a context vector from the output of all previously encoded words while decoding the current word

the decoding of the current word attends to the previously encoded words

from keras import initializers
from keras.engine import InputSpec, Layer
from keras import backend as K


class AttentionWeightedAverage(Layer):
    """
    Computes a weighted average of the different channels across timesteps.
    Uses 1 parameter pr. channel to compute the attention value for a single timestep.
    """

    def __init__(self, return_attention=False,**kwargs):
        self.init = initializers.get('uniform')
        self.supports_masking = True
        self.return_attention = return_attention
        super(AttentionWeightedAverage, self).__init__(** kwargs)

    def build(self, input_shape):
        self.input_spec = [InputSpec(ndim=3)]
        assert len(input_shape) == 3

        self.W = self.add_weight(shape=(input_shape[2], 1),
                                 name='{}_W'.format(self.name),
                                 initializer=self.init)
        self.trainable_weights = [self.W]
        super(AttentionWeightedAverage, self).build(input_shape)

    def call(self, x, mask=None):
        # computes a probability distribution over the timesteps
        # uses 'max trick' for numerical stability
        # reshape is done to avoid issue with Tensorflow
        # and 1-dimensional weights
        logits = K.dot(x, self.W)
        x_shape = K.shape(x)
        logits = K.reshape(logits, (x_shape[0], x_shape[1]))
        ai = K.exp(logits - K.max(logits, axis=-1, keepdims=True))

        # masked timesteps have zero weight
        if mask is not None:
            mask = K.cast(mask, K.floatx())
            ai = ai* mask
        att_weights = ai / (K.sum(ai, axis=1, keepdims=True) + K.epsilon())
        weighted_input = x* K.expand_dims(att_weights)
        result = K.sum(weighted_input, axis=1)
        if self.return_attention:
            return [result, att_weights]
        return result

    def get_output_shape_for(self, input_shape):
        return self.compute_output_shape(input_shape)

    def compute_output_shape(self, input_shape):
        output_len = input_shape[2]
        if self.return_attention:
            return [(input_shape[0], output_len), (input_shape[0], input_shape[1])]
        return (input_shape[0], output_len)

    def compute_mask(self, input, input_mask=None):
        if isinstance(input_mask, list):
            return [None]* len(input_mask)
        else:
            return None

The Model Class

made a wrapper for our GRU/LSTM autoencoder Scikit-learn like to make it easier (and to allow usage of Gridssearch for the best params)
Feel free to play with the architecture to get better results this is just a simple baseline to learn attention based models
Also to get better result you can feed the embedding layer with a pretrained word2vec or glove model and set trainable parameter to False (using the following wrapper https://gist.github.com/mostafaelaraby/1e4b07992a57cb3be4393676e551c7ee)

from keras.layers import  Embedding,GlobalMaxPool1D, GlobalMaxPooling1D,Dense,Input,Dropout,SpatialDropout1D,RepeatVector
from keras.models import Model
from keras.layers.recurrent import LSTM,GRU
from keras.callbacks import ModelCheckpoint,EarlyStopping
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers.core import*
from sklearn.base import BaseEstimator, ClassifierMixin

#Sklearn class style to use grid search and other sklearn utilities with Keras model
class NeuralClassifier(BaseEstimator, ClassifierMixin): 
    def __init__(self,attention=False,type='lstm',embed_size=50,vocab_size=10000,nepochs=5,patience=4,batch_size=16,maxlen=100):
        self.attention = attention
        self.type = type
        self.vocab_size = vocab_size
        self.nepochs = nepochs
        self.patience = patience
        self.batch_size = batch_size
        self.maxlen = maxlen
        self.embed_size = embed_size
        self.file_path = 'autoencoder_'+self.type+'.best.hdf5'
        self.val_size = 0.3
        self.tokenizer = None
        self.autoencoder = None
        self.history = None
        
    def get_params(self, deep=True):
        return {"attention":self.attention,"type":self.type,"vocab_size":self.vocab_size,"nepochs":self.nepochs,\
                "patience":self.patience,"batch_size":self.batch_size,"maxlen":self.maxlen,"embed_size":self.embed_size}
    
    def set_param(self,**parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self 
    
    def init_model(self,out_dim=20):
        if 'lstm' in self.type:
            return self.lstm_model(out_dim)
        elif 'gru' in self.type:
            return self.gru_model(out_dim) 
            
    def lstm_model(self,out_dim):
        inp = Input(shape=(self.maxlen,) )
        embedding = Embedding(self.vocab_size, self.embed_size)(inp)
        encoder = LSTM(self.embed_size, return_sequences=True, dropout=0.1, recurrent_dropout=0.1 ,name='LSTM_encoder')(embedding)
        if self.attention:
            encoder = AttentionWeightedAverage()(encoder)
        else:
            encoder = GlobalMaxPool1D()(encoder)
        encoder = Dense(self.embed_size, activation="relu" )(encoder)
        
        decoded = RepeatVector(self.maxlen)(encoder)
        decoder = LSTM(self.embed_size, return_sequences=False, dropout=0.1, recurrent_dropout=0.1 ,name='LSTM_decoder')(decoded)
        decoder = Dense(out_dim, activation="softmax",name="decoder_out")(decoder)
        autoencoder = Model(inputs=inp,outputs=decoder)
        autoencoder.compile(optimizer='rmsprop', loss='categorical_crossentropy')
        return autoencoder
    
    def gru_model(self,out_dim):
        inp = Input(shape=(self.maxlen,) )
        embedding = Embedding(self.vocab_size, self.embed_size)(inp)
        embedding = SpatialDropout1D(0.1)(embedding)
        encoder = GRU(self.embed_size, return_sequences=True ,name='GRU_encoder')(embedding)
        if self.attention:
            encoder = AttentionWeightedAverage()(encoder)
        else:
            encoder = GlobalMaxPool1D()(encoder)
        encoder = Dense(self.embed_size, activation="relu" )(encoder)
        
        decoded = RepeatVector(self.maxlen)(encoder)
        decoder = GRU(self.embed_size, return_sequences=False, dropout=0.1, recurrent_dropout=0.1 ,name='GRU_decoder')(decoded)
        decoder = Dense(out_dim, activation="softmax",name="decoder_out")(decoder)
        autoencoder = Model(inputs=inp,outputs=decoder)
        autoencoder.compile(optimizer='rmsprop', loss='categorical_crossentropy')
        return autoencoder
    
    def tokenize_set(self,train_sentences,test_sentences):
        self.tokenizer = Tokenizer(num_words=self.vocab_size)
        self.tokenizer.fit_on_texts(list(train_sentences))
        list_tokenized_train = self.tokenizer.texts_to_sequences(train_sentences)
        list_tokenized_test = self.tokenizer.texts_to_sequences(test_sentences)
        X_t = pad_sequences(list_tokenized_train , maxlen=self.maxlen)
        X_te = pad_sequences(list_tokenized_test, maxlen=self.maxlen)
        return X_t,X_te
    
    def fit(self, X,sentences):  
        checkpoint = ModelCheckpoint(self.file_path, monitor='val_loss', verbose=1, save_best_only=True,save_weights_only = True, mode='min')
        early = EarlyStopping(monitor="val_loss", mode="min", patience=self.patience)
        callbacks_list = [ early,checkpoint]
        y_train = self.tokenizer.texts_to_matrix(sentences, mode='binary')
        self.autoencoder = self.init_model(out_dim = y_train.shape[1])
        self.history = self.autoencoder.fit(X, y_train, batch_size=self.batch_size, epochs=self.nepochs, validation_split=self.val_size,callbacks =callbacks_list ,shuffle=True)
        return self
    
    def predict(self,X):
        self.autoencoder.load_weights(self.file_path)
        return self.autoencoder.predict([X],batch_size=1024,verbose=1)

Preparing Input train data

list_titles = library.apply(lambda x : x['Title']+' '+x['Subjects'],axis=1).fillna("_na_").values

Train Attentional GRU autoencoder Model

model_with_attention = NeuralClassifier(attention = True,batch_size=batch_size,type = 'gru_attention' ,vocab_size = vocab_size,maxlen=maxlen,nepochs = 20)
train_features,_ = model_with_attention.tokenize_set(list_titles,[])
model_with_attention.fit(train_features,list_titles)

Train GRU Model without attention

model_without_attention = NeuralClassifier(attention = False,batch_size=batch_size,type = 'gru' ,vocab_size = vocab_size,maxlen=maxlen,nepochs = 20)
model_without_attention.tokenizer = model_with_attention.tokenizer
model_without_attention.fit(train_features,list_titles)

Validation Loss plotting of the autoencoder

def show_loss(history,name):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title(name)
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()
show_loss(model_with_attention.history,'GRU Model With Attention')
show_loss(model_without_attention.history,'GRU Model')

Clustering Books based on its title and subject using KMeans

from string import punctuation
exclude = set(punctuation)
cleaner = lambda title: ''.join(ch for ch in title if ch not in exclude)
from sklearn.cluster import KMeans
kmeans = KMeans()
feats  = model_with_attention.predict(train_features)
y_kmeans = kmeans.fit_transform(feats)
cluster_indx = 'Cluster_Index'
library[cluster_indx]  = kmeans.labels_
library['distance'] = y_kmeans.min(axis=1)
library.sort_values(by='distance',inplace=True,axis=0)
labels = library[cluster_indx].unique() 
wc = WordCloud(background_color="white", max_words=7,width=600, height=300)
for label in labels:
  word_cloud = wc.generate_from_frequencies({book_name:1 for book_name in library[library[cluster_indx]==label]['Title'].apply(lambda x: x[:15]).values.tolist()})
  plt.figure( figsize=(20,10) )
  plt.imshow(word_cloud,interpolation="bilinear")
  plt.title('Label name '+str(label)+' having '+str(library[library[cluster_indx]==label].shape[0])+' book')
  plt.show()

Some Statistics on the resulting clusters

Number of book types published by an author /Publisher (usually same author/publisher tend to publish within a specific type of books)

import seaborn as sns
def plot_n_cluster_per(aggr_col):
    sns.set(font_scale=1) 
    f, ax = plt.subplots(figsize=(16, 32))
    max_n_observations = 10
    n_groups = library.groupby(aggr_col)[cluster_indx].nunique()
    n_groups = n_groups.nlargest(max_n_observations)
    colors_cw = sns.color_palette('cubehelix_r', len(n_groups))
    sns.barplot(n_groups.values,n_groups.keys(), palette = colors_cw[::-1])
    Text = ax.set(title='Type of Books published by the same '+aggr_col)
plot_n_cluster_per('Author')
plot_n_cluster_per('Publisher')

library.to_csv('output_clusters.csv',index=False)

References

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In Y. Bengio & Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1409.0473

City of Seattle library collection

Learning Attentional LSTM model

References

Related Links