Another Twitter sentiment analysis with Python — Part 10 (Neural Network with Doc2Vec/Word2Vec/GloVe)
This is the 10th part of my ongoing Twitter sentiment analysis project. You can find the previous posts from the below links.
- Part 1: Data cleaning
- Part 2: EDA, Data visualisation
- Part 3: Zipf’s Law, Data visualisation
- Part 4: Feature extraction (count vectorizer), N-gram, confusion matrix
- Part 5: Feature extraction (Tfidf vectorizer), machine learning model comparison, lexical approach
- Part 6: Doc2Vec
- Part 7: Phrase modeling + Doc2Vec
- Part 8: Dimensionality reduction (Chi2, PCA)
- Part 9: Neural Networks with Tfidf vectors
In the previous post, I implemented neural network modelling with Tf-idf vectors, but found that with high-dimensional sparse data, neural network did not perform well. In this post, I will see if feeding document vectors from Doc2Vec models or word vectors is any different from Tf-idf vectors.
*In addition to short code blocks I will attach, you can find the link for the whole Jupyter Notebook at the end of this post.
Neural Networks with Doc2Vec
Before I jump into neural network modelling with the vectors I got from Doc2Vec, I would like to give you some background on how I got these document vectors. I have implemented Doc2Vec using Gensim library in the 6th part of this series.
There are three different methods used to train Doc2Vec. Distributed Bag of Words, Distributed Memory (Mean), Distributed Memory (Concatenation). These models were trained with 1.5 million tweets through 30 epochs and the output of the models are 100 dimension vectors for each tweet. After I got document vectors from each model, I have tried concatenating these (so the concatenated document vectors have 200 dimensions) in combination: DBOW + DMM, DBOW + DMC, and saw an improvement to the performance when compared with models with any single pure method. Using different methods of training and concatenating them to improve the performance has already been demonstrated by Le and Mikolov (2014) in their research paper.
Finally, I have applied phrase modelling to detect bigram phrase and trigram phrase as a pre-step of Doc2Vec training and tried different combination across n-grams. When tested with a logistic regression model, I got the best performance result from ‘unigram DBOW + trigram DMM’ document vectors.
I will first start by loading Gensim’s Doc2Vec, and define a function to extract document vectors, then load the doc2vec model I trained.
When fed to a simple logistic regression, the concatenated document vectors (unigram DBOW + trigram DMM) yields 75.90% training set accuracy, and 75.76% validation set accuracy.
I will try different numbers of hidden layers, hidden nodes to compare the performance. In the below code block, you see I first define the seed as “7” but not setting the random seed, “np.random.seed()” will be defined at the start of each model. This is for a reproducibility of various results from different model structures.
*Side Note (reproducibility): To be honest, this took me a while to figure out. I first tried by setting the random seed before I import Keras, and ran one model after another. However, if I define the same model structure after it has run, I couldn’t get the same result. But I also realised if I restart the kernel, and re-run code blocks from start it gives me the same result as the last kernel. So I figured, after running a model the random seed changes, and that is the reason why I cannot get the same result with the same structure if I run them in the same kernel consecutively. Anyway, that is why I set the random seed every time I try a different model. For your information, I am running Keras with Theano backend, and only using CPU not GPU. If you are on the same setting, this should work. I explicitly specified backend as Theano by launching Jupyter Notebook in the command line as follows: “KERAS_BACKEND=theano jupyter notebook”
Please note that not all of the dependencies loaded in the below cell has been used for this post, but imported for later use.
Defining different model structures will be quite repetitive codes, so I will just give you two examples so that you can understand how to define model structures with Keras.
After trying 12 different models with a range of hidden layers (from 1 to 3) and a range of hidden nodes for each hidden layer (64, 128, 256, 512), below is the result I got. Best validation accuracy (79.93%) is from “model_d2v_09” at epoch 7, which has 3 hidden layers of 256 hidden nodes for each hidden layer.
Now I know which model gives me the best result, I will run the final model of “model_d2v_09”, but this time with callback functions in Keras. I was not quite familiar with callback functions in Keras before I received a comment in my previous post. After I got the comment, I did some digging and found all the useful functions in Keras callbacks. Thanks to @rcshubha for the comment. With my final model of Doc2Vec below, I used “checkpoint” and “earlystop”. You can set the “checkpoint” function with options, and with the below parameter setting, “checkpoint” will save the best performing model up until the point of running, and only if a new epoch outperforms the saved model it will save it as a new model. And “early_stop” I defined it as to monitor validation accuracy, and if it doesn’t outperform the best validation accuracy so far for 5 epochs, it will stop.
If I evaluate the model I just run, it will give me the result as same as I got from the last epoch.
But if I load the saved model at the best epoch, then this model will give me the result at that epoch.
from keras.models import load_model
loaded_model = load_model('d2v_09_best_weights.07-0.7993.hdf5')
If you remember the validation accuracy with the same vector representation of the tweets with a logistic regression model (75.76%), you can see that feeding the same information to neural networks yields a significantly better result. It’s amazing to see how neural network can boost the performance of dense vectors, but the best validation accuracy is still lower than the Tfidf vectors + logistic regression model, which gave me 82.92% validation accuracy.
If you have read my posts on Doc2Vec, or familiar with Doc2Vec, you might know that you can also extract word vectors for each word from the trained Doc2Vec model. I will move on to Word2Vec, and try different methods to see if any of those can outperform the Doc2Vec result (79.93%), ultimately outperform the Tfidf + logistic regression model (82.92%).
To make use of word vectors extracted from Doc2Vec model, I can no longer use the concatenated vectors of different n-grams, since they will not consist of the same vocabularies. Thus below, I load the model for unigram DMM and create concatenated vectors with unigram DBOW of 200 dimensions for each word in the vocabularies.
What I will do first before I try neural networks with document representations computed from word vectors is that I will fit logistic regressions with various methods of document representation and with the one that gives me the best validation accuracy, I will finally define a neural network model.
I will also give you the summary of result from all the different word vectors fit with logistic regression as a table.
Word vectors extracted from Doc2Vec models (Average/Sum)
There could be a number of different ways to come up with document representational vectors with individual word vectors. One obvious choice is to average them. For every word in a tweet, see if trained Doc2Vec has word vector representation of the word, if so, sum them up throughout the document while counting how many words were detected as having word vectors, and finally by dividing the summed vector by the count you get the averaged word vector for the whole document which will have the same dimension (200 in this case) as the individual word vectors.
Another method is just the sum of the word vectors without averaging them. This might distort the vector representation of the document if some tweets only have a few words in the Doc2Vec vocabulary and some tweets have most of the words in the Doc2Vec vocabulary. But I will try both summing and averaging and compare the results.
The validation accuracy with averaged word vectors of unigram DBOW + unigram DMM is 71.74%, which is significantly lower than document vectors extracted from unigram DBOW + trigram DMM (75.76%), and also from the results I got from the 6th part of this series, I know that document vectors extracted from unigram DBOW + unigram DMM will give me 75.51% validation accuracy.
I also tried scaling the vectors using ScikitLearn’s scale function and saw significant improvement in computation time and a slight improvement of the accuracy.
Let’s see how summed word vectors perform compared to the averaged counterpart.
The summation method gave me higher accuracy without scaling compared to the average method. But the simple logistic regression with the summed vectors took more than 3 hours to run. So again I tried scaling these vectors.
Surprising! With scaling, logistic regression fitting only took 3 minutes! That’s quite a difference. Validation accuracies with scaled word vectors are 72.42% with averaging, 72.51% with summing.
Word vectors extracted from Doc2Vec models with TFIDF weighting (Average/Sum)
In the 5th part of this series, I have already explained what TF-IDF is. TF-IDF is a way of weighting each word by calculating the product of relative term frequency and inverse document frequency. Since it gives one scalar value for each word in the vocabulary, this can also be used as a weighting factor of each word vectors. Correa Jr. et al (2017) has implemented this Tf-idf weighting in their paper “NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment Analysis”
In order to get the Tfidf value for each word, I first fit and transform the training set with TfidfVectorizer and create a dictionary containing “word”, “tfidf value” pairs. Also, I defined a function “get_w2v_general” to get either averaged word vectors or summed word vectors with given Word2Vec model. Finally, I calculate the multiplication of word vectors with corresponding Tfidf values. From below code, the Tfidf multiplication part took quite a bit of time. To be honest, I am still not sure why it took so long to compute the Tfidf weighting of the word vectors, but after 5 hours it finally finished computing. You can also see later that I tried another method of weighting but that took less than 10 seconds. If you have an answer to this, any insight would be appreciated.
Then I can call “get_w2v_general” function in the same way I did with the above word vectors without weighting factors, then fit a logistic regression model. The validation accuracy for mean is 70.57%, for sum is 70.32%. The results are not what I expected, especially after 5 hours of waiting. By weighting word vectors with Tfidf values, the validation accuracy dropped around 2% both for averaging and summing.
Word vectors extracted from Doc2Vec models with custom weighting (Average/Sum)
In the 3rd part of this series, I have defined a custom metric called “pos_normcdf_hmean”, which is a metric borrowed from the presentation by Jason Kessler in PyData 2017 Seattle. If you want to know more in detail about the calculation, you can either check my previous post or you can also watch Jason Kessler’s presentation. To give you a high-level intuition, by calculating harmonic mean of CDF(Cumulative Distribution Function) transformed values of term frequency rate within the whole document and the term frequency within a class, you can get a meaningful metric which shows how each word is related to a certain class.
I have used this metric to visualise tokens in the 3rd part of the series, and also used this again to create custom lexicon to be used for classification purpose in the 5th part. I will use this again as a weighting factor for the word vectors and see how it affects the performance.
The validation accuracy for mean is 73.27%, and for sum is 70.94%. Unlike Tfidf weighting, this time with custom weighting it actually gave me some performance boost when used with averaging method. But with summing, this weighting has performed no better than the word vectors without weighting.
Word vectors extracted from pre-trained GloVe (Average/Sum)
GloVe is another kind of word representation in vectors proposed by Pennington et al. (2014) from the Stanford NLP Group.
The difference between Word2Vec and Glove is how the two models compute the word vectors. In Word2Vec, the word vectors you are getting is a kind of a by-product of a shallow neural network, when it tries to predict either centre word given surrounding words or vice versa. But with GloVe, the word vectors you are getting is the object matrix of GloVe model, and it calculates this using term co-occurrence matrix and dimensionality reduction.
The good news is you can now easily load and use the pre-trained GloVe vectors from Gensim thanks to its latest update (Gensim 3.2.0). In addition to some pre-trained word vectors, new datasets are also added and this also can be easily downloaded using their downloader API. If you want to know more about this, please check this blog post by RaRe Technologies.
The Stanford NLP Group has made their pre-trained GloVe vectors publicly available, and among them, there are GloVe vectors trained specifically with Tweets. This sounds like something definitely worth trying. They have four different versions of Tweet vectors each with different dimensions (25, 50, 100, 200) trained on 2 billion Tweets. You can find more detail on their website.
For this post, I will use 200 dimensions pre-trained GloVe vectors. (You might need to update Gensim if your version of Gensim is lower than 3.2.0)
By using pre-trained GloVe vectors, I can see that the validation accuracy significantly improved. So far the best validation accuracy was from the averaged word vectors with custom weighting, which gave me 73.27% accuracy, and compared to this, GloVe vectors yield 76.27%, 76.60% for average and sum respectively.
Word vectors extracted from pre-trained Google News Word2Vec (Average/Sum)
With new updated Gensim, I can also load the famous pre-trained Google News word vectors. These word vectors are trained using Word2Vec model on Google News dataset (about 100 billion words) and published by Google. The model contains 300-dimensional vectors for 3 million words and phrases. You can find more detail in the Google project archive.
The validation accuracy for mean is 74.96%, and for sum is 74.92%. Even though it gives me a better result than the word vectors extracted from custom trained Doc2Vec models, but it fails to outperform GloVe vectors. And the vector dimension is even larger in Google News word vectors.
But, this is trained with Google News, and GloVe vector I used was trained specifically with Tweets, thus it is hard to compare each other directly. What if Word2Vec is specifically trained with Tweets?
Separately trained Word2Vec (Average/Sum)
I know I have already tried word vectors I extracted from Doc2Vec models, but what if I train separate Word2Vec models? Even though Doc2Vec models gave good representational vectors of document level, would it be more efficiently learning word vectors if I train pure Word2Vec?
In order to answer my own questions, I trained two Word2Vec models using CBOW (Continuous Bag Of Words) and Skip Gram models. In terms of parameter setting, I set the same parameters I used for Doc2Vec.
- size of vectors: 100 dimensions
- negative sampling: 5
- window: 2
- minimum word count: 2
- alpha: 0.065 (decrease alpha by 0.002 per epoch)
- number of epochs: 30
With above settings, I defined CBOW model by passing “sg=0”, and Skip Gram model by passing “sg=1”.
And once I get the results from two models, I concatenate vectors of two models for each word so that the concatenated vectors will have 200-dimensional representation of each word.
Please note that in the 6th part, where I trained Doc2Vec, I used “LabeledSentence” function imported from Gensim. This has now been deprecated, thus for this post I used “TaggedDocument” function instead. The usage is the same.
The concatenated vectors of unigram CBOW and unigram Skip Gram models has yielded 76.50%, 76.75% validation accuracy respectively with mean and sum method. These results are even higher than the results I got from GloVe vectors.
But please do not confuse this as a general statement. This is an empirical finding in this particular setting.
Separately trained Word2Vec with custom weighting (Average/Sum)
As a final step, I will apply the custom weighting I have implemented above and see if this affects the performance.
Finally I get the best performing word vectors. Averaged word vectors (separately trained Word2Vec models) weighted with custom metric has yielded the best validation accuracy of 77.97%! Below is the table of all the results I tried above.
Neural Network with Word2Vec
The best performing word vectors with logistic regression was chosen to feed to a neural network model. This time I did not try various different architecture. Based on what I have observed during trials of different architectures with Doc2Vec document vectors, the best performing architecture was one with 3 hidden layers with 256 hidden nodes at each hidden layer.
I will finally fit a neural network with early stopping and checkpoint so that I can save the best performing weights on validation accuracy.
from keras.models import load_model
loaded_w2v_model = load_model('w2v_01_best_weights.10-0.8048.hdf5')
It took quite some time for me to try different settings, different calculations, but I learned some valuable lessons through all the trial and errors. Specifically trained Word2Vec with carefully engineered weighting can even outperform Doc2Vec in the classification task.
In the next post, I will try more sophisticated neural network model, Convolutional Neural Network. Again I hope this will give me some boost of the performance.
Thank you for reading, and you can find the Jupyter Notebook from the below link.
Another Twitter sentiment analysis with Python — Part 10 (Neural Network with… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Gurupriyan is a Software Engineer and a technology enthusiast, he’s been working on the field for the last 6 years. Currently focusing on mobile app development and IoT.