Another Twitter sentiment analysis with Python-Part 2

This blog post is the second part of the Twitter sentiment analysis project I am currently doing for my capstone project in General Assembly London. You can find the first part here.

Re-cleaning the data

Before I move on to EDA, and data visualisation, I have made some changes to the data cleaning part, due to the faults of the data cleaning function I defined in the previous post.

The first issue I realised is that, during the cleaning process, negation words are split into two parts, and the ‘t’ after the apostrophe vanishes when I filter tokens with length more than one syllable. This makes words like “can’t” end up as same as “can”. This seems like not a trivial matter for sentiment analysis purpose.

The second issue I realised is that, some of the url link doesn’t start with “http”, sometimes people paste link in “www.websitename.com” form. This wasn’t properly handled when I defined the url address regex pattern as ‘https?://[A-Za-z0–9./]+’. And another problem of this regex pattern is that it only detects alphabet, number, period, slash. This means it will fail to catch the part of the url, if it contains any other special character such as “=”, “_”, “~”, etc.

The third issue is with the regex pattern for twitter ID. In the previous cleaning function I defined it as ‘@[A-Za-z0–9]+’, but with a little googling, I found out that twitter ID also allows underscore symbol as a character can be used with ID. Except for underscore symbol, only characters allowed are alphabets and numbers.

Below is the updated datacleaning function. The order of the cleaning is

  1. Souping
  2. BOM removing
  3. url address(‘http:’pattern), twitter ID removing
  4. url address(‘www.’pattern) removing
  5. lower-case
  6. negation handling
  7. removing numbers and special characters
  8. tokenizing and joining
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()

pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
"haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
"wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
"can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
"mustn't":"must not"}
neg_pattern = re.compile(r'b(' + '|'.join(negations_dic.keys()) + r')b')

def tweet_cleaner_updated(text):
soup = BeautifulSoup(text, 'lxml')
souped = soup.get_text()
try:
bom_removed = souped.decode("utf-8-sig").replace(u"ufffd", "?")
except:
bom_removed = souped
stripped = re.sub(combined_pat, '', bom_removed)
stripped = re.sub(www_pat, '', stripped)
lower_case = stripped.lower()
neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
# During the letters_only process two lines above, it has created unnecessay white spaces,
# I will tokenize and join together to remove unneccessary white spaces
words = [x for x in tok.tokenize(letters_only) if len(x) > 1]
return (" ".join(words)).strip()

After I updated the cleaning function, I re-cleaned the whole 1.6 million entries in the dataset. You can find the detail of the cleaning process from my previous post.

After the cleaning, I exported the cleaned data as csv, then load it as data frame, and it looks as below.

csv = 'clean_tweet.csv'
my_df = pd.read_csv(csv,index_col=0)
my_df.head()
my_df.info()

It looks like there are some null entries in the data, let’s investigate further.

my_df[my_df.isnull().any(axis=1)].head()
np.sum(my_df.isnull().any(axis=1))
my_df.isnull().any(axis=0)

It seems like 3,981 entries have null entries for text column. This is strange, because the original dataset had no null entries, thus if there are any null entries in the cleaned dataset, it must have happened during the cleaning process.

df = pd.read_csv("./trainingandtestdata/training.1600000.processed.noemoticon.csv",header=None)
df.iloc[my_df[my_df.isnull().any(axis=1)].index,:].head()

By looking these entries in the original data, it seems like only text information they had was either twitter ID or url address. Anyway, these are the info I decided to discard for the sentiment analysis, so I will drop these null rows, and update the data frame.

my_df.dropna(inplace=True)
my_df.reset_index(drop=True,inplace=True)
my_df.info()

Word Cloud

The first text visualisation I chose is the controversial word cloud. A word cloud represents word usage in a document by resizing individual words proportionally to its frequency, and then presenting them in random arrangement. There were a lot of debates around word cloud, and I somewhat agree to the points of people who are against using word cloud as data analysis. Some of the concerns over word cloud is that, it supports only the crudest sorts of textual analysis, and it is often applied to situations where textual analysis is not appropriate, and it leaves viewers to figure out the context of the data by themselves without providing the narrative.

But in the case of tweets, textual analysis is the most important analysis, and it provides a general idea of what kind of words are frequent in the corpus, in a sort of quick and dirty way. So, I will give it a go, and figure out what other methods can be used for text visualisation.

For the word cloud, I used the python library wordcloud.

neg_tweets = my_df[my_df.target == 0]
neg_string = []
for t in neg_tweets.text:
neg_string.append(t)
neg_string = pd.Series(neg_string).str.cat(sep=' ')
from wordcloud import WordCloud

wordcloud = WordCloud(width=1600, height=800,max_font_size=200).generate(neg_string)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Some of big words can be interpreted quite neutral, such as “today”,”now”,etc. I can see some of the words in smaller size make sense to be in negative tweets, such as “damn”,”ugh”,”miss”,”bad”, etc. But there is “love” in rather big size, so I wanted to see what is happening.

for t in neg_tweets.text[:200]:
if 'love' in t:
print t

OK, even though the tweets contain the word “love”, in these cases it is negative sentiment, because the tweet has mixed emotions like “love” but “miss”. Or sometimes used in a sarcastic way.

pos_tweets = my_df[my_df.target == 1]
pos_string = []
for t in pos_tweets.text:
pos_string.append(t)
pos_string = pd.Series(pos_string).str.cat(sep=' ')
wordcloud = WordCloud(width=1600, height=800,max_font_size=200,colormap='magma').generate(pos_string) plt.figure(figsize=(12,10)) plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show()

Again I see some netural words in big size, “today”,”now”, but words like “haha”, “love”, “awesome” also stand out.

Interestingly, the word “work” was quite big in negative word cloud, but also quite big in positive word cloud. It might implies that many people express negative sentiment towards work, but also many people are positive about works.

Preparation for more data visualisation

In order for me to implement a couple of data visualisation in the next step, I need term frequency data. What kind of words are used in the tweets, and how many times it is used in entire corpus. I used count vectorizer to calculate the term frequencies, even though the count vectorizer is also for fit, train and predict, but at this stage, I will just be extracting the term frequencies for the visualisation.

There are parameter options available for count vectorizer, such as removing stop words, limiting the maximum number of terms. However, in order to get a full picture of the dataset first, I implemented with stop words included, and not limiting the maximum number of terms.

from sklearn.feature_extraction.text import CountVectorizer
cvec = CountVectorizer()
cvec.fit(my_df.text)
len(cvec.get_feature_names())

OK, it looks like the count vectorizer has extracted 264,936 words out of the corpus.

For below part, I had to go through many trial and errors, because of the memory usage overload. If the code takes time to implement, but still running, it is OK, but it was not the matter of how long it takes the block to run, my mac book pro just simply gave up and either killed the kernel or froze. After numerous attempts, I have finally succeded in processing the data in batches.

The mistake I kept making was, when I slice the document matrix, I tried to slice it in ‘document_matrix.toarray()[start_index,end_index]’, and I finally realised because I was trying to first convert whole documnet_matrix to array, and then slice from there, no matter how I change the batch size, my poor macbook pro couldn’t handle the request. After I change my slicing to ‘document_matrix[start_index,end_index].toarray()’, my mac book pro did a wonderful job for me.

Also your mac book can be an excellent lap warmer for cold winter days during the processing of the below task. In Asia, we call this “two birds with one stone”. You’re welcome.

Cats are the best at finding warm spots in winter!
document_matrix = cvec.transform(my_df.text)
my_df[my_df.target == 0].tail()
%%time
neg_batches = np.linspace(0,798179,100).astype(int)
i=0
neg_tf = []
while i < len(neg_batches)-1:
batch_result = np.sum(document_matrix[neg_batches[i]:neg_batches[i+1]].toarray(),axis=0)
neg_tf.append(batch_result)
if (i % 10 == 0) | (i == len(neg_batches)-2):
print neg_batches[i+1],"entries' term freuquency calculated"
i += 1
my_df.tail()
%%time
pos_batches = np.linspace(798179,1596019,100).astype(int)
i=0
pos_tf = []
while i < len(pos_batches)-1:
batch_result = np.sum(document_matrix[pos_batches[i]:pos_batches[i+1]].toarray(),axis=0)
pos_tf.append(batch_result)
if (i % 10 == 0) | (i == len(pos_batches)-2):
print pos_batches[i+1],"entries' term freuquency calculated"
i += 1
neg = np.sum(neg_tf,axis=0)
pos = np.sum(pos_tf,axis=0)
term_freq_df = pd.DataFrame([neg,pos],columns=cvec.get_feature_names()).transpose()
term_freq_df.head()
term_freq_df.columns = ['negative', 'positive']
term_freq_df['total'] = term_freq_df['negative'] + term_freq_df['positive']
term_freq_df.sort_values(by='total', ascending=False).iloc[:10]

OK, the term frequency data frame has been created! And as you can see the most frequent words are all stop words like “to”, “the”, etc. You might wonder why go through all the heavy processing without removing the stop words and not limiting the maximum terms, to get term frequency dominated by stop words? It is because I really want to test Zipf’s law with the result I got from the above. I need the stop words! You will see what I mean.

I will tell you more about Zipf’s law in the next post, and I promise there will be more visualisation. You will see that above term frequency calculation wasn’t for nothing.

As always, you can find the Jupyter notebook from below link.

https://github.com/tthustla/twitter_sentiment_analysis_part2/blob/master/Capstone_part3-Copy1.ipynb


Another Twitter sentiment analysis with Python-Part 2 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Please follow and like us:
error0