Twitter Sentiment Analysis
July 16, 2014 Leave a comment
I finally got around to working on this problem however simple it may be.
The algorithm was proposed by another ‘Data Science’ course participant and I haven’t implemented the algorithm from this paper
I can explore that later.
This simple algorithm discussed in the forums is this.
1. Find all words in a tweet that exist in a master list. This list already associates a Valence score for a word. Scores can be positive or negative numbers.
2. Find the scores of these words and add them. This is the total score of the tweet.
3. Find all words from the tweet that don’t exist in the master list. These are the non-sentimental words.
4. If such a non-sentimental word occurs in a tweet with a positive score add 1 to a value associated with this word. If the non-sentimental word occurs in a tweet with a negative score or if the score is ‘0’ subtract one from the value associated with the word. The effect on the sentiment when we equate a negative score with ‘0’(else part of the if loop) is not explored. As I mentioned this is a simple algorithm.
This is accomplished by using a dictionary of words with each word associated with a list of two values, one for the positive accumulator and one for the negative accumulator.
import json import sys import types import os import os.path import re class Sentiment(object): def __init__(self): if not (os.path.isfile(sys.argv[1]) and os.access(sys.argv[1], os.R_OK) and os.path.isfile(sys.argv[2]) and os.access(sys.argv[2], os.R_OK)): print "Either files are missing or they are not readable" self.nonsentimentalwords = {} self.sent_file = open(sys.argv[1],'r') self.tweet_file = open(sys.argv[2],'r') def loadscores(self): self.scores = {} # initialize an empty dictionary for line in self.sent_file: term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character" self.scores[term] = int(score) # Convert the score to an integer. def score(self,text): count = 0 tweet = text.split() for s in tweet: if self.scores.has_key(s): count = count + self.scores.get(s) #print count return count def scorenonsentimentalwords(self,text,count): tweet = text.split() for s in tweet: for s in tweet: if (not self.scores.has_key(s.lower())) and (self.nonsentimentalwords.has_key(s.lower())): if count > 0: self.nonsentimentalwords[s][0] = self.nonsentimentalwords[s][0] + 1 else: self.nonsentimentalwords[s][1] = self.nonsentimentalwords[s][1] + 1 def addnonsentimentalwords(self,text): pos = 0 neg = 0 tweet = text.split() for s in tweet: if (not self.scores.has_key(s.lower())) and (not self.nonsentimentalwords.has_key(s.lower())): self.nonsentimentalwords[s] = [pos,neg] def analyze(self): with open(sys.argv[2],'r') as f: for data in f: d = json.loads(data) try: # print json-formatted string #print json.dumps(d, sort_keys=True, indent=4) if d.get('text') and d.get('lang') == 'en': #print "Tweet: ", d['text'] tex = re.sub("[^A-Z\sa-z]", "", d['text']) count = Sentiment.score(self,tex) Sentiment.addnonsentimentalwords(self,tex) Sentiment.scorenonsentimentalwords(self,tex,count) except (ValueError, KeyError, TypeError): print "Error" #for keys,values in self.nonsentimentalwords.items(): #print(keys,values[0] - values[1],values) for key, value in self.nonsentimentalwords.iteritems(): print(str(key) + " " + str(value[0] - value[1])) if __name__ == '__main__': sentiment=Sentiment() sentiment.loadscores() sentiment.analyze()