16 | July | 2014 | MindSpace

Frequency of occurence of a term in a tweet dataset

July 16, 2014 Leave a comment

I am a novice python coder and this algorithm is simple. But still I am overjoyed that my python coding skills are improving.

No: of occurences of a term / Total No: of unique words

import json
import os.path
import re
import sys


class Frequency(object):

 
   def __init__(self):
        if not (os.path.isfile(sys.argv[1]) and os.access(sys.argv[1], os.R_OK)):
            print "Either file is missing or it is not readable"
        self.allterms = {}


   def totalterms(self,text):
           count  = 0
           tweet = text.split()
           for s in tweet:
                    if not  self.allterms:
                        self.allterms[s.lower()] = count
                    else:
                        if not (s.lower() in self.allterms):        
                            self.allterms[s.lower()] = count

   def calculatefrequency(self,text):
           tweet = text.split()
           for s in tweet:
                        if  (self.allterms.has_key(s.lower())):        
                            self.allterms[s.lower()] = float(self.allterms[s.lower()] + 1)
           for key in self.allterms.iterkeys():
              self.allterms[key] = float(self.allterms[key] / len(self.allterms.keys()))
                
   def analyze(self):
        with open(sys.argv[1],'r') as f:
            for data in f:
                d = json.loads(data)
                try: 
                    # print json-formatted string
                    #print json.dumps(d, sort_keys=True, indent=4)
                 
                    if d.get('text') and d.get('lang') == 'en':
                            #print "Tweet: ", d['text']
                            tex = re.sub("[^A-Z\sa-z]", "", d['text'])
                            Frequency.totalterms(self,tex)
                            Frequency.calculatefrequency(self, tex)

                except (ValueError, KeyError, TypeError):
                    print "Error"
        for key,value in self.allterms.iteritems():
            print(str(key) + " " + str("%.6f" %value))              
            
                  
if __name__ == '__main__':


    frequency=Frequency()
    frequency.analyze()

nigga 0.000027
old 0.000027
worldcup 0.000002
list 0.000027
it 0.000002
years 0.000027
see 0.025000
done 0.000004
have 0.025000
shit 0.000027
rt 0.000002
from 0.025000
also 0.000002
top 0.000027
had 0.000002
guitarmandan 0.000002
to 0.000004
win 0.000002
you 0.050000
today 0.000027
me 0.025027
fr 0.000781
someone 0.000002
but 0.000781
moment 0.025000
germany 0.000002
no 0.025000
not 0.025781
come 0.000027
cool 0.000027
a 0.000027
on 0.000027
like 0.000027
of 0.000027
hes 0.000027
well 0.000004
chance 0.025000
calling 0.000027
caring 0.025000
the 0.025027

Filed under Python

Twitter Sentiment Analysis

July 16, 2014 Leave a comment

I finally got around to working on this problem however simple it may be.

The algorithm was proposed by another ‘Data Science’ course participant and I haven’t implemented the algorithm from this paper

I can explore that later.

This simple algorithm discussed in the forums is this.

1. Find all words in a tweet that exist in a master list. This list already associates a Valence score for a word. Scores can be positive or negative numbers.

2. Find the scores of these words and add them. This is the total score of the tweet.

3. Find all words from the tweet that don’t exist in the master list. These are the non-sentimental words.

4. If such a non-sentimental word occurs in a tweet with a positive score add 1 to a value associated with this word. If the non-sentimental word occurs in a tweet with a negative score or if the score is ‘0’ subtract one from the value associated with the word. The effect on the sentiment when we equate a negative score with ‘0’(else part of the if loop) is not explored. As I mentioned this is a simple algorithm.

This is accomplished by using a dictionary of words with each word associated with a list of two values, one for the positive accumulator and one for the negative accumulator.

import json
import sys
import types
import os
import os.path
import re

class Sentiment(object):

 
   def __init__(self):

        if not (os.path.isfile(sys.argv[1]) and os.access(sys.argv[1], os.R_OK) and os.path.isfile(sys.argv[2]) and os.access(sys.argv[2], os.R_OK)):
            print "Either files are missing or they are not readable"
    
        self.nonsentimentalwords = {}
        self.sent_file = open(sys.argv[1],'r')
        self.tweet_file = open(sys.argv[2],'r')

   def loadscores(self):
        self.scores = {} # initialize an empty dictionary
        for line in self.sent_file:
          term, score  = line.split("\t")  # The file is tab-delimited. "\t" means "tab character"
          self.scores[term] = int(score)  # Convert the score to an integer.

   def score(self,text):
        count = 0
        tweet = text.split()
        for s in tweet:
            if self.scores.has_key(s):
                count = count + self.scores.get(s)
        #print count             
        return count
   
   def scorenonsentimentalwords(self,text,count):
           tweet = text.split()
           for s in tweet:
                for s in tweet:
                    if (not self.scores.has_key(s.lower())) and (self.nonsentimentalwords.has_key(s.lower())):        
                        if count > 0:
                            self.nonsentimentalwords[s][0] = self.nonsentimentalwords[s][0] + 1 
                        else:   
                            self.nonsentimentalwords[s][1] = self.nonsentimentalwords[s][1] + 1 
   
   def addnonsentimentalwords(self,text):
       pos = 0
       neg = 0
       tweet = text.split()
       for s in tweet:
            if (not self.scores.has_key(s.lower())) and (not self.nonsentimentalwords.has_key(s.lower())):
                self.nonsentimentalwords[s] = [pos,neg]
                
   def analyze(self):
        with open(sys.argv[2],'r') as f:
            for data in f:
                d = json.loads(data)
                try: 
                    # print json-formatted string
                    #print json.dumps(d, sort_keys=True, indent=4)
                 
                    if d.get('text') and d.get('lang') == 'en':
                            #print "Tweet: ", d['text']
                            tex = re.sub("[^A-Z\sa-z]", "", d['text'])
                            count = Sentiment.score(self,tex)
                            Sentiment.addnonsentimentalwords(self,tex)
                            Sentiment.scorenonsentimentalwords(self,tex,count)

                except (ValueError, KeyError, TypeError):
                    print "Error"
        #for keys,values in self.nonsentimentalwords.items():
            #print(keys,values[0] - values[1],values)                
        for key, value in self.nonsentimentalwords.iteritems():
            print(str(key) + " " + str(value[0] - value[1]))              
            
                  
if __name__ == '__main__':


    sentiment=Sentiment()
    sentiment.loadscores()
    sentiment.analyze()

Filed under Python

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

MindSpace

Frequency of occurence of a term in a tweet dataset

Twitter Sentiment Analysis

Blogroll