MindSpace

Web application stack

September 26, 2014 Leave a comment

RestHUB

JHipster

I really need to keep up with these new stacks.

Filed under Java, REST

Lasso fit

September 26, 2014 Leave a comment

The code I was given

set.seed(3523)
library(AppliedPredictiveModeling)
data(concrete)
inTrain = createDataPartition(concrete$CompressiveStrength, p = 3/4)[[1]]
training = concrete[ inTrain,]
testing = concrete[-inTrain,]

This is the data

<- head(as.matrix(training))
    Cement BlastFurnaceSlag FlyAsh Water Superplasticizer CoarseAggregate
47   349.0              0.0      0 192.0              0.0          1047.0
55   139.6            209.4      0 192.0              0.0          1047.0
56   198.6            132.4      0 192.0              0.0           978.4
58   198.6            132.4      0 192.0              0.0           978.4
63   310.0              0.0      0 192.0              0.0           971.0
115  362.6            189.0      0 164.9             11.6           944.7
    FineAggregate Age CompressiveStrength
47          806.9   3               15.05
55          806.9   7               14.59
56          825.5   7               14.64
58          825.5   3                9.13
63          850.6   3                9.87
115         755.8   7               22.90

Lasso fit and plot

predictors <- as.matrix(training)[,-9]
lasso.fit <- lars(predictors,training$CompressiveStrength,type="lasso",trace=TRUE)
headings <- names(training[-(9:10)])
plot(lasso.fit, breaks=FALSE)
legend("topleft", headings,pch=8, lty=1:length(headings),col=1:length(headings))

According to this graph the last coefficient to be set to zero as the penalty increases is Cement. I think this is correct but I may change this.

Filed under Machine Learning, R

RandomForests

September 22, 2014 Leave a comment

I am just posting R code at this time. The explanation is missing but I am making some progress.

library(ElemStatLearn)
library(randomForest)
data(vowel.train)
data(vowel.test)

> head(vowel.train)
y x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10
1 1 -3.639 0.418 -0.670 1.779 -0.168 1.627 -0.388 0.529 -0.874 -0.814
2 2 -3.327 0.496 -0.694 1.365 -0.265 1.933 -0.363 0.510 -0.621 -0.488
3 3 -2.120 0.894 -1.576 0.147 -0.707 1.559 -0.579 0.676 -0.809 -0.049
4 4 -2.287 1.809 -1.498 1.012 -1.053 1.060 -0.567 0.235 -0.091 -0.795
5 5 -2.598 1.938 -0.846 1.062 -1.633 0.764 0.394 -0.150 0.277 -0.396
6 6 -2.852 1.914 -0.755 0.825 -1.588 0.855 0.217 -0.246 0.238 -0.365

vowel.train$y <- factor(vowel.train$y)
set.seed(33833)
fit.rf <- randomForest(vowel.train$y ~ .,data=vowel.train)
plot(fit.rf)
varImpPlot(fit.rf)

I was asked to find the order of variable importance which this graph shows.

Filed under Machine Learning, R

Principal Component Analysis

September 18, 2014 Leave a comment

library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

I recently studied Predictive Analytics techniques as part of a course. I was given the code shown above. I generated the following two predictive models to compare their accuracy figures. This might be easy for experts but I found it tricky. So I post the code here for my reference.

Non-PCA

training1 <- training[,grepl("^IL|^diagnosis",names(training))]

 test1 <- testing[,grepl("^IL|^diagnosis",names(testing))]

 modelFit <- train(diagnosis ~ .,method="glm",data=training1)

 confusionMatrix(test1$diagnosis,predict(modelFit, test1))

Confusion Matrix and Statistics

Reference

Prediction Impaired Control

Impaired 2 20

Control 9 51

Accuracy : 0.6463

95% CI : (0.533, 0.7488)

No Information Rate : 0.8659

P-Value [Acc > NIR] : 1.00000

Kappa : -0.0702

Mcnemar’s Test P-Value : 0.06332

Sensitivity : 0.18182

Specificity : 0.71831

Pos Pred Value : 0.09091

Neg Pred Value : 0.85000

Prevalence : 0.13415

Detection Rate : 0.02439

Detection Prevalence : 0.26829

Balanced Accuracy : 0.45006

‘Positive’ Class : Impaired

PCA

training2 <- training[,grepl("^IL",names(training))]

preProc <- preProcess(training2,method="pca",thresh=0.8)

test2 <- testing[,grepl("^IL",names(testing))]

trainpca <- predict(preProc, training2)

testpca <- predict(preProc, test2)

modelFitpca <- train(training1$diagnosis ~ .,method="glm",data=trainpca)

confusionMatrix(test1$diagnosis,predict(modelFitpca, testpca))

Confusion Matrix and Statistics

Reference
Prediction Impaired Control
Impaired 3 19
Control 4 56

Accuracy : 0.7195
95% CI : (0.6094, 0.8132)
No Information Rate : 0.9146
P-Value [Acc > NIR] : 1.000000

Kappa : 0.0889
Mcnemar’s Test P-Value : 0.003509

Sensitivity : 0.42857
Specificity : 0.74667
Pos Pred Value : 0.13636
Neg Pred Value : 0.93333
Prevalence : 0.08537
Detection Rate : 0.03659
Detection Prevalence : 0.26829
Balanced Accuracy : 0.58762

‘Positive’ Class : Impaired

Filed under Machine Learning, R

Open Government Data Platform India

September 14, 2014 1 Comment

I found many datasets in this site but many of them are not useful. Some of them are just junk and others are not useful for predictive analytics.

But I found one that I actually used. The y-axis labels are smudged but that can be fixed.

The data is JSON which I parsed.

library(RJSONIO)
library(ggplot2)
library(reshape2)
library(grid)
this.dir <- dirname(parent.frame(2)$ofile) 
setwd(this.dir)

airlines   = fromJSON("json")
df <- sapply(airlines$data,unlist)
df <- data.frame(t(df))
colnames(df) <- c( (airlines[[1]][[1]])[[2]], (airlines[[1]][[2]])[[2]], (airlines[[1]][[3]])[[2]], (airlines[[1]][[4]])[[2]], (airlines[[1]][[5]])[[2]], (airlines[[1]][[6]])[[2]], (airlines[[1]][[7]])[[2]], (airlines[[1]][[8]])[[2]], (airlines[[1]][[9]])[[2]],(airlines[[1]][[10]])[[2]] )

df.melted <- melt(df, id = "YEAR")
print(class(df.melted$value))
df.melted$value<-as.numeric(df.melted$value) 
df.melted$value <- format(df.melted$value, scientific = FALSE)
print(ggplot(data = df.melted, aes(x = YEAR, y = value, color = variable)) +geom_point() + theme(axis.text.x = element_text(angle = 90,hjust = 0.9))  + theme(axis.text.y = element_text(angle = 360, hjust = 1, size=7.5, vjust=1))+ theme(plot.margin =unit(c(3,1,0.5,1), "cm")) + ylab("")  + theme(legend.text=element_text(size=6)))

This is the sample data.

 head(df)
     YEAR INTERNATIONAL  ACM (IN NOS) DOMESTIC ACM (IN NOS) TOTAL ACM (IN NOS)
1 1995-96                       92515                314727             407242
2 1996-97                       94884                324462             419346
3 1997-98                       98226                317531             415757
4 1998-99                       99563                325392             424955
5 1999-00                       99701                368015             467716
6 2000-01                      103211                386575             489786
  INTERNATIONAL PAX (IN NOS) DOMESTIC PAX (IN NOS) TOTAL PAX (IN NOS)
1                   11449756              25563998           37013754
2                   12223660              24276108           36499768
3                   12782769              23848833           36631602
4                   12916788              24072631           36989419
5                   13293027              25741521           39034548
6                   14009052              28017568           42026620
  INTERNATIONAL FREIGHT (IN MT) DOMESTIC FREIGHT (IN MT) TOTAL FREIGHT (IN MT)
1                        452853                   196516                649369
2                        479088                   202122                681210
3                        488175                   217405                705580
4                        474660                   224490                699150
5                        531844                   265570                797414
6                        557772                   288373                846145

Filed under R

Decision Tree

September 4, 2014 4 Comments

This is a technique used to analyze data for prediction. I came across this when I was studying Machine Learning.

Tree models are computationally intensive techniques for recursively partitioning response variables into subsets based on their relationship to one or more (usually many) predictor variables.

> head(data)
  file_id time cell_id    d1    d2 fsc_small fsc_perp fsc_big    pe chl_small
1     203   12       1 25344 27968     34677    14944   32400  2216     28237
2     203   12       4 12960 22144     37275    20440   32400  1795     36755
3     203   12       6 21424 23008     31725    11253   32384  1901     26640
4     203   12       9  7712 14528     28744    10219   32416  1248     35392
5     203   12      11 30368 21440     28861     6101   32400 12989     23421
6     203   12      15 30032 22704     31221    13488   32400  1883     27323
  chl_big     pop
1    5072    pico
2   14224   ultra
3       0    pico
4   10704   ultra
5    5920 synecho
6    6560    pico

training <- createDataPartition(data$pop, times=1,p=.5,list=FALSE)
train <- data[training,]
test <- data[,training]

fol <- formula(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small)
model <- rpart(fol, method="class", data=train)
print(model)

n= 36172

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 36172 25742 pico (0.0014 0.18 0.29 0.25 0.28)
2) pe=41300 5175 660 nano (0 0.87 0 0 0.13) *
11) chl_small=5001.5 9856 783 synecho (0.0052 0.054 0.0051 0.92 0.015)
6) chl_small>=38109.5 653 133 nano (0.078 0.8 0 0.055 0.07) *
7) chl_small< 38109.5 9203 166 synecho (0 0.0015 0.0054 0.98 0.011) *

Filed under Machine Learning, R

Python vs Java Streams and lambda

August 30, 2014 5 Comments

I ported the first facebook Qualification Round Solution to Java 8.

The main idea is to count the frequency of each letter, then assign the value 26 to the most frequent letter, 25 to the next, etc. If two letters are tied for most frequent, it doesn’t matter which of them gets which value, since the sum will be the same. The python code below explains the solution pretty well.

I haven’t thoroughly checked for bugs but this is almost as beautiful as Python. Java is more verbose though.
I haven’t tested it thoroughly though.

import java.util.Map;
import java.util.TreeMap;
import java.util.stream.IntStream;

public class WordCount {
	
	public int x = 26;

	public static void main(String... argv){
		
		WordCount wc = new WordCount();
		wc.count();
	}

	private void count() {

		String s  = "__mainn__".replaceAll("[^a-z\\s]", "");

		System.out.println(s);
		

        final Map<Character, Integer> count = s.chars().
        		map(Character::toLowerCase).
                collect(TreeMap::new, (m, c) -> m.merge((char) c, 1, Integer::sum), Map::putAll);
        
             
        count.entrySet().stream().
        	sorted((l, r) -> r.getValue().compareTo(l.getValue())).
        		forEach(e -> count.merge(e.getKey(), x--, Math::multiplyExact));
                //Stop when x == 0.Not tested
        
        System.out.println(count.entrySet().stream().mapToDouble(e -> e.getValue()).sum());
	//Treating these numbers as double to sum them. Doesn't seem to matter.	
	}
}

mainn
a-1
i-1
m-1
n-2
{a=25, i=24, m=23, n=52}
124.0

Filed under Java

Processed 0.25 TB on Amazon EMR clusters

August 28, 2014 Leave a comment

I did that by provisioning 1 m1.medium Master node and 15 m1.xlarge Core nodes. This is easy and relatively cheap.
Since I deal with Pig I don’t have to design my MapReduce Jobs. I have to learn how to code MR jobs in the future.

This command stores the result in a file. I used to count the records in the file but I realized I don’t have to because the command actually prints how many records it writes.

store variable INTO '/user/hadoop/file' USING PigStorage();

Filed under Cloud, Elastic MapReduce, Pig

← Older posts

Newer posts →

MindSpace

Apache Mahout

Time series forecast

Web application stack

Lasso fit

The code I was given

This is the data

Lasso fit and plot

RandomForests

Principal Component Analysis

Non-PCA

PCA

Open Government Data Platform India

Decision Tree

Python vs Java Streams and lambda

Processed 0.25 TB on Amazon EMR clusters

Blogroll