September | 2014 | MindSpace

Web application stack

September 26, 2014 Leave a comment

RestHUB

JHipster

I really need to keep up with these new stacks.

Filed under Java, REST

Lasso fit

September 26, 2014 Leave a comment

The code I was given

set.seed(3523)
library(AppliedPredictiveModeling)
data(concrete)
inTrain = createDataPartition(concrete$CompressiveStrength, p = 3/4)[[1]]
training = concrete[ inTrain,]
testing = concrete[-inTrain,]

This is the data

<- head(as.matrix(training))
    Cement BlastFurnaceSlag FlyAsh Water Superplasticizer CoarseAggregate
47   349.0              0.0      0 192.0              0.0          1047.0
55   139.6            209.4      0 192.0              0.0          1047.0
56   198.6            132.4      0 192.0              0.0           978.4
58   198.6            132.4      0 192.0              0.0           978.4
63   310.0              0.0      0 192.0              0.0           971.0
115  362.6            189.0      0 164.9             11.6           944.7
    FineAggregate Age CompressiveStrength
47          806.9   3               15.05
55          806.9   7               14.59
56          825.5   7               14.64
58          825.5   3                9.13
63          850.6   3                9.87
115         755.8   7               22.90

Lasso fit and plot

predictors <- as.matrix(training)[,-9]
lasso.fit <- lars(predictors,training$CompressiveStrength,type="lasso",trace=TRUE)
headings <- names(training[-(9:10)])
plot(lasso.fit, breaks=FALSE)
legend("topleft", headings,pch=8, lty=1:length(headings),col=1:length(headings))

According to this graph the last coefficient to be set to zero as the penalty increases is Cement. I think this is correct but I may change this.

Filed under Machine Learning, R

RandomForests

September 22, 2014 Leave a comment

I am just posting R code at this time. The explanation is missing but I am making some progress.

library(ElemStatLearn)
library(randomForest)
data(vowel.train)
data(vowel.test)

> head(vowel.train)
y x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10
1 1 -3.639 0.418 -0.670 1.779 -0.168 1.627 -0.388 0.529 -0.874 -0.814
2 2 -3.327 0.496 -0.694 1.365 -0.265 1.933 -0.363 0.510 -0.621 -0.488
3 3 -2.120 0.894 -1.576 0.147 -0.707 1.559 -0.579 0.676 -0.809 -0.049
4 4 -2.287 1.809 -1.498 1.012 -1.053 1.060 -0.567 0.235 -0.091 -0.795
5 5 -2.598 1.938 -0.846 1.062 -1.633 0.764 0.394 -0.150 0.277 -0.396
6 6 -2.852 1.914 -0.755 0.825 -1.588 0.855 0.217 -0.246 0.238 -0.365

vowel.train$y <- factor(vowel.train$y)
set.seed(33833)
fit.rf <- randomForest(vowel.train$y ~ .,data=vowel.train)
plot(fit.rf)
varImpPlot(fit.rf)

I was asked to find the order of variable importance which this graph shows.

Filed under Machine Learning, R

Principal Component Analysis

September 18, 2014 Leave a comment

library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
adData = data.frame(diagnosis,predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]]
training = adData[ inTrain,]
testing = adData[-inTrain,]

I recently studied Predictive Analytics techniques as part of a course. I was given the code shown above. I generated the following two predictive models to compare their accuracy figures. This might be easy for experts but I found it tricky. So I post the code here for my reference.

Non-PCA

training1 <- training[,grepl("^IL|^diagnosis",names(training))]

 test1 <- testing[,grepl("^IL|^diagnosis",names(testing))]

 modelFit <- train(diagnosis ~ .,method="glm",data=training1)

 confusionMatrix(test1$diagnosis,predict(modelFit, test1))

Confusion Matrix and Statistics

Reference

Prediction Impaired Control

Impaired 2 20

Control 9 51

Accuracy : 0.6463

95% CI : (0.533, 0.7488)

No Information Rate : 0.8659

P-Value [Acc > NIR] : 1.00000

Kappa : -0.0702

Mcnemar’s Test P-Value : 0.06332

Sensitivity : 0.18182

Specificity : 0.71831

Pos Pred Value : 0.09091

Neg Pred Value : 0.85000

Prevalence : 0.13415

Detection Rate : 0.02439

Detection Prevalence : 0.26829

Balanced Accuracy : 0.45006

‘Positive’ Class : Impaired

PCA

training2 <- training[,grepl("^IL",names(training))]

preProc <- preProcess(training2,method="pca",thresh=0.8)

test2 <- testing[,grepl("^IL",names(testing))]

trainpca <- predict(preProc, training2)

testpca <- predict(preProc, test2)

modelFitpca <- train(training1$diagnosis ~ .,method="glm",data=trainpca)

confusionMatrix(test1$diagnosis,predict(modelFitpca, testpca))

Confusion Matrix and Statistics

Reference
Prediction Impaired Control
Impaired 3 19
Control 4 56

Accuracy : 0.7195
95% CI : (0.6094, 0.8132)
No Information Rate : 0.9146
P-Value [Acc > NIR] : 1.000000

Kappa : 0.0889
Mcnemar’s Test P-Value : 0.003509

Sensitivity : 0.42857
Specificity : 0.74667
Pos Pred Value : 0.13636
Neg Pred Value : 0.93333
Prevalence : 0.08537
Detection Rate : 0.03659
Detection Prevalence : 0.26829
Balanced Accuracy : 0.58762

‘Positive’ Class : Impaired

Filed under Machine Learning, R

Open Government Data Platform India

September 14, 2014 1 Comment

I found many datasets in this site but many of them are not useful. Some of them are just junk and others are not useful for predictive analytics.

But I found one that I actually used. The y-axis labels are smudged but that can be fixed.

The data is JSON which I parsed.

library(RJSONIO)
library(ggplot2)
library(reshape2)
library(grid)
this.dir <- dirname(parent.frame(2)$ofile) 
setwd(this.dir)

airlines   = fromJSON("json")
df <- sapply(airlines$data,unlist)
df <- data.frame(t(df))
colnames(df) <- c( (airlines[[1]][[1]])[[2]], (airlines[[1]][[2]])[[2]], (airlines[[1]][[3]])[[2]], (airlines[[1]][[4]])[[2]], (airlines[[1]][[5]])[[2]], (airlines[[1]][[6]])[[2]], (airlines[[1]][[7]])[[2]], (airlines[[1]][[8]])[[2]], (airlines[[1]][[9]])[[2]],(airlines[[1]][[10]])[[2]] )

df.melted <- melt(df, id = "YEAR")
print(class(df.melted$value))
df.melted$value<-as.numeric(df.melted$value) 
df.melted$value <- format(df.melted$value, scientific = FALSE)
print(ggplot(data = df.melted, aes(x = YEAR, y = value, color = variable)) +geom_point() + theme(axis.text.x = element_text(angle = 90,hjust = 0.9))  + theme(axis.text.y = element_text(angle = 360, hjust = 1, size=7.5, vjust=1))+ theme(plot.margin =unit(c(3,1,0.5,1), "cm")) + ylab("")  + theme(legend.text=element_text(size=6)))

This is the sample data.

 head(df)
     YEAR INTERNATIONAL  ACM (IN NOS) DOMESTIC ACM (IN NOS) TOTAL ACM (IN NOS)
1 1995-96                       92515                314727             407242
2 1996-97                       94884                324462             419346
3 1997-98                       98226                317531             415757
4 1998-99                       99563                325392             424955
5 1999-00                       99701                368015             467716
6 2000-01                      103211                386575             489786
  INTERNATIONAL PAX (IN NOS) DOMESTIC PAX (IN NOS) TOTAL PAX (IN NOS)
1                   11449756              25563998           37013754
2                   12223660              24276108           36499768
3                   12782769              23848833           36631602
4                   12916788              24072631           36989419
5                   13293027              25741521           39034548
6                   14009052              28017568           42026620
  INTERNATIONAL FREIGHT (IN MT) DOMESTIC FREIGHT (IN MT) TOTAL FREIGHT (IN MT)
1                        452853                   196516                649369
2                        479088                   202122                681210
3                        488175                   217405                705580
4                        474660                   224490                699150
5                        531844                   265570                797414
6                        557772                   288373                846145

Filed under R

Decision Tree

September 4, 2014 4 Comments

This is a technique used to analyze data for prediction. I came across this when I was studying Machine Learning.

Tree models are computationally intensive techniques for recursively partitioning response variables into subsets based on their relationship to one or more (usually many) predictor variables.

> head(data)
  file_id time cell_id    d1    d2 fsc_small fsc_perp fsc_big    pe chl_small
1     203   12       1 25344 27968     34677    14944   32400  2216     28237
2     203   12       4 12960 22144     37275    20440   32400  1795     36755
3     203   12       6 21424 23008     31725    11253   32384  1901     26640
4     203   12       9  7712 14528     28744    10219   32416  1248     35392
5     203   12      11 30368 21440     28861     6101   32400 12989     23421
6     203   12      15 30032 22704     31221    13488   32400  1883     27323
  chl_big     pop
1    5072    pico
2   14224   ultra
3       0    pico
4   10704   ultra
5    5920 synecho
6    6560    pico

training <- createDataPartition(data$pop, times=1,p=.5,list=FALSE)
train <- data[training,]
test <- data[,training]

fol <- formula(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small)
model <- rpart(fol, method="class", data=train)
print(model)

n= 36172

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 36172 25742 pico (0.0014 0.18 0.29 0.25 0.28)
2) pe=41300 5175 660 nano (0 0.87 0 0 0.13) *
11) chl_small=5001.5 9856 783 synecho (0.0052 0.054 0.0051 0.92 0.015)
6) chl_small>=38109.5 653 133 nano (0.078 0.8 0 0.055 0.07) *
7) chl_small< 38109.5 9203 166 synecho (0 0.0015 0.0054 0.98 0.011) *

Filed under Machine Learning, R

MindSpace

Apache Mahout

Time series forecast

Web application stack

Lasso fit

The code I was given

This is the data

Lasso fit and plot

RandomForests

Principal Component Analysis

Non-PCA

PCA

Open Government Data Platform India

Decision Tree

Blogroll