## Apache Mahout

I followed this tutorial. Mahout seems to be an easy way to test Machine Learning algorithms using the Java API.

But I would use this  this R code instead of the one shown in the tutorial to convert the MovieLens dataset to CSV format.

r<-file("u.data","r")
w<-file("u1.csv","w")

while( length(data <- readLines(r)) > 0 ){
writeLines(gsub("\\s+",",",data),w)
}


## Time series forecast

This code counts how many values from the testing dataset fall within the 95% Confidence Interval range.

library(forecast)
library(lubridate)  # For year() function below
training = dat[year(dat$date) < 2012,] testing = dat[(year(dat$date)) > 2011,]
tstrain = ts(training$visitsTumblr) sum <- 0 fit <- bats(tstrain) fc <- forecast(fit,h=235) mat <- fc$upper
for(i in 1:nrow(mat)){
v <- data.frame(mat[i,])
print(paste(testing$visitsTumblr[i],v[1,] , v[2,])) if(testing$visitsTumblr[i] > v[1,] & testing$visitsTumblr[i] < v[2,]){ sum <- sum + 1 } } print(sum)  The forecast object has this type of data(Lo 95 & Hi 95) which I use. > fc Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 366 207.4397 -124.2019 539.0813 -299.7624 714.6418 367 197.2773 -149.6631 544.2177 -333.3223 727.8769 368 235.5405 -112.0582 583.1392 -296.0658 767.1468 369 235.5405 -112.7152 583.7962 -297.0707 768.1516 370 235.5405 -113.3710 584.4520 -298.0736 769.1546 371 235.5405 -114.0256 585.1065 -299.0747 770.1556 372 235.5405 -114.6789 585.7599 -300.0739 771.1548  ## Lasso fit ### The code I was given set.seed(3523) library(AppliedPredictiveModeling) data(concrete) inTrain = createDataPartition(concrete$CompressiveStrength, p = 3/4)[[1]]
training = concrete[ inTrain,]
testing = concrete[-inTrain,]


### This is the data

<- head(as.matrix(training))
Cement BlastFurnaceSlag FlyAsh Water Superplasticizer CoarseAggregate
47   349.0              0.0      0 192.0              0.0          1047.0
55   139.6            209.4      0 192.0              0.0          1047.0
56   198.6            132.4      0 192.0              0.0           978.4
58   198.6            132.4      0 192.0              0.0           978.4
63   310.0              0.0      0 192.0              0.0           971.0
115  362.6            189.0      0 164.9             11.6           944.7
FineAggregate Age CompressiveStrength
47          806.9   3               15.05
55          806.9   7               14.59
56          825.5   7               14.64
58          825.5   3                9.13
63          850.6   3                9.87
115         755.8   7               22.90


### Lasso fit and plot

predictors <- as.matrix(training)[,-9]
lasso.fit <- lars(predictors,training$CompressiveStrength,type="lasso",trace=TRUE) headings <- names(training[-(9:10)]) plot(lasso.fit, breaks=FALSE) legend("topleft", headings,pch=8, lty=1:length(headings),col=1:length(headings))  According to this graph the last coefficient to be set to zero as the penalty increases is Cement. I think this is correct but I may change this. ## RandomForests I am just posting R code at this time. The explanation is missing but I am making some progress. library(ElemStatLearn) library(randomForest) data(vowel.train) data(vowel.test)  > head(vowel.train) y x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10 1 1 -3.639 0.418 -0.670 1.779 -0.168 1.627 -0.388 0.529 -0.874 -0.814 2 2 -3.327 0.496 -0.694 1.365 -0.265 1.933 -0.363 0.510 -0.621 -0.488 3 3 -2.120 0.894 -1.576 0.147 -0.707 1.559 -0.579 0.676 -0.809 -0.049 4 4 -2.287 1.809 -1.498 1.012 -1.053 1.060 -0.567 0.235 -0.091 -0.795 5 5 -2.598 1.938 -0.846 1.062 -1.633 0.764 0.394 -0.150 0.277 -0.396 6 6 -2.852 1.914 -0.755 0.825 -1.588 0.855 0.217 -0.246 0.238 -0.365 vowel.train$y <- factor(vowel.train$y) set.seed(33833) fit.rf <- randomForest(vowel.train$y ~ .,data=vowel.train)
plot(fit.rf)
varImpPlot(fit.rf)


I was asked to find the order of variable importance which this graph shows.

## Principal Component Analysis

library(caret)
library(AppliedPredictiveModeling)
set.seed(3433)
data(AlzheimerDisease)
inTrain = createDataPartition(adData$diagnosis, p = 3/4)[[1]] training = adData[ inTrain,] testing = adData[-inTrain,]  I recently studied Predictive Analytics techniques as part of a course. I was given the code shown above. I generated the following two predictive models to compare their accuracy figures. This might be easy for experts but I found it tricky. So I post the code here for my reference. ### Non-PCA training1 <- training[,grepl("^IL|^diagnosis",names(training))] test1 <- testing[,grepl("^IL|^diagnosis",names(testing))] modelFit <- train(diagnosis ~ .,method="glm",data=training1) confusionMatrix(test1$diagnosis,predict(modelFit, test1))


Confusion Matrix and Statistics

Reference

Prediction Impaired Control

Impaired 2 20

Control 9 51

Accuracy : 0.6463

95% CI : (0.533, 0.7488)

No Information Rate : 0.8659

P-Value [Acc > NIR] : 1.00000

Kappa : -0.0702

Mcnemar’s Test P-Value : 0.06332

Sensitivity : 0.18182

Specificity : 0.71831

Pos Pred Value : 0.09091

Neg Pred Value : 0.85000

Prevalence : 0.13415

Detection Rate : 0.02439

Detection Prevalence : 0.26829

Balanced Accuracy : 0.45006

‘Positive’ Class : Impaired

### PCA

training2 <- training[,grepl("^IL",names(training))]

preProc <- preProcess(training2,method="pca",thresh=0.8)

test2 <- testing[,grepl("^IL",names(testing))]

trainpca <- predict(preProc, training2)

testpca <- predict(preProc, test2)

modelFitpca <- train(training1$diagnosis ~ .,method="glm",data=trainpca) confusionMatrix(test1$diagnosis,predict(modelFitpca, testpca))

Confusion Matrix and Statistics

Reference
Prediction Impaired Control
Impaired 3 19
Control 4 56

Accuracy : 0.7195
95% CI : (0.6094, 0.8132)
No Information Rate : 0.9146
P-Value [Acc > NIR] : 1.000000

Kappa : 0.0889
Mcnemar’s Test P-Value : 0.003509

Sensitivity : 0.42857
Specificity : 0.74667
Pos Pred Value : 0.13636
Neg Pred Value : 0.93333
Prevalence : 0.08537
Detection Rate : 0.03659
Detection Prevalence : 0.26829
Balanced Accuracy : 0.58762

‘Positive’ Class : Impaired

## Decision Tree

This is a technique used to analyze data for prediction. I came across this when I was studying Machine Learning.

Tree models are computationally intensive techniques for recursively partitioning response variables into subsets based on their relationship to one or more (usually many) predictor variables.

> head(data)
file_id time cell_id    d1    d2 fsc_small fsc_perp fsc_big    pe chl_small
1     203   12       1 25344 27968     34677    14944   32400  2216     28237
2     203   12       4 12960 22144     37275    20440   32400  1795     36755
3     203   12       6 21424 23008     31725    11253   32384  1901     26640
4     203   12       9  7712 14528     28744    10219   32416  1248     35392
5     203   12      11 30368 21440     28861     6101   32400 12989     23421
6     203   12      15 30032 22704     31221    13488   32400  1883     27323
chl_big     pop
1    5072    pico
2   14224   ultra
3       0    pico
4   10704   ultra
5    5920 synecho
6    6560    pico

training <- createDataPartition(data\$pop, times=1,p=.5,list=FALSE)
train <- data[training,]
test <- data[,training]

fol <- formula(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small)
model <- rpart(fol, method="class", data=train)
print(model)

n= 36172

node), split, n, loss, yval, (yprob)
* denotes terminal node

1) root 36172 25742 pico (0.0014 0.18 0.29 0.25 0.28)
2) pe=41300 5175 660 nano (0 0.87 0 0 0.13) *
11) chl_small=5001.5 9856 783 synecho (0.0052 0.054 0.0051 0.92 0.015)
6) chl_small>=38109.5 653 133 nano (0.078 0.8 0 0.055 0.07) *
7) chl_small< 38109.5 9203 166 synecho (0 0.0015 0.0054 0.98 0.011) *

## Machine learning on Big data

I just viewed the webinar conducted by SkyTree. This particular slide has some information about the evolution of Machine Learning.

Let $J(\theta) = \theta^3$. Furthermore, let $\theta = 1$ and $\epsilon=0.01$. You use the formula

$(J(\theta+\epsilon )- J(\theta-\epsilon))/2\epsilon$

to approximate the derivative. What value do you get using this approximation ?(When $\theta=1$,the true, exact derivative is
$d/d\theta J(\theta) = 3$ ).

The Octave code that I used to solve this is

((1+0.01)^3-(1-0.01)^3)/(2*0.10)


3.0001

## Logistic Regression

I just finished the code for a Logistic Regression classification model. I worked on a Card transaction processing system and there was a requirement for identifying whether a card transaction is fradulent or not. We did not use any classification model but if we had had a training set of historical data we could have used this learning algorithm.
Most of the time it is the lack of skills that affects software projects that I have been involved with.

### Cost function for Logistic Regression

$J(\theta) = 1/m \sum_{i=1}^{m}[-y^{(i)} log( h_\theta(x^{(i)}) - (1-y^{(i)})log( 1 - h_\theta(x^{(i)}))] ;$

$(\partial J(\theta)/\partial \theta_j ) = 1/m \sum_{i=1}^{m} ( h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j$