sigmoid function

What is a sigmoid function ?

It is http://mathworld.wolfram.com/SigmoidFunction.html

This simple code is the standard way to plot it. I am using Octave.

x = -10:0.1:10;
a = 1.0 ./ (1.0 + exp(-x));
figure;
plot(x,a,'-','linewidth',3)

sigmoid

The Caltech-JPL Summer School on Big Data Analytics

This treasure trove of videos teach many Machine Learning subjects. This is not intended to be a typical Coursera course because there are no deadlines or tests.

There is so much to write about what I learn from these videos but for now these measures to assess the costs and benefits of a classification model are intended as reference.

Screen Shot 2014-10-13 at 11.26.57 PM

Frequent Itemsets

Screen Shot 2014-10-06 at 11.50.25 PM I am reading Chapter 6 on Frequent Itemsets. I hope to understand the A-priori algorithm.

Apache Mahout

I followed this tutorial. Mahout seems to be an easy way to test Machine Learning algorithms using the Java API.

But I would use this  this R code instead of the one shown in the tutorial to convert the MovieLens dataset to CSV format.

r<-file("u.data","r")
w<-file("u1.csv","w")

while( length(data <- readLines(r)) > 0 ){
	writeLines(gsub("\\s+",",",data),w)
}

Time series forecast

This code counts how many values from the testing dataset fall within the 95% Confidence Interval range.

library(forecast)
library(lubridate)  # For year() function below
dat=read.csv("~/Desktop/gaData.csv")
training = dat[year(dat$date) < 2012,]
testing = dat[(year(dat$date)) > 2011,]
tstrain = ts(training$visitsTumblr)
sum <- 0
 fit <- bats(tstrain)
 fc <- forecast(fit,h=235)
 mat <- fc$upper
 for(i in 1:nrow(mat)){
     v <- data.frame(mat[i,])
     print(paste(testing$visitsTumblr[i],v[1,] , v[2,]))
     if(testing$visitsTumblr[i] > v[1,] & testing$visitsTumblr[i] < v[2,]){
	sum <- sum + 1
     }
 }
     print(sum)

The forecast object has this type of data(Lo 95 & Hi 95) which I use.

> fc
    Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
366       207.4397 -124.2019 539.0813 -299.7624 714.6418
367       197.2773 -149.6631 544.2177 -333.3223 727.8769
368       235.5405 -112.0582 583.1392 -296.0658 767.1468
369       235.5405 -112.7152 583.7962 -297.0707 768.1516
370       235.5405 -113.3710 584.4520 -298.0736 769.1546
371       235.5405 -114.0256 585.1065 -299.0747 770.1556
372       235.5405 -114.6789 585.7599 -300.0739 771.1548

Lasso fit

The code I was given

set.seed(3523)
library(AppliedPredictiveModeling)
data(concrete)
inTrain = createDataPartition(concrete$CompressiveStrength, p = 3/4)[[1]]
training = concrete[ inTrain,]
testing = concrete[-inTrain,]

This is the data

<- head(as.matrix(training))
    Cement BlastFurnaceSlag FlyAsh Water Superplasticizer CoarseAggregate
47   349.0              0.0      0 192.0              0.0          1047.0
55   139.6            209.4      0 192.0              0.0          1047.0
56   198.6            132.4      0 192.0              0.0           978.4
58   198.6            132.4      0 192.0              0.0           978.4
63   310.0              0.0      0 192.0              0.0           971.0
115  362.6            189.0      0 164.9             11.6           944.7
    FineAggregate Age CompressiveStrength
47          806.9   3               15.05
55          806.9   7               14.59
56          825.5   7               14.64
58          825.5   3                9.13
63          850.6   3                9.87
115         755.8   7               22.90

Lasso fit and plot

predictors <- as.matrix(training)[,-9]
lasso.fit <- lars(predictors,training$CompressiveStrength,type="lasso",trace=TRUE)
headings <- names(training[-(9:10)])
plot(lasso.fit, breaks=FALSE)
legend("topleft", headings,pch=8, lty=1:length(headings),col=1:length(headings))

Screen Shot 2014-09-26 at 9.34.44 AM

According to this graph the last coefficient to be set to zero as the penalty increases is Cement. I think this is correct but I may change this.

RandomForests

I am just posting R code at this time. The explanation is missing but I am making some progress.

library(ElemStatLearn)
library(randomForest)
data(vowel.train)
data(vowel.test)
> head(vowel.train)
y x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10
1 1 -3.639 0.418 -0.670 1.779 -0.168 1.627 -0.388 0.529 -0.874 -0.814
2 2 -3.327 0.496 -0.694 1.365 -0.265 1.933 -0.363 0.510 -0.621 -0.488
3 3 -2.120 0.894 -1.576 0.147 -0.707 1.559 -0.579 0.676 -0.809 -0.049
4 4 -2.287 1.809 -1.498 1.012 -1.053 1.060 -0.567 0.235 -0.091 -0.795
5 5 -2.598 1.938 -0.846 1.062 -1.633 0.764 0.394 -0.150 0.277 -0.396
6 6 -2.852 1.914 -0.755 0.825 -1.588 0.855 0.217 -0.246 0.238 -0.365
vowel.train$y <- factor(vowel.train$y)
set.seed(33833)
fit.rf <- randomForest(vowel.train$y ~ .,data=vowel.train)
plot(fit.rf)
varImpPlot(fit.rf)

Screen Shot 2014-09-22 at 10.36.08 AM

I was asked to find the order of variable importance which this graph shows.

Screen Shot 2014-09-22 at 10.42.31 AM