Principal Component Analysis

This is about what I think I understood about Principal Component Analysis. I will update this blog post later.

The code is in github and it works but I think the eigen values could be wrong. I have to test it further.

These are the two main functions.


    """Compute the covariance matrix for a given dataset.
    """
def estimateCovariance( data ):
    print data
    mean = getmean( data )
    print mean
    dataZeroMean = map(lambda x : x - mean, data )
    print dataZeroMean
    covar = map( lambda x : np.outer(x,x) , dataZeroMean )
    print getmean( covar ) 
    return getmean( covar )

    """Computes the top `k` principal components, corresponding scores, and all eigenvalues.
    """
def pca(data, k=2):
    
    d = estimateCovariance(  data )
    
    eigVals, eigVecs = eigh(d)

    validate( eigVals, eigVecs )
    inds = np.argsort(eigVals)[::-1]
    topComponent = eigVecs[:,inds[:k]]
    print '\nTop Component: \n{0}'.format(topComponent)
    
    correlatedDataScores = map(lambda x : np.dot( x ,topComponent), data )
    print ('\nScores : \n{0}'
       .format('\n'.join(map(str, correlatedDataScores))))
    print '\n eigenvalues: \n{0}'.format(eigVals[inds])
    return topComponent,correlatedDataScores,eigVals[inds]

JPA and Spring @Transactional and JBoss Arquillian

JBoss Arquillian is a test framework that one can use to execute tests in the IDE as part of the development process. The key parts are the deployment API and container adapters that enable us to deploy, tests that execute inside a container, automatically and repeatedly.

I have written about Arquillian here.
In this post I will show how a simple Arquillian test for a JPA transaction avoids countless wasted hours. Actually I spent a few hours trying to find out why enabling the wrong Transaction Manager produces log lines almost similar to the section below and misleads one into thinking that transactions are indeed in effect. It is the wrong transaction manager and no rows are actually committed to the Database. But the logs do show some messages that indicate data is committed.

This is the correct set of log messages that show that JpaTransactionManager takes effect.

DEBUG: org.springframework.transaction.annotation.AnnotationTransactionAttribute
Source – Adding transactional method ‘TestImpl.test’ with attribute: PROPAGATION
_REQUIRED,ISOLATION_DEFAULT; ”
DEBUG: org.springframework.orm.jpa.JpaTransactionManager – Creating new transact
ion with name [com.jpa.test.TestImpl.test]: PROPAGATION_REQUIRED,ISOLATION_DEFAU
LT; ”
DEBUG: org.hibernate.internal.SessionImpl – Opened session at timestamp: 1442144
9169
TRACE: org.hibernate.internal.SessionImpl – Setting flush mode to: AUTO
TRACE: org.hibernate.internal.SessionImpl – Setting cache mode to: NORMAL
DEBUG: org.springframework.orm.jpa.JpaTransactionManager – Opened new EntityMana
ger [org.hibernate.ejb.EntityManagerImpl@8f64d] for JPA transaction
DEBUG: org.hibernate.engine.transaction.spi.AbstractTransactionImpl – begin
DEBUG: org.hibernate.engine.jdbc.internal.LogicalConnectionImpl – Obtaining JDBC
connection
DEBUG: org.springframework.jdbc.datasource.SimpleDriverDataSource – Creating new
JDBC Driver Connection to [jdbc:hsqldb:mem:dataSource]
DEBUG: org.hibernate.engine.jdbc.internal.LogicalConnectionImpl – Obtained JDBC
connection
DEBUG: org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction – initial
autocommit status: true
DEBUG: org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction – disablin
g autocommit
DEBUG: org.springframework.orm.jpa.JpaTransactionManager – Exposing JPA transact
ion as JDBC transaction [org.springframework.orm.jpa.vendor.HibernateJpaDialect$
HibernateConnectionHandle@423d24]
TRACE: org.springframework.transaction.support.TransactionSynchronizationManager
– Bound value [org.springframework.jdbc.datasource.ConnectionHolder@805780] for
key [org.springframework.jdbc.datasource.SimpleDriverDataSource@faa27c] to thre
ad [http-nio-8080-exec-5]
TRACE: org.springframework.transaction.support.TransactionSynchronizationManager
– Bound value [org.springframework.orm.jpa.EntityManagerHolder@7a72fc] for key
[org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean@1e0cc0a] to
thread [http-nio-8080-exec-5]
TRACE: org.springframework.transaction.support.TransactionSynchronizationManager
– Initializing transaction synchronization
TRACE: org.springframework.transaction.interceptor.TransactionInterceptor – Gett
ing transaction for [com.jpa.test.TestImpl.test]
INFO : jpa – TransactionSynchronizationManager.isActualTransactionActive()true
INFO : jpa – INMEMORY_DB [id=id, street=Street, area=Area, state=State, country
=LO, pin=1]
TRACE: org.springframework.transaction.support.TransactionSynchronizationManager
– Retrieved value [org.springframework.orm.jpa.EntityManagerHolder@7a72fc] for
key [org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean@1e0cc0a]
bound to thread [http-nio-8080-exec-5]
TRACE: org.hibernate.engine.spi.IdentifierValue – ID unsaved-value strategy UNDE
FINED
TRACE: org.hibernate.event.internal.AbstractSaveEventListener – Transient instan
ce of: com.jpa.test.INMEMORY_DB
TRACE: org.hibernate.event.internal.DefaultPersistEventListener – Saving transie
nt instance
DEBUG: org.hibernate.event.internal.AbstractSaveEventListener – Generated identi
fier: id, using strategy: org.hibernate.id.Assigned
TRACE: org.hibernate.event.internal.AbstractSaveEventListener – Saving [com.jpa.
test.INMEMORY_DB#id]
TRACE: org.hibernate.engine.spi.ActionQueue – Adding an EntityInsertAction for [
com.jpa.test.INMEMORY_DB] object
TRACE: org.hibernate.engine.spi.ActionQueue – Adding insert with no non-nullable
, transient entities: [EntityInsertAction[com.jpa.test.INMEMORY_DB#id]]
TRACE: org.hibernate.engine.spi.ActionQueue – Adding resolved non-early insert a
ction.
TRACE: org.hibernate.action.internal.UnresolvedEntityInsertActions – No unresolv
ed entity inserts that depended on [[com.jpa.test.INMEMORY_DB#id]]
TRACE: org.hibernate.action.internal.UnresolvedEntityInsertActions – No entity i
nsert actions have non-nullable, transient entity dependencies.
TRACE: org.springframework.transaction.interceptor.TransactionInterceptor – Comp
leting transaction for [com.jpa.test.TestImpl.test]
DEBUG: org.springframework.orm.jpa.JpaTransactionManager – Initiating transactio
n commit
DEBUG: org.springframework.orm.jpa.JpaTransactionManager – Committing JPA transa
ction on EntityManager [org.hibernate.ejb.EntityManagerImpl@8f64d]
DEBUG: org.hibernate.engine.transaction.spi.AbstractTransactionImpl – committing

The source code has a copy of log4j.xml that enables the appropriate log. But this method is not repeatable in the sense that it is hard to manually check the log messages everytime we change the configuration or add new code. That is what unit tests are for and Arquillian container tests deploy our code into a container and execute tests in the IDE. The developer does not have to deploy manually and test the code. All that is required is a good regression test suite.

Arquillian uses the dependency arquillian-transaction-spring to make the test method transactional.

There are some dependencies in the pom.xml as well as in this Arquillian test that are not needed or redundant but the required ones are there.

package com.jpa.test;

import static org.junit.Assert.assertEquals;

import java.io.File;
import java.util.List;
import java.util.logging.Logger;

import javax.persistence.EntityManager;
import javax.persistence.PersistenceContext;
import javax.persistence.TypedQuery;
import javax.transaction.SystemException;

import org.jboss.arquillian.container.test.api.Deployment;
import org.jboss.arquillian.junit.Arquillian;
import org.jboss.arquillian.spring.integration.test.annotation.SpringConfiguration;
import org.jboss.arquillian.transaction.api.annotation.Transactional;
import org.jboss.shrinkwrap.api.ShrinkWrap;
import org.jboss.shrinkwrap.api.formatter.Formatters;
import org.jboss.shrinkwrap.api.spec.JavaArchive;
import org.jboss.shrinkwrap.api.spec.WebArchive;
import org.jboss.shrinkwrap.resolver.api.maven.Maven;
import org.junit.Test;
import org.junit.runner.RunWith;


@RunWith(Arquillian.class)
@SpringConfiguration("applicationContext.xml")
public class ShrinkWrappedJPATest {

	private static Logger l = Logger.getLogger("jpa");
		
	@PersistenceContext(unitName="testingSetup")
	private EntityManager entityManager;

    @Deployment
    public static WebArchive createWebArchive() {
  
    	final WebArchive war=ShrinkWrap.create(WebArchive.class,"ShrinkWrapJPA.war");
    	  
        JavaArchive jar = ShrinkWrap.create(JavaArchive.class)
                				.addPackage("com.jpa.test");
 

        war.addAsLibrary(jar);
    	war.addAsResource("applicationContext.xml");
    	war.addAsResource("arquillian.xml");
    	war.addAsResource("log4j.xml");
    	war.addAsResource("schema.sql");
    	war.addAsResource("test-data.sql");
    	war.addAsResource("log4j.xml");
    	war.addAsResource("persistence.xml", "META-INF/persistence.xml");
    	loadDependencies( war );
 
    	l.info(war.toString(Formatters.VERBOSE));
    	return war;
    }

        
    
    
    private static void loadDependencies( final WebArchive war ){
    	
        File springorm = Maven.
				resolver().
					resolve("org.springframework:spring-orm:4.1.6.RELEASE")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springorm);

        File hibernate = Maven.
 				resolver().
 					resolve("org.hibernate:hibernate-core:4.1.7.FINAL")
 						.withoutTransitivity().asSingle(File.class);

         war.addAsLibraries(hibernate);

 
	    File hibernate1 = Maven.
					resolver().
						resolve("org.hibernate:hibernate-entitymanager:4.1.7.FINAL")
							.withoutTransitivity().asSingle(File.class);
	
	     war.addAsLibraries(hibernate1);

        File springexpression = Maven.
				resolver().
					resolve("org.springframework:spring-expression:4.2.0.RELEASE")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springexpression);

        File springweb = Maven.
				resolver().
					resolve("org.springframework:spring-web:4.2.0.RELEASE")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springweb);

        File springcore = Maven.
				resolver().
					resolve("org.springframework:spring-core:4.1.6.RELEASE")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springcore);
        
        File springcontext = Maven.
				resolver().
					resolve("org.springframework:spring-context:4.1.6.RELEASE")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springcontext);

        File springjdbc = Maven.
				resolver().
					resolve("org.springframework:spring-jdbc:4.1.6.RELEASE")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springjdbc);
        File springtx = Maven.
				resolver().
					resolve("org.springframework:spring-tx:4.1.6.RELEASE")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springtx);
        File hsqldb = Maven.
 				resolver().
 					resolve("org.hsqldb:hsqldb:2.3.1")
 						.withoutTransitivity().asSingle(File.class);

         war.addAsLibraries(hsqldb);
         File dbcp = Maven.
 				resolver().
 					resolve("commons-dbcp:commons-dbcp:1.4")
 						.withoutTransitivity().asSingle(File.class);

         war.addAsLibraries(dbcp);
       File aopalliance = Maven.
				resolver().
					resolve("aopalliance:aopalliance:1.0")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(aopalliance);
        File extensionspring = Maven.
				resolver().
					resolve("org.jboss.arquillian.extension:arquillian-service-deployer-spring-3:1.0.0.Beta3")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(extensionspring);


         File springbeans = Maven.
				resolver().
					resolve("org.springframework:spring-beans:4.1.6.RELEASE")
						.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springbeans);

        File springaop = Maven.
				resolver().
					resolve("org.springframework:spring-aop:4.2.0.RELEASE")
					.withoutTransitivity().asSingle(File.class);

        war.addAsLibraries(springaop);


         File transactionapi = Maven.
 				resolver().
 					resolve("org.jboss.arquillian.extension:arquillian-transaction-api:1.0.1.Final")
 					.withoutTransitivity().asSingle(File.class);

         war.addAsLibraries(transactionapi);
         File transactionimplbase = Maven.
 				resolver().
 					resolve("org.jboss.arquillian.extension:arquillian-transaction-impl-base:1.0.1.Final")
 					.withoutTransitivity().asSingle(File.class);

         war.addAsLibraries(transactionimplbase);

    }
    
    @Test
    @Transactional(manager="transactionManager")
    
	public void save() throws Exception, SystemException {
 

        INMEMORY_DB a = new INMEMORY_DB();
		a.setId("id");
		a.setStreet("Street");
		a.setArea("Area");
		a.setState("State");
		a.setCountry("LO");
		a.setPin(1);
		entityManager.persist( a );
  		assertEquals(getAddressCount(), 2);
	}

 
	public int getAddressCount(){
		TypedQuery<INMEMORY_DB> query =
				entityManager.createQuery("SELECT c FROM INMEMORY_DB c", INMEMORY_DB.class);
		List<INMEMORY_DB> results = query.getResultList();	
		return results.size();
	}

}

Aesthetics of Matplotlib graphs

matplotlib
The graph shown in my earlier postis not clear and it looks wrong. I have improved it to some extent using this code. Matplotlib has many features more powerful than what I used earlier. I have commented the code used to annotate and display the actual points in the graph. I couldn’t properly draw the tick marks so that the red graph is clearly shown because the data range wasn’t easy to work with. There should be some feature that I still have not explored.


import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np

def main():
    gclog = pd.DataFrame(columns=['SecondsSinceLaunch',
                                   'BeforeSize',
                                   'AfterSize',
                                   'TotalSize',
                                   'RealTime'])
    with open("D:\\performance\\data.txt", "r") as f:
        for line in f:
            strippeddata = line.split()
            gclog = gclog.append(pd.DataFrame( [dict(SecondsSinceLaunch=strippeddata[0],
                                                     BeforeSize=strippeddata[1],
                                                     AfterSize=strippeddata[2],
                                                     TotalSize=strippeddata[3],
                                                     RealTime=strippeddata[4])] ),
                                               ignore_index=True)
    print gclog
    #gclog.time = pd.to_datetime(gclog['SecondsSinceLaunch'], format='%Y-%m-%dT%H:%M:%S.%f')
    gclog = gclog.convert_objects(convert_numeric=True)
    fig, ax = plt.subplots(figsize=(17, 14), facecolor='white', edgecolor='white')
    ax.axes.tick_params(labelcolor='darkblue', labelsize='10')
    for axis, ticks in [(ax.get_xaxis(), np.arange(10, 8470, 100) ), (ax.get_yaxis(), np.arange(10, 9125, 300))]:
        axis.set_ticks_position('none')
        axis.set_ticks(ticks)
        axis.label.set_color('#999999')
        if False: axis.set_ticklabels([])
    plt.grid(color='#999999', linewidth=1.0, linestyle='-')
    plt.xticks(rotation=70)
    plt.gcf().subplots_adjust(bottom=0.15)
    map(lambda position: ax.spines[position].set_visible(False), ['bottom', 'top', 'left', 'right'])
    ax.set_xlabel(r'AfterSize'), ax.set_ylabel(r'TotalSize')
    ax.set_xlim(10, 8470, 100), ax.set_ylim(10, 9125, 300)    
    plt.plot(sorted(gclog.AfterSize),gclog.TotalSize,c="red")
#     for i,j in zip(sorted(gclog.AfterSize),gclog.TotalSize):
#         ax.annotate('(' + str(i) + ',' + str(j) + ')',xy=(i, j))
    
    plt.show()
if __name__=="__main__":
    main()

figure_1

In fact after finishing the Principal Component Analysis section of a Machine Learning course I took recently I realized beautiful 3D graphs can be drawn.

cube

Arquillian Unit Tests

arquillian_logoI have contributed an article to DZone about Arquillian using this source code.

Aesthetics of ‘R’ Plots

I submitted a fresh pull request to adoptopenjdk to rectify the graphs in parsing-java-micro-benchmarking-harness-data-using-dplyr-part-2 after the lead developer of gs-collections identified that the titles in the two graphs are switched.

It is very important that we do not mislead with data.

g <- ggplot(jc, aes(x = IDX, y = V1)) +
		theme_bw() +
		geom_ribbon(aes(ymin = V1 - error, ymax = V1 + error), fill = "lightseagreen",
				linetype = 2, alpha= 0.1) +
		geom_line(color = "lightblue", size = 0.6) +
		geom_errorbar(aes(ymin = V1 - error, ymax = V1 + error), width = 0.1,
				color = "darkorange3") +
		labs(x = "Iterations", y = "Goldmansachs collections") + geom_text(aes(label = V1), size=4, family="Times", lineheight=.8)

ggplotgc

ggplotjc

Task duration estimation using betaPERT distribution and Monte Carlo Analysis

I have been researching this seemingly simple topic for many months. After reading many articles and browsing books and asking other experts I have a basic idea about this. Now I also find it ludicrous that our famed project managers in my companies do not seem to know this simple distribution after appearing for the PMP exams. Many managers here care not a coot for such technical items.

According to betaPERT

“The Beta-PERT methodology allows to parametrize a generalized Beta distribution based on expert opinion regarding a pessimistic estimate (minimum value), a most likely estimate (mode), and an optimistic estimate (maximum value).”

I will add more explanations later based on what I understand.


library(ggplot2)

opt <- 50 # Optimistic estimate
lik <- 100 # Likely estimate
pes <- 500 # Pessimistic estimate

lambda <- 4 # PERT weighting for likely

# PERT estimate is then
print( (opt + lambda*lik + pes)/(2 + lambda) )

# Mapping to the beta distribution
s1 <- 1+lambda*(lik-opt)/(pes-opt)
s2 <- 1+lambda*(pes-lik)/(pes-opt)

# Generate 1000 samples from the beta distribution that is scaled to the PERT parameters

persondays <- opt + (pes-opt) * rbeta(1000, s1, s2)

# Look at the output
png("D:/R/persondayssimulation.png")
hist(persondays)
dev.off()

print( summary(persondays))
# Compare to the PERT estimate
print( mean(persondays) )

gg <- ggplot(data.frame(persondays),
		     aes(x = persondays))

gg <- gg + geom_histogram(aes(y = ..density..),
		                  color = "black",
						  fill = "white", 
		                  binwidth = 15)
				  
gg <- gg + geom_density(fill = "mediumvioletred",
		                alpha = 0.5)
gg <- gg + theme_bw() 

ggsave("D:\\R\\density.png",
		width=6,
		height=6,
		dpi=100)

density

The book I was asked to read to understand Monte Carlo analysis is Introducing Monte Carlo Methods with R

Credit should go to
Mango
for sending me a private mail about this.

Parsing Java Micro-benchmarking Harness data using dplyr – Part 2

I have to add explanations later because I have to determine if the statistical measures calculated are correct or wrong. But this is based on the previous blog post.

Update : I think the measures are correctly plotted.

Types of Error bars used to plot the diagram

Error bars Type Description
Standard error (SEM) Inferential A measure of how variable the mean will be, if you repeat the whole study many times.
Confidence interval (CI), usually 95% CI Inferential A range of values you can be 95% confident contains the true mean.

The parsing will not work if JMH changes the default format of the output file.

library(stringr)
library(dplyr)
library(ggplot2)

data <- read.table("D:\\jmh\\jmh.txt",sep="\t")

final <-data %>%
	    select(V1) %>%	
		filter(grepl("^Iteration", V1)) %>%  
        mutate(V1 = str_extract(V1, "\\d+\\.\\d*"))

final <- mutate(final,IDX = 1:n())

jc <- final %>%
		filter(IDX < 21)


gc <- final %>%
		filter(IDX > 20)

gc <- mutate(gc,IDX = 1:n())

jc <- data.frame(sapply(jc, function(x) as.numeric(as.character(x))))
gc <- data.frame(sapply(gc, function(x) as.numeric(as.character(x))))


print(summary(jc$V1))
error <- qt(0.995,df=length(jc$V1)-1)*sd(jc$V1)/sqrt(length(jc$V1))
error1 <- mean(jc$V1)-error
error2 <- mean(jc$V1)+error

q <- qplot(geom = "line",jc$IDX,jc$V1, colour='red')+geom_errorbar(aes(x=jc$IDX, ymin=jc$V1-sd(jc$V1), ymax=jc$V1+sd(jc$V1)), width=0.25)+ 
		geom_ribbon(aes(x=jc$IDX, y=jc$V1, ymin=error1, ymax=error2),fill="ivory2",alpha = 0.4)+ 
		xlab('Iterations') + ylab("Java Collections")+theme_bw() 

ggsave("D:\\jmh\\jc.png", width=6, height=6, dpi=100)

#Using error <- qt(0.995,df=length(jc$V1)-1)*sd(jc$V1)/sqrt(length(jc$V1)) 
g <- ggplot(jc, aes(x = IDX, y = V1)) +
		theme_bw() +
		geom_ribbon(aes(ymin = V1 - error, ymax = V1 + error), fill = "gray60",
				alpha = 0.3) +
		geom_line(color = "blue", size = 1) +
		geom_errorbar(aes(ymin = V1 - error, ymax = V1 + error), width = 0.25,
				color = "red") +
		labs(x = "Iterations", y = "Java collections")

ggsave("D:\\jmh\\ggplotjc.png", width=6, height=6, dpi=100)


print(summary(gc$V1))
error <- qt(0.995,df=length(gc$V1)-1)*sd(gc$V1)/sqrt(length(gc$V1))
error1 <- mean(gc$V1)-error
error2 <- mean(gc$V1)+error

q1 <- qplot(geom = "line",gc$IDX,gc$V1, colour='red')+geom_errorbar(aes(x=gc$IDX, ymin=gc$V1-sd(gc$V1), ymax=gc$V1+sd(gc$V1)), width=0.25)+ 
		geom_ribbon(aes(x=gc$IDX, y=gc$V1, ymin=error1, ymax=error2),fill="ivory2",alpha = 0.4)+ 
		xlab('Iterations') + ylab("Goldmansachs Collections")+theme_bw() 


ggsave("D:\\jmh\\gc.png", width=6, height=6, dpi=100)

#Using error <- qt(0.995,df=length(gc$V1)-1)*sd(gc$V1)/sqrt(length(gc$V1)) 
g1 <- ggplot(gc, aes(x = IDX, y = V1)) +
		theme_bw() +
		geom_ribbon(aes(ymin = V1 - error, ymax = V1 + error), fill = "gray60",
				alpha = 0.3) +
		geom_line(color = "blue", size = 1) +
		geom_errorbar(aes(ymin = V1 - error, ymax = V1 + error), width = 0.25,
				color = "red") +
		labs(x = "Iterations", y = "Goldmansachs collections")

ggsave("D:\\jmh\\ggplotgc.png", width=6, height=6, dpi=100)

jc

gc

Suggested by the R user forum to improve the aesthetics of the plot. The Confidence Interval of 99% shown in the plots above is not correct. But the curves and error bars are correct.

g1 <- ggplot(gc, aes(x = IDX, y = V1)) +
		theme_bw() +
		geom_ribbon(aes(ymin = V1 - error, ymax = V1 + error), fill = "gray60",
				alpha = 0.3) +
		geom_line(color = "blue", size = 1) +
		geom_errorbar(aes(ymin = V1 - error, ymax = V1 + error), width = 0.25,
				color = "red") +
		labs(x = "Iterations", y = "Goldmansachs collections")

ggplot creates these two graphs. So instead of qplot code we should use ggplot.

ggplotjc

ggplotgc

Update : See this