Porting Python BoxPlot code to “R”

My previous two entries explained how I am attempting to port the Python code used to create the graph in the DZone article to “R”.

The author of that article published the data that he used to generate the graph. The data consists of several files. I have taken one file and tried to create boxplots from it. I will improve this by combining all the files, parsing them and generating a combined boxplot. But first I coded this “R” script to parse one file and generate a boxplot.

The data from one of the files looks like this.


timeStamp,elapsed,label,responseCode,responseMessage,threadName,dataType,success,Latency
1346999466187,32,Home page - anon,200,OK,Anonymous Browsing 1-2,text,true,31
1346999466182,37,Login form,200,OK,Node save 3-1,text,true,36
1346999466184,35,Home page - anon,200,OK,Anonymous Browsing 1-11,text,true,32
1346999466182,37,Home page - anon,200,OK,Anonymous Browsing 1-1,text,true,34
1346999466189,30,Home page - anon,200,OK,Anonymous Browsing 1-4,text,true,27
1346999466185,46,Home page - anon,200,OK,Anonymous Browsing 1-5,text,true,34
1346999466185,44,Search,200,OK,Search 4-1,text,true,35
1346999466188,28,Home page - anon,200,OK,Anonymous Browsing 1-3,text,true,26
1346999466182,33,Home page - anon,200,OK,Anonymous Browsing 1-7,text,true,32
1346999466182,36,Login Form,200,OK,Perform Login/View Account 5-1,text,true,35
1346999466182,35,Home page - anon,200,OK,Anonymous Browsing 1-10,text,true,33
1346999466182,34,Login Form,200,OK,Authenticated Browsing 2-1,text,true,32
1346999466184,33,Home page - anon,200,OK,Anonymous Browsing 1-6,text,true,31
1346999466182,37,Home page - anon,200,OK,Anonymous Browsing 1-9,text,true,35

It is very easy to parse this and create a “R” data frame.


# TODO: Box Plots
# 
# Author: radhakrishnan
###############################################################################


library(plyr)

options("scipen"=100, "digits"=4)

data <- read.table("~/Documents/Learn R Statistics/R/jmeter_results/4-overall-summary.csv",sep=",",header=T)

head(data)

#I don't think I need 'ddply' here but it serves the purpose.
#It groups the data based on the 'label' and returns the two relevant columns
data <- ddply( data , .variables = "label" , .fun = function(x) x[,c("label","elapsed")])

uniquelables <- as.character(unique(data$label))
lists <- replicate( length(uniquelables),list())

j = 1


for (i in uniquelables ){
  lists[j] = as.list(as.data.frame(data[data$label %in% i,'elapsed']))
  j = j + 1
}

boxplot(lists)

BoxPlot from one file

boxplot

Error bars using ‘R’

I believe our measurements are uncertain and we need to show the errors in our capacity measurement plots. I suspect that we are making fundamental mistakes in our attempts to gather performance statistics and drawing graphs. All the more reason for showing these uncertainties. Our management and clients should not be mislead by the lack of skills of our Capacity planners.

This code and the graph are used to learn one aspect of showing such errors. I am yet to investigate the type of errors and their statistical significance.

If there is a mistake I will make corrections to this blog entry.

Updated : Code and graph.

 this.dir <- dirname(parent.frame(2)$ofile) 
setwd(this.dir)
 #Reference values plotted on x-axis. These are constant.
 #These values could be time of day. So every day at the same
 #time we could collect other measurements
 referenceset <- data.frame(c(5,10,15,20,25,30,35,40,50,60))
 colnames( referenceset) <- c("reference")

 #These are the sets of measurements. So every day at the same
 #time we could collect several samples. This is simulated now.
 sampleset <- data.frame( matrix(sample(1:2, c(20000), replace = TRUE), ncol = 2000) )
 
 sampleset <- cbind( sampleset, referenceset )
 
 #Calculate mean
 sampleset$mean <- apply(sampleset[,1:10],2,mean)
 
 #Calculate Standard Deviation
 sampleset$sd <- apply(sampleset[,c(1:10)],2,sd)
 
 #Calculate Standard Error
 sampleset$se <- sampleset$sd / sqrt(10)
 
 #print(sampleset)

	png(
	"errorbars.png",
	width =500, height = 510)
 
 plot( sampleset$reference,
       sampleset$mean,
	   las=1,
	   ylab="Mean of 'y' values",
	   xlab="x",
      ylim=c(0,3),
	  type="l",
	  lwd=1,
	   col="blue"
      );
	  
arrows(sampleset$reference,
       sampleset$mean-sampleset$se,
	   sampleset$reference,
	   sampleset$mean+sampleset$se,
	   code = 3,
	   angle=90,
	   length=0.2)

dev.off()


errorbars

Videos and articles to view and read

    Memory Barriers: a Hardware View for Software Hackers
    http://shipilev.net/pub/talks/devoxx-Nov2013-benchmarking.pdf
    Caching in: understand, measure and use your CPU Cache more effectively
    http://lwn.net/Articles/552095/

Performance tuning article

The Java magazine carries an article on Performance tuning.

When I read this line I stopped and posted it here. This is exactly what is happening in our production systems. Aha !

The bottleneck is almost certainly where you think it is.
Trust in your amazing analytical skills—they will lead you to the right culprit, which is usually any code that wasn’t written by you.

Calculating memorypool sizes

I have chosen some values from these guidelines from Charlie Hunt’s book. This is the latest ‘R’ code. That last blog entry has the old code.

I am using some of these general rules just as a foundation for further calculation. Generally our capacity planning teams do not have any baseline. I have not investigated the actual justification for some of these figures.

Guidelines for Calculating Java Heap Sizing

Java heap - 3x to 4x old generation space occupancy
after full garbage collection

Permanent Generation - 1.2x to 1.5x permanent generation space
occupancy after full garbage collection

Young Generation - 1x to 1.5x old generation space
occupancy after full garbage collection

Old Generation Implied from overall Java heap size minus the young
generation size

2x to 3x old generation space occupancy
after full garbage collection


library(stringr)

this.dir <- dirname(parent.frame(2)$ofile) 
setwd(this.dir)

data <- read.table("D:\\GC Analysis\\gc.log-node1",sep="\n")

parse.mean <- function(){
	fullgc.timestamp( fullgc.read() )
}


# Grep Full GC lines
fullgc.read <- function(){
	return (data$V1[ grep ("(.*)Full GC(.*)", data[,1])])
}

fullgc.data <- function(timedata,memorydata){

   memorydata["Time"] <- "NA"
   memorydata$Time <- unlist(timedata)
   return (fixdate(memorydata))
}

fullgc.timestamp <- function(input){

	time <- str_extract(input,"T[^.]*")
	timeframex<-data.frame(time)
	timeframey<-subset(timeframex, timeframex[1] != "NA")
	timeframey<-substring(timeframey$time,2,9)
	timeframey <- data.frame(timeframey)
	colnames( timeframey ) <- c("Time")
    return (timeframey)	
}

fullgc.memorypool.mean <- function(input){
	data <- str_extract(input,"PSYou.*[/)]")
	data <- str_extract_all(data,"[0-9]+")
	data <- data.frame(matrix(unlist(data),ncol=12,byrow=T))
	colnames(data)[2] <- c("YoungGenAfterFullGC")
	colnames(data)[8] <- c("OldGenAfterFullGC")
	colnames(data)[11] <- c("PermGenAfterFullGC")
    return (data)
}

fixdate <- function(filtereddata){

    filtereddata$Time1<-strptime(filtereddata$Time,"%H:%M:%S")
	offset<-86400
    lasttime<-filtereddata$Time1[1]
    
	filtereddata$Time1[1]<-filtereddata$Time1[1]+offset
	lasttime<-filtereddata$Time1[1]
	for(timedate in 2:length(filtereddata$Time1)) {
		if(as.numeric(filtereddata$Time1[timedate]) < lasttime){
			offset<- offset + 86400
		}
		filtereddata$Time1[timedate]<-filtereddata$Time1[timedate]+offset
		lasttime<-filtereddata$Time1[timedate]
	}
	return(filtereddata)
}

fullgc.youngandoldgen.graph <- function(filtereddata){
	print(filtereddata)
	png(
	"younggenrecommendation.png",
	width =1700, height = 510)
options("scipen"=100, "digits"=4)    
plot( filtereddata$Time1,
      levels(filtereddata$YoungGenAfterFullGC)[filtereddata$YoungGenAfterFullGC],
	  col="darkblue",type="l",
	  ylab="MB",
	  xlab="",
	  las=2,
	  lwd=5.5,
	  cex.lab=1,
	  cex.axis=0.8,
	  xaxt="n",
	  ylim=c(min(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC])),
	         max(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC]))))
axis(1, at = as.numeric(filtereddata$Time1), labels = filtereddata$Time, las = 2,cex.axis=1.2)
mtext("Time", side=1, line=-1, cex=1.3) 
abline(h=mean(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC])),col="yellow")
title("Younggen(1.5*Mean of Oldgen) and Total heap(4 * oldgen mean) recommendation",cex.main=1.5)
	box("figure", col="blue")
points( filtereddata$Time1,
        levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC],
		col="yellow",
		las=2,
		lwd=5.5,
		type="l")
legend("topleft", lty=c(1,1),lwd=c(3.5,3.5),
       c(paste("Younggen = Oldgenmean * 1.5 = ",
	            signif(mean(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC]))/1048*2,digits=2)," "," * 1.5 = ",
				signif((mean(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC]))/1048*2)*1.5,digits=2),"MB"),
				paste("Total heap ",signif((mean(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC]))/1048*2)*4,digits=2),"MB")),
 	   fill=c("darkblue"))

dev.off()
}

fullgc.permandoldgen.graph <- function(filtereddata){

	png(
	"permgenrecommendation.png",
	width =1700, height = 510)
options("scipen"=100, "digits"=4)    
plot( filtereddata$Time1,
      levels(filtereddata$PermGenAfterFullGC)[filtereddata$PermGenAfterFullGC],
	  col="darkblue",type="l",
	  ylab="MB",
	  xlab="",
	  las=2,
	  lwd=5.5,
	  cex.lab=1,
	  cex.axis=0.8,
	  xaxt="n",
	  ylim=c(min(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC])),
	         max(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC]))))
axis(1, at = as.numeric(filtereddata$Time1), labels = filtereddata$Time, las = 2,cex.axis=1.2)
mtext("Time", side=1, line=-1, cex=1.3) 
abline(h=mean(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC])),col="yellow")
abline(h=mean(as.numeric(levels(filtereddata$PermGenAfterFullGC)[filtereddata$PermGenAfterFullGC])),col="darkblue")
title("Permgen(1.5*mean)/Oldgen(2*mean)  recommendation",cex.main=1.5)
	box("figure", col="blue")
points( filtereddata$Time1,
        levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC],
		col="yellow",
		las=2,
		lwd=5.5,
		type="l")
legend("topleft", lty=c(1,1),lwd=c(3.5,3.5),
       c(paste("Oldgen ",signif((mean(as.numeric(levels(filtereddata$OldGenAfterFullGC)[filtereddata$OldGenAfterFullGC]))/1048*2)*2,digits=2),"MB"),
	     paste("Permgen ",signif((mean(as.numeric(levels(filtereddata$PermGenAfterFullGC)[filtereddata$PermGenAfterFullGC]))/1048*2)*1.5,digits=2),"MB")),
 	   fill=c("darkblue","yellow"))

dev.off()
}

permgenrecommendation

younggenrecommendation

Permanent/old generation space occupancy

permgenrecommendation

This graphs is generated from the garbage collection logs using ‘R’. The legend shows the general
recommendation based on the mean occupancy(Charlie Hunt’s book on Java performance).

  1. 1.2x to 1.5x permanent generation space occupancy after full garbage collection
  2. 2x to 3x old generation space occupancy after full garbage collection

Some values of time in the x-axis are too close together for reasons that I couldn’t understand. My previous blog entry has the old code and graph. I will post the modified code.

Analyzing the garbage collection log using ‘R’

I know that I am publishing ‘R’ code like this in a hurry. But I plan to add more explanations later on. The comments in the code are missing.

I obtain garbage collection log from a production JVM and isolate the ‘Full GC’ lines. The goal is to draw graphs of utilization and find the mean and recommend a size for the memory pools. I refer to Charlie Hunt’s book on Java Performance.


2013-10-04T19:36:16.497+0530: 2.152: [GC [PSYoungGen: 65537K->3456K(382272K)] 65537K->3456K(1256128K), 0.0090610 secs] [Times: user=0.04 sys=0.00, real=0.00 secs]
2013-10-04T19:36:16.506+0530: 2.161: [Full GC (System) [PSYoungGen: 3456K->0K(382272K)] [PSOldGen: 0K->3285K(873856K)] 3456K->3285K(1256128K) [PSPermGen: 12862K->12862K(25984K)], 0.0512630 secs] [Times: user=0.05 sys=0.00, real=0.06 secs]
2013-10-04T19:36:25.830+0530: 11.485: [GC [PSYoungGen: 327680K->22711K(382272K)] 330965K->25997K(1256128K), 0.0234720 secs] [Times: user=0.15 sys=0.02, real=0.02 secs]
2013-10-04T19:36:26.910+0530: 12.565: [GC [PSYoungGen: 84665K->28038K(382272K)] 87950K->31324K(1256128K), 0.0214230 secs] [Times: user=0.20 sys=0.03, real=0.02 secs]
2013-10-04T19:36:26.931+0530: 12.586: [Full GC (System) [PSYoungGen: 28038K->0K(382272K)] [PSOldGen: 3285K->30437K(873856K)] 31324K->30437K(1256128K) [PSPermGen: 39212K->39212K(69504K)], 0.2616280 secs] [Times: user=0.25 sys=0.01, real=0.26 secs]
2013-10-04T19:36:39.614+0530: 25.269: [GC [PSYoungGen: 327680K->15462K(382272K)] 358117K->45900K(1256128K), 0.0361220 secs] [Times: user=0.34 sys=0.00, real=0.03 secs]
2013-10-04T19:36:45.141+0530: 30.796: [GC [PSYoungGen: 343142K->30879K(382272K)] 373580K->61317K(1256128K), 0.1013610 secs] [Times: user=1.30 sys=0.00, real=0.10 secs]


library(stringr)

this.dir <- dirname(parent.frame(2)$ofile) 
setwd(this.dir)

data <- read.table("D:\\GC Analysis\\gc.log-node1",sep="\n")
#print(data)

parse.mean <- function(){
	fullgc.timestamp( fullgc.read() )
}


# Grep Full GC lines
fullgc.read <- function(){
	return (data$V1[ grep ("(.*)Full GC(.*)", data[,1])])
}

fullgc.data <- function(timedata,memorydata){

   memorydata["Time"] <- "NA"
   memorydata$Time <- unlist(timedata)
   return (memorydata)
}

fullgc.timestamp <- function(input){

	time <- str_extract(input,"T[^.]*")
	timeframex<-data.frame(time)
	timeframey<-subset(timeframex, timeframex[1] != "NA")
	timeframey<-substring(timeframey$time,2,9)
	timeframey <- data.frame(timeframey)
	colnames( timeframey ) <- c("Time")
    return (timeframey)	
}

fullgc.oldgen.mean <- function(input){
	data <- str_extract(input,"PSOld.*[/)]")
	data <- str_extract_all(data,"[0-9]+")
	data <- data.frame(matrix(unlist(data),ncol=9,byrow=T))
	colnames(data)[5] <- c("OldGenAfterFullGC")
	colnames(data)[8] <- c("PermGenAfterFullGC")
    return (data)	
}

subset.removezerohours <- function( input ){
	return( subset(input,as.POSIXlt(strptime(input$Time,"%H:%M:%S"))$hour != 0))
}

fullgc.permgen.graph <- function(filtereddata){

	png(
	"permgenrecommendation.png",
	width =1720, height = 490)
    filtereddata$Time1<-strptime(filtereddata$Time,"%H:%M:%S")
	print(filtereddata)
    
	offset<-86400
    lasttime<-filtereddata$Time1[1]
    
	filtereddata$Time1[1]<-filtereddata$Time1[1]+offset
	lasttime<-filtereddata$Time1[1]
	for(timedate in 2:length(filtereddata$Time1)) {
		if(as.numeric(filtereddata$Time1[timedate]) < lasttime){
			offset<- offset + 86400
		}
		filtereddata$Time1[timedate]<-filtereddata$Time1[timedate]+offset
		lasttime<-filtereddata$Time1[timedate]
	}
plot( filtereddata$Time1,
      levels(filtereddata$PermGenAfterFullGC)[filtereddata$PermGenAfterFullGC],
	  col="darkblue",type="l",
	  ylab="MB",
	  xlab="",
	  las=2,
	  lwd=5.5,
	  cex.lab=1,
	  cex.axis=1,
	  xaxt="n")
axis(1, at = as.numeric(filtereddata$Time1), labels = filtereddata$Time, las = 2,cex.axis=1.3)
mtext("Time", side=1, line=-1, cex=1.3) 
title("Permgen mean and recommendation",cex.main=1.5)

dev.off()
}