Pig | MindSpace

Processed 0.25 TB on Amazon EMR clusters

August 28, 2014 Leave a comment

I did that by provisioning 1 m1.medium Master node and 15 m1.xlarge Core nodes. This is easy and relatively cheap.
Since I deal with Pig I don’t have to design my MapReduce Jobs. I have to learn how to code MR jobs in the future.

This command stores the result in a file. I used to count the records in the file but I realized I don’t have to because the command actually prints how many records it writes.

store variable INTO '/user/hadoop/file' USING PigStorage();

Filed under Cloud, Elastic MapReduce, Pig

Pig JOIN

August 27, 2014 Leave a comment

This execution cost me $1.76 for about 1 hour. The number of machines is the same(previous post).

X = FILTER ntriples BY (subject matches '.*business.*');
y = foreach X generate subject as subject2, predicate as predicate2, object as object2 PARALLEL 50;
j = JOIN X BY subject,y BY subject2 PARALLEL 50;
j = DISTINCT j PARALLEL 50;

Counting the records in the file.

FILE = LOAD 'join-results';
FILE_C = GROUP FILE ALL;
FILE_COUNT = FOREACH FILE_C GENERATE COUNT(FILE);

Filed under Cloud, Elastic MapReduce, Pig

Cluster configuration

August 22, 2014 Leave a comment

So this is the real deal. The Pig Job mentioned in the previous post failed when the actual file was processed on the EMR cluster. It succeeded only after I resized the cluster and added more heap space.

I used 1 m1.small master node, 10 m1.small code nodes and 5 m1.small task nodes. I think so many nodes are not needed to process this file and just the increased heap without the task nodes would have been sufficient.

Filed under Cloud, Elastic MapReduce, Pig

Big Data analysis on the cloud

August 21, 2014 Leave a comment

I was given this dataset( http://km.aifb.kit.edu/projects/btc-2010/). I believe it is RDF. But more importantly I executed some Pig Jobs locally and this is how it worked for me. The main idea here is how it helped me to learn about Pig MapReduce Jobs.

The data is in quads like this.

<http://openean.kaufkauf.net/id/businessentities/GLN_7654990000088> <http://www.w3.org/2000/01/rdf-schema#isDefinedBy> <http://openean.kaufkauf.net/id/businessentities/><http://openean.kaufkauf.net/id/businessentities/GLN_6406510000068> .
<http://openean.kaufkauf.net/id/businessentities/GLN_3521100000068> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/goodrelations/v1#BusinessEntity> <http://openean.kaufkauf.net/id/businessentities/GLN_6406510000068> .

After processing by another Pig script I started working with this data.

(<http://openean.kaufkauf.net/id/businessentities/GLN_7612688000000>,3)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7615990000096>,3)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7634640000088>,3)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7636150000008>,3)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7636690000018>,3)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7654990000088>,1)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7657220000032>,3)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7658940000098>,3)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7659150000014>,3)
(<http://openean.kaufkauf.net/id/businessentities/GLN_7662880000018>,3)

The schema of the data is like this.

count_by_object: {group: chararray,count: long}

x = GROUP count_by_object BY count;
y = FOREACH x GENERATE group,COUNT(count_by_object);

Line 1 shown above groups the tuples by the count. This is what I get.

(1,{(<http://openean.kaufkauf.net/id/businessentities/GLN_7654990000088>,1)})
(3,{(<http://openean.kaufkauf.net/id/businessentities/GLN_0000049021028>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0000054110120>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0078477000014>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0084610000032>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0088720000050>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0120490000028>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0133770000090>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0144360000086>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0146140000040>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0160080000038>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0162990000030>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0165590000028>,3),(<http://openean.kaufkauf.net/id/businessentities/GLN_0166620000056>,3),
.........

Line 2 of the Pig script give me this result.

(1,1)
(3,333)

It is a interesting way to learn Pig which internally spawns Hadoop MapReduce Jobs. But the real fun is the Amazon Elastic MapReduce on-demand clusters. If the file is very large the EMR clusters should be used. It is basically Big Data analysis on the cloud.

Filed under Cloud, Elastic MapReduce, Pig

MindSpace

Processed 0.25 TB on Amazon EMR clusters

Pig JOIN

Cluster configuration

Big Data analysis on the cloud

Blogroll