Pig JOIN
August 27, 2014 Leave a comment
This execution cost me $1.76 for about 1 hour. The number of machines is the same(previous post).
X = FILTER ntriples BY (subject matches '.*business.*'); y = foreach X generate subject as subject2, predicate as predicate2, object as object2 PARALLEL 50; j = JOIN X BY subject,y BY subject2 PARALLEL 50; j = DISTINCT j PARALLEL 50;
Counting the records in the file.
FILE = LOAD 'join-results'; FILE_C = GROUP FILE ALL; FILE_COUNT = FOREACH FILE_C GENERATE COUNT(FILE);