apache pig - Reuse Pig Groups in nested FOREACH statement -

- May 15, 2014

i'm trying group records together, calculate average of score1, filter out lower half of scores, , compute average of score2. can calculate summary statistics, , rejoin them original dataset, i'd prefer use intermediate grouped values.

example input

id,groupby,score1,score2 1,a,58.8,67.3 2,a,85.2,76.3 3,b,49.1,90.7 4,b,78.3,99.8

pig script

records = load 'example.csv' using pigstorage(',') (id,groupby,score1,score2); grouped = group records groupby; avgscore = foreach grouped generate group groupby, avg(records.score1) avgscore; joined = join grouped group, avgscore groupby using 'replicated'; results = foreach joined {     scores = foreach records generate score1,score2;     low = filter scores score1 < avgscore.avgscore;     generate groupby, avg(low.score2); }; dump results;

desired output

a    67.3 b    90.7

however gives me result of java.lang.exception: org.apache.pig.backend.executionengine.execexception: error 0: scalar has more 1 row in output. 1st : (a,72.0), 2nd :(b,63.7)

you grouping 2 different data structures in line 4. joining grouped (which grouped) avgscore (which should flattened).

you should doing:

joined = join records groupby, avgscore groupby using 'replicated';

edit: rewrite avoid confusion (since there 2 groupbys)

records = load 'example.csv' using pigstorage(',') (id,groupby,score1,score2); grouped = group records groupby; avgscore = foreach grouped generate group groupby, avg(records.score1) avgscore; joined = join records groupby, avgscore groupby using 'replicated'; joined_reduced = foreach joined generate id, records::groupby groupby, avgscore, score1, score2; filter_joined = filter joined_reduced (score1 > avgscore); grouped2 = group filter_joined groupby; result = foreach grouped2 generate flatten (group), avg(filter_joined.score2) low_avg;  dump result;

Search This Blog

Back

apache pig - Reuse Pig Groups in nested FOREACH statement -

Comments

Post a Comment

Popular posts from this blog

python - Referencing Data From a 2D Histogram -

php - MySQL LIMIT results with INNER JOIN with more than 2 tables -

c# - Derived UserControl layout resets after build -