apache pig - Reuse Pig Groups in nested FOREACH statement -
i'm trying group records together, calculate average of score1, filter out lower half of scores, , compute average of score2. can calculate summary statistics, , rejoin them original dataset, i'd prefer use intermediate grouped values.
example input
id,groupby,score1,score2 1,a,58.8,67.3 2,a,85.2,76.3 3,b,49.1,90.7 4,b,78.3,99.8
pig script
records = load 'example.csv' using pigstorage(',') (id,groupby,score1,score2); grouped = group records groupby; avgscore = foreach grouped generate group groupby, avg(records.score1) avgscore; joined = join grouped group, avgscore groupby using 'replicated'; results = foreach joined { scores = foreach records generate score1,score2; low = filter scores score1 < avgscore.avgscore; generate groupby, avg(low.score2); }; dump results;
desired output
a 67.3 b 90.7
however gives me result of java.lang.exception: org.apache.pig.backend.executionengine.execexception: error 0: scalar has more 1 row in output. 1st : (a,72.0), 2nd :(b,63.7)
you grouping 2 different data structures in line 4. joining grouped (which grouped) avgscore (which should flattened).
you should doing:
joined = join records groupby, avgscore groupby using 'replicated';
edit: rewrite avoid confusion (since there 2 groupbys)
records = load 'example.csv' using pigstorage(',') (id,groupby,score1,score2); grouped = group records groupby; avgscore = foreach grouped generate group groupby, avg(records.score1) avgscore; joined = join records groupby, avgscore groupby using 'replicated'; joined_reduced = foreach joined generate id, records::groupby groupby, avgscore, score1, score2; filter_joined = filter joined_reduced (score1 > avgscore); grouped2 = group filter_joined groupby; result = foreach grouped2 generate flatten (group), avg(filter_joined.score2) low_avg; dump result;
Comments
Post a Comment