python - Optimization tips for reading/parsing large number of JSON.gz files -


i have interesting problem @ hand. who's beginner when comes working data @ morderate scale, i'd love tips veterans here.

i have around 6000 json.gz files totalling around 5gb compressed , 20gb uncompressed. i'm opening each file , reading them line line using gzip module; using json.loads() loading each line , parsing complicated json structure. i'm inserting lines each file pytable @ once before iterating next file.

all taking me around 3 hours. bulk inserting pytable didn't speed @ all. of time gone getting values parsed json line since have horrible structure. straightforward 'attrname':attrvalue, complicated , time consuming structures like:

'attrarray':[{'name':abc, 'value':12},{'value':12},{'name':xyz, 'value':12}...]

...where need pick value of objects in attr array have corresponding name, , ignore don't. need iterate through list , inspect each json object inside. (i'd glad if can point out quicker clever way, if exists)

so suppose actual parsing part of doesn't have scope of speedup. think might scope of speedup actual reading file part.

so ran few tests (i don't have numbers me right now) , after removing parsing part of program; going through files line line itself taking considerable amount of time.

so ask: there part of problem think might doing suboptimally?

for filename in filenamelist:     f = gzip.open(filename):     toinsert=[]     line in f:         parsedline = json.loads(line)         attr1 = parsedline['attr1']         attr2 = parsedline['attr2']         .         .         .         attr10 = parsedline['attr10']         arr = parsedline['attrarray']         el in arr:             try:                 if el['name'] == 'abc':                     attrabc = el['value']                 elif el['name'] == 'xyz':                     attrxyz = el['value']                 .                 .                 .             except keyerror:                 pass         toinsert.append([attr1,attr2,...,attr10,attrabc,attrxyz...])      table.append(toinsert) 

one clear piece of "low-hanging fruit"

if you're going accessing same compressed files on , on (it's not clear description whether one-time operation), should decompress them once rather decompressing them on-the-fly each time read them.

decompression cpu-intensive operation, , python's gzip module not fast compared c utilities zcat/gunzip.

likely fastest approach gunzip these files, save results somewhere, , read uncompressed files in script.

other issues

the rest of not answer, it's long comment. in order make faster, need think few other questions:

  1. what trying do data?
  2. do need load all of @ once?
    • if can segment data smaller pieces, can reduce latency of program if not overall time required. example, might know need few specific lines specific files whatever analysis you're trying do... great! load specific lines.
    • if do need access data in arbitrary , unpredictable ways, should load system (rdbms?) stores in format more amenable kinds of analyses you're doing it.

if last bullet point true, 1 option load each json "document" postgresql 9.3 database (the json support awesome , fast) , further analyses there. can extract meaningful keys json documents load them.


Comments

Popular posts from this blog

javascript - how to protect a flash video from refresh? -

android - Associate same looper with different threads -

visual studio 2010 - Connect to informix database windows form application -