python - Optimization tips for reading/parsing large number of JSON.gz files -
i have interesting problem @ hand. who's beginner when comes working data @ morderate scale, i'd love tips veterans here.
i have around 6000
json.gz files totalling around 5gb compressed , 20gb uncompressed. i'm opening each file , reading them line line using gzip
module; using json.loads()
loading each line , parsing complicated json structure. i'm inserting lines each file pytable @ once before iterating next file.
all taking me around 3 hours. bulk inserting pytable didn't speed @ all. of time gone getting values parsed json line since have horrible structure. straightforward 'attrname':attrvalue
, complicated , time consuming structures like:
'attrarray':[{'name':abc, 'value':12},{'value':12},{'name':xyz, 'value':12}...]
...where need pick value
of objects in attr
array have corresponding name
, , ignore don't. need iterate through list , inspect each json object inside. (i'd glad if can point out quicker clever way, if exists)
so suppose actual parsing part of doesn't have scope of speedup. think might scope of speedup actual reading file part.
so ran few tests (i don't have numbers me right now) , after removing parsing part of program; going through files line line itself taking considerable amount of time.
so ask: there part of problem think might doing suboptimally?
for filename in filenamelist: f = gzip.open(filename): toinsert=[] line in f: parsedline = json.loads(line) attr1 = parsedline['attr1'] attr2 = parsedline['attr2'] . . . attr10 = parsedline['attr10'] arr = parsedline['attrarray'] el in arr: try: if el['name'] == 'abc': attrabc = el['value'] elif el['name'] == 'xyz': attrxyz = el['value'] . . . except keyerror: pass toinsert.append([attr1,attr2,...,attr10,attrabc,attrxyz...]) table.append(toinsert)
one clear piece of "low-hanging fruit"
if you're going accessing same compressed files on , on (it's not clear description whether one-time operation), should decompress them once rather decompressing them on-the-fly each time read them.
decompression cpu-intensive operation, , python's gzip
module not fast compared c utilities zcat
/gunzip
.
likely fastest approach gunzip
these files, save results somewhere, , read uncompressed files in script.
other issues
the rest of not answer, it's long comment. in order make faster, need think few other questions:
- what trying do data?
- do need load all of @ once?
- if can segment data smaller pieces, can reduce latency of program if not overall time required. example, might know need few specific lines specific files whatever analysis you're trying do... great! load specific lines.
- if do need access data in arbitrary , unpredictable ways, should load system (rdbms?) stores in format more amenable kinds of analyses you're doing it.
if last bullet point true, 1 option load each json "document" postgresql 9.3 database (the json support awesome , fast) , further analyses there. can extract meaningful keys json documents load them.
Comments
Post a Comment