calculating means from csv with python's numpy -


i have 10gb (can't fit in ram) file of format:

col1,col2,col3,col4 1,2,3,4 34,256,348, 12,,3,4 

so have columns , missing values , want calculate means of columns 2 , 3. plain python like:

def means(rng):     s, e = rng      open("data.csv") fd:         title = next(fd)         titles = title.split(',')         print "means for", ",".join(titles[s:e])          ret = [0] * (e-s)         c, l in enumerate(fd):             vals = l.split(",")[s:e]             i, v in enumerate(vals):                 try:                     ret[i] += int(v)                 except valueerror:                     pass          return map(lambda s: float(s) / (c + 1), ret) 

but suspect there faster way thins numpy (i still novice @ it).

pandas best friend:

from pandas.io.parsers import read_csv numpy import sum  # load 10000 elements @ time, can play number better # performance on machine my_data = read_csv("data.csv", chunksize=10000)  total = 0 count = 0  chunk in my_data:     # if want exclude nas average, remove next line     chunk = chunk.fillna(0.0)      total += chunk.sum(skipna=true)     count += chunk.count()  avg = total / count  col1_avg = avg["col1"] # ... etc. ... 

Comments

Popular posts from this blog

javascript - how to protect a flash video from refresh? -

android - Associate same looper with different threads -

visual studio 2010 - Connect to informix database windows form application -