calculating means from csv with python's numpy -
i have 10gb (can't fit in ram) file of format:
col1,col2,col3,col4 1,2,3,4 34,256,348, 12,,3,4
so have columns , missing values , want calculate means of columns 2 , 3. plain python like:
def means(rng): s, e = rng open("data.csv") fd: title = next(fd) titles = title.split(',') print "means for", ",".join(titles[s:e]) ret = [0] * (e-s) c, l in enumerate(fd): vals = l.split(",")[s:e] i, v in enumerate(vals): try: ret[i] += int(v) except valueerror: pass return map(lambda s: float(s) / (c + 1), ret)
but suspect there faster way thins numpy (i still novice @ it).
pandas best friend:
from pandas.io.parsers import read_csv numpy import sum # load 10000 elements @ time, can play number better # performance on machine my_data = read_csv("data.csv", chunksize=10000) total = 0 count = 0 chunk in my_data: # if want exclude nas average, remove next line chunk = chunk.fillna(0.0) total += chunk.sum(skipna=true) count += chunk.count() avg = total / count col1_avg = avg["col1"] # ... etc. ...
Comments
Post a Comment