The dataset contains 117 variables with 102294 entries. I tried R, python and SAS.
For SAS, reading the datafile takes about 4.77sec using proc import. I guess using data procedure should be but not too much.
proc import datafile = 'F:\......\Features.txt' out= test
dbms = dlm replace;
delimiter='09'x;
getnames=no;
run;
For Python, depends on what is your program like (Reference).
import fileinput test = [] i = 0 for line in fileinput.input("Features.txt"): test.append(line)
The time is 2.062000036239624.
If using
test = [] file = open("Features.txt") while 1: line = file.readline(); if not line: break test.append(line)
The time is 2.004999876022339, slightly faster than the previous one.
If using nested loop:
test = [] file = open("Features.txt") while 1: lines = file.readlines(100000); if not lines: break for line in lines: test.append(line)
The elapse time is 1.9250001907348633. Pretty fast hmm?
If we use csv import:
import csv with open("Features.txt",'r') as f: reader = csv.reader(f,delimiter = '\t', quoting=csv.QUOTE_NONE) for row in reader: test.append(row)
This time, it is 6.2779998779296875. not bad.
While for R, quite to my disappointment. It took shockingly 57.94sec!!!! Almost 1 min
t = proc.time() dataset = read.table('Features.txt') print(proc.time() - t)
I believe we can use some low level functions in R to speed up this process, but I never realized that this function is so low-efficient. Maybe because it is too generic.
No comments:
Post a Comment