The dataset contains 117 variables with 102294 entries. I tried R, python and SAS.
For SAS, reading the datafile takes about 4.77sec using proc import. I guess using data procedure should be but not too much.
proc import datafile = 'F:\......\Features.txt' out= test
dbms = dlm replace;
delimiter='09'x;
getnames=no;
run;
For Python, depends on what is your program like (Reference).
import fileinput
test = []
i = 0
for line in fileinput.input("Features.txt"):
test.append(line)
The time is 2.062000036239624.
If using
test = []
file = open("Features.txt")
while 1:
line = file.readline();
if not line:
break
test.append(line)
The time is 2.004999876022339, slightly faster than the previous one.
If using nested loop:
test = []
file = open("Features.txt")
while 1:
lines = file.readlines(100000);
if not lines:
break
for line in lines:
test.append(line)
The elapse time is 1.9250001907348633. Pretty fast hmm?
If we use csv import:
import csv
with open("Features.txt",'r') as f:
reader = csv.reader(f,delimiter = '\t', quoting=csv.QUOTE_NONE)
for row in reader:
test.append(row)
This time, it is 6.2779998779296875. not bad.
While for R, quite to my disappointment. It took shockingly 57.94sec!!!! Almost 1 min
t = proc.time()
dataset = read.table('Features.txt')
print(proc.time() - t)
I believe we can use some low level functions in R to speed up this process, but I never realized that this function is so low-efficient. Maybe because it is too generic.
No comments:
Post a Comment