9/26/2011

Data import time

I am trying different software packages for import a large dataset.
The dataset contains 117 variables with 102294 entries. I tried R, python and SAS.
For SAS, reading the datafile takes about 4.77sec using proc import. I guess using data procedure should be but not too much.

proc import datafile = 'F:\......\Features.txt' out= test
dbms = dlm replace;
 delimiter='09'x;
 getnames=no;
run;

 For Python, depends on what is your program like (Reference).

If using fileinput module:
import fileinput
test = []
i = 0
for line in fileinput.input("Features.txt"):
    test.append(line)

The time is 2.062000036239624.
If using
test = []
file = open("Features.txt")
while 1:
    line = file.readline();
    if not line:
        break  
    test.append(line)

The time is 2.004999876022339, slightly faster than the previous one.
If using nested loop:
test = []
file = open("Features.txt")
while 1:
    lines = file.readlines(100000);
    if not lines:
        break
    for line in lines:  
        test.append(line)

The elapse time is 1.9250001907348633. Pretty fast hmm?
If we use csv import:
import csv
with open("Features.txt",'r') as f:
    reader = csv.reader(f,delimiter = '\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        test.append(row)

This time, it is 6.2779998779296875.  not bad.

While for R, quite to my disappointment. It took shockingly 57.94sec!!!! Almost 1 min
t = proc.time()
dataset = read.table('Features.txt')
print(proc.time() - t)

I believe we can use some low level functions in R to speed up this process, but I never realized that this function is so low-efficient. Maybe because it is too generic.

No comments: