Thursday, 17 January 2008

python read files

Doing it the usual way

The standard idiom consists of a an ‘endless’ while loop, in which we repeatedly call the file’s readline method. Here’s an example:

# File: readline-example-1.py

file = open("sample.txt")

while 1:
line = file.readline()
if not line:
break
pass # do something

This snippet reads the file line by line. If readline reaches the end of the file, it returns an empty string. Otherwise, it returns the line of text, including the trailing newline character.

On my test machine, using a 10 megabyte sample text file, this script reads about 32,000 lines per second.
Using the fileinput module

If you think the while loop is ugly, you can hide the readline call in a wrapper class. The standard fileinput module contains an input class which does exactly that.

# File: readline-example-2.py

import fileinput

for line in fileinput.input("sample.txt"):
pass

However, adding more layers of Python code doesn’t exactly help. For the same test setup, performance drops to 13,000 lines per second. That’s nearly two and half times slower!
Speeding up line reading

To speed things up, we obviously need to make sure we spend as little time on in Python code (running under the interpreter) as possible.

One way to do this is to tell the file object to read larger chunks of data. For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method. Or you could even use the read method to read the entire file into a single memory block, and then use string.split to chop it up into individual lines.

However, if you’re processing really large files, it would be nice if you could limit the chunk size to something reasonable. For example, if you read a few thousand lines at a time, you probably won’t use up more than 100 kilobytes or so.

The following script uses a nested loop. The outer loop uses readlines to read about 100,000 bytes of text, and the inner loop processes those lines using a simple for-in loop:

# File: readline-example-3.py

file = open("sample.txt")

while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something

Can this really be faster? You bet. With the same test data, we can now process 96,900 lines of text per second!

Or to put it another way, this solution is three times as fast as the standard solution, and over seven times faster than the fileinput version.

In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better:

# File: readline-example-5.py

file = open("sample.txt")

for line in file:
pass # do something

In Python 2.1, you have to use the xreadlines iterator factory instead:

# File: readline-example-4.py

file = open("sample.txt")

for line in file.xreadlines():
pass # do something

Copyright © 2000 Fredrik Lundh

Powered by ScribeFire.

No comments:

My photo
London, United Kingdom
twitter.com/zhengxin

Facebook & Twitter