A Safe Read Function
Fredrik Lundh | June 2007
The file read method takes an optional size argument, which tells Python how much data you want to read from the file. If this argument is used, Python allocates a buffer large enough to hold size bytes of data, reads that much data from the file, and finally adjusts the size of the resulting buffer to match the amount of data actually read from the file.
This is all and well if you’re using a fixed size, or trust the source, but if you’re getting the size from somewhere else, your program might misbehave badly if it gets broken data.
For example, the following snippet reads an 8-byte header from a binary file, where the first four bytes is a constant string, and the next four bytes contains the size of the following data block.
header = fp.read(8) tag, size = struct.unpack("4si", header) if tag != "HEAD": raise IOError("invalid header") data = fp.read(size)
If the size field contains bogus data (accidentally or on purpose), the read call may attempt to allocate hundreds of megabytes of memory, or gigabytes, even if the file isn’t close to being that large. If you’re lucky, this results in a memory error, but it may also cause excessive swapping, or otherwise affect other processes on the same machine.
Here’s a simple replacement. This behaves like an ordinary read(size) call, but doesn’t blindly overallocate.
def safe_read(fp, size, blocksize=1024*1024): if size <= 0: return "" if size <= blocksize: return fp.read(size) data = [] while size > 0: block = fp.read(min(size, blocksize)) if not block: break data.append(block) size = size - len(block) return "".join(data)