Getting Information About a File
This is an old draft from 1997.
The file system itself can reveal some interesting information about a document. For example, it can tell you the size of the document file, and when it was created, modified, or even last read. On some platforms, you can also find out who owns the file in question. To get this information in Python, you can use the stat function in the platform-independent os module:
import os st = os.stat("file.dat")
This function takes the name of a file, and returns a 10-member tuple with the following contents:
(mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime)The
- mode (ST_MODE)
-
The access rights for this file.
FIXME: summary 0xxx S_IRUSR, S_IWUSR, S_IXUSR, etc. os.path.isfile, os.path.isdir, os.path.islink, os.path.ismount
- ino, dev (ST_INO, ST_DEV)
- The ino (I-node) and dev (device) members can be used to determine the physical location of a file. On a UNIX system, the (dev, ino)-tuple uniquely identifies a physical file. On a Windows system, the device number (FIXME: always? usually?) corresponds to the drive letter (0=A:, 1=B:, 2=C:, etc).
- nlink (ST_NLINK)
- On a UNIX system, this is the number of hard links to this file. Under Windows, this member is always 1.
- uid, gid (ST_UID, ST_GID)
- On a UNIX system, these can be used to determine the owner of a given file. Under Windows, these are set to 0.
- size (ST_SIZE)
-
The size of the file, in bytes.
FIXME: os.path.getsize()
- atime, mtime, ctime (ST_ATIME, ST_MTIME, ST_CTIME)
- The time when the file was last accessed, last modified, and when the file information was last changed. The times are given in seconds since a reference time (the “epoch”, usually 1970) in the same was as time.time() returns the current time. Under Windows, the time of last access is usually not valid.
Here’s an example that prints the size and time of last modification for a given file:
import os, time from stat import * # ST_SIZE etc try: st = os.stat(file) except IOError: print "failed to get information about", file else: print "file size:", st[ST_SIZE] print "file modified:", time.asctime(time.localtime(st[ST_MTIME]))
The os module also provides a fstat function, which can be used on an opened file. It takes an integer file handle, not a file object, so you have to use the fileno method on the file object:
fp = open("file.dat") st = os.fstat(fp.fileno())
This function returns the same values as a corresponding call to os.stat.
FIXME: describe os.lstat and statcache.stat. briefly describe the time functions.
Owner Information
On a UNIX system, you can use stat to figure out who’s owning a file. The ST_UID member provides this information, but disguised as an integer value — a user identity. The pwd (password) module contains functions to map this to the user’s name. The following snippet maps the ST_UID field to a login name:
try: import pwd # not available on all platforms userinfo = pwd.getpwuid(st[ST_UID]) except (ImportError, KeyError): print "failed to get the owner name for", file else: print "file owned by:", userinfo[0]
The getpwuid function returns a tuple with user information, or raises a KeyError exception if the user is not known. The tuple contains the following fields: (name, password, uid, info, directory, shell).
- name
- User name. This is the user’s login identity.
- password
- Encrypted password (see below).
- uid
- User identity (an integer).
- gid
- Group identity (an integer).
- info
- User information, like full name, department, phone number, etc.
- directory
- The user’s login directory.
- shell
- The user’s login shell.
In addition to getpwuid, this module includes getpwnam which taks a user name instead of a user identity, and getpwall which returns a list of user information tuples, for all known users of the system. The function names may sound strange, but they are taken from the underlying POSIX C functions.
If you are going to fetch the user name for a large number of files, it may be more efficient to use getpwall to preload a dictionary with the user information:
import pwd _info = {} for userinfo in pwd.getpwall(): _info[userinfo[2]] = userinfo def getuserinfo(uid): return _info[uid]
Password issues
An interesting detail is that the pwd module returns the user’s password. This of course sounds like a serious security risk, doesn’t it? Well, it’s not as bad as it may seem: UNIX uses a one-way encryption scheme, meaning that you can go from a clear-text password to an encrypted password, but not the other way around (at least not easily).
If you use Python on a UNIX platform, you usually have access to the crypt module. Using that module, you can check a password by encrypting it yourself, and comparing the result to the entry in the password database. Here’s a small function that let’s you “simulate” logging in to the machine:
import pwd, crypt def login(user, password): "Check if a user would be able to log in using password." try: pw1 = pwd.getpwnam(user)[1] pw2 = crypt.crypt(password, pw1) return pw1 == pw2 except KeyError: return 0 # no such user
Note that the crypt function takes the encrypted password as its second argument. This is because the first two characters in the encrypted password is a random number, used to make sure there’s more than one way to store the same password.
FIXME: add information on NIS, shadow password databases, etc.
Caching data
If the number of users is large, but you only need to get the name of a few of them, storing all user names in a dictionary is a waste of time and money. Instead, you can look up the identities using getpwuid, but store the result in a Python dictionary. The next time you use the same identity, it is readfrom the dictionary instead, saving you a (possibly slow) call to the pwd module.
import pwd _users = {} def getuserinfo(uid): try: return _users[uid] except KeyError: _users[uid] = info = pwd.getpwuid(st[ST_UID]) return info
A function or method that uses a dictionary to store results in this fashion is called a memo function. Obviously, this only works if the result is always the same for any given arguent, and it should only be used when the number of possible argument values is relatively limited (at least for a given instance of your program). For example, using this technique to speed up a mathematical operation such as math.sin is not a very good idea (not only because the number of possible arguments is large; the math operations are already fast enough compared to dictionary lookups). Another important restriction is that it must be possible to use the arguments as dictionary keys.
Interestingly enough, Python makes it easy to turn an arbitrary function info a memo function. The following Memoize class shows one way to do it; it uses the __call__ method to capture calls to the object itself, and looks the argument tuple up in the memo dictionary. The actual function is only called if the argument haven’t been used before.
class Memoize: def __init__(self, function): self.memo = {} self.function = function def __call__(self, *args): try: return self.memo[args] except KeyError: result = apply(self.function, args) self.memo[args] = result return result
Given this class, we can turn getpwuid into a memo function with a single line of code:
import pwd getuserinfo = Memoize(pwd.getpwuid)
If resources are really scarce (or the data you store is large), you may wish to limit the size of the dictionary. For example, you can make sure it never contains more than 100 entries by adding the following lines to the above example, just before you update the dictionary with a new user:
if len(self.memo) >= 100: del self.memo[some entry]
The only problem here is to decide which entry to remove. The longer it takes to create the data that we want to store in the cache, the more important it becomes that we make the right decision. A simple solution is remove a random entry every time a new entry is added to the cache. This is not very easy to implement, it is also more efficient than you might believe. Especially in situations where you don’t know much about how the cache will be used, removing a random item is a good way to prevent worst-case behaviour.
The following class implements this cache scheme, using a dictionary interface rather than the functional interface used by the Memoize class. To use this class for your own cache, create a subclass and implement your own version of the fetch method.
import random class RandomCache: def __init__(self, size=None): self.size = size self.data = {} def __getitem__(self, item): try: return self.data[item] except KeyError: value = self.fetch(item) if self.size and len(self.data) >= self.size: del self.data[random.choice(self.data.keys())] self.data[item] = value def fetch(self, item): raise NotImplementedError
Another way is to keep track of when an entry was last used, and remove the oldest entry every time the cache has become too large. The following class provides the same interface as the RandomCache class, ut it removes the least recently used (LRU) entry.
class LRUCache: def __init__(self, size=None): self.size = size self.data = {} self.user = [] def __getitem__(self, item): try: value = self.data[item] if self.size: # bring used item to front self.used.remove(item) self.used.append(item) except KeyError: value = self.fetch(item) if self.size: if len(self.data) >= self.size: del self.data[self.used.pop(0)] self.user.append(item) self.data[item] = value return value def fetch(self, item): raise NotImplementedError
FIXME: using a list isn’t very efficent if size is large. add an example using a priority heap instead (and benchmark!)
Getting Information from HTML Documents
Most documents published on the World Wide Web are written in a special document format called Hypertext Markup Language (HTML). This is basically a text format; you can create and edit most HTML document using conventional tools. But in addition to plain text, an HTML document can also contain special markup elements which describes things like the overall document structure, text styles, and embedded images. Here’s a simple example:
<TITLE>HTML Overview</TITLE> <H1>Overview</H1> <P> An HTML document is basically an ordinary text file, but in addition to the plain text, it can also include certain markup directives. These directives include <EM>tags</EM> and <EM>entities</EM>. <P> Tags are used to delimit different sections of the document. For example, the <TITLE> tag is used to define the document title, and <P> starts a new paragraph. <P> Entities are used to represent special characters and symbols by name (&<EM>name</EM>;) or number (&<EM>number</EM>;), rather than writing them as is in the text. Certain characters must always be written as entities, unless they are part of the markup.
June 2004: The above example is pretty weird, and the whole description represents a somewhat aged view of HTML, don’t you think? /F
Somewhat simplified, the markup consists of tags written in angle brackets (<TITLE>, <P>, etc.), and character entities, consisting of an ampersand, a name, and a semicolon (<, etc.).
Tags are used to divide the document into different parts, and to embed images, tables, and other objects in the text. When used for the former purpose, the tags are usually used in pairs, like <TITLE> and </TITLE> in this example. Other tags used in the example are <H1> for first-level heading, <EM> for emphasized text, and <P> for paragraph breaks. Note that since it wouldn’t make sense to nest paragraphs, there’s no need to explicitly close them using </P> (you can do it if you wish, though).
The other type of markup, entities, is used to embed symbols and special characters i the text. For example, to embed markup characters like <, >, and & in the text, you must write them as entities (<, >, and &).
But enough theory. Here’s how this document looks in a web browser:
Figure: The sample document rendered by the Grail3 web browser
If you compare the source document with the text displayed by the browser, you’ll notice some additional things. The tags are not displayed at all, and the entities are converted to characters, as expected. The TITLE section is displayed in the browser’s window title bar, and the H1 and EM sections are rendered using separate fonts. Also note that line breaks don’t correspond to those in the source document. HTML boldly collapses any sequence of whitespace, including newline characters, into a single space. You need to use tags to separate paragraphs and line.
For a complete description of HTML, see HTML 3.2 Reference Specification.
Descriptive tags
In addition to all the tags that can be used to control the look of a document, HTML also provides tags which provide information about the document itself. For our purposes, the most interesting tag is the META tag, which in its simplest form provides a key/value pair with information about the document.
<META NAME='key' CONTENT='value'>
At the time of writing, the key names are not standardized, but major search engines like InfoSeek and Alta Vista uses the following convention:
- keywords
- The content field contains a number of keywords relevant for this document. The keywords are separated by commas.
- description
- The content field contains a descriptive text (a summary or an abstract). When you search for documents, this tet may be displayed along with the document title (taken from the TITLE tag) to help you pick the rigt document.
Other commonly used fields are author, generator (what program was used to create the file), publisher, and timestamp (when was the document last modified). Here’s a sample document containing a number of META tags:
FIXME
Parsing the document
How that we know what to look for, it’s time to write some code to extract this information from any given HTML file. As usual, the standard library contains just what we need: the htmllib and formatter modules.The htmllib module defines one class, HTMLParser, which reads the HTML document and calls methods of a formatter object. The parser class have methods corresponding to the various tags that can occur in an HTML document, most of which you can override if you wish to handle some tag in a special way.
FIGURE: parser/formatter/writer structure, including typical operations
The formatter class have methods corresponding to more abstract text operations, like font changes, paragraph marks, flowing text, etc. The formatter module defines two standard formatters, a NullFormatter class which happily ignores everything generated by the parser, and an AbstractFormatter class which converts the text operations to concrete text rendering operations.
Since we’re out to extract information from the HTML document itself, we can use the NullFormatter class, and simply overload the parser method used to handle the META tag.
from htmllib import HTMLParser from formatter import NullFormatter import string class MetaParser(HTMLParser): def __init__(self): HTMLParser.__init__(self, NullFormatter()) self.meta_dict = {} def do_meta(self, attrs): # this method is called for META tags name = content = None # attrs is a list of 2-tuples for k, v in attrs: if k == "name": name = string.lower(v) elif k == "content": content = v if name and content: self.meta_dict[name] = content def getmeta(file): # extract META tags from an HTML document p = MetaParser() f = open(file) while 1: s = fp.read(10000) if not s: break p.feed(s) p.close() # the title tag is extracted by the base class if p.title: p.meta_dict["title"] = p.title return p.meta_dict
The parser is designed to parse the document in pieces, so we simply read chunks from the HTML file, and pass them to the feed method. There’s no need to read individual lines, or to make sure the chunks don’t end in the middle of a tag or an entity.
We override the do_meta method, which is called by the parser for each META tag in the document. In this method, the attrs argument contains the parameters used in the META tag. We scan this list to look for name and content parameters, and if both are found, store the content in the meta_dict dictionary. The parser base class automatically extracts the TITLE section, if present, and we’ll add that to the dictionary before returning the dictionary to the caller.
Running this on our sample file produces the following output:
>>> import htmlmeta >>> htmlmeta.getmeta("fredrik.html") {'author: 'Fredrik', 'title': 'Python Resources', 'description': "Fredrik's Python Resource Center", 'keywords': 'Python, PIL, Tkinter, stuff', 'generator': 'PyHTML 1.2'}
Getting Information from Image Files
For an image file, you might be interested in things like image size, type of image data, and compression ratio. The easiest way to get this information is to use PIL’s identification mechanism. Simply call PIL’s open function, and examine the resulting image object. This operation is usually fast, since PIL only reads as much of the file as is necessary to determine what the file contains, and how to read the image data proper. The actual image is not read or decoded until it is actually needed.
The following example determines format, pixel type, and size of a given image file:
from PIL import Image try: im = Image.open(file) except IOError: print "failed to identify", file else: print "image format:", im.format print "image mode:", im.mode print "image size:", im.size if im.info.has_key("description"): print "image description:", im.info["description"]
You can derive other metrics from this information as well. For example, by comparing the size of the image file with the size of the actual image data, you can get a measure of the compression ratio for an image.
MODEBITS = { # bits per pixel for common PIL image modes "1": 1, "P": 8, "L": 8, "RGB": 24, "RGBA": 32, "CMYK": 32 } try: bits = MODEBITS[im.mode] imagebytes = ((im.size[0] * bits + 7) / 8) * im.size[1] filebytes = os.stat(file)[ST_SIZE] except (IOError, KeyError): print "failed to determine compression ratio for", file else: print "compression:", round(imagebytes / filebytes, 2), "times"
The compression ratio is close to 1 for non-compressed formats like BMP and PPM, typically 2-10 for GIF and other lossless compression formats, and 10-20 for JPEG files. Note that as the expression is written, you won’t end up dividing by zero since an empty file cannot possibly be a valid image file. If you change the expression to get the compressed size in percent of the full size, don’t forget that an image file may contain an empty image (imagebytes=0).