Sockets: Usenet Support
This is a really old draft from 1997.
Pulling Documents and Images off Usenet
Another source for information and images is the part of Internet called Usenet, or News. Usenet is a distributed bulletin-board, where messages can be read from, and posted to special news servers. Messages posted to a given news server are propagated to other servers, but as with the Web, you have to connect to a server to be able to read the messages.
The protocol used to fetch messages (“articles”) from a news server is called Network News Transfer Protocol (NNTP). <RFC977>. Here’s a typical session, in which the client application connects, reads the standard headers for new messages in the newsgroup called comp.lang.python, downloads one of them, and then posts a message to the server (possibly in response to the other message):
Client: connects Server: 200 news.spam.egg PyNNTP 1.0 ready (posting ok) Client: GROUP comp.lang.python Server: 211 367 13887 14268 comp.lang.python Client: XOVER 14211-14268 Client: 204 data follows Server: (sends overview information for articles 14211 to 14268) Server: . Client: ARTICLE 14220 Server: 220 14220 <5qj8v5$8dd@news.spam.egg > article Server: (sends message) Server: . Client: POST Server: 340 OK Client: (sends message) Client: . Server: 240 Article posted Client: QUIT Client: disconnects
Note that each command from the client starts with a command keyword, and each reply from the server starts with a status code. Messages and listings are terminated with a line containing a single dot.
The server assigns a serial number to each message (in this case, the comp.lang.python newsgroup currently contains 367 messages, having numbers between 13887 to 14268), and it’s usually up to the client to keep track of which messages it has already seen.
News Message Format
We’ll implement an NNTP client class in a moment, but before we do that, let’s see what the news messages look like. Here’s a simple example:
Path: news.myisp.se!newsfeed.internetmci.com!news.spam.egg From: user@spam.egg Newsgroups: comp.lang.python Subject: Re: Where's the bacon? Date: 17 Jul 1999 09:25:53 -0400 Lines: 12 Sender: user@spam.egg Message-ID: <lqsoxd95em.ach@news.spam.egg> References: <199907152100.RAA14304@foobar.spam.egg> Xref: news.spam.egg comp.lang.python:14304 Fredrik wrote: > Haven't got a clue. Maybe someone else knows more. You could check the list of contributed software at www.python.org. ...
As in HTTP, the message starts with a list of headers, followed by an empty line, and the message body itself. Python’s standard library contains a module designed to represent messages like this. This module is named rfc822, after the Internet specification with the same name (the full name of which is Standard for the Format of ARPA Internet Text Messages, by the way).
RFC822 only specifies the general layout of the message; another specification, RFC1036, defines what headers to use in a news message.
<FIXME: header field summary: From, Date, Newsgroups, Subject, Message-ID, and Path>
The Message class defined in the rfc822 module takes a file handle, extracts the header fields, and leaves the file pointer positioned on the first line in the message, after the empty line. Basically, an instance of the Message class behaves like a dictionary of header fields, but also provides a set of utility functions and members.
The following code snippet reads a message from a file, and dumps the header dictionary to the screen:
import rfc822 fp = open("sample.news") msg = rfc822.Message(fp) for k, v in msg.items(): print k, "=", v
If applied to the above example, this script prints something like:
path = news.myisp.se!newsfeed.internetmci.com!news.spam.egg newsgroups = comp.lang.python from = user@spam.egg sender = user@spam.egg xref = news.spam.egg comp.lang.python:14304 date = 17 Jul 1999 09:25:53 -0400 references = <199907152100.RAA14304@foobar.egg> lines = 12 message-id = <lqsoxd95em.ach@news.spam.egg> subject = Re: Where's the bacon?
Sending Binary Data via News
The RFC822 specification (published in 1982) explicitly specifies that only 7-bit US ASCII characters can be used in news messages (it also applies to mail, something we will discuss later in this chapter). Nevertheless, binary files can be posted anyway, by first encoding them using one of the following methods:
- Use the Unix uuencode utility to encode the data.
- Use the Multipurpose Internet Mail Extension (MIME) encoding standard. Especially the base64 encoding scheme is becoming popular as a slightly more convenient alternative to uuencode.
- [FIXME: Use the yEnc format]
In both uuencode and base64, each group of 3 data bytes is converted to 4 ASCII characters, storing 6 bits of original data in each character. While uuencode stores each 6-bit value as chr(value+32), the base64 encoding uses a character table designed to minimize the risk for errors if the message is to be converted to other character sets. Python’s standard library supports both formats, via the uu and base64 modules, and a low-level support module called binascii.
The uuencode format is line-oriented, and the encoded data starts with a begin line, which also contains the Unix file mode (in octal), and the original filename. Then follows the encoded lines (the first character gives the number of bytes encoded on the rest of the line, and is usually an “M” for a full line of 45 binary bytes), and the encoded block ends with a line containing the word end. Here’s an example:
begin 600 can.jpg M_]C_X `02D9)1@`!``$`4P!3``#__@`752U,96%D(%-Y<W1E;7,L($EN8RX` M_]L`A `#`@("`@(#`@("`P,#`P0(!00$! 0)!P<%" L*# P+"@L+# X2#PP- M$0T+"Q 5$!$3$Q04% P/%A@6%!@2%!03`0,#`P0$! D%!0D3#0L-$Q,3$Q,3 ... typically a few hundred similar lines ... M?E3;Y52UNG1$5E2,`A1QT_7W]SZFL8?"O4N"3C)LBTHEW ?YL<#=SCGMZ=!^ M50M-*NH_*Y3##&WC'TQT_P#U53BN9JQ7*K19J:ZB0PV3Q*(RZ$ML&,G*GM]? =Y#L*S)I9$E9%D8!20,GIS6>'2:5T;Q24I6\@_]FB ` end
The MIME format is a bit different; it uses special message headers to indicate what the message contains, and how it is encoded. If the message header contains a field named MIME-Version, the document is encoded using the MIME specification. We’ll get back to MIME and base64-encoding later in this chapter, when we look closer on how to send and receive images and other documents via electronic mail.
Decoding uuencoded messages
To figure out if a message contains uuencoded data, we need to scan the message body for a line starting with begin, followed by a number and a filename. We can then use the binascii module to convert each line to a chunk of binary data, and write it to a file, or, as in the following example, store it in a list. The getuubody function shown below also returns the filename. If the message is not encoded, this function sets the filename to None, and returns the message body as is.
Example: extract uuencoded data (from messageutils.py)
import regex, string begin = regex.compile("begin [0-9]+ \(.*\)") def getuubody(msg): "Given a uuencoded message, extract and decode the message body" msg.rewindbody() while 1: s = msg.fp.readline() if not s: break if begin.match(s) > 0: # decode uuencoded message body body = [] file = begin.group(1) for s in msg.fp.readlines(): if s[:3] == "end": break try: body.append(binascii.a2b_uu(s)) except: # workaround for broken encoders bytes = (((ord(s[0])-32) & 63) * 4 + 3) / 3 body.append(binascii.a2b_uu(s[:bytes])) return file, string.join(body, "") msg.rewindbody() return None, msg.fp.read()
Note that some encoders sometimes adds extra padding characters to lines containing less than 45 bytes of binary data. In earlier versions of Python, the binascii module raises an exception if it stumbles upon such a line; the above try/except clause works around this problem by explicitly truncating the line to the appropriate length.
[FIXME: explain why uu.py cannot be used: it assumes that the file is already positioned on the begin line, and it doesn’t handle offending encoders well either (this will probably be fixed in binascii in 1.5 final)]
An NNTP Client Library
Creating a client library for the NNTP protocol is a straight-forward task. Again, the SimpleClient takes care of the socket configuration issues, and provides getline and putline primitives.
The code shown here includes a minimal set of commands only; list to get a list of newsgroups available on the server, group to select which group to read, overview to get an overview of all or some messages in a group, and retrieve to read a given message. The overview method uses an NNTP command called XOVER, which is an extension to the original NNTP protocol. Virtually every modern news server supports this command, though, and some news clients won’t work without it. The retrieve method uses either HEAD, BODY, or ARTICLE, to read parts or all of a message. The default is ARTICLE, which reads both headers and body in a single call.
Example: File: NNTPClient.py
from string import * import SimpleClient ARTICLE, HEAD, BODY = tuple(range(3)) class NNTPClient(SimpleClient.SimpleClient): def __init__(self, host, port = 119): # connect SimpleClient.SimpleClient.__init__(self, host, port) s, self.welcome = self.getstatus() if s not in [200, 201, 205]: raise IOError, (s, "NNTP connection error", self.welcome) self.may_post = (s == 200) self.must_login = (s == 205) def close(self): "Quit." try: stat = self.command(None, "QUIT") except IOError: pass # self.destroy() def command(self, ok, *args): self.putline(join(args)) s, m = self.getstatus() if ok and s not in ok: raise IOError, (s, args[0]+" command failed", m) return m def getstatus(self): info = self.getline() return atoi(info[:3]), info def getmessage(self, newline = ""): text = [] while 1: s = self.getline() if s[:1] == ".": s = s[1:] if not s: break text.append(s + newline) return text def _range(self, lo, hi): if hi is None: return str(lo) return "%s-%s" % (lo, hi) # # NNTP commands (subset) def group(self, group): "Select group. Returns number of messages, range, and group name." m = split(self.command([211], "group", group)) self.groupinfo = group, (atoi(m[2]), atoi(m[3])) return (atoi(m[1]), # number of messages (est.) atoi(m[2]), atoi(m[3]), # message number range m[4]) # group name def list(self): "List groups. Returns list of (group, lo, hi, may_post) tuples" self.command([215], "LIST") data = [] for s in self.getmessage(): s = split(s) data.append((s[0], # group name atoi(s[1]), atoi(s[2]),# message number range s[3] in "yY")) # may post return data def overview(self, lo, hi = None): "Get message overview (extension)." self.command([224], "XOVER", self._range(lo, hi)) data = [] for s in self.getmessage(): s = split(s, "\t") data.append((atoi(s[0]), # message number s[1], # subject s[2], # from s[3], # date s[4], # message id tuple(split(s[5])), # references atoi(s[6]), # byte count atoi(s[7]))) # line count return data def retrieve(self, msgid, mode = ARTICLE): "Get article (mode argument controls which part)" if mode == HEAD: self.command([221], "HEAD", str(msgid)) elif mode == BODY: self.command([222], "BODY", str(msgid)) else: self.command([220], "ARTICLE", str(msgid)) return self.getmessage("\n")
Messages are returned as a list of strings, where each string ends with a newline. In this way, messages obtained via retrieve looks like messages read from a file using readlines.
An NNTP Robot
The following example uses the NNTPClient module to download messages from a news server. It fetches overview information from the server (including the From and Subject header fields, and size information), passes that information to a user-defined filter function, and downloads messages as indicated by the filter. The messages are stored in files named group-serial.mail. [FIXME: redesign NNTPClient so it returns Article instances, and move the processing into that class.
Example: File: newsrobot.py
# # user configuration HOST = "news.spam.egg" GROUP = "alt.binaries.pictures.bacon" def messagefilter(info): serial, subject, _from, date, msgid, ref, bytes, lines = info # assume everything larger than 10k is an image, but don't # download things larger than 60k return 10000 <= bytes <= 60000 # # main program import NNTPClient import string nntp = NNTPClient.NNTPClient(HOST) count, lo, hi, name = nntp.group(GROUP) # get last message number, if saved try: fp = open(GROUP + ".last") lo = max(lo, string.atoi(fp.readline())+1) fp.close() except (IOError, ValueError): pass # scan whole group # loop over new messages for info in nntp.overview(lo, hi): serial = info[0] if messagefilter(info): print "fetching", info[2], "(%d bytes)" % info[6] message = nntp.retrieve(serial) fp = open("%s-%d.news" % (GROUP, serial), "w") fp.writelines(message) fp.close() nntp.close() # store last message number try: fp = open(GROUP + ".last", "w") fp.write(str(serial) + "\n") fp.close() except IOError: pass
Note that the we store the last message number seen in a file named group.last, to avoid downloading the same messages over and over again. To start all over again, for example if you change the filter, simply remove that file.
[FIXME: instead of storing the raw message to disk, this code should call the getuubody method and store the message body in the “incoming” directory]