I hate working with XML. It's easy to extract data from simple text
files or CSV files, but XML is all nested, and has entities, and lots
of pointy brackets. Regexp just doesn't cut it, you really need an XML
parser. And for some reason Python is not so great at XML.
Python has too many XML choices. There's the stock Python install, which barely does anything. Then there's what you probably should use, PyXML, which has an ugly hack to confusingly install on top of the default Python libraries. But if you follow the advice of Python's most visible XML expert, Uche Ogbuji, you may think there's something wrong with PyXML and install 4Suite instead, which is the same as PyXML only different. Or should you use Amara instead? Then there's ElementTree which is brilliantly fast and simple to use, but limited, or xmltramp, which is even more hacky. On the other extreme there's libxml2, which is fast and powerful but has an awful API. Mind you, this is all for the basic stuff, like parsing XML. There's lots more Python XML options too. But what's missing is a clear single simple library to use. PyXML seems the most standard, but it seems very slow and it tries to be more DOM-like than Python-like. I hate DOM. All of this is a long-winded preamble to my attempt to do something simple with XPath in Python. I ended up with three tiny sample programs that extract the 'xmlUrl' attributes from an OPML file. Here they are. PyXML
from xml.dom.ext.reader import Sax2
This is pretty simple, if a bit wordy. While my goal was
simplicity, it's worth noting this is really slow, like 2.5
seconds on a 14k file. Other versions are 0.5 seconds.
from xml import xpath doc = Sax2.FromXmlFile('foo.opml').documentElement for url in xpath.Evaluate('//@xmlUrl', doc): print url.value libxml2
import libxml2
Looks simple enough. But this example hides the awfulness of
the libxml2 API. For instance when you're looking at a tag and you
want a list of all its attributes, you can't just get a list.
You call get_properties(), which
returns only the first attribute, which you then have to call
.next on to get the second one. This is Python, guys, not C.
We have list as a datatype. The good thing about libxml2 is it's
powerful and fast.
doc = libxml2.parseFile('foo.opml') for url in doc.xpathEval('//@xmlUrl'): print url.content ElementTree
from elementtree import ElementTree
This is my favourite example, because it feels the simplest and most
Pythonic. But ElementTree's XPath support is
woefully incomplete. About all you can do is select nodes. You can't
select attributes or do anything fancier.
tree = ElementTree.parse("foo.opml") for outline in tree.findall("//outline"): print outline.get('xmlUrl') Bottom line? Python is all about "batteries included". But the XML batteries are weak. There are some more powerful options but they've all got drawbacks. Either the APIs are awful, the libraries are slow, or else they lack features. Someone needs to put a clean new XML system into Python. Implement standard SAX and DOM because you have to, but then build a really nice Pythonic API as well and promote that.
See also
this followup.
|