xml.etree.elementtree.iterparse函数

上一篇 / 下一篇  2015-09-10 09:37:42 / 个人分类:python

转自
http://effbot.org/zone/element-iterparse.htm


The ElementTree iterparse Function

The new iterparse interface allows you to track changes to the tree while it is being built. This interface was first added in the cElementTree library, and is also available in ElementTree 1.2.5 and later.

Recent versions of lxml.etree (dead link) also supports this API.

Usage #

To use iterparse, just call the method and iterate over the object it returns. The result is an iterable that returns a stream of (event, element) tuples.

forevent, eleminiterparse(source):
    ... elemiscomplete; process it ...
 
forevent, eleminiterparse(source, events=("start","end")):ifevent =="start":
        ... elem was just added to the tree ...else:
        ... elemiscomplete; process it ...

The events option specify what events you want to see (available events in this release are “start”, “end”, “start-ns”, and “end-ns”, where the “ns” events are used to get detailed namespace information). If the option is omitted, only “end” events are returned.

Note: The tree builder and the event generator are not necessarily synchronized; the latter usually lags behind a bit. This means that when you get a “start” event for an element, the builder may already have filled that element with content. You cannot rely on this, though — a “start” event can only be used to inspect the attributes, not the element content. For more details, see this message.

Incremental Parsing #

Note that iterparse still builds a tree, just like parse, but you can safely rearrange or remove parts of the tree while parsing. For example, to parse large files, you can get rid of elements as soon as you’ve processed them:

forevent, eleminiterparse(source):ifelem.tag =="record":
        ... process record elements ...
        elem.clear()

The above pattern has one drawback; it does not clear the root element, so you will end up with a single element with lots of empty child elements. If your files are huge, rather than just large, this might be a problem. To work around this, you need to get your hands on the root element. The easiest way to do this is to enable start events, and save a reference to the first element in a variable:

# get an iterablecontext = iterparse(source, events=("start","end"))# turn it into an iteratorcontext = iter(context)# get the root elementevent, root = context.next()forevent, elemincontext:ifevent =="end"andelem.tag =="record":
        ... process record elements ...
        root.clear()

(future releases will make it easier to access the root element from within the loop)

Namespace Events #

The namespace events contain information about namespace scopes in the source document. This can be used to keep track of active namespace prefixes, which are otherwise discarded by the parser. Here’s how you can emulate thenamespaces attribute in the FancyTreeBuilder class:

events = ("end","start-ns","end-ns")
namespaces = []forevent, eleminiterparse(source, events=events):ifevent =="start-ns":
        namespaces.insert(0, elem)elifevent =="end-ns":
        namespaces.pop(0)else:
        ...

The namespaces variable in this example will contain a stack of (prefix, uri) tuples.

(Note how iterparse lets you replace instance variables with local variables. The code is not only easier to write, it is also a lot more efficient.)

For better performance, you can append and remove items at the right end of the list instead, and loop backwards when looking for prefix mappings.

Incremental Decoding #

Here’s a rather efficient and almost complete XML-RPC decoder (just add fault handling). This implementation is 3 to 4 times faster than the 170-line version I wrote for Python’s xmlrpclib library…

 
fromcElementTreeimportiterparsefromcStringIOimportStringIOimportdatetime, timedefmake_datetime(text):returndatetime.datetime(
        *time.strptime(text,"%Y%m%dT%H:%M:%S")[:6]
    )

unmarshallers = {"int":lambdax: int(x.text),"i4":lambdax: int(x.text),"boolean":lambdax: x.text =="1","string":lambdax: x.textor"","double":lambdax: float(x.text),"dateTime.iso8601":lambdax: make_datetime(x.text),"array":lambdax: [v.textforvinx],"struct":lambdax: dict((k.textor"", v.text)fork, vinx),"base64":lambdax: decodestring(x.textor""),"value":lambdax: x[0].text,
}defloads(data):
    params = method = Noneforaction, eleminiterparse(StringIO(data)):
        unmarshal = unmarshallers.get(elem.tag)ifunmarshal:
            data = unmarshal(elem)
            elem.clear()
            elem.text = dataelifelem.tag =="methodCall":
            method = elem.textelifelem.tag =="params":
            params = tuple(v.textforvinelem)returnparams, method

Note that code uses the text attribute to temporarily hold unmarshalled Python objects. All standard ElementTree implementations support this, but some alternative implementations may not support non-text attribute values.

The same approach can be used to read Apple’s plist format:

 
try:importcElementTreeasETexceptImportError:importelementtree.ElementTreeasETimportbase64, datetime, re

unmarshallers = {# collections"array":lambdax: [v.textforvinx],"dict":lambdax:
        dict((x[i].text, x[i+1].text)foriinrange(0, len(x), 2)),"key":lambdax: x.textor"",# simple types"string":lambdax: x.textor"","data":lambdax: base64.decodestring(x.textor""),"date":lambdax: datetime.datetime(*map(int, re.findall("\d+", x.text))),"true":lambdax: True,"false":lambdax: False,"real":lambdax: float(x.text),"integer":lambdax: int(x.text),

}defload(file):
    parser = iterparse(file)foraction, eleminparser:
        unmarshal = unmarshallers.get(elem.tag)ifunmarshal:
            data = unmarshal(elem)
            elem.clear()
            elem.text = dataelifelem.tag !="plist":raiseIOError("unknown plist type: %r"% elem.tag)returnparser.root[0].text

To round this off, here’s the obligatory RSS-reader-in-less-than-eight-lines example:

 
fromurllibimporturlopenfromcElementTreeimportiterparseforevent, eleminiterparse(urlopen("http://online.effbot.org/rss.xml")):ifelem.tag =="item":printelem.findtext("link"),"-", elem.findtext("title")
        elem.clear()# won't need the children any more

TAG:

 

评分:0

我来说两句

Open Toolbar