Parsing XML and HTML with lxml | From lxml import etree
lxmlprovidesaverysimpleandpowerfulAPIforparsingXMLandHTML.Itsupportsone-stepparsingaswellasstep-by-stepparsingusinganevent-drivenAPI(currentlyonlyforXML).Theusualsetupprocedure:>>>fromlxmlimportetreeThefollowingexamplesalsouseStringIOorBytesIOtoshowhowtoparsefromfilesandfile-likeobjects.Bothareavailableintheiomodule:fromioimportStringIO,BytesIOParsersarerepresentedbyparserobjects.ThereissupportforparsingbothXMLand(broken)HTML.NotethatXHTMLisbestparsedasXML,parsingitwiththeHTMLparser...
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
The usual setup procedure:
>>> from lxml import etreeThe following examples also use StringIO or BytesIO to show how to parse from files and file-like objects. Both are available in the io module:
from io import StringIO, BytesIOParsers are represented by parser objects. There is support for parsing both XML and (broken) HTML. Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results. Here is a simple example for parsing XML from an in-memory string:
>>> xml = <a xmlns="test"><b xmlns="test"/></a> >>> root = etree.fromstring(xml) >>> etree.tostring(root) b<a xmlns="test"><b xmlns="test"/></a>To read from a file or file-like object, you can use...