lxml.html Documentation | Lxml html
Sinceversion2.0,lxmlcomeswithadedicatedPythonpackagefordealingwithHTML:lxml.html.ItisbasedonlxmlsHTMLparser,butprovidesaspecialElementAPIforHTMLelements,aswellasanumberofutilitiesforcommonHTMLprocessingtasks.ParsingHTMLfragmentsThereareseveralfunctionsavailabletoparseHTML:parse(filename_url_or_file):Parsesthenamedfileorurl,oriftheobjecthasa.read()method,parsesfromthat.IfyougiveaURL,oriftheobjecthasa.geturl()method(asfile-likeobjectsfromurllib.urlopen()have),thenthatURLisusedasthebaseURL.Youc...
Since version 2.0, lxml comes with a dedicated Python package for dealing with HTML: lxml.html. It is based on lxmls HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.
Parsing HTML fragmentsThere are several functions available to parse HTML:
parse(filename_url_or_file):Parses the named file or url, or if the object has a .read() method, parses from that.
If you give a URL, or if the object has a .geturl() method (as file-like objects from urllib.urlopen() have), then that URL is used as the base URL. You can also provide an explicit base_url keyword argument.
document_fromstring(string): Parses a document from the given string. This always creates a correct HTML document, which means the parent node is <html>, and there is a body and possibly a head. fragment_fromstring(string, create_parent=False): Returns an HTML fragment from a stri...