# Beautifulsoup > * Comments and other special strings --- # Beautiful Soup Documentation # Source: https://beautiful-soup-4.readthedocs.io/en/latest/ # Path: index Beautiful Soup latest * Beautiful Soup Documentation * Getting help * Quick Start * Installing Beautiful Soup * Problems after installation * Installing a parser * Making the soup * Kinds of objects * `Tag` * Name * Attributes * Multi-valued attributes * `NavigableString` * `BeautifulSoup` * Comments and other special strings * Navigating the tree * Going down * Navigating using tag names * `.contents` and `.children` * `.descendants` * `.string` * `.strings` and `stripped_strings` * Going up * `.parent` * `.parents` * Going sideways * `.next_sibling` and `.previous_sibling` * `.next_siblings` and `.previous_siblings` * Going back and forth * `.next_element` and `.previous_element` * `.next_elements` and `.previous_elements` * Searching the tree * Kinds of filters * A string * A regular expression * A list * `True` * A function * `find_all()` * The `name` argument * The keyword arguments * Searching by CSS class * The `string` argument * The `limit` argument * The `recursive` argument * Calling a tag is like calling `find_all()` * `find()` * `find_parents()` and `find_parent()` * `find_next_siblings()` and `find_next_sibling()` * `find_previous_siblings()` and `find_previous_sibling()` * `find_all_next()` and `find_next()` * `find_all_previous()` and `find_previous()` * CSS selectors * Modifying the tree * Changing tag names and attributes * Modifying `.string` * `append()` * `extend()` * `NavigableString()` and `.new_tag()` * `insert()` * `insert_before()` and `insert_after()` * `clear()` * `extract()` * `decompose()` * `replace_with()` * `wrap()` * `unwrap()` * `smooth()` * Output * Pretty-printing * Non-pretty printing * Output formatters * `get_text()` * Specifying the parser to use * Differences between parsers * Encodings * Output encoding * Unicode, Dammit * Smart quotes * Inconsistent encodings * Line numbers * Comparing objects for equality * Copying Beautiful Soup objects * Parsing only part of a document * `SoupStrainer` * Troubleshooting * `diagnose()` * Errors when parsing a document * Version mismatch problems * Parsing XML * Other parser problems * Miscellaneous * Improving Performance * Translating this documentation * Beautiful Soup 3 * Porting code to BS4 * You need a parser * Method names * Generators * XML * Entities * Miscellaneous __Beautiful Soup * Docs » * Beautiful Soup Documentation * [ View page source](_sources/index.rst.txt) * * * # Beautiful Soup Documentation¶  [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations. This document covers Beautiful Soup version 4.8.1. The examples in this documentation should work the same way in Python 2.7 and Python 3.2. You might be looking for the documentation for [Beautiful Soup 3](http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html). If so, you should know that Beautiful Soup 3 is no longer being developed and that support for it will be dropped on or after December 31, 2020. If you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see Porting code to BS4. This documentation has been translated into other languages by Beautiful Soup users: * [è¿ç¯ææ¡£å½ç¶è¿æä¸æç.](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/) * ãã®ãã¼ã¸ã¯æ¥æ¬èªã§å©ç¨ã§ãã¾ã([å¤é¨ãªã³ã¯](http://kondou.com/BS4/)) * [ì´ ë¬¸ìë íêµì´ ë²ìë ê°ë¥í©ëë¤.](https://www.crummy.com/software/BeautifulSoup/bs4/doc.ko/) * [Este documento também está disponÃvel em Português do Brasil.](https://www.crummy.com/software/BeautifulSoup/bs4/doc.ptbr/) ## Getting help¶ If you have questions about Beautiful Soup, or run into problems, [send mail to the discussion group](https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup). If your problem involves parsing an HTML document, be sure to mention what the diagnose() function says about that document. # Quick Start¶ Hereâs an HTML document Iâll be using as an example throughout this document. Itâs part of a story from Alice in Wonderland: html_doc = """
The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" Running the âthree sistersâ document through Beautiful Soup gives us a `BeautifulSoup` object, which represents the document as a nested data structure: from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify()) # # ## # The Dormouse's story # #
## Once upon a time there were three little sisters; and their names were # # Elsie # # , # # Lacie # # and # # Tillie # # ; and they lived at the bottom of a well. #
## ... #
# # Here are some simple ways to navigate that data structure: soup.title #The Dormouse's story
soup.p['class'] # u'title' soup.a # Elsie soup.find_all('a') # [Elsie, # Lacie, # Tillie] soup.find(id="link3") # Tillie One common task is extracting all the URLs found within a pageâs tags: for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie Another common task is extracting all the text from a page: print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ... Does this look like what you need? If so, read on. # Installing Beautiful Soup¶ If youâre using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager: `$ apt-get install python-bs4` (for Python 2) `$ apt-get install python3-bs4` (for Python 3) Beautiful Soup 4 is published through PyPi, so if you canât install it with the system packager, you can install it with `easy_install` or `pip`. The package name is `beautifulsoup4`, and the same package works on Python 2 and Python 3. Make sure you use the right version of `pip` or `easy_install` for your Python version (these may be named `pip3` and `easy_install3` respectively if youâre using Python 3). `$ easy_install beautifulsoup4` `$ pip install beautifulsoup4` (The `BeautifulSoup` package is probably not what you want. Thatâs the previous major release, [Beautiful Soup 3](http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html). Lots of software uses BS3, so itâs still available, but if youâre writing new code you should install `beautifulsoup4`.) If you donât have `easy_install` or `pip` installed, you can [download the Beautiful Soup 4 source tarball](http://www.crummy.com/software/BeautifulSoup/download/4.x/) and install it with `setup.py`. `$ python setup.py install` If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its `bs4` directory into your applicationâs codebase, and use Beautiful Soup without installing it at all. I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other recent versions. ## Problems after installation¶ Beautiful Soup is packaged as Python 2 code. When you install it for use with Python 3, itâs automatically converted to Python 3 code. If you donât install the package, the code wonât be converted. There have also been reports on Windows machines of the wrong version being installed. If you get the `ImportError` âNo module named HTMLParserâ, your problem is that youâre running the Python 2 version of the code under Python 3. If you get the `ImportError` âNo module named html.parserâ, your problem is that youâre running the Python 3 version of the code under Python 2. In both cases, your best bet is to completely remove the Beautiful Soup installation from your system (including any directory created when you unzipped the tarball) and try the installation again. If you get the `SyntaxError` âInvalid syntaxâ on the line `ROOT_TAG_NAME = u'[document]'`, you need to convert the Python 2 code to Python 3. You can do this either by installing the package: `$ python3 setup.py install` or by manually running Pythonâs `2to3` conversion script on the `bs4` directory: `$ 2to3-3.2 -w bs4` ## Installing a parser¶ Beautiful Soup supports the HTML parser included in Pythonâs standard library, but it also supports a number of third-party Python parsers. One is the [lxml parser](http://lxml.de/). Depending on your setup, you might install lxml with one of these commands: `$ apt-get install python-lxml` `$ easy_install lxml` `$ pip install lxml` Another alternative is the pure-Python [html5lib parser](http://code.google.com/p/html5lib/), which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands: `$ apt-get install python-html5lib` `$ easy_install html5lib` `$ pip install html5lib` This table summarizes the advantages and disadvantages of each parser library: Parser | Typical usage | Advantages | Disadvantages ---|---|---|--- Pythonâs html.parser | `BeautifulSoup(markup, "html.parser")` | * Batteries included * Decent speed * Lenient (As of Python 2.7.3 and 3.2.) | * Not as fast as lxml, less lenient than html5lib. lxmlâs HTML parser | `BeautifulSoup(markup, "lxml")` | * Very fast * Lenient | * External C dependency lxmlâs XML parser | `BeautifulSoup(markup, "lxml-xml")` `BeautifulSoup(markup, "xml")` | * Very fast * The only currently supported XML parser | * External C dependency html5lib | `BeautifulSoup(markup, "html5lib")` | * Extremely lenient * Parses pages the same way a web browser does * Creates valid HTML5 | * Very slow * External Python dependency If you can, I recommend you install and use lxml for speed. If youâre using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, itâs essential that you install lxml or html5libâPythonâs built-in HTML parser is just not very good in older versions. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See Differences between parsers for details. # Making the soup¶ To parse a document, pass it into the `BeautifulSoup` constructor. You can pass in a string or an open filehandle: from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp) soup = BeautifulSoup("data") First, the document is converted to Unicode, and HTML entities are converted to Unicode characters: BeautifulSoup("Sacré bleu!") Sacré bleu! Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.) # Kinds of objects¶ Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But youâll only ever have to deal with about four kinds of objects: `Tag`, `NavigableString`, `BeautifulSoup`, and `Comment`. ## `Tag`¶ A `Tag` object corresponds to an XML or HTML tag in the original document: soup = BeautifulSoup('Extremely bold') tag = soup.b type(tag) #Extremely bold### Attributes¶ A tag may have any number of attributes. The tag `` has an attribute âidâ whose value is âboldestâ. You can access a tagâs attributes by treating the tag like a dictionary: tag['id'] # u'boldest' You can access that dictionary directly as `.attrs`: tag.attrs # {u'id': 'boldest'} You can add, remove, and modify a tagâs attributes. Again, this is done by treating the tag as a dictionary: tag['id'] = 'verybold' tag['another-attribute'] = 1 tag # del tag['id'] del tag['another-attribute'] tag # tag['id'] # KeyError: 'id' print(tag.get('id')) # None #### Multi-valued attributes¶ HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is `class` (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list: css_soup = BeautifulSoup('') css_soup.p['class'] # ["body"] css_soup = BeautifulSoup('') css_soup.p['class'] # ["body", "strikeout"] If an attribute looks like it has more than one value, but itâs not a multi- valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone: id_soup = BeautifulSoup('') id_soup.p['id'] # 'my id' When you turn a tag back into a string, multiple attribute values are consolidated: rel_soup = BeautifulSoup('
Back to the homepage
') rel_soup.a['rel'] # ['index'] rel_soup.a['rel'] = ['index', 'contents'] print(rel_soup.p) #Back to the homepage
You can disable this by passing `multi_valued_attributes=None` as a keyword argument into the `BeautifulSoup` constructor: no_list_soup = BeautifulSoup('', 'html', multi_valued_attributes=None) no_list_soup.p['class'] # u'body strikeout' You can use ``get_attribute_list` to get a value thatâs always a list, whether or not itâs a multi-valued atribute: id_soup.p.get_attribute_list('id') # ["my id"] If you parse a document as XML, there are no multi-valued attributes: xml_soup = BeautifulSoup('', 'xml') xml_soup.p['class'] # u'body strikeout' Again, you can configure this using the `multi_valued_attributes` argument: class_is_multi= { '*' : 'class'} xml_soup = BeautifulSoup('', 'xml', multi_valued_attributes=class_is_multi) xml_soup.p['class'] # [u'body', u'strikeout'] You probably wonât need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification: from bs4.builder import builder_registry builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES ## `NavigableString`¶ A string corresponds to a bit of text within a tag. Beautiful Soup uses the `NavigableString` class to contain these bits of text: tag.string # u'Extremely bold' type(tag.string) #No longer bold`NavigableString` supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string canât contain anything (the way a tag may contain a string or another tag), strings donât support the `.contents` or `.string` attributes, or the `find()` method. If you want to use a `NavigableString` outside of Beautiful Soup, you should call `unicode()` on it to turn it into a normal Python Unicode string. If you donât, your string will carry around a reference to the entire Beautiful Soup parse tree, even when youâre done using Beautiful Soup. This is a big waste of memory. ## `BeautifulSoup`¶ The `BeautifulSoup` object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree. You can also pass a `BeautifulSoup` object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents: doc = BeautifulSoup("
The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') Iâll use this as an example to show you how to move from one part of a document to another. ## Going down¶ Tags may contain strings and other tags. These elements are the tagâs children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tagâs children. Note that Beautiful Soup strings donât support any of these attributes, because a string canât have children. ### Navigating using tag names¶ The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the tag, just say `soup.head`: soup.head #The Dormouse's story
An HTML parser takes this string of characters and turns it into a series of events: âopen an tagâ, âopen a tagâ, âopen atagâ, and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document. ### `.next_element` and `.previous_element`¶ The `.next_element` attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as `.next_sibling`, but itâs usually drastically different. Hereâs the final tag in the âthree sistersâ document. Its `.next_sibling` is a string: the conclusion of the sentence that was interrupted by the start of the tag.: last_a_tag = soup.find("a", id="link3") last_a_tag # Tillie last_a_tag.next_sibling # '; and they lived at the bottom of a well.' But the `.next_element` of that tag, the thing that was parsed immediately after the tag, is not the rest of that sentence: itâs the word âTillieâ: last_a_tag.next_element # u'Tillie' Thatâs because in the original markup, the word âTillieâ appeared before that semicolon. The parser encountered an tag, then the word âTillieâ, then the closing tag, then the semicolon and rest of the sentence. The semicolon is on the same level as the tag, but the word âTillieâ was encountered first. The `.previous_element` attribute is the exact opposite of `.next_element`. It points to whatever element was parsed immediately before this one: last_a_tag.previous_element # u' and\n' last_a_tag.previous_element.next_element # Tillie ### `.next_elements` and `.previous_elements`¶ You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed: for element in last_a_tag.next_elements: print(repr(element)) # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' #
...
# u'...' # u'\n' # None # Searching the tree¶ Beautiful Soup defines a lot of methods for searching the parse tree, but theyâre all very similar. Iâm going to spend a lot of time explaining the two most popular methods: `find()` and `find_all()`. The other methods take almost exactly the same arguments, so Iâll just cover them briefly. Once again, Iâll be using the âthree sistersâ document as an example: html_doc = """The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') By passing in a filter to an argument like `find_all()`, you can zoom in on the parts of the document youâre interested in. ## Kinds of filters¶ Before talking in detail about `find_all()` and similar methods, I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tagâs name, on its attributes, on the text of a string, or on some combination of these. ### A string¶ The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the tags in the document: soup.find_all('b') # [The Dormouse's story] If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead. ### A regular expression¶ If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its `search()` method. This code finds all the tags whose names start with the letter âbâ; in this case, the tag and the tag: import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b This code finds all the tags whose names contain the letter âtâ: for tag in soup.find_all(re.compile("t")): print(tag.name) # html # title ### A list¶ If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the tags and all the tags: soup.find_all(["a", "b"]) # [The Dormouse's story, # Elsie, # Lacie, # Tillie] ### `True`¶ The value `True` matches everything it can. This code finds all the tags in the document, but none of the text strings: for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p ### A function¶ If none of the other matches work for you, define a function that takes an element as its only argument. The function should return `True` if the argument matches, and `False` otherwise. Hereâs a function that returns `True` if a tag defines the âclassâ attribute but doesnât define the âidâ attribute: def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') Pass this function into `find_all()` and youâll pick up all thetags: soup.find_all(has_class_but_no_id) # [
The Dormouse's story
, #Once upon a time there were...
, #...
] This function only picks up the tags. It doesnât pick up the tags,
because those tags define both âclassâ and âidâ. It doesnât pick up
tags like and The Dormouse's story tag with the CSS class âtitleâ? Letâs look at the arguments to
`find_all()`.
### The `name` argument¶
Pass in a value for `name` and youâll tell Beautiful Soup to only consider
tags with certain names. Text strings will be ignored, as will tags whose
names that donât match.
This is the simplest usage:
soup.find_all("title")
# [ The Dormouse's story
Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well.
a_string.find_parents("p", class="title") # [] One of the three tags is the direct parent of the string in question, so our search finds it. One of the threetags is an indirect parent of the string, and our search finds that as well. Thereâs a
tag with the CSS class âtitleâ somewhere in the document, but itâs not one of this stringâs parents, so we canât find it with `find_parents()`. You may have made the connection between `find_parent()` and `find_parents()`, and the .parent and .parents attributes mentioned earlier. The connection is very strong. These search methods actually use `.parents` to iterate over all the parents, and check each one against the provided filter to see if it matches. ## `find_next_siblings()` and `find_next_sibling()`¶ Signature: find_next_siblings(name, attrs, string, limit, **kwargs) Signature: find_next_sibling(name, attrs, string, **kwargs) These methods use .next_siblings to iterate over the rest of an elementâs siblings in the tree. The `find_next_siblings()` method returns all the siblings that match, and `find_next_sibling()` only returns the first one: first_link = soup.a first_link # Elsie first_link.find_next_siblings("a") # [Lacie, # Tillie] first_story_paragraph = soup.find("p", "story") first_story_paragraph.find_next_sibling("p") #
...
## `find_previous_siblings()` and `find_previous_sibling()`¶ Signature: find_previous_siblings(name, attrs, string, limit, **kwargs) Signature: find_previous_sibling(name, attrs, string, **kwargs) These methods use .previous_siblings to iterate over an elementâs siblings that precede it in the tree. The `find_previous_siblings()` method returns all the siblings that match, and `find_previous_sibling()` only returns the first one: last_link = soup.find("a", id="link3") last_link # Tillie last_link.find_previous_siblings("a") # [Lacie, # Elsie] first_story_paragraph = soup.find("p", "story") first_story_paragraph.find_previous_sibling("p") #The Dormouse's story
## `find_all_next()` and `find_next()`¶ Signature: find_all_next(name, attrs, string, limit, **kwargs) Signature: find_next(name, attrs, string, **kwargs) These methods use .next_elements to iterate over whatever tags and strings that come after it in the document. The `find_all_next()` method returns all matches, and `find_next()` only returns the first match: first_link = soup.a first_link # Elsie first_link.find_all_next(string=True) # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n'] first_link.find_next("p") #...
In the first example, the string âElsieâ showed up, even though it was contained within the tag we started from. In the second example, the lasttag in the document showed up, even though itâs not in the same part of the tree as the tag we started from. For these methods, all that matters is that an element match the filter, and show up later in the document than the starting element. ## `find_all_previous()` and `find_previous()`¶ Signature: find_all_previous(name, attrs, string, limit, **kwargs) Signature: find_previous(name, attrs, string, **kwargs) These methods use .previous_elements to iterate over the tags and strings that came before it in the document. The `find_all_previous()` method returns all matches, and `find_previous()` only returns the first match: first_link = soup.a first_link # Elsie first_link.find_all_previous("p") # [
Once upon a time there were three little sisters; ...
, #The Dormouse's story
] first_link.find_previous("title") # tag that contains the tag we started with. This shouldnât be too
surprising: weâre looking at all the tags that show up earlier in the
document than the one we started with. A tag that contains an tag must
have shown up before the tag it contains.
## CSS selectors¶
As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the
[SoupSieve](https://facelessuser.github.io/soupsieve/) project. If you
installed Beautiful Soup through `pip`, SoupSieve was installed at the same
time, so you donât have to do anything extra.
`BeautifulSoup` has a `.select()` method which uses SoupSieve to run a CSS
selector against a parsed document and return all the matching elements. `Tag`
has a similar method which runs a CSS selector against the contents of a
single tag.
(Earlier versions of Beautiful Soup also have the `.select()` method, but only
the most commonly-used CSS selectors are supported.)
The SoupSieve [documentation](https://facelessuser.github.io/soupsieve/) lists
all the currently supported CSS selectors, but here are some of the basics:
You can find tags:
soup.select("title")
# [ ...
Extremely bolddel tag['class'] del tag['id'] tag #
Extremely bold## Modifying `.string`¶ If you set a tagâs `.string` attribute to a new string, the tagâs contents are replaced with that string: markup = 'I linked to example.com' soup = BeautifulSoup(markup) tag = soup.a tag.string = "New link text." tag # New link text. Be careful: if the tag contained other tags, they and all their contents will be destroyed. ## `append()`¶ You can add to a tagâs contents with `Tag.append()`. It works just like calling `.append()` on a Python list: soup = BeautifulSoup("Foo") soup.a.append("Bar") soup # FooBar soup.a.contents # [u'Foo', u'Bar'] ## `extend()`¶ Starting in Beautiful Soup 4.7.0, `Tag` also supports a method called `.extend()`, which works just like calling `.extend()` on a Python list: soup = BeautifulSoup("Soup") soup.a.extend(["'s", " ", "on"]) soup # Soup's on soup.a.contents # [u'Soup', u''s', u' ', u'on'] ## `NavigableString()` and `.new_tag()`¶ If you need to add a string to a document, no problemâyou can pass a Python string in to `append()`, or you can call the `NavigableString` constructor: soup = BeautifulSoup("") tag = soup.b tag.append("Hello") new_string = NavigableString(" there") tag.append(new_string) tag # Hello there. tag.contents # [u'Hello', u' there'] If you want to create a comment or some other subclass of `NavigableString`, just call the constructor: from bs4 import Comment new_comment = Comment("Nice to see you.") tag.append(new_comment) tag # Hello there tag.contents # [u'Hello', u' there', u'Nice to see you.'] (This is a new feature in Beautiful Soup 4.4.0.) What if you need to create a whole new tag? The best solution is to call the factory method `BeautifulSoup.new_tag()`: soup = BeautifulSoup("") original_tag = soup.b new_tag = soup.new_tag("a", href="http://www.example.com") original_tag.append(new_tag) original_tag # new_tag.string = "Link text." original_tag # Link text. Only the first argument, the tag name, is required. ## `insert()`¶ `Tag.insert()` is just like `Tag.append()`, except the new element doesnât necessarily go at the end of its parentâs `.contents`. Itâll be inserted at whatever numeric position you say. It works just like `.insert()` on a Python list: markup = 'I linked to example.com' soup = BeautifulSoup(markup) tag = soup.a tag.insert(1, "but did not endorse ") tag # I linked to but did not endorse example.com tag.contents # [u'I linked to ', u'but did not endorse', example.com] ## `insert_before()` and `insert_after()`¶ The `insert_before()` method inserts tags or strings immediately before something else in the parse tree: soup = BeautifulSoup("stop") tag = soup.new_tag("i") tag.string = "Don't" soup.b.string.insert_before(tag) soup.b # Don'tstop The `insert_after()` method inserts tags or strings immediately following something else in the parse tree: div = soup.new_tag('div') div.string = 'ever' soup.b.i.insert_after(" you ", div) soup.b # Don't you
I wish I was bold.
") soup.p.string.wrap(soup.new_tag("b")) # I wish I was bold. soup.p.wrap(soup.new_tag("div") #I wish I was bold.
A one
") soup.p.append(", a two") soup.p.contents # [u'A one', u', a two'] print(soup.p.encode()) #A one, a two
print(soup.p.prettify()) ## A one # , a two #
You can call `Tag.smooth()` to clean up the parse tree by consolidating adjacent strings: soup.smooth() soup.p.contents # [u'A one, a two'] print(soup.p.prettify()) ## A one, a two #
The `smooth()` method is new in Beautiful Soup 4.8.0. # Output¶ ## Pretty-printing¶ The `prettify()` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string: markup = 'I linked to example.com' soup = BeautifulSoup(markup) soup.prettify() # '\n \n \n \n \n...' print(soup.prettify()) # # # # # # I linked to # # example.com # # # # You can call `prettify()` on the top-level `BeautifulSoup` object, or on any of its `Tag` objects: print(soup.a.prettify()) # # I linked to # # example.com # # ## Non-pretty printing¶ If you just want a string, with no fancy formatting, you can call `unicode()` or `str()` on a `BeautifulSoup` object, or a `Tag` within it: str(soup) # 'I linked to example.com' unicode(soup.a) # u'I linked to example.com' The `str()` function returns a string encoded in UTF-8. See Encodings for other options. You can also call `encode()` to get a bytestring, and `decode()` to get Unicode. ## Output formatters¶ If you give Beautiful Soup a document that contains HTML entities like â&lquot;â, theyâll be converted to Unicode characters: soup = BeautifulSoup("“Dammit!” he said.") unicode(soup) # u'\u201cDammit!\u201d he said.' If you then convert the document to a string, the Unicode characters will be encoded as UTF-8. You wonât get the HTML entities back: str(soup) # '\xe2\x80\x9cDammit!\xe2\x80\x9d he said.' By default, the only characters that are escaped upon output are bare ampersands and angle brackets. These get turned into â&â, â<â, and â>â, so that Beautiful Soup doesnât inadvertently generate invalid HTML or XML: soup = BeautifulSoup("The law firm of Dewey, Cheatem, & Howe
") soup.p #The law firm of Dewey, Cheatem, & Howe
soup = BeautifulSoup('A link') soup.a # A link You can change this behavior by providing a value for the `formatter` argument to `prettify()`, `encode()`, or `decode()`. Beautiful Soup recognizes five possible values for `formatter`. The default is `formatter="minimal"`. Strings will only be processed enough to ensure that Beautiful Soup generates valid HTML/XML: french = "Il a dit <<Sacré bleu!>>
" soup = BeautifulSoup(french) print(soup.prettify(formatter="minimal")) # # ## Il a dit <<Sacré bleu!>> #
# # If you pass in `formatter="html"`, Beautiful Soup will convert Unicode characters to HTML entities whenever possible: print(soup.prettify(formatter="html")) # # ## Il a dit <<Sacré bleu!>> #
# # If you pass in `formatter="html5"`, itâs the same as `formatter="html5"`, but Beautiful Soup will omit the closing slash in HTML void tags like âbrâ: soup = BeautifulSoup("
# Il a dit <
# IL A DIT <
tag. This parser also adds an empty
tag to the document. Hereâs the same document parsed with Pythonâs built-in HTML parser: BeautifulSoup("", "html.parser") # Like html5lib, this parser ignores the closing tag. Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a tag. Unlike lxml, it doesnât even bother to add an tag. Since the document ââ is invalid, none of these techniques is the âcorrectâ way to handle it. The html5lib parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the âcorrectâ way, but all three techniques are legitimate. Differences between parsers can affect your script. If youâre planning on distributing your script to other people, or running it on multiple machines, you should specify a parser in the `BeautifulSoup` constructor. That will reduce the chances that your users parse a document differently from the way you parse it. # Encodings¶ Any HTML or XML document is written in a specific encoding like ASCII or UTF-8. But when you load that document into Beautiful Soup, youâll discover itâs been converted to Unicode: markup = "Sacr\xe9 bleu!
''' soup = BeautifulSoup(markup) print(soup.prettify()) # # # # # ## Sacré bleu! #
# # Note that the tag has been rewritten to reflect the fact that the document is now in UTF-8. If you donât want UTF-8, you can pass an encoding into `prettify()`: print(soup.prettify("latin-1")) # # # # ... You can also call encode() on the `BeautifulSoup` object, or any element in the soup, just as if it were a Python string: soup.p.encode("latin-1") # 'Sacr\xe9 bleu!
' soup.p.encode("utf-8") # 'Sacr\xc3\xa9 bleu!
' Any characters that canât be represented in your chosen encoding will be converted into numeric XML entity references. Hereâs a document that includes the Unicode character SNOWMAN: markup = u"\N{SNOWMAN}" snowman_soup = BeautifulSoup(markup) tag = snowman_soup.b The SNOWMAN character can be part of a UTF-8 document (it looks like â), but thereâs no representation for that character in ISO-Latin-1 or ASCII, so itâs converted into â☃â for those encodings: print(tag.encode("utf-8")) # â print tag.encode("latin-1") # ☃ print tag.encode("ascii") # ☃ ## Unicode, Dammit¶ You can use Unicode, Dammit without using Beautiful Soup. Itâs useful whenever you have data in an unknown encoding and you just want it to become Unicode: from bs4 import UnicodeDammit dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'utf-8' Unicode, Dammitâs guesses will get a lot more accurate if you install the `chardet` or `cchardet` Python libraries. The more data you give Unicode, Dammit, the more accurately it will guess. If you have your own suspicions as to what the encoding might be, you can pass them in as a list: dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) print(dammit.unicode_markup) # Sacré bleu! dammit.original_encoding # 'latin-1' Unicode, Dammit has two special features that Beautiful Soup doesnât use. ### Smart quotes¶ You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML entities: markup = b"I just \x93love\x94 Microsoft Word\x92s smart quotes
" UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup # u'I just “love” Microsoft Word’s smart quotes
' UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup # u'I just “love” Microsoft Word’s smart quotes
' You can also convert Microsoft smart quotes to ASCII quotes: UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup # u'I just "love" Microsoft Word\'s smart quotes
' Hopefully youâll find this feature useful, but Beautiful Soup doesnât use it. Beautiful Soup prefers the default behavior, which is to convert Microsoft smart quotes to Unicode characters along with everything else: UnicodeDammit(markup, ["windows-1252"]).unicode_markup # u'I just \u201clove\u201d Microsoft Word\u2019s smart quotes
' ### Inconsistent encodings¶ Sometimes a document is mostly in UTF-8, but contains Windows-1252 characters such as (again) Microsoft smart quotes. This can happen when a website includes data from multiple sources. You can use `UnicodeDammit.detwingle()` to turn such a document into pure UTF-8. Hereâs a simple example: snowmen = (u"\N{SNOWMAN}" * 3) quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") doc = snowmen.encode("utf8") + quote.encode("windows_1252") This document is a mess. The snowmen are in UTF-8 and the quotes are in Windows-1252. You can display the snowmen or the quotes, but not both: print(doc) # âââ�I like snowmen!� print(doc.decode("windows-1252")) # Ã¢ËÆÃ¢ËÆÃ¢ËÆâI like snowmen!â Decoding the document as UTF-8 raises a `UnicodeDecodeError`, and decoding it as Windows-1252 gives you gibberish. Fortunately, `UnicodeDammit.detwingle()` will convert the string to pure UTF-8, allowing you to decode it to Unicode and display the snowmen and quote marks simultaneously: new_doc = UnicodeDammit.detwingle(doc) print(new_doc.decode("utf8")) # ââââI like snowmen!â `UnicodeDammit.detwingle()` only knows how to handle Windows-1252 embedded in UTF-8 (or vice versa, I suppose), but this is the most common case. Note that you must know to call `UnicodeDammit.detwingle()` on your data before passing it into `BeautifulSoup` or the `UnicodeDammit` constructor. Beautiful Soup assumes that a document has a single encoding, whatever it might be. If you pass it a document that contains both UTF-8 and Windows-1252, itâs likely to think the whole document is Windows-1252, and the document will come out looking like `Ã¢ËÆÃ¢ËÆÃ¢ËÆâI like snowmen!â`. `UnicodeDammit.detwingle()` is new in Beautiful Soup 4.1.0. # Line numbers¶ The `html.parser` and ``html5lib` parsers can keep track of where in the original document each Tag was found. You can access this information as `Tag.sourceline` (line number) and `Tag.sourcepos` (position of the start tag within a line): markup = "Paragraph 1
\nParagraph 2
" soup = BeautifulSoup(markup, 'html.parser') for tag in soup.find_all('p'): print(tag.sourceline, tag.sourcepos, tag.string) # (1, 0, u'Paragraph 1') # (2, 3, u'Paragraph 2') Note that the two parsers mean slightly different things by `sourceline` and `sourcepos`. For html.parser, these numbers represent the position of the initial less-than sign. For html5lib, these numbers represent the position of the final greater-than sign: soup = BeautifulSoup(markup, 'html5lib') for tag in soup.find_all('p'): print(tag.sourceline, tag.sourcepos, tag.string) # (2, 1, u'Paragraph 1') # (3, 7, u'Paragraph 2') You can shut off this feature by passing `store_line_numbers=False` into the ``BeautifulSoup` constructor: markup = "Paragraph 1
\nParagraph 2
" soup = BeautifulSoup(markup, 'html.parser', store_line_numbers=False) soup.p.sourceline # None This feature is new in 4.8.1, and the parsers based on lxml donât support it. # Comparing objects for equality¶ Beautiful Soup says that two `NavigableString` or `Tag` objects are equal when they represent the same HTML or XML markup. In this example, the two tags are treated as equal, even though they live in different parts of the object tree, because they both look like âpizzaâ: markup = "I want pizza and more pizza!
" soup = BeautifulSoup(markup, 'html.parser') first_b, second_b = soup.find_all('b') print first_b == second_b # True print first_b.previous_element == second_b.previous_element # False If you want to see whether two variables refer to exactly the same object, use is: print first_b is second_b # False # Copying Beautiful Soup objects¶ You can use `copy.copy()` to create a copy of any `Tag` or `NavigableString`: import copy p_copy = copy.copy(soup.p) print p_copy #I want pizza and more pizza!
The copy is considered equal to the original, since it represents the same markup as the original, but itâs not the same object: print soup.p == p_copy # True print soup.p is p_copy # False The only real difference is that the copy is completely detached from the original Beautiful Soup object tree, just as if `extract()` had been called on it: print p_copy.parent # None This is because two different `Tag` objects canât occupy the same space at the same time. # Parsing only part of a document¶ Letâs say you want to use Beautiful Soup look at a documentâs tags. Itâs a waste of time and memory to parse the entire document and then go over it again looking for tags. It would be much faster to ignore everything that wasnât an tag in the first place. The `SoupStrainer` class allows you to choose which parts of an incoming document are parsed. You just create a `SoupStrainer` and pass it in to the `BeautifulSoup` constructor as the `parse_only` argument. (Note that _this feature wonât work if youâre using the html5lib parser_. If you use html5lib, the whole document will be parsed, no matter what. This is because html5lib constantly rearranges the parse tree as it works, and if some part of the document didnât actually make it into the parse tree, itâll crash. To avoid confusion, in the examples below Iâll be forcing Beautiful Soup to use Pythonâs built-in parser.) ## `SoupStrainer`¶ The `SoupStrainer` class takes the same arguments as a typical method from Searching the tree: name, attrs, string, and **kwargs. Here are three `SoupStrainer` objects: from bs4 import SoupStrainer only_a_tags = SoupStrainer("a") only_tags_with_id_link2 = SoupStrainer(id="link2") def is_short_string(string): return len(string) < 10 only_short_strings = SoupStrainer(string=is_short_string) Iâm going to bring back the âthree sistersâ document one more time, and weâll see what the document looks like when itâs parsed with these three `SoupStrainer` objects: html_doc = """The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) # # Elsie # # # Lacie # # # Tillie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) # # Lacie # print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) # Elsie # , # Lacie # and # Tillie # ... # You can also pass a `SoupStrainer` into any of the methods covered in Searching the tree. This probably isnât terribly useful, but I thought Iâd mention it: soup = BeautifulSoup(html_doc) soup.find_all(only_short_strings) # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', # u'\n\n', u'...', u'\n'] # Troubleshooting¶ ## `diagnose()`¶ If youâre having trouble understanding what Beautiful Soup does to a document, pass the document into the `diagnose()` function. (New in Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing you how different parsers handle the document, and tell you if youâre missing a parser that Beautiful Soup could be using: from bs4.diagnose import diagnose with open("bad.html") as fp: data = fp.read() diagnose(data) # Diagnostic running on Beautiful Soup 4.2.0 # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) # I noticed that html5lib is not installed. Installing it may help. # Found lxml version 2.3.2.0 # # Trying to parse your data with html.parser # Here's what html.parser did with the document: # ... Just looking at the output of diagnose() may show you how to solve the problem. Even if not, you can paste the output of `diagnose()` when asking for help. ## Errors when parsing a document¶ There are two different kinds of parse errors. There are crashes, where you feed a document to Beautiful Soup and it raises an exception, usually an `HTMLParser.HTMLParseError`. And there is unexpected behavior, where a Beautiful Soup parse tree looks a lot different than the document used to create it. Almost none of these problems turn out to be problems with Beautiful Soup. This is not because Beautiful Soup is an amazingly well-written piece of software. Itâs because Beautiful Soup doesnât include any parsing code. Instead, it relies on external parsers. If one parser isnât working on a certain document, the best solution is to try a different parser. See Installing a parser for details and a parser comparison. The most common parse errors are `HTMLParser.HTMLParseError: malformed start tag` and `HTMLParser.HTMLParseError: bad end tag`. These are both generated by Pythonâs built-in HTML parser library, and the solution is to install lxml or html5lib. The most common type of unexpected behavior is that you canât find a tag that you know is in the document. You saw it going in, but `find_all()` returns `[]` or `find()` returns `None`. This is another common problem with Pythonâs built-in HTML parser, which sometimes skips tags it doesnât understand. Again, the solution is to install lxml or html5lib. ## Version mismatch problems¶ * `SyntaxError: Invalid syntax` (on the line `ROOT_TAG_NAME = u'[document]'`): Caused by running the Python 2 version of Beautiful Soup under Python 3, without converting the code. * `ImportError: No module named HTMLParser` \- Caused by running the Python 2 version of Beautiful Soup under Python 3. * `ImportError: No module named html.parser` \- Caused by running the Python 3 version of Beautiful Soup under Python 2. * `ImportError: No module named BeautifulSoup` \- Caused by running Beautiful Soup 3 code on a system that doesnât have BS3 installed. Or, by writing Beautiful Soup 4 code without knowing that the package name has changed to `bs4`. * `ImportError: No module named bs4` \- Caused by running Beautiful Soup 4 code on a system that doesnât have BS4 installed. ## Parsing XML¶ By default, Beautiful Soup parses documents as HTML. To parse a document as XML, pass in âxmlâ as the second argument to the `BeautifulSoup` constructor: soup = BeautifulSoup(markup, "xml") Youâll need to have lxml installed. ## Other parser problems¶ * If your script works on one computer but not another, or in one virtual environment but not another, or outside the virtual environment but not inside, itâs probably because the two environments have different parser libraries available. For example, you may have developed the script on a computer that has lxml installed, and then tried to run it on a computer that only has html5lib installed. See Differences between parsers for why this matters, and fix the problem by mentioning a specific parser library in the `BeautifulSoup` constructor. * Because [HTML tags and attributes are case-insensitive](http://www.w3.org/TR/html5/syntax.html#syntax), all three HTML parsers convert tag and attribute names to lowercase. That is, the markup