How to Parse HTML in Python

09/12/2021

Contents

In this article, you will learn how to parse HTML in Python.

Parsing HTML

There are several libraries available in Python to parse HTML, and each has its own strengths and weaknesses.

Here are some of the most popular options:

BeautifulSoup

BeautifulSoup is a library that makes it easy to scrape information from web pages. It sits on top of an HTML or XML parser, providing Python ways of accessing data in HTML and XML files. You can install it using “pip install beautifulsoup4”.

lxml

lxml is a library that provides a way to parse HTML and XML documents. It’s a fast and easy-to-use library that is capable of handling even the largest and most complex HTML documents. You can install it using “pip install lxml”.

html.parser

The built-in html.parser library provides basic HTML parsing capabilities for those who don’t want to install any additional libraries.

Here’s an example using BeautifulSoup to parse an HTML document:

from bs4 import BeautifulSoup

html_doc = """
<html>
  <head>
    <title>My Page Title</title>
  </head>
  <body>
    <h1>My Heading</h1>
    <p>My paragraph.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title.string) # My Page Title
print(soup.body.p.string) # My paragraph.

When choosing a library, it’s important to consider the size and complexity of the HTML documents you’ll be working with, as well as your own personal preferences. If you’re just starting out and don’t need advanced features, html.parser is a good choice. If you’re working with larger or more complex documents, or if you need additional features like XSLT or schema validation, then you may want to consider BeautifulSoup or lxml.