Clean HTML Code with BeautifulSoup: Python Tutorial

Introduction

This Python HTML cleaner script (Clean HTML Code) uses BeautifulSoup to effortlessly clean HTML with BeautifulSoup, removing unnecessary tags, styles, and attributes. Enhanced for 2025, it now handles multiple encodings (e.g., cp1251, latin-1) to ensure compatibility with diverse HTML files. Perfect for web scraping, content processing, or simplifying messy code.

The Script

Here’s the complete HTML parsing with encoding script:


from bs4 import BeautifulSoup
import os

# Define file paths dynamically
base_dir = os.path.dirname(__file__)
input_file = os.path.join(base_dir, 'article.html')
output_file = os.path.join(base_dir, 'cleaned_output.html')

def clean_html_table(html_content):
    """Clean HTML by removing styles, scripts, and attributes."""
    # Parse HTML with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')
    for tag in soup(['style', 'script']):
        tag.decompose()

    # Clear all attributes (e.g., inline styles) from tags
    for tag in soup.find_all(True):  # True finds all tags
        tag.attrs = {}

    # Return formatted, cleaned HTML
    return soup.prettify()

if __name__ == '__main__':
    # Read HTML file with encoding fallback
    try:
        with open(input_file, 'r', encoding='cp1251') as file:  # Try cp1251 first
            html_content = file.read()
    except UnicodeDecodeError:
        with open(input_file, 'r', encoding='latin-1') as file:  # Fallback to latin-1
            html_content = file.read()

    # Clean the HTML
    cleaned_html = clean_html_table(html_content)

    # Save to a new file
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(cleaned_html)

    print("HTML successfully cleaned and saved to 'cleaned_output.html'")

How It Works

Dynamic File Paths

The script uses os.path to set paths relative to its location:

base_dir = os.path.dirname(__file__)
input_file = os.path.join(base_dir, 'article.html')
output_file = os.path.join(base_dir, 'cleaned_output.html')

This ensures portability across systems.

HTML Cleaning Function

The clean_html_table function:

Parses HTML with BeautifulSoup(html_content, 'html.parser').
Clears all tag attributes (e.g., style, class) with tag.attrs = {}.
Returns formatted HTML via soup.prettify().

Encoding Handling

It attempts to read the input file with cp1251, falling back to latin-1 if needed:

try:
    with open(input_file, 'r', encoding='cp1251') as file:
        html_content = file.read()
except UnicodeDecodeError:
    with open(input_file, 'r', encoding='latin-1') as file:
        html_content = file.read()

Output is always saved in utf-8 for consistency.

Setup and Usage

Requirements: Install BeautifulSoup:

pip install beautifulsoup4

Usage:

Create an article.html file with your HTML content.
Run the script in its directory.
Check cleaned_output.html for the cleaned result.

The script adapts to encoding issues, making it robust for varied HTML sources.

Conclusion

This BeautifulSoup script simplifies Python HTML cleanup by removing clutter like styles and scripts, with smart encoding handling for 2025’s diverse web data. It’s a must-have tool for developers, content creators, or anyone parsing HTML efficiently.

CODE

joker

Professional data parsing via ZennoPoster, Python, creating browser and keyboard automation scripts. SEO-promotion and website creation: from a business card site to a full-fledged portal.

Super User

English

German

Russian

HTML

CSS

WordPress

Python

Photoshop