0 %
!
Programmer
SEO-optimizer
English
German
Russian
HTML
CSS
WordPress
Python
C#
  • Bootstrap, Materialize
  • GIT knowledge
0

No products in the cart.

Clean HTML Code with BeautifulSoup: Python Tutorial

23.10.2024
68 / 100

Introduction

This Python HTML cleaner script (Clean HTML Code) uses BeautifulSoup to effortlessly clean HTML with BeautifulSoup, removing unnecessary tags, styles, and attributes. Enhanced for 2025, it now handles multiple encodings (e.g., cp1251, latin-1) to ensure compatibility with diverse HTML files. Perfect for web scraping, content processing, or simplifying messy code.


Clean HTML Code with BeautifulSoup: Python Tutorial

The Script

Here’s the complete HTML parsing with encoding script:


from bs4 import BeautifulSoup
import os

# Define file paths dynamically
base_dir = os.path.dirname(__file__)
input_file = os.path.join(base_dir, 'article.html')
output_file = os.path.join(base_dir, 'cleaned_output.html')

def clean_html_table(html_content):
    """Clean HTML by removing styles, scripts, and attributes."""
    # Parse HTML with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')
    for tag in soup(['style', 'script']):
        tag.decompose()

    # Clear all attributes (e.g., inline styles) from tags
    for tag in soup.find_all(True):  # True finds all tags
        tag.attrs = {}

    # Return formatted, cleaned HTML
    return soup.prettify()

if __name__ == '__main__':
    # Read HTML file with encoding fallback
    try:
        with open(input_file, 'r', encoding='cp1251') as file:  # Try cp1251 first
            html_content = file.read()
    except UnicodeDecodeError:
        with open(input_file, 'r', encoding='latin-1') as file:  # Fallback to latin-1
            html_content = file.read()

    # Clean the HTML
    cleaned_html = clean_html_table(html_content)

    # Save to a new file
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write(cleaned_html)

    print("HTML successfully cleaned and saved to 'cleaned_output.html'")

How It Works

Dynamic File Paths

The script uses os.path to set paths relative to its location:

base_dir = os.path.dirname(__file__)
input_file = os.path.join(base_dir, 'article.html')
output_file = os.path.join(base_dir, 'cleaned_output.html')

This ensures portability across systems.

HTML Cleaning Function

The clean_html_table function:

  • Parses HTML with BeautifulSoup(html_content, 'html.parser').
  • Clears all tag attributes (e.g., style, class) with tag.attrs = {}.
  • Returns formatted HTML via soup.prettify().

Encoding Handling

It attempts to read the input file with cp1251, falling back to latin-1 if needed:

try:
    with open(input_file, 'r', encoding='cp1251') as file:
        html_content = file.read()
except UnicodeDecodeError:
    with open(input_file, 'r', encoding='latin-1') as file:
        html_content = file.read()

Output is always saved in utf-8 for consistency.


Setup and Usage

Requirements: Install BeautifulSoup:

pip install beautifulsoup4

Usage:

  1. Create an article.html file with your HTML content.
  2. Run the script in its directory.
  3. Check cleaned_output.html for the cleaned result.

The script adapts to encoding issues, making it robust for varied HTML sources.


Conclusion

This BeautifulSoup script simplifies Python HTML cleanup by removing clutter like styles and scripts, with smart encoding handling for 2025’s diverse web data. It’s a must-have tool for developers, content creators, or anyone parsing HTML efficiently.

Posted in PythonTags:
© 2025... All Rights Reserved.