Clean HTML Code with BeautifulSoup: Python Tutorial
Introduction
This Python HTML cleaner script (Clean HTML Code) uses BeautifulSoup to effortlessly clean HTML with BeautifulSoup, removing unnecessary tags, styles, and attributes. Enhanced for 2025, it now handles multiple encodings (e.g., cp1251, latin-1) to ensure compatibility with diverse HTML files. Perfect for web scraping, content processing, or simplifying messy code.
The Script
Here’s the complete HTML parsing with encoding script:
from bs4 import BeautifulSoup
import os
# Define file paths dynamically
base_dir = os.path.dirname(__file__)
input_file = os.path.join(base_dir, 'article.html')
output_file = os.path.join(base_dir, 'cleaned_output.html')
def clean_html_table(html_content):
"""Clean HTML by removing styles, scripts, and attributes."""
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
for tag in soup(['style', 'script']):
tag.decompose()
# Clear all attributes (e.g., inline styles) from tags
for tag in soup.find_all(True): # True finds all tags
tag.attrs = {}
# Return formatted, cleaned HTML
return soup.prettify()
if __name__ == '__main__':
# Read HTML file with encoding fallback
try:
with open(input_file, 'r', encoding='cp1251') as file: # Try cp1251 first
html_content = file.read()
except UnicodeDecodeError:
with open(input_file, 'r', encoding='latin-1') as file: # Fallback to latin-1
html_content = file.read()
# Clean the HTML
cleaned_html = clean_html_table(html_content)
# Save to a new file
with open(output_file, 'w', encoding='utf-8') as file:
file.write(cleaned_html)
print("HTML successfully cleaned and saved to 'cleaned_output.html'")
How It Works
Dynamic File Paths
The script uses os.path
to set paths relative to its location:
base_dir = os.path.dirname(__file__)
input_file = os.path.join(base_dir, 'article.html')
output_file = os.path.join(base_dir, 'cleaned_output.html')
This ensures portability across systems.
HTML Cleaning Function
The clean_html_table
function:
- Parses HTML with
BeautifulSoup(html_content, 'html.parser')
. - Clears all tag attributes (e.g.,
style
,class
) withtag.attrs = {}
. - Returns formatted HTML via
soup.prettify()
.
Encoding Handling
It attempts to read the input file with cp1251
, falling back to latin-1
if needed:
try:
with open(input_file, 'r', encoding='cp1251') as file:
html_content = file.read()
except UnicodeDecodeError:
with open(input_file, 'r', encoding='latin-1') as file:
html_content = file.read()
Output is always saved in utf-8
for consistency.
Setup and Usage
Requirements: Install BeautifulSoup:
pip install beautifulsoup4
Usage:
- Create an
article.html
file with your HTML content. - Run the script in its directory.
- Check
cleaned_output.html
for the cleaned result.
The script adapts to encoding issues, making it robust for varied HTML sources.
Conclusion
This BeautifulSoup script simplifies Python HTML cleanup by removing clutter like styles and scripts, with smart encoding handling for 2025’s diverse web data. It’s a must-have tool for developers, content creators, or anyone parsing HTML efficiently.

Professional data parsing via ZennoPoster, Python, creating browser and keyboard automation scripts. SEO-promotion and website creation: from a business card site to a full-fledged portal.