BeautifulSoup – clean html tags from styles and scripts
Today’s code is intended for those who want to quickly and effortlessly clean up HTML code from unnecessary styles and tags. We will do it with the help of a common Python library – BeautifulSoup. Here is the code itself:
from bs4 import BeautifulSoup
def clean_html_table(html_content):
# Parsing HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Delete everything <style> и <script> теги
for tag in soup(['style', 'script']):
tag.decompose()
# Remove all attributes (e.g. inline styles) from tags
for tag in soup.find_all(True): # True находит все теги
tag.attrs = {} # Clearing all attributes
# Return cleaned HTML
cleaned_html = soup.prettify()
return cleaned_html
# Example of use
if __name__ == '__main__':
# Reading HTML file
with open('input.html', 'r', encoding='utf-8') as file:
html_content = file.read()
# Cleaning tables and HTML code
cleaned_html = clean_html_table(html_content)
# Saving the result to a new file
with open('cleaned_output.html', 'w', encoding='utf-8') as file:
file.write(cleaned_html)
print("The HTML has been successfully cleaned up and saved to 'cleaned_output.html'")
Let’s break it down in more detail:
Import the main library “from bs4 import BeautifulSoup”, then create a function called “def clean_html_table(html_content)”, where we specify the content file as a variable.
Next, we parser the HTML using BeautifulSoup:
soup = BeautifulSoup(htmll_content, 'html.parser') # i.e. create an object with the desired content
Using a library that has a style and tag search, we divide the whole text into the necessary elements.
# Delete everything <style> и <script> теги
for tag in soup(['style', 'script']):
tag.decompose()
# Remove all attributes (e.g. inline styles) from tags
for tag in soup.find_all(True): # True находит все теги
tag.attrs = {} # Clearing all attributes
Now let’s just return the finished text for later writing to a file:
# Return cleaned HTML
cleaned_html = soup.prettify()
return cleaned_html
Well, and, actually, the code of application in practice (before using it, we should prepare a file with the original content “input.html”):
# Example of use
if __name__ == '__main__':
# Reading HTML file
with open('input.html', 'r', encoding='utf-8') as file:
html_content = file.read()
# Cleaning tables and HTML code
cleaned_html = clean_html_table(html_content)
# Saving the result to a new file
with open('cleaned_output.html', 'w', encoding='utf-8') as file:
file.write(cleaned_html)
print("The HTML has been successfully cleaned up and saved to 'cleaned_output.html'")
Who can use this simple python script? First of all for those who write articles or process them in editors, as well as for those who take external code from websites when parsing data…
Professional data parsing via ZennoPoster, Python, creating browser and keyboard automation scripts. SEO-promotion and website creation: from a business card site to a full-fledged portal.