0 %
!
Programmer
SEO-optimizer
English
German
Russian
HTML
CSS
WordPress
Python
C#
  • Bootstrap, Materialize
  • GIT knowledge
0

No products in the cart.

BeautifulSoup – clean html tags from styles and scripts

23.10.2024

Today’s code is intended for those who want to quickly and effortlessly clean up HTML code from unnecessary styles and tags. We will do it with the help of a common Python library – BeautifulSoup. Here is the code itself:

from bs4 import BeautifulSoup

def clean_html_table(html_content):
# Parsing HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Delete everything <style> и <script> теги
for tag in soup(['style', 'script']):
tag.decompose()

# Remove all attributes (e.g. inline styles) from tags
for tag in soup.find_all(True): # True находит все теги
tag.attrs = {} # Clearing all attributes

# Return cleaned HTML
cleaned_html = soup.prettify()
return cleaned_html

# Example of use
if __name__ == '__main__':
# Reading HTML file
with open('input.html', 'r', encoding='utf-8') as file:
html_content = file.read()

# Cleaning tables and HTML code
cleaned_html = clean_html_table(html_content)

# Saving the result to a new file
with open('cleaned_output.html', 'w', encoding='utf-8') as file:
file.write(cleaned_html)

print("The HTML has been successfully cleaned up and saved to 'cleaned_output.html'")

Let’s break it down in more detail:

Import the main library “from bs4 import BeautifulSoup”, then create a function called “def clean_html_table(html_content)”, where we specify the content file as a variable.

Next, we parser the HTML using BeautifulSoup:

soup = BeautifulSoup(htmll_content, 'html.parser') # i.e. create an object with the desired content

Using a library that has a style and tag search, we divide the whole text into the necessary elements.


# Delete everything <style> и <script> теги
for tag in soup(['style', 'script']):
tag.decompose()

# Remove all attributes (e.g. inline styles) from tags
for tag in soup.find_all(True): # True находит все теги
tag.attrs = {} # Clearing all attributes

Now let’s just return the finished text for later writing to a file:


# Return cleaned HTML
cleaned_html = soup.prettify()
return cleaned_html

Well, and, actually, the code of application in practice (before using it, we should prepare a file with the original content “input.html”):


# Example of use
if __name__ == '__main__':
# Reading HTML file
with open('input.html', 'r', encoding='utf-8') as file:
html_content = file.read()

# Cleaning tables and HTML code
cleaned_html = clean_html_table(html_content)

# Saving the result to a new file
with open('cleaned_output.html', 'w', encoding='utf-8') as file:
file.write(cleaned_html)

print("The HTML has been successfully cleaned up and saved to 'cleaned_output.html'")

Who can use this simple python script? First of all for those who write articles or process them in editors, as well as for those who take external code from websites when parsing data…

Posted in PythonTags:
Write a comment
© 2024... All Rights Reserved.

You cannot copy content of this page