BeautifulSoup – clean html tags from styles and scripts

Today’s code is intended for those who want to quickly and effortlessly clean up HTML code from unnecessary styles and tags. We will do it with the help of a common Python library – BeautifulSoup. Here is the code itself:

from bs4 import BeautifulSoup
def clean_html_table(html_content):

    # Parsing HTML with BeautifulSoup

    soup = BeautifulSoup(html_content, 'html.parser')
    # Delete everything <style> и <script> теги

    for tag in soup(['style', 'script']):

        tag.decompose()
    # Remove all attributes (e.g. inline styles) from tags

    for tag in soup.find_all(True):  # True находит все теги

        tag.attrs = {}  # Clearing all attributes
    # Return cleaned HTML

    cleaned_html = soup.prettify()

    return cleaned_html
# Example of use

if __name__ == '__main__':

    # Reading HTML file

    with open('input.html', 'r', encoding='utf-8') as file:

        html_content = file.read()
    # Cleaning tables and HTML code

    cleaned_html = clean_html_table(html_content)
    # Saving the result to a new file

    with open('cleaned_output.html', 'w', encoding='utf-8') as file:

        file.write(cleaned_html)

print("The HTML has been successfully cleaned up and saved to 'cleaned_output.html'")

Let’s break it down in more detail:

Import the main library “from bs4 import BeautifulSoup”, then create a function called “def clean_html_table(html_content)”, where we specify the content file as a variable.

Next, we parser the HTML using BeautifulSoup:

soup = BeautifulSoup(htmll_content, 'html.parser') # i.e. create an object with the desired content

Using a library that has a style and tag search, we divide the whole text into the necessary elements.

# Delete everything <style> и <script> теги for tag in soup(['style', 'script']): tag.decompose()

# Remove all attributes (e.g. inline styles) from tags for tag in soup.find_all(True): # True находит все теги tag.attrs = {} # Clearing all attributes

Now let’s just return the finished text for later writing to a file:

# Return cleaned HTML cleaned_html = soup.prettify() return cleaned_html

Well, and, actually, the code of application in practice (before using it, we should prepare a file with the original content “input.html”):

# Example of use if __name__ == '__main__': # Reading HTML file with open('input.html', 'r', encoding='utf-8') as file: html_content = file.read()


    # Cleaning tables and HTML code

    cleaned_html = clean_html_table(html_content)
    # Saving the result to a new file

    with open('cleaned_output.html', 'w', encoding='utf-8') as file:

        file.write(cleaned_html)

print("The HTML has been successfully cleaned up and saved to 'cleaned_output.html'")

Who can use this simple python script? First of all for those who write articles or process them in editors, as well as for those who take external code from websites when parsing data…

CODE

joker

Professional data parsing via ZennoPoster, Python, creating browser and keyboard automation scripts. SEO-promotion and website creation: from a business card site to a full-fledged portal.

!

English

German

Russian

HTML

CSS

WordPress

Python

C#

BeautifulSoup – clean html tags from styles and scripts