Python: Stripping Content from HTML Files

Utilizing Python to Strip Valuable Data From HTML

I created the Nano Blog, and then later built Meganano from scratch with just HTML, CSS. Whilst it wasn’t bad, it wasn’t great either. So, I decided to migrate my work over to WordPress. I needed to convert everything I had previously written, back to plain text content. Whilst slowly stripping HTML tags from my content I realized I was wasting a serious amount of time not utilizing Python.

Program Purpose:

The Python program is designed for stripping HTML tags from web pages that were stored as HTML files. The primary goal was to extract valuable content from these HTML files for migration to my new WordPress website.

Program Overview:

The Python program processes HTML text and performs the following steps:

  • Reads HTML: It reads the HTML content from the old website, which is saved to a text file on local system.

  • HTML Tag Removal: The program searches for and removes HTML tags from the HTML content stored in a variable. HTML tags are identified based on their format, typically enclosed in angle brackets, such as <p>, <div>, <a>, etc.

  • Tag Removal Logic: The program uses a simple string processing approach; all HTML tags are stored in a variable and the program scans the file for these tags. It replaces the removed content with a single blank space.

  • Content Extraction: After removing the HTML tags, the program will leave behind the only the valuable content.

  • Storage or Transfer: The valuable content extracted from the HTML files is then to a text file, for easy migration to your new WordPress website.

HTML Code for old Website

As you can see in the image below, there is a lot of HTML code, and I had hundreds of webpages I needed to separate the valuable content from the HTML content.

HTML code for my old website

The Code:

The code is pretty basic and maybe you need to add more tags than what I needed, but I think I covered most the common HTML tags, the special CSS styling within the HTML wasn’t accounted for but still it saved me a huge amount of time.

#Add your html code to a file named strip.txt and let python do the rest
#You will need to set up a location within the f variable to grab and save files

f = open('/Users/PC/Desktop/strip.txt','r') #Add your location here
a = ['<!DOCTYPE html>','<html>','</html>','<head>','</head>','<meta name=','<link rel=','<title>','</title>','<body>','</body>','<p>','</p>','<div>','</div>','<div class="','<span class="','<a href="','">','</a>','<span>','</span>','<h1>','</h1>','<h2>','</h2>','<h3>','</h3>','<h4>','</h4>','<h5>','</h5>','<b>','</b>','<u>','</u>','<i>','</i>','<ol>','</ol>','<ul>','</ul>','<li>','</li>','<table>','</table>','<th>','</th>','<td>','</td>','<tr>','</tr>','<pre>','</pre>','<code>','</code>','<br>','<hr>','<style>','</style>','<strong>','</strong>','    ','     ','         ','             ','                 ']
lst = []
for line in f:
    for word in a:
        if word in line:
            line = line.replace(word,'')
    lst.append(line)
f.close()
f = open('/Users/PC/Desktop/clean.txt','w') #Add your location here
for line in lst:
    f.write(line)

Conclusion

The Python program serves a valuable purpose by enabling the extraction of essential content from HTML files, a critical step in the process of migrating from my old website to a new WordPress site. By manually removing HTML tags and preserving valuable data, the program streamlines the transition, making it easier to preserve the core content while stripping away the complexities of the old website’s structure.

Overall, the program is a practical and resourceful approach to content migration, simplifying the task of transitioning to a new web platform.

That’s All Folks!

Find more of our Python guides here: Python Guides

Recommendation:

Big Book of Small Python Programs: 81 Easy Practice Programs: https://amzn.to/3rGZjCR

Luke Barber

Hello, fellow tech enthusiasts! I'm Luke, a passionate learner and explorer in the vast realms of technology. Welcome to my digital space where I share the insights and adventures gained from my journey into the fascinating worlds of Arduino, Python, Linux, Ethical Hacking, and beyond. Armed with qualifications including CompTIA A+, Sec+, Cisco CCNA, Unix/Linux and Bash Shell Scripting, JavaScript Application Programming, Python Programming and Ethical Hacking, I thrive in the ever-evolving landscape of coding, computers, and networks. As a tech enthusiast, I'm on a mission to simplify the complexities of technology through my blogs, offering a glimpse into the marvels of Arduino, Python, Linux, and Ethical Hacking techniques. Whether you're a fellow coder or a curious mind, I invite you to join me on this journey of continuous learning and discovery.

Leave a Reply

Your email address will not be published. Required fields are marked *

Verified by MonsterInsights