How to convert HTML code to Text with Python (solved)

There are several tools to convert html to text (html2text) in python. My favorite is beautiful soup. Under linux systems, you install the beautiful soup python module with the following two commands:


easy_install beautifulsoup4
easy_install html5lib

Then, in your source code, you can call beautiful soup functions to translate html text into simple text (removing all the HTML tags or replacing them with alternative relevant content) :

html_doc = """Some HTML code that you want to convert"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.get_text())

Of course, you need to import the module if you want to call its functions. That the role of the line in bold above (from bs4 import BeautifulSoup).

Html-source-code
Html-source-code (Photo credit: Wikipedia)

Alternatives to BeautifulSoup to implement HTML2text

Striptogram might be an alternative to beautiful soup, but I must say, I am fully satisfied by beautiful soup.

from stripogram import html2text, html2safehtml
# Only allow <b>, <a>, <i>, <br>, and <p> tags
clean_html = html2safehtml(original_html,valid_tags=("b", "a", "i", "br", "p"))
# Don't process <img> tags, just strip them out. Use an indent of 4 spaces
# and a page that's 80 characters wide.
text = html2text(original_html,ignore_tags=("img",),indent_width=4,page_width=80


 
 
 

If you want to improve your coding skills, I advise you to look at “Cracking the Coding Interview: 150 Programming Questions and Solutions“. It was written by Gayle Laakmann McDowell, a former recruiter from Google who also worked at Apple and I find it really great!

  

Leave a Reply

Your email address will not be published. Required fields are marked *