Python 2.7 Tutorial Pt 14

Python How ToIn the previous tutorial I showed you how to grab the following from any sites rss feed using Python:

  • Title of Articles
  • All the Content from the Original Article
  • Link to the Original Article

This is known as website scraping and it is the major component used by all automated website applications. One thing I didn’t cover is how to strip the html tags from those articles, so I’ll do that in this video.

I’ll also provide you with the ability to delete HTML tags using regular expressions. My original regular expressions tutorial is here REGEX Tutorial.

Like always, a lot of code follows the video. If you have any questions or comments leave them below. And, if you missed my other Python Tutorials they are available here:

All of the Code from the Video

#! /usr/bin/python

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

def cleanHtml(i):
i = str(i) # Convert the Beautiful Soup Tag to a string
bS = BeautifulSoup(i) # Pass the string to Beautiful Soup to strip out html

# Find all of the text between paragraph tags and strip out the html
i = bS.find(‘p’).getText()

# Strip ampersand codes and WATCH:
i = re.sub(‘&\w+;’,”,i)
i = re.sub(‘WATCH:’,”,i)
return i

def cleanHtmlRegex(i):
i = str(i)
regexPatClean = re.compile(r'<[^<]*?/?>’)
i = regexPatClean.sub(”, i)
# Strip ampersand codes and WATCH:
i = re.sub(‘&\w+;’,”,i)
return re.sub(‘WATCH:’,”,i)

# Copy all of the content from the provided web page
webpage = urlopen(‘’).read()

# Grab everything that lies between the title tags using a REGEX
titleString = ‘<title>(.*)</title>’
patFinderTitle = re.compile(titleString)

# Grab the link to the original article using a REGEX
origArticleLink = ‘<link rel.*href=”(.*)” />’
patFinderLink = re.compile(origArticleLink)

# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)

# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)

# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article

articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article

divBegin = articlePage.find(‘<div>’) # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div

# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)

# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll(‘p’)

# Print all of the paragraphs to screen
for i in paragList:
# i = cleanHtml(i)
i = cleanHtmlRegex(i)
print i

print “\n”

12 Responses to “Python 2.7 Tutorial Pt 14”

  1. Elliott says:

    Great Tutorials. I have a request for a hybrid tutorial/topic of Website Scraping, which you’ve covered a fair bit, combined with SQLite database storage that you touched on, all within Python. The twist to make it interesting is I’d like to Scrap and Store hierarchical data which, from what I’ve gathered, can be a difficult task using only SQLite database storage (which is my goal due to compatibility of the technique if I can develop one, SQLite is so portable). The way I understand it is SQL in general doesn’t like dynamically created Tables (due to SQL injection), which was my first idea on the subject, simply storing each new child of the hierarchical-tree in its own Dynamically made Table as it was discovered via Scraping, that didn’t work out (haha). So I am left with working around the Flat Table Model or “Modified Preorder Tree Traversal Algorithm” for dealing with Tree Type data and is a little overwhelming for what I need given the small-ish scale. Unless there is another way I haven’t heard of yet? (very possible)

    The example is like this, if I were to Scrap the following data, how do I store it in such a way that I can retrieve its order without a ton of recursion because that isn’t python’s strong suit:

    Book 1 (level 1 of Hierarchy)
    Chapter 1 (level 2)
    Content 1 (3)
    Sub-Content 1 (4)
    Sub-Content 2 (4)
    Content 2 (3)
    Chapter 2 (2)
    Content 1 (3)
    Book 2 (1)
    Chapter 1 (2)

    etc, only with more data… that example outline covers the idea well. In practice it will be more like “grab Page Header tabs on a website, follow the link, Scrap some of the new page’s content, follow another link, some more content and done with that branch; onto to the next Page Header tab to do the same thing /loop and store”, but baby steps first. I just need to learn about storing advanced data in SQLite, because having parsed data in offline storage would be very useful for my project.

    Should I make a pre-defined Table for each “level” of the tree and then tie them together with Foreign Keys (more on this subject would be great too)? Are Foreign Keys the answer to sticking to the Flat Table Model of SQLite? I am thinking that if I just keep track of what “level” of the tree the data “should” be in as I am storing it, I can rebuild the Tree at Read Time rather then trying to store it as such. Thoughts?

    Thanks for all the work you are doing to help your users, I really like the your motives to share the wealth of knowledge with everyone. I know I’ll be referring everyone I can to this site for the breadth and depth of topics covered. /Subscribed+Liked

    • admin says:

      I think you are really looking for a tutorial on SQL. Here is my old one SQL Tutorial I’m going to redo my PHP Tutorial in the style that you guys have told me that you like. Along the way I’ll cover SQL, Web Scraping, WordPress Themes and Plugins. That should help to round out your knowledge so that you can do pretty much anything.

  2. maleds says:

    Great Tutorial
    but I was wondering if i can scrap multiple pages or sites in a for loop.
    pages = 10
    for (i in pages):

    • admin says:

      You’re going to have to scrap the web pages individually unless the same delimiters are used to surround the main content from the main article. You can still do this in one program just individually for each site.

  3. maleds says:

    like in bash.
    Pages = $1;
    basically is there a way to in put variables in and send it to urlopen(“url”).read()
    or must i start using some magic.
    Thanks again

  4. SQL training says:

    I am interested with SQL because almost all programmers use this database. I am trying to learn this step by step. And I thank google I found your site. Thank you for your tutorials.

  5. Jambay says:

    thanks for this tutorial!
    Do you know if it is posible to scrap a website generated with javascript?

  6. Cyd says:

    Yay! I got the whole thing working with Python 3.2 and BeautifulSoup 4.0.

    This is a really (really) helpful series…

    • admin says:

      You’re a really nice person 🙂 Most people would have complained that everything didn’t work right because of the updates. I’m very happy to be able to help you in any small way. Thank you

  7. matt winchester says:

    Try to explain how your code is related to binary digits.
    I understand hard wired things and your code confuses me.
    1-9 tutorials I folled with good results 10-14 im lost and
    I don’t understand why you choose to program with code that
    relys heavly on a libray. When and where do I choose code,
    im troubled with. please explain white space better if you
    can. My version is 2.7 and I use the direct input screen.
    thanks matt.

    • Derek Banas says:

      White space is used to define things like when a function starts and ends. Every function must have a starting and ending point. The white space defines both.

      You use outside libraries a lot because it is hard to write efficient code that performs complex tasks. With libraries you just import and use. They are extremely useful.

Leave a Reply to matt winchester Cancel reply

Your email address will not be published.