In this video tutorial I show you how to scrap websites. I introduce 2 new modules being UrlLib and Beautiful Soup. UrlLib is preinstalled on Python, but you have to install Beautiful Soup for it to work.
Beautiful Soup is available at their website. If you are using Python versions previous to Python 3.0 get this version Beautiful Soup for Python previous to 3.0. If you are using Python 3.0 or higher get this version of Beautiful Soup.
To install it follow these steps:
This is how you normally install all Python modules on every OS by the way!
What is Website Scraping and is it Legal?
Website scraping is almost always legal as long as you provide the following:
As per what Website Scraping is. It is the act of removing information from one or many sites using some automated program. I provide you a program that was made to Website Scrap the Huffington Post, but this code can be used to scrap most any site.
Like always, a lot of code follows the video. If you have any questions or comments leave them below. And, if you missed my other Python Tutorials they are available here:
All the Code from the Video
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
# Copy all of the content from the provided web page
webpage = urlopen(‘http://feeds.huffingtonpost.com/huffingtonpost/LatestNews’).read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile(‘<title>(.*)</title>’)
# Grab the link to the original article using a REGEX
patFinderLink = re.compile(‘<link rel.*href=”(.*)” />’)
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = 
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find(‘<div>’) # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll(‘p’)
# Print all of the paragraphs to screen
for i in paragList:
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
titleSoup = soup2.findAll(‘title’)
linkSoup = soup2.findAll(‘link’)
for i in listIterator: