Python 2.7 Tutorial Pt 13 Website Scraping

Python How ToIn this video tutorial I show you how to scrap websites. I introduce 2 new modules being UrlLib and Beautiful Soup. UrlLib is preinstalled on Python, but you have to install Beautiful Soup for it to work.

Beautiful Soup is available at their website. If you are using Python versions previous to Python 3.0 get this version Beautiful Soup for Python previous to 3.0. If you are using Python 3.0 or higher get this version of Beautiful Soup.

To install it follow these steps:

  • Download the files
  • Untar or Decompress the file
  • Drop the .py and .pyc files in your lib/site-packages folder
  • You can also install the module by running this command: python install from inside the Beautiful Soup folder

This is how you normally install all Python modules on every OS by the way!

What is Website Scraping and is it Legal?

Website scraping is almost always legal as long as you provide the following:

  • A link back to the original article
  • A shortened version of the article
  • You make no changes to the article

As per what Website Scraping is. It is the act of removing information from one or many sites using some automated program. I provide you a program that was made to Website Scrap the Huffington Post, but this code can be used to scrap most any site.

Like always, a lot of code follows the video. If you have any questions or comments leave them below. And, if you missed my other Python Tutorials they are available here:

All the Code from the Video

#! /usr/bin/python

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

# Copy all of the content from the provided web page
webpage = urlopen(‘’).read()

# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile(‘<title>(.*)</title>’)

# Grab the link to the original article using a REGEX
patFinderLink = re.compile(‘<link rel.*href=”(.*)” />’)

# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)

# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)

# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article

articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article

divBegin = articlePage.find(‘<div>’) # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div

# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)

# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll(‘p’)

# Print all of the paragraphs to screen
for i in paragList:
print i

print “\n”

# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)

print soup2.findAll(‘title’)
print soup2.findAll(‘link’)

titleSoup = soup2.findAll(‘title’)
linkSoup = soup2.findAll(‘link’)

for i in listIterator:
print titleSoup[i]

print linkSoup[i]
print “\n”

120 Responses to “Python 2.7 Tutorial Pt 13 Website Scraping”

  1. Matt says:

    Hi Derek,

    Thanks for all the great tutorials! They have really helped me a lot! FYI I had to copy the beautiful soup files directly into the lib folder to get it to import properly. Also, when I copy the code from the site I get an error because the open single quote appears as a non-ASCII character. Example…

    webpage = urlopen(‘http://……)

    The single quote has to be deleted and typed back in for it to work.

    Just thought I’d share what I found since others may be experiencing it too.

    Have a nice weekend

    • admin says:

      Yes the quotes sometimes get a little messed up. Also you have to place the tabs in the right place. I could provide a link to a file? WordPress messes things up sometimes.

  2. Matt says:

    Well… Maybe I am doing some other things incorrectly too. Still trying to get the code to compile without errors. Might want to ignore what I wrote above.

  3. Matt says:

    okay 🙂

    It looks like you can’t copy/paste the code from your website into the module. You have to delete and retype the single and double quotes. Then it will run properly.

    Thanks for all the guidance. I would be hopelessly lost without your tutorials.

  4. Tito says:

    Hi Derek,

    Can you demonstrate a quick example on how to parse the data from the tcpdump output, given the output has already been converted to text format?


    • admin says:

      I don’t have a lot of experience with the tcpdump library, but I find it is normally easiest to parse plain text through the use of regular expressions. That is how most of these parsing libraries work anyway. I’m going to try to expand my regular expression tutorial today. I’ll cover all of the most commonly wanted regular expressions.

      What specifically are you trying to do?

      • Tito says:

        As a matter of fact, after watching the regex tutorials, I’ll try to parse something on my own using regex with Python. I’ve been using grep with simple regex to search for what i want but it is too labor intensive

        Thanks again,

  5. Anonymous says:

    Thanks for your prompt reply. I am not looking for any thing specifics. I use grep and some simple regex to get what I need for the most part but that seems to involve too much manual labor and I m not a programmer/scripter by any mean. That’s why I have been trying to learn Python by reading easy books and watching your tutorials :).

  6. anonymous says:

    Does the code above work for RSS feeds. I tried using your code to scrape the following RSS feed:

    and I get the following errors:

    Traceback (most recent call last):
    File “C:\Documents and Settings\Administrator\My Documents\harvardext\week2\week2friday\”, line 56, in
    print findPatTitle[i] # The title
    IndexError: list index out of range

  7. John says:

    Hello Admin,

    Your py tutorials are HQ and best on youtube.
    Please keep them coming and lead us all to advanced python coding skills.
    Please don’t stop after 20-30 tutorials!

    Good Luck!

  8. John says:

    When I visit the link you posted: , for BeautifulSoup download compatible with Py 2.7 I see tons of files and don’t know what do download.

  9. Graham Perry says:


    I have recently started with Python and initially found it difficult. Your tutorials have brought me a long way. Thanks for sharing and keep up the good work.


  10. aragon says:


    Thanks for putting up all these great videos. Im wanting to scrape a webpage that requires a log-in first, is there an easy way to do this in Python?


    • admin says:

      You’re welcome.

      The short answer to your question is maybe 🙂 Some sites will completely block you from getting through their login screen programmatically through the use of a captcha.

      If there is nothing blocking you have to figure out all of the requirements to login: username, password, encoding issues, what cookies are set, etc. I can’t think of away to write code that would work with every site. This is definitely a hack job, but I’ll look to see if I can come up with something

  11. Scott says:

    Great tutorials!

    Would love a tutorial on how to scrape friends and followers on Twitter, any plans for that?


    • admin says:

      I’m going to be covering social network programming next. Ill start with Facebook and then twitter. Explain exactly what you want to do with twitter and I’ll tell you if it is possible

  12. Reed says:

    Hey Derek,

    Great tutorials! I’m having a problem with the re.findall(patFinderTitle,webpage) portion of the code.

    I get the following error:

    Traceback (most recent call last):
    File “”, line 2, in
    File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7 /”,
    line 177, in findall
    return _compile(pattern, flags) .findall(string)
    TypeError: expected string or buffer

    Any idea what the problem may be?

    • admin says:

      Thanks I’m glad you like them 🙂 I have a few questions: Are you searching the Huffington Post or some other site? Beautiful Soup doesn’t work with all sites. Have you edited the code in anyway?

  13. Reed says:

    I think the problem was I didn’t append read() to the end of urlopen… Sorry about that.

    Now I’m getting an Index Error: “list index out of range”, and I’ve checked several times to make sure that the code I’m using is identical. Could it be a problem with beautiful soup? I’m worried I might not have gotten it on the path, but eclipse seems to import it.

    Thanks for any help.

  14. Reed says:

    And yes, I’m using the same Huffington Post link

  15. Reed says:

    Here is the code I’m using:

    from urllib import urlopen
    from BeautifulSoup import BeautifulSoup
    import re

    webpage = urlopen(‘’).read()

    patFinderTitle = re.compile(‘(.*)’)

    patFinderLink = re.compile(”)

    findPatTitle = re.findall(patFinderTitle,webpage)

    findPatLink = re.findall(patFinderLink,webpage)

    listIterator = []
    listIterator[:] = range(1,100)

    for i in listIterator:
    print findPatTitle[i]
    print findPatLink[i]
    print “\n”

  16. Reed says:

    Sorry Derek, I had a problem in my code. I was using (.*) instead of (.*)

    • admin says:

      I’m glad you fixed it. I figured it was some silly typo. Sorry I couldn’t respond quicker. I’m getting a lot of comments lately

    • Ahmad Masood says:

      Hi Reed
      I am having the same problem can you explain what you meant by
      “I was using (.*) instead of (.*)”
      this line

  17. Nelson says:

    Hello Derek, great site you have here, i love python, it was love at first, your tutos rock.

    Could you do demo how to use a rss reader in Tkinter?


  18. Nelson says:

    Well trying to create a simple rss feed like scrapping a web and show it with python in a visual way, not txt.

  19. Nelson says:

    Ok, no problem.

  20. Che says:

    I have really enjoyed your tutorials! Muchas gracias!

    Do you have any suggestions for scraping websites which don’t particularly want to be scraped? Like wikipedia?

    Again, thanks for all the great tutorials!

    • admin says:

      You’re very welcome. I just looked at Wikipedia and yikes what a mess! You can of course grab data from it and probably delete all of the css. I don’t think you should try to grab what you want using regex. It’s probably a better idea to grab all of it and then delete what you don’t want using regex. It’s do able, but will take some time. I hope that helps

  21. hemant says:

    Hi watched your video and they are really good but one thing that i didn’t get it is why you use this “if __name__ == ‘__main__’: main ()”
    i’ve wrote so many begineers prog without using it. Can you please explain it to me

    • admin says:

      That line allows your Python code to act as either a reusable module or as standalone program. It simply tells the interpretor to call the main function

  22. Anonymous says:

    I got your zip archive and opened your for web scraping. I don’t know why it’s missing certain titles, and full articles. I went through the current huffington post rss feeds and ran your code and for some reason I’m getting the same problems. I looked at each title on the website and in the output checking to make sure that there was the same title,url,article full text. Some articles are missing, some articles have the title and url but no text from the actual article. Is it just my computer, could you verify if this works 100% correctly?

  23. mirror says:

    I got your zip archive and opened your for web scraping. I don’t know why it’s missing certain titles, and full articles. I went through the current huffington post rss feeds and ran your code and for some reason I’m getting the same problems. I looked at each title on the website and in the output checking to make sure that there was the same title,url,article full text. Some articles are missing, some articles have the title and url but no text from the actual article. Is it just my computer, could you verify if this works 100% correctly?

  24. Rich says:

    Wonderful tutorial. Web Scraping is the reason I have started to teach myself Python. A little problem with the code above, though troubleshooting it was a good learning experience for me, the divBegin line is not fully finished which would identify the body_entry_text division.
    All in all , wonderful job Derek, thank you for teaching me about Python!

  25. Rick says:

    First, thanks for all the videos, they’re really great. I have a question about grabbing the titles and links from the huffington RSS feed. From your code:

    titleString = ‘(.*)’
    origArticleLink = ”

    I can strip the code to where it only grabs these lines and prints them out, and it works. But how!?? The Huffington RSS feed seems to have changed it’s tags. I can’t figure out how to scrape the new tags, but this old code is working when I don’t see why it should!

    For example, here is the article title and link from the RSS feed. I can’t figure out how to scrape it with new code, but the code I list above still works…

    ‘Tabatha Takes Over’: Tabatha’s Appalled By Salon Employees Drinking At Work (VIDEO)

    • admin says:

      I checked out the code. Everything seems to still be working because the title and link to the original article are still set up the same way. I then use urlopen to do all of the heavy lifting in regards to grabbing the original articles. I’m not sure what could be going wrong with your code? The trick is to grab the original article link and let urlopen do its job for any other feeds your pulling from. Does that make sense?

  26. Rick says:

    Oops, looks like the comments section killed the code I pasted in there. But you know what it should be, the simple title tag and the link ref tag

  27. helena says:

    Great tutorial Derek! Very helpful!

    I had one problem when running your code though. I also get an error message like the one below:

    Traceback (most recent call last):
    File “C:/Python27/”, line 18, in
    findPatTitle = re.findall(patFinderTitle,webpage)
    File “C:\Python27\lib\”, line 177, in findall
    return _compile(pattern, flags).findall(string)
    TypeError: expected string or buffer

    any idea on how I can fix this?



  28. helena says:

    nvm. problem fixed. don’t know how, but the program is working now. Thank you again, Derek, for posting this!

  29. ann says:

    Hey Derek,

    Great tutorial. One question though: how do you get rid of the tags?


  30. ann says:

    The tags, and also the link tags.

  31. Cyd says:

    Before even trying BeautifulSoup, I was trying to get the other code working.

    If I print the webpage variable, the whole thing is there in all its glory, but when I print patFinderTitle, I see that it’s an object:

    This explains the error code I’m getting when re.findall tries to run:

    Traceback (most recent call last):
    File “C:/Documents and Settings/Cyd/Desktop/”, line 14, in
    findPatTitle = re.findall(patFinderTitle, webpage)
    File “C:\Python32\lib\”, line 193, in findall
    return _compile(pattern, flags).findall(string)
    TypeError: can’t use a string pattern on a bytes-like object

    It doesn’t explain why it works for you, though. I’ve looked through the Python documentation and can’t find any explanation.

    Can you tell me what I’m missing? (I even tried copying and pasting your code for that section, and get the same error.)


    • Cyd says:

      I guess using the code tag doesn’t work. Anyway, it’s an object:

      • Cyd says:

        Okay, third try:

        Pretend there are opening and closing tags on this:

        _sre.SRE_Pattern object at 0x010B3180

        • Cyd says:

          Sorry again.

          Through mucho searching on the web, I found that I can put a ‘b’ in front of the pattern to make it a bytes pattern and then everything works fine (except I get an ugly b’ in front of each entry).

          • Cyd says:

            Nope, that didn’t work. When it gets to articlePage, I’m told that ‘bytes’ object has no attribute ‘timeout’ (part of urlopen). I guess it still wants a regular string.

            I just don’t understand why the re.compile lines are returning bytes objects instead of strings.

    • admin says:

      Sorry I couldn’t get to you quicker. I’ll look over the code. Because I made this tutorial so long ago I’m guessing beautiful soup must have changed. The code worked in the past as you saw in the video.

      I’ve also heard that many people have been struggling because of recent changes in eclipse and pydev. I’ve seen numerous errors from people that didn’t recently update pydev.

      In hind site I probably should have avoided covering beautiful soup because the author seems to be giving up on the project.

      • Cyd says:

        No problem. I appreciate your quick response. I found bs4 (the new beautiful soup) and it’s working ok. I encoded the urlopen statements into utf-8 and that fixed the object problem.

        Then for the soup.findall, I used soup.find_all and stuff started printing out. It got through four articles before hitting an error in an articlePage:

        “UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x80 in position 132109: invalid start byte”

        But I must say, figuring this stuff out has GOT to make me a better programmer.

        Thanks again.
        On to the next tutorial!

        • admin says:

          That is pretty much how I learned to program. Except I didn’t have the internet. I had to go dig up stuff in real books. Yuck!

          I’ll have to revisit this tutorial and make corrections based off of BS not being backwards compatible.

          Thank you for pointing all of this out 🙂

  32. Adil M. says:

    Amazing videos – thank you so much for sharing your knowledge. I am however having the following issue:

    Shows me the following error while running…

    SyntaxError: Non-ASCII character ‘\x94’ in file C:\Users\amoosa\workspace\PythonTest\PythonTest\ on line 19, but no encoding declared; see for details

    It points to the following sentence:
    patFinderLink = re.compile(“”)

    I am using Windows 7 and Eclipse with PyDev.

    Furthermore, could you point/instruct how to take care of a login page to have credentials put in, so I can start doing the website scraping thereafter?


    • Adil says:

      Nevermind, I found and corrected the problem. Apparently, the copy/paste function has the ‘”‘ sign taken in differently. When I removed the quotes and re-added them manually – no errors were discovered.

      Just thought I should let you know 🙂

      Also, thank you for the videos. I am a big fan of your site.

      • admin says:

        Yes that is a problem with my old videos. It was done for security reasons. I was going to go back and correct that in every tutorial, but I figured I’d just keep making new tutorials instead.

        I could take months to go back and fix all of my past errors 🙂

  33. mike says:

    this site is amazing. is there anyway you could do another tutorial on another website? (non rss / feed)

  34. John says:

    Just found your article on web scraping after seeing Kim Rees (Periscopic) on “Digging into Open Data” at OSCON 2012. Thanks!

  35. Chris Armstrong says:

    Hey Derek,

    Loving the tutorials, but ran into a problem on the code in this tutorial. I’m getting the below error:

    TypeError: expected string or buffer

    Related to this line:
    findPatTitle = re.findall(patFinderTitle, webpage)

    I did see that a few others had this error, and I saw your response about updating eclipse. I just did that, and am still getting this error. I’m using python 3.2 if that helps. The code in it’s entirety is:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    import re

    webpage = urlopen('')

    patFinderTitle = re.compile('(.*)') #the (.*) is a regular exp to grab anything between title tags

    patFinderLink = re.compile('')

    findPatTitle = re.findall(patFinderTitle, webpage)
    findPatLink = re.findall(patFinderLink, webpage)

    listIterator = []
    listIterator[:] = range(2,16)

    for i in listIterator:

    • admin says:

      I think it is a beautiful soup issue. Since I made this tutorial, I no longer use that library. I’ll take a look to see if that is the issue.

  36. David says:


    I’m getting an IndexError: list index out of range. I’ve stripped down the code to its basics and still have the problem. I would be very thankful if you could help me out.

  37. amar says:

    would you comment on how you would do the following using beautifulsoup? i feel like it can be simplified significantly but i run into errors when i trying doing it soup.

    # def get_urls_of_restaurant():
    # list_urls = []
    # n = 0
    # nn = 0
    # for i in range(4):
    # url = urlopen(‘’ + str(nn) + ‘-Dar_es_Salaam.html’).readlines() #open URL whis lists restaurants
    # while n < len(url):
    # if '"' not in url[n] and '‘ in url[n] and len(url[n]) > 5:
    # list_urls.append(url[n-1].split(‘”‘)[1])
    # n += 1
    # n = 0
    # nn += 30
    # list_urls.reverse()
    # print “Geting urls done! Get %s” %len(list_urls) + ‘ urls.’
    # return list_urls

    • admin says:

      Hi, What version of BeautifulSoup are you using? They changed a bunch of things lately and of course it isn’t backwards compatible. What errors are you getting? Thanks – Derek

  38. Sean says:

    How do I check the URL that I’m going to is not a broken URL?

  39. Franko says:

    Hey great tutorial, I plan to check out more of them. Your directions on installing Beautiful soup was way better than the instructions on BeautifulSoup’s website.

  40. Rahul says:

    Hi derek

    I had a question about python , is there a way to handle exceptions of a program u dnt know about ( ie u want to catch a simple exception in a program someone else has written ) . how would you invoke their program in out try catch block , or could they invoke the try catch block in their program ?

    Thank you for your time

    Edit: I thought about using execfile() but it dnt help

    • admin says:

      You can catch all Python exceptions like this, but it isn’t recommended

      # do sstuff
      except Exception, e:
      print e

      • Rahul says:

        Hi Derek

        Thank you for your answer , my prob statement was like

        1) prog 1 : debugger (try except stament for a few errors)( written by me)

        2) prog 2 : reg prog (written by my team ) by person X

        person X wants to know if certain exceptions are occuring in prog 2 , he imports prog 1 and tada he can catch the custom exceptions.

        I just want to know , how do I write prog 1 so it works like that?

  41. AndresDuque says:

    Hi derek, Im a fan of your videos about scrappin in python, but I have a problem, my iddle of python doesnt recognize beautifulSoap, how i can do for recognize this import?


  42. Takatino says:

    Hey! I just want to ask you a question; Is it possible to do website scraping without the beautiful soup module? I am currently making a program that runs different python scripts, and I don’t want other users who use it to manually download beautiful soup. I also need to know what I should do when the tags are combined together, for example:

    link = http://linkhere … title = titlehere

  43. Hi Derek,

    Is it possible to get BS to do this?

    Go to website homepage (video site)

    Enter the article (normally by clicking thumbnail’s url)

    Grab the title

    Grab the video’s tags

    Grab the videos embed code

    Then go back and move on to the next article

  44. Chris says:

    Great tutorials, thank you (and keep it up)! QUESTION: How to scrape when the data spans multiple pages (with pagination, such as < Last >>)? Thank you!

  45. Josh says:

    Could you post an example of webscraping where webpages are loaded with Javascripts (and extracting information from that site)?

  46. ashok says:

    can you tell me, how to make hotel booking system in tkinter

  47. Elmar says:

    Thank you very much

  48. Hi mates, how is everything, and what you want to say about this post,
    in my view its actually awesome for me.

  49. Adizero says:

    Great post about screen scraping. This works pretty well for singe page or website. For large project, Scrapy framework is more suitable than the method mentioned in this post.

  50. Nathan says:

    I really like you tutorial, it made things easy to understand.

    I have one problem I have been trying to figure out. I got a big file with over 200 lines. Here is a example of one of the lines:


    The file name has the date in it, and the title. The date code is: Y/M/D. The ldm is an abbreviation for the title. How can I make a script to get the date code and put it in here Date and the title here Title. I have several HTML files like this, each one with several hundred lines, all just like this.

    Any ideas are GREATLY appreciated!!!

    • Derek Banas says:

      Thank you 🙂 I made a Regex tutorial that seems to help people. It was done using PHP, but the regular expression codes work exactly the same way in Python. Here is the video PHP Regex Tutorial. I explain the thought process behind using regex in complex situations. I hope it helps

  51. Sammed Mandape says:

    Hi Derek,

    I am trying to do some web scraping for my research in python. But I am unable to construct the url. After going through your post thought maybe shoot you some of my questions. Any help is greatly appreciated.

    I want to access this website from python and submit my query to convert ids. Then get that response back from the website into my program. While studying about it I learned more of webscraping in python. Is this the way to do it? For example if I give input as
    And choose option

    And hit convert it will return PMCID and NIHMSid. I want these returned values to be used in my program. I guess that will be just parsing the results.

    I went through basic youtube videos for web scraping in python as this
    I also tried doing it by my self using BeautifulSoup but no success. Also, tried firebug in firefox to get the url that I can use to query the website.

    As far as I got, the base url is:
    But I am unable to join it to the query to complete the url.
    When queried I get the and when tried to get the complete query as it is not working.

    So, I am guessing I am missing something here.

    Thank you for your time and help.

  52. Sammed Mandape says:

    Sorry for not mentioning the website. The website I am trying to access is

  53. Inon says:

    Hi Derek!

    I loved your Python tutorials!

    I went over some stuff on my own and have created two well-commented *.py files covering:

    1. Lambda expressions, functional programming tools, and list comprehension
    (all following the 2.7.6 documentation).

    2. Iteration generators and related stuff
    (mostly following the stackoverflow question at

    I’d be proud if you wanted to make a couple of video tutorials out of these! Your video editing and narrating are probably much better than what I could muster, and you already have a crowd.

    Please let me know (by e-mail, if at all possible) if you’d like me to post the *.py files on googledocs or whatever…

    Good luck with your awsome project, I’m certainly going to continue using it!



  1. Tweets that mention Python 2.7 Tutorial Pt 13 Website Scraping | New Think Tank -- - [...] This post was mentioned on Twitter by vio7j. vio7j said: Python 2.7 Tutorial Pt 13 Website Scraping | New…
  2. Scraping websites with BeautifulSoup - Python For Beginners - [...] Here is another example I saw on [...]

Leave a Reply

Your email address will not be published.