In this video tutorial I show you how to scrap websites. I introduce 2 new modules being UrlLib and Beautiful Soup. UrlLib is preinstalled on Python, but you have to install Beautiful Soup for it to work.
Beautiful Soup is available at their website. If you are using Python versions previous to Python 3.0 get this version Beautiful Soup for Python previous to 3.0. If you are using Python 3.0 or higher get this version of Beautiful Soup.
To install it follow these steps:
This is how you normally install all Python modules on every OS by the way!
What is Website Scraping and is it Legal?
Website scraping is almost always legal as long as you provide the following:
As per what Website Scraping is. It is the act of removing information from one or many sites using some automated program. I provide you a program that was made to Website Scrap the Huffington Post, but this code can be used to scrap most any site.
Like always, a lot of code follows the video. If you have any questions or comments leave them below. And, if you missed my other Python Tutorials they are available here:
All the Code from the Video
#! /usr/bin/python
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
# Copy all of the content from the provided web page
webpage = urlopen(‘http://feeds.huffingtonpost.com/huffingtonpost/LatestNews’).read()
# Grab everything that lies between the title tags using a REGEX
patFinderTitle = re.compile(‘<title>(.*)</title>’)
# Grab the link to the original article using a REGEX
patFinderLink = re.compile(‘<link rel.*href=”(.*)” />’)
# Store all of the titles and links found in 2 lists
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
# Create an iterator that will cycle through the first 16 articles and skip a few
listIterator = []
listIterator[:] = range(2,16)
# Print out the results to screen
for i in listIterator:
print findPatTitle[i] # The title
print findPatLink[i] # The link to the original article
articlePage = urlopen(findPatLink[i]).read() # Grab all of the content from original article
divBegin = articlePage.find(‘<div>’) # Locate the div provided
article = articlePage[divBegin:(divBegin+1000)] # Copy the first 1000 characters after the div
# Pass the article to the Beautiful Soup Module
soup = BeautifulSoup(article)
# Tell Beautiful Soup to locate all of the p tags and store them in a list
paragList = soup.findAll(‘p’)
# Print all of the paragraphs to screen
for i in paragList:
print i
print “\n”
# Here I retrieve and print to screen the titles and links with just Beautiful Soup
soup2 = BeautifulSoup(webpage)
print soup2.findAll(‘title’)
print soup2.findAll(‘link’)
titleSoup = soup2.findAll(‘title’)
linkSoup = soup2.findAll(‘link’)
for i in listIterator:
print titleSoup[i]
print linkSoup[i]
print “\n”
Hi Derek,
Thanks for all the great tutorials! They have really helped me a lot! FYI I had to copy the beautiful soup files directly into the lib folder to get it to import properly. Also, when I copy the code from the site I get an error because the open single quote appears as a non-ASCII character. Example…
webpage = urlopen(‘http://……)
The single quote has to be deleted and typed back in for it to work.
Just thought I’d share what I found since others may be experiencing it too.
Have a nice weekend
Yes the quotes sometimes get a little messed up. Also you have to place the tabs in the right place. I could provide a link to a file? WordPress messes things up sometimes.
Well… Maybe I am doing some other things incorrectly too. Still trying to get the code to compile without errors. Might want to ignore what I wrote above.
okay 🙂
It looks like you can’t copy/paste the code from your website into the module. You have to delete and retype the single and double quotes. Then it will run properly.
Thanks for all the guidance. I would be hopelessly lost without your tutorials.
I’ll provide links to the actual files later today so you can just download and run them.
Here is all of the Python Tutorial code in one zip archive. That should help you out. Thanks
Thanks Derek! I sent you an email on a Python Programming Job, it was sent from a different email address, but it is from me. Thanks again for all your help!!
Hi Derek,
Can you demonstrate a quick example on how to parse the data from the tcpdump output, given the output has already been converted to text format?
Thanks,
Tito
I don’t have a lot of experience with the tcpdump library, but I find it is normally easiest to parse plain text through the use of regular expressions. That is how most of these parsing libraries work anyway. I’m going to try to expand my regular expression tutorial today. I’ll cover all of the most commonly wanted regular expressions.
What specifically are you trying to do?
As a matter of fact, after watching the regex tutorials, I’ll try to parse something on my own using regex with Python. I’ve been using grep with simple regex to search for what i want but it is too labor intensive
Thanks again,
Tito
I’ll have the video tutorials for regular expressions very soon. They work the same in most every language except you use different methods.
Thanks for your prompt reply. I am not looking for any thing specifics. I use grep and some simple regex to get what I need for the most part but that seems to involve too much manual labor and I m not a programmer/scripter by any mean. That’s why I have been trying to learn Python by reading easy books and watching your tutorials :).
I’m putting up a ton of regular expression videos today. I hope they help
Does the code above work for RSS feeds. I tried using your code to scrape the following RSS feed:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=1Xe9IY_Zn1SGVM3uOPAfgR_U9UtRLec761jzrl721l-NGWw14F
and I get the following errors:
Traceback (most recent call last):
File “C:\Documents and Settings\Administrator\My Documents\harvardext\week2\week2friday\PubMedScraper_edits.py”, line 56, in
print findPatTitle[i] # The title
IndexError: list index out of range
The code works on any page as long as you provide the proper tags that surround the content that you want to get. Here is a more advanced tutorial on using Regular Expressions that may help you Complex Regular Expressions. Hope that helps?
Hello Admin,
Your py tutorials are HQ and best on youtube.
Please keep them coming and lead us all to advanced python coding skills.
Please don’t stop after 20-30 tutorials!
Good Luck!
When I visit the link you posted:http://www.crummy.com/software/BeautifulSoup/download/3.x/ , for BeautifulSoup download compatible with Py 2.7 I see tons of files and don’t know what do download.
For almost everyone, the the 3.2 series is the best choice. Sorry for not pointing that out
Derek
I have recently started with Python and initially found it difficult. Your tutorials have brought me a long way. Thanks for sharing and keep up the good work.
Graham
Derek
Thanks for putting up all these great videos. Im wanting to scrape a webpage that requires a log-in first, is there an easy way to do this in Python?
Thanks
You’re welcome.
The short answer to your question is maybe 🙂 Some sites will completely block you from getting through their login screen programmatically through the use of a captcha.
If there is nothing blocking you have to figure out all of the requirements to login: username, password, encoding issues, what cookies are set, etc. I can’t think of away to write code that would work with every site. This is definitely a hack job, but I’ll look to see if I can come up with something
Hey Derek,
I’ve been looking for your tutorials on FB and twitter, and couldn’t find it. Have you done it yet? If yes, can you post the link? If no, when is it going to come out?
Thanks!
Hi Ann, I did 3 tutorials on Facebook: How to Make Facebook Apps, How to Make Facebook Apps Pt 2 and How to integrate Facebook into WordPress.
I plan on making more Facebook apps as soon as I figure out all of the recent changes. As per Twitter, what would you like to see?
Great tutorials!
Would love a tutorial on how to scrape friends and followers on Twitter, any plans for that?
Scott
I’m going to be covering social network programming next. Ill start with Facebook and then twitter. Explain exactly what you want to do with twitter and I’ll tell you if it is possible
Hey Derek,
Great tutorials! I’m having a problem with the re.findall(patFinderTitle,webpage) portion of the code.
I get the following error:
Traceback (most recent call last):
File “”, line 2, in
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7 /re.py”,
line 177, in findall
return _compile(pattern, flags) .findall(string)
TypeError: expected string or buffer
Any idea what the problem may be?
Thanks I’m glad you like them 🙂 I have a few questions: Are you searching the Huffington Post or some other site? Beautiful Soup doesn’t work with all sites. Have you edited the code in anyway?
I think the problem was I didn’t append read() to the end of urlopen… Sorry about that.
Now I’m getting an Index Error: “list index out of range”, and I’ve checked several times to make sure that the code I’m using is identical. Could it be a problem with beautiful soup? I’m worried I might not have gotten it on the path, but eclipse seems to import it.
Thanks for any help.
And yes, I’m using the same Huffington Post link
Here is the code I’m using:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re
webpage = urlopen(‘http://feeds.huffingtonpost.com/huffingtonpost/LatestNews’).read()
patFinderTitle = re.compile(‘(.*)’)
patFinderLink = re.compile(”)
findPatTitle = re.findall(patFinderTitle,webpage)
findPatLink = re.findall(patFinderLink,webpage)
listIterator = []
listIterator[:] = range(1,100)
for i in listIterator:
print findPatTitle[i]
print findPatLink[i]
print “\n”
Sorry Derek, I had a problem in my code. I was using (.*) instead of (.*)
I’m glad you fixed it. I figured it was some silly typo. Sorry I couldn’t respond quicker. I’m getting a lot of comments lately
Hi Reed
I am having the same problem can you explain what you meant by
“I was using (.*) instead of (.*)”
this line
Hello Derek, great site you have here, i love python, it was love at first, your tutos rock.
Could you do demo how to use a rss reader in Tkinter?
Thanks
Thank you 🙂 I’m glad you like it. What specifically are you looking to do with RSS feeds?
Hi Reed
I am having the same problem can you explain what you meant by
“I was using (.*) instead of (.*)”
this line
Well trying to create a simple rss feed like scrapping a web and show it with python in a visual way, not txt.
Well trying to create a simple rss redaer like scrapping a web and show it with python in a visual way, not txt, to keep track of my favorite sites and stuff.
I’ll do my best. I’m a bit backed up with tutorials at the moment
Ok, no problem.
I have really enjoyed your tutorials! Muchas gracias!
Do you have any suggestions for scraping websites which don’t particularly want to be scraped? Like wikipedia?
Again, thanks for all the great tutorials!
You’re very welcome. I just looked at Wikipedia and yikes what a mess! You can of course grab data from it and probably delete all of the css. I don’t think you should try to grab what you want using regex. It’s probably a better idea to grab all of it and then delete what you don’t want using regex. It’s do able, but will take some time. I hope that helps
Hi watched your video and they are really good but one thing that i didn’t get it is why you use this “if __name__ == ‘__main__’: main ()”
i’ve wrote so many begineers prog without using it. Can you please explain it to me
That line allows your Python code to act as either a reusable module or as standalone program. It simply tells the interpretor to call the main function
I got your zip archive and opened your testcode3.py for web scraping. I don’t know why it’s missing certain titles, and full articles. I went through the current huffington post rss feeds and ran your code and for some reason I’m getting the same problems. I looked at each title on the website and in the output checking to make sure that there was the same title,url,article full text. Some articles are missing, some articles have the title and url but no text from the actual article. Is it just my computer, could you verify if this works 100% correctly?
I got your zip archive and opened your testcode3.py for web scraping. I don’t know why it’s missing certain titles, and full articles. I went through the current huffington post rss feeds and ran your code and for some reason I’m getting the same problems. I looked at each title on the website and in the output checking to make sure that there was the same title,url,article full text. Some articles are missing, some articles have the title and url but no text from the actual article. Is it just my computer, could you verify if this works 100% correctly?
I’ll check this out as soon as possible. Everything should work perfectly unless the tags have been changed on the site
Wonderful tutorial. Web Scraping is the reason I have started to teach myself Python. A little problem with the code above, though troubleshooting it was a good learning experience for me, the divBegin line is not fully finished which would identify the body_entry_text division.
All in all , wonderful job Derek, thank you for teaching me about Python!
You’re very welcome. Thank you for pointing that out 🙂
First, thanks for all the videos, they’re really great. I have a question about grabbing the titles and links from the huffington RSS feed. From your code:
titleString = ‘(.*)’
origArticleLink = ”
I can strip the code to where it only grabs these lines and prints them out, and it works. But how!?? The Huffington RSS feed seems to have changed it’s tags. I can’t figure out how to scrape the new tags, but this old code is working when I don’t see why it should!
For example, here is the article title and link from the RSS feed. I can’t figure out how to scrape it with new code, but the code I list above still works…
‘Tabatha Takes Over’: Tabatha’s Appalled By Salon Employees Drinking At Work (VIDEO)
I checked out the code. Everything seems to still be working because the title and link to the original article are still set up the same way. I then use urlopen to do all of the heavy lifting in regards to grabbing the original articles. I’m not sure what could be going wrong with your code? The trick is to grab the original article link and let urlopen do its job for any other feeds your pulling from. Does that make sense?
Oops, looks like the comments section killed the code I pasted in there. But you know what it should be, the simple title tag and the link ref tag
Great tutorial Derek! Very helpful!
I had one problem when running your code though. I also get an error message like the one below:
Traceback (most recent call last):
File “C:/Python27/beautifulSoupDemo.py”, line 18, in
findPatTitle = re.findall(patFinderTitle,webpage)
File “C:\Python27\lib\re.py”, line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
any idea on how I can fix this?
Thanks,
Helena
This error seems to get cleared up if you update everything in eclipse. Just click Help and Check for Updates in eclipse
nvm. problem fixed. don’t know how, but the program is working now. Thank you again, Derek, for posting this!
Great 🙂
Hey Derek,
Great tutorial. One question though: how do you get rid of the tags?
Ann
Delete anything that qualifies as a tag using regular expressions.
The tags, and also the link tags.
Before even trying BeautifulSoup, I was trying to get the other code working.
If I print the webpage variable, the whole thing is there in all its glory, but when I print patFinderTitle, I see that it’s an object:
This explains the error code I’m getting when re.findall tries to run:
Traceback (most recent call last):
File “C:/Documents and Settings/Cyd/Desktop/pytut_13.py”, line 14, in
findPatTitle = re.findall(patFinderTitle, webpage)
File “C:\Python32\lib\re.py”, line 193, in findall
return _compile(pattern, flags).findall(string)
TypeError: can’t use a string pattern on a bytes-like object
It doesn’t explain why it works for you, though. I’ve looked through the Python documentation and can’t find any explanation.
Can you tell me what I’m missing? (I even tried copying and pasting your code for that section, and get the same error.)
Thanks,
Cyd
I guess using the code tag doesn’t work. Anyway, it’s an object:
Okay, third try:
Pretend there are opening and closing tags on this:
_sre.SRE_Pattern object at 0x010B3180
Sorry again.
Through mucho searching on the web, I found that I can put a ‘b’ in front of the pattern to make it a bytes pattern and then everything works fine (except I get an ugly b’ in front of each entry).
Nope, that didn’t work. When it gets to articlePage, I’m told that ‘bytes’ object has no attribute ‘timeout’ (part of urlopen). I guess it still wants a regular string.
I just don’t understand why the re.compile lines are returning bytes objects instead of strings.
Sorry I couldn’t get to you quicker. I’ll look over the code. Because I made this tutorial so long ago I’m guessing beautiful soup must have changed. The code worked in the past as you saw in the video.
I’ve also heard that many people have been struggling because of recent changes in eclipse and pydev. I’ve seen numerous errors from people that didn’t recently update pydev.
In hind site I probably should have avoided covering beautiful soup because the author seems to be giving up on the project.
No problem. I appreciate your quick response. I found bs4 (the new beautiful soup) and it’s working ok. I encoded the urlopen statements into utf-8 and that fixed the object problem.
Then for the soup.findall, I used soup.find_all and stuff started printing out. It got through four articles before hitting an error in an articlePage:
“UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x80 in position 132109: invalid start byte”
But I must say, figuring this stuff out has GOT to make me a better programmer.
Thanks again.
On to the next tutorial!
That is pretty much how I learned to program. Except I didn’t have the internet. I had to go dig up stuff in real books. Yuck!
I’ll have to revisit this tutorial and make corrections based off of BS not being backwards compatible.
Thank you for pointing all of this out 🙂
Amazing videos – thank you so much for sharing your knowledge. I am however having the following issue:
Shows me the following error while running…
SyntaxError: Non-ASCII character ‘\x94’ in file C:\Users\amoosa\workspace\PythonTest\PythonTest\Web_Scraping1.py on line 19, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
It points to the following sentence:
patFinderLink = re.compile(“”)
I am using Windows 7 and Eclipse with PyDev.
Furthermore, could you point/instruct how to take care of a login page to have credentials put in, so I can start doing the website scraping thereafter?
Thanks.
Nevermind, I found and corrected the problem. Apparently, the copy/paste function has the ‘”‘ sign taken in differently. When I removed the quotes and re-added them manually – no errors were discovered.
Just thought I should let you know 🙂
Also, thank you for the videos. I am a big fan of your site.
Yes that is a problem with my old videos. It was done for security reasons. I was going to go back and correct that in every tutorial, but I figured I’d just keep making new tutorials instead.
I could take months to go back and fix all of my past errors 🙂
this site is amazing. is there anyway you could do another tutorial on another website? (non rss / feed)
Thank you. I show you how to website scrap using other languages. I have a really neat tutorial using PHP here PHP Website Scraping. The code is easy to translate into Python
Just found your article on web scraping after seeing Kim Rees (Periscopic) on “Digging into Open Data” at OSCON 2012. Thanks!
You’re very welcome 🙂 Here are some videos on web scraping with PHP
Web Design and Programming Pt 7 REGEX
Web Design and Programming Pt 8 Regex
Web Design and Programming Pt 24 Regex on Steroids
I’m also a member of Atlanta PHP Meetup. Check out “AtlantaPHP dot org” or stop by if you’re in town. Again, thanks!
Hey Derek,
Loving the tutorials, but ran into a problem on the code in this tutorial. I’m getting the below error:
TypeError: expected string or buffer
Related to this line:
findPatTitle = re.findall(patFinderTitle, webpage)
I did see that a few others had this error, and I saw your response about updating eclipse. I just did that, and am still getting this error. I’m using python 3.2 if that helps. The code in it’s entirety is:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
webpage = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/LatestNews')
patFinderTitle = re.compile('(.*)') #the (.*) is a regular exp to grab anything between title tags
patFinderLink = re.compile('')
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)
listIterator = []
listIterator[:] = range(2,16)
for i in listIterator:
print(findPatTitle[i])
print(findPatLink[i])
print("/n")
I think it is a beautiful soup issue. Since I made this tutorial, I no longer use that library. I’ll take a look to see if that is the issue.
Hi,
I’m getting an IndexError: list index out of range. I’ve stripped down the code to its basics and still have the problem. I would be very thankful if you could help me out.
Fixed. Indentation problem…
Great! I’m glad you fixed it 🙂
hi,
would you comment on how you would do the following using beautifulsoup? i feel like it can be simplified significantly but i run into errors when i trying doing it soup.
# def get_urls_of_restaurant():
# list_urls = []
# n = 0
# nn = 0
# for i in range(4):
# url = urlopen(‘http://www.tripadvisor.com/Restaurants-g293748-oa’ + str(nn) + ‘-Dar_es_Salaam.html’).readlines() #open URL whis lists restaurants
# while n < len(url):
# if '"' not in url[n] and '‘ in url[n] and len(url[n]) > 5:
# list_urls.append(url[n-1].split(‘”‘)[1])
# n += 1
# n = 0
# nn += 30
# list_urls.reverse()
# print “Geting urls done! Get %s” %len(list_urls) + ‘ urls.’
# return list_urls
Hi, What version of BeautifulSoup are you using? They changed a bunch of things lately and of course it isn’t backwards compatible. What errors are you getting? Thanks – Derek
I am using beautifulsoup4
That is the problem. I made this tutorial using BeautifulSoup 3. After time I started having trouble with this library as well and now just use regular expressions to perform scraping techniques.
How do I check the URL that I’m going to is not a broken URL?
Check out this article How to check for broken url in Python
Hey great tutorial, I plan to check out more of them. Your directions on installing Beautiful soup was way better than the instructions on BeautifulSoup’s website.
Frank
Thank you 🙂 I do my best to make everything easy.
Hi derek
I had a question about python , is there a way to handle exceptions of a program u dnt know about ( ie u want to catch a simple exception in a program someone else has written ) . how would you invoke their program in out try catch block , or could they invoke the try catch block in their program ?
Thank you for your time
Edit: I thought about using execfile() but it dnt help
You can catch all Python exceptions like this, but it isn’t recommended
try:
# do sstuff
except Exception, e:
print e
Hi Derek
Thank you for your answer , my prob statement was like
1) prog 1 : debugger (try except stament for a few errors)( written by me)
2) prog 2 : reg prog (written by my team ) by person X
person X wants to know if certain exceptions are occuring in prog 2 , he imports prog 1 and tada he can catch the custom exceptions.
I just want to know , how do I write prog 1 so it works like that?
Hi derek, Im a fan of your videos about scrappin in python, but I have a problem, my iddle of python doesnt recognize beautifulSoap, how i can do for recognize this import?
Regards!
Are you using Python 2.7? Type python -V in the terminal or console to find out
Hey! I just want to ask you a question; Is it possible to do website scraping without the beautiful soup module? I am currently making a program that runs different python scripts, and I don’t want other users who use it to manually download beautiful soup. I also need to know what I should do when the tags are combined together, for example:
link = http://linkhere … title = titlehere
Edit: For the example, it should have been:
“link = http://linkhere … title = titlehere”
Sure you can do that and I actually prefer to scrap without beautiful soup. you just have to use regular expressions. I cover how later in this tutorial. I also show how to scrap complicated stuff in PHP here Website Scraping with PHP. The regex can easily be used in python. I hope that helps
Hi Derek,
Is it possible to get BS to do this?
Go to website homepage (video site)
Enter the article (normally by clicking thumbnail’s url)
Grab the title
Grab the video’s tags
Grab the videos embed code
Then go back and move on to the next article
It normally works but BS can get buggy at times. Try it out and see
Great tutorials, thank you (and keep it up)! QUESTION: How to scrape when the data spans multiple pages (with pagination, such as < Last >>)? Thank you!
Thank you 🙂 You’ll have to grab multiple pages depending upon how everything is set up. This would be one of those cases by case problems
Could you post an example of webscraping where webpages are loaded with Javascripts (and extracting information from that site)?
I have numerous tutorials on regular expressions on my site. This regex tutorial may help the most. It is a php tutorial, but the regular expression part is the same in python
can you tell me, how to make hotel booking system in tkinter
That is a pretty big project. Can you break it down into just the parts that are confusing to you?
Thank you very much
You’re very welcome 🙂
Hi mates, how is everything, and what you want to say about this post,
in my view its actually awesome for me.
Thank you very much 🙂
Great post about screen scraping. This works pretty well for singe page or website. For large project, Scrapy framework is more suitable than the method mentioned in this post.
Thank you 🙂 Yes I was just playing around and mainly i was teaching what goes one with a screen scraping tool
I really like you tutorial, it made things easy to understand.
I have one problem I have been trying to figure out. I got a big file with over 200 lines. Here is a example of one of the lines:
DateTitle
The file name has the date in it, and the title. The date code is: Y/M/D. The ldm is an abbreviation for the title. How can I make a script to get the date code and put it in here Date and the title here Title. I have several HTML files like this, each one with several hundred lines, all just like this.
Any ideas are GREATLY appreciated!!!
Thank you 🙂 I made a Regex tutorial that seems to help people. It was done using PHP, but the regular expression codes work exactly the same way in Python. Here is the video PHP Regex Tutorial. I explain the thought process behind using regex in complex situations. I hope it helps
Hi Derek,
I am trying to do some web scraping for my research in python. But I am unable to construct the url. After going through your post thought maybe shoot you some of my questions. Any help is greatly appreciated.
I want to access this website from python and submit my query to convert ids. Then get that response back from the website into my program. While studying about it I learned more of webscraping in python. Is this the way to do it? For example if I give input as
21707345
23482678
And choose option
PMID to PMCID (or NIHMSID)
And hit convert it will return PMCID and NIHMSid. I want these returned values to be used in my program. I guess that will be just parsing the results.
I went through basic youtube videos for web scraping in python as this http://www.youtube.com/watch?v=f2h41uEi0xU
I also tried doing it by my self using BeautifulSoup but no success. Also, tried firebug in firefox to get the url that I can use to query the website.
As far as I got, the base url is: http://www.ncbi.nlm.nih.gov/pmc/pmctopmid/
But I am unable to join it to the query to complete the url.
When queried I get the and when tried to get the complete query as http://www.ncbi.nlm.nih.gov/pmc/pmctopmid/pubmed/23482678/ it is not working.
So, I am guessing I am missing something here.
Thank you for your time and help.
The problem you’re having revolves around the difference between scraping sites using the GET versus the POST method. The GET method is easy to use and that is what I use in these tutorials when needed. This website however is using the POST method which can be very complicated and maybe even impossible because you can’t pass information in the URL.
As you noticed when you pass information there is no change with the URL. This information may help you http://stackoverflow.com/questions/15423286/using-button-on-a-page-with-python
I hope that helps
Thanks a lot Derek. I am sure this helps me to learn things.
Great I’m very happy to be able to help 🙂
Sorry for not mentioning the website. The website I am trying to access is http://www.ncbi.nlm.nih.gov/pmc/pmctopmid/
Hi Derek!
I loved your Python tutorials!
I went over some stuff on my own and have created two well-commented *.py files covering:
1. Lambda expressions, functional programming tools, and list comprehension
(all following the 2.7.6 documentation).
2. Iteration generators and related stuff
(mostly following the stackoverflow question at http://stackoverflow.com/questions/231767/the-python-yield-keyword-explained)
I’d be proud if you wanted to make a couple of video tutorials out of these! Your video editing and narrating are probably much better than what I could muster, and you already have a crowd.
Please let me know (by e-mail, if at all possible) if you’d like me to post the *.py files on googledocs or whatever…
Good luck with your awsome project, I’m certainly going to continue using it!
Inon.
Hi Inon, Thank you for all of the nice compliments 🙂 Please feel free to post links to your code and most anything else in the comments here and I’ll gladly share them. Thank you very much for the offer!
Shoot… gotta fix all the indentations… Sorry!