01
Nov
09

Rapid downloading

An activity I’ve been engaging in now that its the end of semester is grabbing the notes for courses that look interesting to me and using them to decide if I want to take them next semester. The computer science department often posts the notes for courses on their website (publicly viewable!) across the entire semester, and at the end a nice collection of notes can be downloaded.

Therein lies the biggest problem, having to try to save and download 12 weeks X 3 lectures a week of notes is a real pain especially if I have to grab 5-6 different courses. So I set about writing a python script to help me out. Unfortunately, I only discovered python’s “websucker.py” script after I was midway through that bit of code. Meh.

First, I had to decompose the html script and find all the links present on a set page, then identify which of these links are .pdf/.ppt files (what I want to download), then download and save them to a local directory. Sounds simple, but in the end it took me about 3 hours in all to finish the script and get it working as I wanted it to. Currently, its in a working state, but nothing too pretty.

I used  BeautifulSoup as a html parser, a somewhat odd name but I’ve seen worse – e.g. CherryPy. I chanced upon it while googling around for the best option of parsing and searching through HTML trees and found a snippet of useful code on StackOverflow (the next best thing after experts exchange).

I started off with importing the required libraries

import urlparse, urllib
from BeautifulSoup import BeautifulSoup
import re
then opened the website, parsed it in
url  = "http://www.cs.auckland.ac.nz/compsci340s2c/exams/"
source = urllib.urlopen(url).read()

soup = BeautifulSoup(source)
print soup.prettify()

BeautifulSoup(source) takes the website and parses it into discreet tags that can be referenced. Also, the soup object has many different operations that can be carried out on it as seen later. soup.prettify is used to print the html code of the page, indented properly (hence the prettify moniker)!

link = 'empty'
links = []
filenames = [] 
for item in soup.findAll('a'):
    if (str(item).find('pdf') > 0):
        print item.renderContents()
        link =  str(item['href'])
        links.append(link)
        filenames.append(link)
        print 'pdf found'

This bit of code involves searching for all links and appending them to a list. I should have used a dictionary, but thats for version 2.0. soup.findAll(‘a’) looks for all tags with the label ‘a’. I then get all objects with the extension “pdf” in them and from there, I obtain the value of the attribute ‘href’ with item[‘href’], thank you BeautifulSoup!  The renderContents function shows the actual information stored within the tags.

 

Finally, we download all the files,

for index, item in enumerate(links):
    item  =  url + item
    print item
    print filenames[index]
    filename = "c:\\cs340\\" + filenames[index]
    print filename
    urllib.urlretrieve(item, filename)
    #print 'link is'
print 'finito'

Currently, the directory filepath is hardcoded, but that should really be softcoded (version 2.0 again!), the filenames are identical to that of the original files hosted on the webpage.

Downloading scripts after, we get the following output


Click to access 2007examWithAnswers.pdf


2007examWithAnswers.pdf
c:\cs340\2007examWithAnswers.pdf

Click to access 2008examWithAnswers.pdf


2008examWithAnswers.pdf
c:\cs340\2008examWithAnswers.pdf
finito

and we’re done! finito! Now, to put this to a UI, and build version 2.0. I’ve had a real crap time trying to post code here and preserving indenting, but using the “pre” and “code” tags seem to work


0 Responses to “Rapid downloading”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


tweets

Categories