5/09/2011

Get data from website

The grand final of UEFA is around the corner and I found an amazing website that summarized the performance of each team since 1955. As the first step of data mining, I just go head and grab data from the website. This is quite a simple procedure which is covered in STAT 133 from Berkeley. I will just post here for future reference. I am going to use these data (maybe in combination of some other data) to do some modeling work.

In R, there is a very powerful function download.file() for grabbing html from website. I use this function in combination with regular expression to extract address information form the index web page so that I can get each individual html.

NOTE: As a courtesy to the website maintainer, please ONLY download the data once and do the rest of the stuff on your computer.


# TODO: Get data from website
# 
# Author: Roger Everett
###############################################################################

setwd('Your directory')

############################################################
#Constant Declaration
############################################################
items <- c('match','ccoef','crank','tcoef','trank')
method <- c('method1','method2','method3','method4')
base_path <- 'http://www.xs4all.nl/~kassiesa/bert/uefa/data/'


# first download the main page
url <- 'http://www.xs4all.nl/~kassiesa/bert/uefa/data/index.html'
# as a courtesy to the maintainers of the web sites, if the index file exist, then 
# don't strain the server
if (!file.exists("index.html")) download.file(url,'index.html',method = 'auto')
indexFile <- file("index.html",'r') # open the saved file for read only
indexContent <- readLines(indexFile)
# extract individual file url from the index page
match <- grep("^(<td><a href=\"method.)",indexContent, perl = TRUE, value = TRUE)
#url_name store the unique identify for each html file, e.g. method3/coef2005.html, etc
url_name <- sapply(match,function(x){substr(x,14,35)}, 
  simplify = TRUE, USE.NAMES = FALSE)
# url_list is the full path of the htmls
url_list <- sapply(url_name,function(x){paste(base_path, x, sep='')}, 
  simplify = TRUE, USE.NAMES = FALSE)    
close(indexFile)

# create directories
if(length(dir("method*")) != 4) {
 for (i in 1:length(method)){
  dir.create(method[i])
 }
}
#gather data
for (i in 1:length(url_list))
{
 if (!file.exists(url_name[i])) download.file(url_list[i],url_name[i],method = 'auto')
}


No comments: