8/21/2011

Regular expression, a deeper visit

I did some deeper exercise with regular expression with R. What I was trying to do is to download some pdfs from Ohio State human resources website. I was interested in the summary of faculties and stuff. Before doing anything too fancy, I need to get the URL's first. Here I tried some new techniques other than I used before http://evertqin.blogspot.com/2011/03/quote-of-stock-price-in-r.html

The regular expression I used this time are:

  1. . (dot) which can represent any one character. So with .*(greedy) it matches everything zero or more times until the end of the string;
  2. Follow the .*, can add  "(\\.pdf){1}"(in R, two backslash is need to make a . (dot) mean dot.(@_@0) so the whole pattern looks like "staffcompmkt/.*(\\.pdf){1}" It means search string start with "staffcompmkt/" and end with .pdf for the first time.
  3. Another alternative to * (or +) is *? (or +?) lazy methods. I was really confused by the difference between this two. It turns out that with * (or +) the regular expression engine tries to match the previous item as many times as possible and then proceed to the items after *. But lazy method only match the previous item ones, then look for the item after that until finding it.
There is also a very useful R command, mapply. I used sapply a lot previous. This function is good but it can only accept one argument (although the function prototype can have multiple arguments but you can only deal with one vector while the other arguments are the same for each run. The result is a bigger matrix.)
With mapply, you can assign vectors of the same length to a single function. The mapply function will deal with each entry of the vectors at a time until the whole run.
As I did in this practice:
files = mapply( function(x,startposition,length){substr(x,startposition,startposition + length)},
       x = indexContent[which(match != -1)],
       startposition = match[which(match != -1)],
       length = attr(match, 'match.length')[which(match != -1)] - 1)

Because regexpr function in R gives the starting location and the length of the match in a matrix form, I need to extract relative URL of each individual pdfs from the original index file with substr function. Bear in mind that they are all different for each line of URL. So a mapply function is perfect for this purpose. as shown above.

Finally, a take home message is that if you want to save a pdf with file.download function. You need to use internal method with mode = 'wb'. Only in binary writing mode, can the downloaded pdfs be opened.

The source code:
# TODO: Gather data and analysis
# 
# Author: evert
###############################################################################

setwd('Blablalalala')

############################################################
#Constant Declaration
############################################################
rootURL = "http://hr.osu.edu/statistics/"
URL = "http://hr.osu.edu/statistics/staffcompmkt/College%20of%20Optometry.pdf"

indexURL = "http://hr.osu.edu/statistics/staffcompmkt_home.aspx"
if(!file.exists("index.html")) download.file(indexURL,'index.html',method = 'auto')
indexFile = file("index.html")
indexContent = readLines(indexFile)

(match <- regexpr("staffcompmkt/.*(\\.pdf){1}",indexContent, perl = TRUE))

# mapply applies the function to the first argument, the second, the third...etc
files = mapply( function(x,startposition,length){substr(x,startposition,startposition + length)},
       x = indexContent[which(match != -1)],
       startposition = match[which(match != -1)],
       length = attr(match, 'match.length')[which(match != -1)] - 1)
cat(files)
# Create the subfolder if it does not exist
if (!file.exists("staffcompmkt")) dir.create("staffcompmkt")
for (file in files){
  URL = paste(rootURL,file, sep='')
  if(!file.exists(substr(file,14,nchar(files)))) 
    # This only works if write in binary mode with mode ="wb"
    download.file(URL,paste("staffcompmkt/",gsub("%20"," ", substr(file,14,nchar(file))),sep = ''),method ='auto',mode ="wb")
}

cat(files)

No comments: