The regular expression I used this time are:
- . (dot) which can represent any one character. So with .*(greedy) it matches everything zero or more times until the end of the string;
- Follow the .*, can add "(\\.pdf){1}"(in R, two backslash is need to make a . (dot) mean dot.(@_@0) so the whole pattern looks like "staffcompmkt/.*(\\.pdf){1}" It means search string start with "staffcompmkt/" and end with .pdf for the first time.
- Another alternative to * (or +) is *? (or +?) lazy methods. I was really confused by the difference between this two. It turns out that with * (or +) the regular expression engine tries to match the previous item as many times as possible and then proceed to the items after *. But lazy method only match the previous item ones, then look for the item after that until finding it.
There is also a very useful R command, mapply. I used sapply a lot previous. This function is good but it can only accept one argument (although the function prototype can have multiple arguments but you can only deal with one vector while the other arguments are the same for each run. The result is a bigger matrix.)
With mapply, you can assign vectors of the same length to a single function. The mapply function will deal with each entry of the vectors at a time until the whole run.
As I did in this practice:files = mapply( function(x,startposition,length){substr(x,startposition,startposition + length)}, x = indexContent[which(match != -1)], startposition = match[which(match != -1)], length = attr(match, 'match.length')[which(match != -1)] - 1)
Because regexpr function in R gives the starting location and the length of the match in a matrix form, I need to extract relative URL of each individual pdfs from the original index file with substr function. Bear in mind that they are all different for each line of URL. So a mapply function is perfect for this purpose. as shown above.
Finally, a take home message is that if you want to save a pdf with file.download function. You need to use internal method with mode = 'wb'. Only in binary writing mode, can the downloaded pdfs be opened.
The source code:
# TODO: Gather data and analysis # # Author: evert ############################################################################### setwd('Blablalalala') ############################################################ #Constant Declaration ############################################################ rootURL = "http://hr.osu.edu/statistics/" URL = "http://hr.osu.edu/statistics/staffcompmkt/College%20of%20Optometry.pdf" indexURL = "http://hr.osu.edu/statistics/staffcompmkt_home.aspx" if(!file.exists("index.html")) download.file(indexURL,'index.html',method = 'auto') indexFile = file("index.html") indexContent = readLines(indexFile) (match <- regexpr("staffcompmkt/.*(\\.pdf){1}",indexContent, perl = TRUE)) # mapply applies the function to the first argument, the second, the third...etc files = mapply( function(x,startposition,length){substr(x,startposition,startposition + length)}, x = indexContent[which(match != -1)], startposition = match[which(match != -1)], length = attr(match, 'match.length')[which(match != -1)] - 1) cat(files) # Create the subfolder if it does not exist if (!file.exists("staffcompmkt")) dir.create("staffcompmkt") for (file in files){ URL = paste(rootURL,file, sep='') if(!file.exists(substr(file,14,nchar(files)))) # This only works if write in binary mode with mode ="wb" download.file(URL,paste("staffcompmkt/",gsub("%20"," ", substr(file,14,nchar(file))),sep = ''),method ='auto',mode ="wb") } cat(files)
No comments:
Post a Comment