7/20/2011

Classification and Discrimination with R

I did some exercise with R today. I have been trying to play a bit with the minimum expected  cost of misclassification (ECM) method. It is pretty straightforward and self-explanatory. The only tricks are drawing of the linear function that separate the two regions and the drawing of confidence ellipse.

To draw confidence ellipse, we need 'ellipse' package or 'car' package. The former comes in handy because it generates a set of axis in a matrix form which can be manipulated.

The data comes from Johnson &Wicherm, Table 11.2 (The Salmon dataset). From the above figures of marginal qqplot and chisq plot, we can see the normality of the data is not bad. So we can go ahead and apply the minimum ECM method, assuming same prior and equal costs.
The figure shows the scatter plot of two variables, the linear function that separate the two regions and the 95% confidence ellipse for each type of fish.

As stated in the book, since 'Alaskan' Fish has larger variance, the classification imposes more penalty on that type of fish. From the plot, we can see there are more red area below the threshold. 

My next step is to try some other classification methods.



library(ggplot2)
library(sqldf)
library(ellipse)

setwd('your path')

dataset = read.table("T11-2.dat",col.names = c('Membership','Gender','Freshwater','Marine'))
# Draw the marginal qqplot
windows(9,5)
par(mfrow=c(1,2))
qqnorm(dataset$Freshwater)
qqline(dataset$Freshwater)
qqnorm(dataset$Marine)
qqline(dataset$Marine)

# Draw a chisq plot
chisq.plot(dataset[,3:4], ask = F) # The chisq plot looks ok. Bivariate normality holds

# calculate sample statistics
# Alaskan
x1 = dataset[which(dataset$Membership == 1),2:4]
x.bar1 = mean(x1[,-1])
x.bar1 = as.matrix(x.bar1,nrow=2) # convert data.frame to matrix
#Canadian
x2 = dataset[which(dataset$Membership == 2),2:4]
x.bar2 = mean(x2[,-1])
x.bar2 = as.matrix(x.bar2,nrow=2) # convert data.frame to matrix
# calculate sample covariance matrix
S1 = cov(x1[,-1])
S2 = cov(x2[,-1])
n1 = length(which(dataset$Membership == 1))
n2 = length(which(dataset$Membership == 2))
S.pool = (n1-1)/((n1-1)+(n2-1))*S1 + (n2-1)/((n1-1)+(n2-1))*S2

# EDA 
dataset.factor = sqldf("SELECT Gender, Freshwater, Marine, 
              CASE WHEN Membership == 1 THEN 'Alaskan'
                   WHEN Membership == 2 THEN 'Canadian'
              END  membership FROM dataset")


p = ggplot(data = dataset.factor, aes(x = Freshwater, y = Marine, colour = membership))
p = p + geom_point() + labs(colour = '')


# classification
a.hat.transpose = t(x.bar1-x.bar2)%*%solve(S.pool) 
m.hat = 0.5*t(x.bar1-x.bar2)%*%solve(S.pool)%*%(x.bar1+x.bar2)
w = c(-m.hat, a.hat.transpose)

p = p + geom_abline(aes(intercept = m.hat/a.hat.transpose[2], 
  slope = -a.hat.transpose[1]/a.hat.transpose[2], colour = 'Threshold'), size = 1 )
p = p + geom_path(aes(x = ellipse(S1,centre = x.bar1)[,1], y = ellipse(S1, centre = x.bar1)[,2], colour = 'Alaskan'),size = 1)
p = p + geom_path(aes(x = ellipse(S2,centre = x.bar2)[,1], y = ellipse(S2, centre = x.bar2)[,2], colour = 'Canadian'),size = 1)
p = p + opts(title = expression('Classification with 95% confidence ellipse'))
p

No comments: