To draw confidence ellipse, we need 'ellipse' package or 'car' package. The former comes in handy because it generates a set of axis in a matrix form which can be manipulated.
The data comes from Johnson &Wicherm, Table 11.2 (The Salmon dataset). From the above figures of marginal qqplot and chisq plot, we can see the normality of the data is not bad. So we can go ahead and apply the minimum ECM method, assuming same prior and equal costs.
The figure shows the scatter plot of two variables, the linear function that separate the two regions and the 95% confidence ellipse for each type of fish.
As stated in the book, since 'Alaskan' Fish has larger variance, the classification imposes more penalty on that type of fish. From the plot, we can see there are more red area below the threshold.
My next step is to try some other classification methods.
library(ggplot2) library(sqldf) library(ellipse) setwd('your path') dataset = read.table("T11-2.dat",col.names = c('Membership','Gender','Freshwater','Marine')) # Draw the marginal qqplot windows(9,5) par(mfrow=c(1,2)) qqnorm(dataset$Freshwater) qqline(dataset$Freshwater) qqnorm(dataset$Marine) qqline(dataset$Marine) # Draw a chisq plot chisq.plot(dataset[,3:4], ask = F) # The chisq plot looks ok. Bivariate normality holds # calculate sample statistics # Alaskan x1 = dataset[which(dataset$Membership == 1),2:4] x.bar1 = mean(x1[,-1]) x.bar1 = as.matrix(x.bar1,nrow=2) # convert data.frame to matrix #Canadian x2 = dataset[which(dataset$Membership == 2),2:4] x.bar2 = mean(x2[,-1]) x.bar2 = as.matrix(x.bar2,nrow=2) # convert data.frame to matrix # calculate sample covariance matrix S1 = cov(x1[,-1]) S2 = cov(x2[,-1]) n1 = length(which(dataset$Membership == 1)) n2 = length(which(dataset$Membership == 2)) S.pool = (n1-1)/((n1-1)+(n2-1))*S1 + (n2-1)/((n1-1)+(n2-1))*S2 # EDA dataset.factor = sqldf("SELECT Gender, Freshwater, Marine, CASE WHEN Membership == 1 THEN 'Alaskan' WHEN Membership == 2 THEN 'Canadian' END membership FROM dataset") p = ggplot(data = dataset.factor, aes(x = Freshwater, y = Marine, colour = membership)) p = p + geom_point() + labs(colour = '') # classification a.hat.transpose = t(x.bar1-x.bar2)%*%solve(S.pool) m.hat = 0.5*t(x.bar1-x.bar2)%*%solve(S.pool)%*%(x.bar1+x.bar2) w = c(-m.hat, a.hat.transpose) p = p + geom_abline(aes(intercept = m.hat/a.hat.transpose[2], slope = -a.hat.transpose[1]/a.hat.transpose[2], colour = 'Threshold'), size = 1 ) p = p + geom_path(aes(x = ellipse(S1,centre = x.bar1)[,1], y = ellipse(S1, centre = x.bar1)[,2], colour = 'Alaskan'),size = 1) p = p + geom_path(aes(x = ellipse(S2,centre = x.bar2)[,1], y = ellipse(S2, centre = x.bar2)[,2], colour = 'Canadian'),size = 1) p = p + opts(title = expression('Classification with 95% confidence ellipse')) p
No comments:
Post a Comment