R Script for K-Means Cluster Analysis

This document provides a brief overview of the kmeansc.R script which can be used to carry out K-means cluster analysis and related analyses on two-way tables. All procedures included in this script are part of freely available packages for the R statistical computing software and this script is simply meant to make it easier to carry out these analyses. You can download a sample data file along with the script to follow along with this example. Right click and click Save As for both of the files above. Sample output can be downloaded here.

File Format:
This script is designed to use the *.csv (comma separated value) file format. Microsoft Excel as well as the open-source program Calc can be used to produce files in this format from any tabular data. See the help sections for each of these programs for more information. For the purposes of this script, the file should be named "data.csv" in all lower case letters.

Table Format:
Tables should be formatted with each of the samples/observations as rows and each of the variables to be included as columns. The first row of the spreadsheet should be a header that labels each of the columns. The first column should be named "group" all in lower case letters. This column should contain information on how the observations are to be grouped (i.e., by region/site, etc.). The script is written such that this analysis will not work without this column. If you choose not to use groups for your analysis, you should fill this column with the same information (for example NA) for every row. All of the remaining columns should contain numerical data that will be used to define clusters. This analysis will not work if there are missing data in any rows or columns, so samples with missing data should be removed before running the script. A sample table format is shown below:

group

VarA

VarB

VarC

VarD

VarE

El Morro 3.867734733 6.763993603 10.69645667 3.47807071 13.26029891
El Morro 3.834225282 13.31093941 10.82968254 0.65799061 16.42231967
El Morro 6.281354025 5.243439187 10.66008141 1.312441089 16.27595885
Zuni 5.880036865 11.65656376 12.27343013 3.533976308 7.897465544
Zuni 6.721518026 1.881274844 10.13945267 2.477597931 11.31768366
Puerco 6.340362167 24.32552962 10.41703188 4.020033153 14.76344674
Puerco 4.311121233 10.88879727 11.01169512 2.283811961 16.90447911
Upper LC 2.63470996 4.270196967 10.22653629 2.502083543 14.46706922

Requirements for Running the Script:
In order to run this script, you must install the R statistical package (version 2.8). R can be downloaded for free here. Follow the instructions on the R site for installation procedures. In addition to this, this script requires three specific R packages to be installed (cluster, fpc, seriation). In order to install these two packages, simply click on the "packages" drop down menu at the top of the R window and click on "Install package(s)". Choose a CRAN mirror (it is best to choose the location closest to you). Select the "cluster," "fpc," or "seriation" package and click OK. Repeat the process for the other two packages. For further instructions for installing packages, check here.

Starting the Script:
The first step for running the script is to place the script file "kmeansc.R" and the file "data.csv" in the working directory of R. To change the working directory, click on "File" in the R window and select "Change dir", then simply browse to the directory that you would like to use as the working directory. Next, to actually run the script, type the following line into the R command line:

source('kmeansc.R')

Running the Script:
After typing the command above into the command line, the graphics window will open and display a plot representing the within groups sum of squares on the Y axis and the cluster level along the X axis. The command window will then print a suggested clustering level to used based on the pamk function described here. Next, click on the command window and type the clustering level that you would like to use in the window and hit enter. The script will then assign this cluster level and output a file including the original data along with the cluster assignment as "kmeans_out.csv". Next, click on the graphics window to display bar charts of cluster assignments by "group". Click on the graphics window once more to display a Multi-Dimensional scaling plot of the K-means clusters. Finally, click once more to display the dissimilarity plot described here. At this point, all of the output is saved to a pdf file in the working directory called "kmeans_output.pdf".

Script:
# The text below is the text of the script included in the file "kmeansc.R".

library(cluster)
library(fpc)
library(seriation)
data1 = read.table(file='data.csv', sep=',', header=T)
group <- data1$group
data1$group <- NULL
kdata <- na.omit(data1)
kdata <- scale(kdata)
wss <- (nrow(kdata)-1)*sum(apply(kdata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(kdata, centers=i)$withinss)
print('Testing K-Means solutions')
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within Groups Sum of Squares")
suggest <- pamk(kdata)
print('Determining suggested number of clusters based on cluster distortion')
print(suggest$nc)
choose.clust <- function(){readline("What clustering solution would you like to use? ")}
clust.level <- as.integer(choose.clust())

fit <- kmeans(kdata, clust.level)
aggregate(kdata, by=list(fit$cluster), FUN=mean)
clust.out <- fit$cluster
kclust <- as.matrix(cbind(clust.out, kdata))
write.table(kclust, file="kmeans_out.csv", sep=",")
print('Applying K-means cluster level and appending files')

# BAR PLOT

par(ask=TRUE)
print('Creating bar chart of clusters by Group')
b.plot <- table(clust.out, group)
b.plot.mat <- as.matrix(b.plot)
b.plot.per <- prop.table(b.plot.mat, margin=2)*100
barplot(b.plot.per, main="K-means Clusters by Group", ylim=c(0,100), ylab="Percent", beside=TRUE, cex.names=0.5, col=rainbow(clust.level))

# PLOT CONFIDENCE INTERVALS OVER MDS PLOT

par(ask=TRUE)
print('Plotting confidence intervals over first two MDS dimensions')
clusplot(kdata, fit$cluster, shade=F, labels=0, lines=0, color=T)

# DISSIMILARITY PLOT

print('Creating dissimilarity plot')
d <- dist(kdata,method='euclidean')
l <- fit$cluster
res <- dissplot(d,labels=l,options=list(main='Dissimilarity Plot - Standard'))

#OUTPUT

print('Sending all output to PDF file')
pdf(file="kmeans_output.pdf")
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within Groups Sum of Squares")
barplot(b.plot.per, main="K-means Clusters by Group", ylim=c(0,100), ylab="Percent", beside=TRUE, cex.names=0.5, col=rainbow(clust.level))
clusplot(kdata, fit$cluster, shade=F, labels=0, lines=0, color=T)
plot(res, options=list(main='Dissimilarity Plot - Standard'))
dev.off()