R Script for K-Means Cluster Analysis
This document provides a brief overview of the kmeansc.R script which can be used to carry out K-means cluster analysis and related analyses on two-way tables. All procedures included in this script are part of freely available packages for the R statistical computing software and this script is simply meant to make it easier to carry out these analyses. You can download a sample data file along with the script to follow along with this example. Right click and click Save As for both of the files above. Sample output can be downloaded here.
File Format:
This script is designed to use the *.csv (comma separated value) file format.
Microsoft Excel as well as the open-source program Calc can be used to produce
files in this format from any tabular data. See the help sections for each of
these programs for more information. For the purposes of this script, the file
should be named "data.csv" in all lower case letters.
Table Format:
Tables should be formatted with each of the samples/observations as rows and
each of the variables to be included as columns. The first row of the spreadsheet
should be a header that labels each of the columns. The first column should
be named "group" all in lower case letters. This
column should contain information on how the observations are to be grouped
(i.e., by region/site, etc.). The script is written such that this analysis
will not work without this column. If you choose not to use groups for your
analysis, you should fill this column with the same information (for example
NA) for every row. All of the remaining columns should contain numerical data
that will be used to define clusters. This analysis will not work if there are
missing data in any rows or columns, so samples with missing data should be
removed before running the script. A sample table format is shown below:
group |
VarA |
VarB |
VarC |
VarD |
VarE |
| El Morro | 3.867734733 | 6.763993603 | 10.69645667 | 3.47807071 | 13.26029891 |
| El Morro | 3.834225282 | 13.31093941 | 10.82968254 | 0.65799061 | 16.42231967 |
| El Morro | 6.281354025 | 5.243439187 | 10.66008141 | 1.312441089 | 16.27595885 |
| Zuni | 5.880036865 | 11.65656376 | 12.27343013 | 3.533976308 | 7.897465544 |
| Zuni | 6.721518026 | 1.881274844 | 10.13945267 | 2.477597931 | 11.31768366 |
| Puerco | 6.340362167 | 24.32552962 | 10.41703188 | 4.020033153 | 14.76344674 |
| Puerco | 4.311121233 | 10.88879727 | 11.01169512 | 2.283811961 | 16.90447911 |
| Upper LC | 2.63470996 | 4.270196967 | 10.22653629 | 2.502083543 | 14.46706922 |
Requirements for Running the
Script:
In order to run this script, you must install the R statistical package
(version 2.8). R can be downloaded for free here.
Follow the instructions on the R site for installation procedures. In addition
to this, this script requires three specific R packages to be installed (cluster,
fpc, seriation). In order to install these two packages, simply click on the
"packages" drop down menu at the top of the R window and click on
"Install package(s)". Choose a CRAN mirror (it is best to choose the
location closest to you). Select the "cluster," "fpc," or
"seriation" package and click OK. Repeat the process for the other
two packages. For further instructions for installing packages, check here.
Starting the Script:
The first step for running the script is to place the script file "kmeansc.R"
and the file "data.csv" in the working directory of R. To change the
working directory, click on "File" in the R window and select "Change
dir", then simply browse to the directory that you would like to use as
the working directory. Next, to actually run the script, type the following
line into the R command line:
source('kmeansc.R')
Running the Script:
After typing the command above into the command line, the graphics
window will open and display a plot representing the within groups sum of squares
on the Y axis and the cluster level along the X axis. The command window will
then print a suggested clustering level to used based on the pamk function described
here.
Next, click on the command window and type the clustering level that you would
like to use in the window and hit enter. The script will then assign this cluster
level and output a file including the original data along with the cluster assignment
as "kmeans_out.csv". Next, click on the graphics window to display
bar charts of cluster assignments by "group". Click on the graphics
window once more to display a Multi-Dimensional scaling plot of the K-means
clusters. Finally, click once more to display the dissimilarity plot described
here. At
this point, all of the output is saved to a pdf file in the working directory
called "kmeans_output.pdf".
Script:
# The text below is the text of the script included in the file "kmeansc.R".
library(cluster)
library(fpc)
library(seriation)
data1 = read.table(file='data.csv', sep=',', header=T)
group <- data1$group
data1$group <- NULL
kdata <- na.omit(data1)
kdata <- scale(kdata)
wss <- (nrow(kdata)-1)*sum(apply(kdata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(kdata, centers=i)$withinss)
print('Testing K-Means solutions')
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within
Groups Sum of Squares")
suggest <- pamk(kdata)
print('Determining suggested number of clusters based on cluster distortion')
print(suggest$nc)
choose.clust <- function(){readline("What clustering solution would
you like to use? ")}
clust.level <- as.integer(choose.clust())
fit <- kmeans(kdata, clust.level)
aggregate(kdata, by=list(fit$cluster), FUN=mean)
clust.out <- fit$cluster
kclust <- as.matrix(cbind(clust.out, kdata))
write.table(kclust, file="kmeans_out.csv", sep=",")
print('Applying K-means cluster level and appending files')
# BAR PLOT
par(ask=TRUE)
print('Creating bar chart of clusters by Group')
b.plot <- table(clust.out, group)
b.plot.mat <- as.matrix(b.plot)
b.plot.per <- prop.table(b.plot.mat, margin=2)*100
barplot(b.plot.per, main="K-means Clusters by Group", ylim=c(0,100),
ylab="Percent", beside=TRUE, cex.names=0.5, col=rainbow(clust.level))
# PLOT CONFIDENCE INTERVALS OVER MDS PLOT
par(ask=TRUE)
print('Plotting confidence intervals over first two MDS dimensions')
clusplot(kdata, fit$cluster, shade=F, labels=0, lines=0, color=T)
# DISSIMILARITY PLOT
print('Creating dissimilarity
plot')
d <- dist(kdata,method='euclidean')
l <- fit$cluster
res <- dissplot(d,labels=l,options=list(main='Dissimilarity Plot - Standard'))
#OUTPUT
print('Sending all output
to PDF file')
pdf(file="kmeans_output.pdf")
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within
Groups Sum of Squares")
barplot(b.plot.per, main="K-means Clusters by Group", ylim=c(0,100),
ylab="Percent", beside=TRUE, cex.names=0.5, col=rainbow(clust.level))
clusplot(kdata, fit$cluster, shade=F, labels=0, lines=0, color=T)
plot(res, options=list(main='Dissimilarity Plot - Standard'))
dev.off()