R Script for Ward's Method of Hierarchical Cluster Analysis
This document provides a brief overview of the wardc.R script which can be used to carry out Ward's method of hierarchical cluster analysis on two-way tables. All procedures included in this script are part of freely available packages for the R statistical computing software and this script is simply meant to make it easier to carry out these analyses. You can download a sample data file along with the script to follow along with this example. Right click and click Save As for both of the files above. Sample output can be downloaded here.
File Format:
This script is designed to use the *.csv (comma separated value) file format.
Microsoft Excel as well as the open-source program Calc can be used to produce
files in this format from any tabular data. See the help sections for each of
these programs for more information. For the purposes of this script, the file
should be named "data.csv" in all lower case letters.
Table Format:
Tables should be formatted with each of the samples/observations as rows and
each of the variables to be included as columns. The first row of the spreadsheet
should be a header that labels each of the columns. The first column should
be named "group" all in lower case letters. This
column should contain information on how the observations are to be grouped
(i.e., by region/site, etc.). The script is written such that this analysis
will not work without this column. If you choose not to use groups for your
analysis, you should fill this column with the same information (for example
NA) for every row. All of the remaining columns should contain numerical data
that will be used to define clusters. This analysis will not work if there are
missing data in any rows or columns, so samples with missing data should be
removed before running the script. A sample table format is shown below:
group |
VarA |
VarB |
VarC |
VarD |
VarE |
| El Morro | 3.867734733 | 6.763993603 | 10.69645667 | 3.47807071 | 13.26029891 |
| El Morro | 3.834225282 | 13.31093941 | 10.82968254 | 0.65799061 | 16.42231967 |
| El Morro | 6.281354025 | 5.243439187 | 10.66008141 | 1.312441089 | 16.27595885 |
| Zuni | 5.880036865 | 11.65656376 | 12.27343013 | 3.533976308 | 7.897465544 |
| Zuni | 6.721518026 | 1.881274844 | 10.13945267 | 2.477597931 | 11.31768366 |
| Puerco | 6.340362167 | 24.32552962 | 10.41703188 | 4.020033153 | 14.76344674 |
| Puerco | 4.311121233 | 10.88879727 | 11.01169512 | 2.283811961 | 16.90447911 |
| Upper LC | 2.63470996 | 4.270196967 | 10.22653629 | 2.502083543 | 14.46706922 |
Requirements for Running the
Script:
In order to run this script, you must install the R statistical package
(version 2.8). R can be downloaded for free here.
Follow the instructions on the R site for installation procedures. In addition
to this, this script requires a specific R packages to be installed (cluster).
In order to install this package, simply click on the "packages" drop
down menu at the top of the R window and click on "Install package(s)".
Choose a CRAN mirror (it is best to choose the location closest to you). Select
the "cluster" package and click OK. For further instructions for installing
packages, check here.
Starting the Script:
The first step for running the script is to place the script file "wardc.R"
and the file "data.csv" in the working directory of R. To change the
working directory, click on "File" in the R window and select "Change
dir", then simply browse to the direct that you would like to use as the
working directory. Next, to actually run the script, type the following line
into the R command line:
source('wardc.R')
Running the Script:
After typing the command above into the command line, the graphics
window will open and display a tree of the data and cluster levels. The command
window will then print a message requesting that you input the clustering level
that you would like to use. Click on the command window and type the clustering
level that you would like to use in the window and hit enter. The script will
then assign this cluster level and output a file including the original data
along with the cluster assignment as "ward_out.csv". Next, click on
the graphics window and the script will show red rectangles around the cluster
assignments on the data tree. Finally, click on the graphics window once more
to display bar charts of cluster assignments by "group". At this point,
all of the output is saved to a pdf file in the working directory called "ward_output.pdf".
Script:
# The text below is the text of the script included in the file "wardc.R".
library(cluster)
data1 = read.table(file='data.csv', sep=',', header=T)
group <- data1$group
data1$group <- NULL
wdata <- na.omit(data1)
wdata <- scale(wdata)
fit <- hclust(dist(wdata), method='ward')
plot(fit, labels=FALSE)
par(ask=TRUE)
choose.clust <- function(){readline("What clustering solution would
you like to use? ")}
clust.level <- as.integer(choose.clust())
groups <- cutree(fit, clust.level)
rect.hclust(fit, clust.level, border='red')
clust.out <- as.matrix(groups)
colnames(clust.out) <- c("cluster")
wclust <- as.matrix(cbind(clust.out, data1))
write.table(wclust, file="ward_out.csv", sep=",")
print('Applying cluster level and appending files')
par(ask=TRUE)
print('Creating bar chart of clusters by Group')
b.plot <- table(clust.out, group)
b.plot.mat <- as.matrix(b.plot)
b.plot.per <- prop.table(b.plot.mat, margin=2)*100
barplot(b.plot.per, main="Clusters by Group", ylim=c(0,100), ylab="Percent",
beside=TRUE, cex.names=0.5, col=rainbow(clust.level))
print('Sending all output
to PDF file')
pdf(file="ward_output.pdf")
plot(fit, labels=FALSE)
rect.hclust(fit, clust.level, border='red')
barplot(b.plot.per, main="K-means Clusters by Group", ylim=c(0,100),
ylab="Percent", beside=TRUE, cex.names=0.5, col=rainbow(clust.level))
dev.off()