R Script for Ward's Method of Hierarchical Cluster Analysis

This document provides a brief overview of the wardc.R script which can be used to carry out Ward's method of hierarchical cluster analysis on two-way tables. All procedures included in this script are part of freely available packages for the R statistical computing software and this script is simply meant to make it easier to carry out these analyses. You can download a sample data file along with the script to follow along with this example. Right click and click Save As for both of the files above. Sample output can be downloaded here.

File Format:
This script is designed to use the *.csv (comma separated value) file format. Microsoft Excel as well as the open-source program Calc can be used to produce files in this format from any tabular data. See the help sections for each of these programs for more information. For the purposes of this script, the file should be named "data.csv" in all lower case letters.

Table Format:
Tables should be formatted with each of the samples/observations as rows and each of the variables to be included as columns. The first row of the spreadsheet should be a header that labels each of the columns. The first column should be named "group" all in lower case letters. This column should contain information on how the observations are to be grouped (i.e., by region/site, etc.). The script is written such that this analysis will not work without this column. If you choose not to use groups for your analysis, you should fill this column with the same information (for example NA) for every row. All of the remaining columns should contain numerical data that will be used to define clusters. This analysis will not work if there are missing data in any rows or columns, so samples with missing data should be removed before running the script. A sample table format is shown below:

group

VarA

VarB

VarC

VarD

VarE

El Morro 3.867734733 6.763993603 10.69645667 3.47807071 13.26029891
El Morro 3.834225282 13.31093941 10.82968254 0.65799061 16.42231967
El Morro 6.281354025 5.243439187 10.66008141 1.312441089 16.27595885
Zuni 5.880036865 11.65656376 12.27343013 3.533976308 7.897465544
Zuni 6.721518026 1.881274844 10.13945267 2.477597931 11.31768366
Puerco 6.340362167 24.32552962 10.41703188 4.020033153 14.76344674
Puerco 4.311121233 10.88879727 11.01169512 2.283811961 16.90447911
Upper LC 2.63470996 4.270196967 10.22653629 2.502083543 14.46706922

Requirements for Running the Script:
In order to run this script, you must install the R statistical package (version 2.8). R can be downloaded for free here. Follow the instructions on the R site for installation procedures. In addition to this, this script requires a specific R packages to be installed (cluster). In order to install this package, simply click on the "packages" drop down menu at the top of the R window and click on "Install package(s)". Choose a CRAN mirror (it is best to choose the location closest to you). Select the "cluster" package and click OK. For further instructions for installing packages, check here.

Starting the Script:
The first step for running the script is to place the script file "wardc.R" and the file "data.csv" in the working directory of R. To change the working directory, click on "File" in the R window and select "Change dir", then simply browse to the direct that you would like to use as the working directory. Next, to actually run the script, type the following line into the R command line:

source('wardc.R')

Running the Script:
After typing the command above into the command line, the graphics window will open and display a tree of the data and cluster levels. The command window will then print a message requesting that you input the clustering level that you would like to use. Click on the command window and type the clustering level that you would like to use in the window and hit enter. The script will then assign this cluster level and output a file including the original data along with the cluster assignment as "ward_out.csv". Next, click on the graphics window and the script will show red rectangles around the cluster assignments on the data tree. Finally, click on the graphics window once more to display bar charts of cluster assignments by "group". At this point, all of the output is saved to a pdf file in the working directory called "ward_output.pdf".

Script:
# The text below is the text of the script included in the file "wardc.R".

library(cluster)
data1 = read.table(file='data.csv', sep=',', header=T)
group <- data1$group
data1$group <- NULL
wdata <- na.omit(data1)
wdata <- scale(wdata)
fit <- hclust(dist(wdata), method='ward')
plot(fit, labels=FALSE)

par(ask=TRUE)
choose.clust <- function(){readline("What clustering solution would you like to use? ")}
clust.level <- as.integer(choose.clust())

groups <- cutree(fit, clust.level)
rect.hclust(fit, clust.level, border='red')

clust.out <- as.matrix(groups)
colnames(clust.out) <- c("cluster")
wclust <- as.matrix(cbind(clust.out, data1))
write.table(wclust, file="ward_out.csv", sep=",")
print('Applying cluster level and appending files')

par(ask=TRUE)
print('Creating bar chart of clusters by Group')
b.plot <- table(clust.out, group)
b.plot.mat <- as.matrix(b.plot)
b.plot.per <- prop.table(b.plot.mat, margin=2)*100
barplot(b.plot.per, main="Clusters by Group", ylim=c(0,100), ylab="Percent", beside=TRUE, cex.names=0.5, col=rainbow(clust.level))

print('Sending all output to PDF file')
pdf(file="ward_output.pdf")
plot(fit, labels=FALSE)
rect.hclust(fit, clust.level, border='red')
barplot(b.plot.per, main="K-means Clusters by Group", ylim=c(0,100), ylab="Percent", beside=TRUE, cex.names=0.5, col=rainbow(clust.level))
dev.off()