CSE 471/598, CBS 598
– Introduction to Artificial intelligence
Project II –
Requirement Specifications
Decision Tree Induction
TA: Surendra Singhi,
surendra@asu.edu
Deadline: 11/24/04
Wednesday
In this project you
are required to implement decision
tree. Please start now.
Induction
of
Decision Trees
Decision Trees are
one type of classifiers that is widely
used by the machine learning community. The label ‘y’ of an example
‘X’ is obtained by passing the example down from the root to the
leaf, testing the appropriate attribute at each node and following the
branch
corresponding to the attribute’s value in the example. A decision tree
is
learned by recursively replacing the leaves by test nodes until certain
stopping criterion is met, starting at the root. Please refer to the
textbook
for details.
Problem
Your program should
have the functions, described below:
- A function named
“read-data” which should take an input file and store it in a structure
in some format of your choice and return
the structure (henceforth called dataset). A sample data set is given
below.
(read-data
“train.arff”)
- A function named
“number-of-instances” which returns the number of instances when being
passed the dataset structure (returned by
read-data).
(number-of-instance
test-data)
- A function named
“instance” which on being passed the dataset structure
(returned by read-data) and the instance number (a valid value)
returns the instance (whatever format you think is good).
(instance
test-data 10)
- Function
“build-classifier” which should take the dataset (returned
by read-data) and an optional impurity function as its parameter. The
function should build a decision tree using the impurity function which
is given to it. If the impurity function is not provided (passed when
the function is called) then your build-classifier function should use
a random method of attribute selection (taking the attributes in some
order is not random). The reason why your function should take another
function as a parameter is that in case if a different impurity measure
is used, one should have the flexibility of doing that. This function
should return the built classifier (some
structure).
(build-classifier
train-data
#’entropy)
(build-classifier
train-data )
- A
function named “entropy” should take a list containing
probability values (any number of them) and return the impurity
(purity) value.
(entropy ‘(0.5
0.5))
(entropy ‘(1))
- Function
“evaluate-instance” which should take as input a classifier and an
instance, and return its classification.
(evaluate-instance
(build-classifier
train-data)
(instance test-data 5))
- A function named
“classification-error” which should take as
input parameter a classifier and a dataset, and return the
classification error (percentage) on it.
(classification-error
(build-classifier train-data)
test-data)
- A function named
“confusion-matrix” which will take the same parameters as
classification-error but returns the confusion matrix
(a two-dimensional array). A confusion matrix is a matrix
showing the predicted and actual classifications. It is of size L x L,
where L is the number of different label values.
(confusion-matrix
(build-classifier train-data)
test-data)
- A function
“evaluate-classifier” which should take three(third one optional) input
parameters a classifier, a dataset and an open stream(optional
parameter) and write the classification of instances, classification
error and confusion matrix on the stream. If no stream is given as
input then it should write on the default output screen.
(evaluate-classifier
(build-classifier train-data)
test-data file-str)
(evaluate-classifier
(build-classifier train-data)
test-data)
Entropy
Measure
As discussed in the
class and in the textbook.
Classification
error
It
is the percentage of examples that are not classified correctly by the
classifier.
Classification Error
= (1 – accuracy)*100 (as x%)
(where accuracy = (a + e + i )/N)
Confusion matrix
It
is a matrix showing the predicted and actual classifications. A
confusion matrix
is of size L x L, where L is the number of different label values. The
following confusion matrix is for L=2:
actual \ predicted
|
negative
|
positive
|
Negative
|
a
|
b
|
Positive
|
c
|
d
|
A Confusion Matrix of
L = 3 is shown below.
Class1 Class2 Class3
Class1
a b
c
Class2
d
e
f
Class3
g
h
i
Datasets
The datasets will
have the format as this file (Parity5.arff).
All the attributes
will be nominal (not necessarily
Boolean) and the class label will also be discrete (not necessarily
Boolean,
but no more than 10). There will not be any missing values in the data.
The
attribute named “class” is the class attribute.
Project Submission
Guidelines
You need to submit
the soft copy of the project via myASU
Assignments section.
- The LISP file
should be named {last name}-{first-name}.lisp
- Submit the file
through Assignment section in myASU.
- Please make sure
that the file is in simple text format and not in
document (Microsoft Word or some other document) format.
- The first
line in the file should be your name. Please
comment it.
- If you want to
label portions of your code as solution to part one, part two, etc., or
you want to write the problem statement in the lisp file(it is not
needed, you can do it if you feel it is convenient for you), please
make sure that they are all commented out.
- There should be no
syntax errors in the program(like forgotten closing brackets, misplaced
semicolon unmatched double quotes, etc.,)
- Please try to
comment out whatever code you have written for debugging purpose. The
function execution calls like (tsp “C:\\input.txt” ‘Bavaria ‘DFS) should all be
commented out.
Also, you are
supposed to turn in a hardcopy report either
in class or at the CSE department office. The report should be brief
and
contain the following:
Submission
and
Grading
You need to submit
the soft copy of the project via myASU
Assignments section and turn in a report in the class
or at the CSE department office. The report should be brief and contain
the
following:
- A short problem
statement
- Design – Data
Structure Design and Algorithm design
- Sample Input and
Output (One for a random tree, one for a
tree induced using information gain)
- Discussion on any
topic on decision trees. It could be on any other impurity measure, or
creating ensembles using decision trees, etc.
The distribution of
grades for the project is:
Soft Copy – 80%
Project Report – 20%