CSE 471/598 – Introduction to Artificial intelligence

CSE 471/598, CBS 598 – Introduction to Artificial intelligence

Project II – Requirement Specifications

Decision Tree Induction

TA: Surendra Singhi, surendra@asu.edu

Deadline: 11/24/04 Wednesday

In this project you are required to implement decision tree. Please start now.

Induction of Decision Trees

Decision Trees are one type of classifiers that is widely used by the machine learning community. The label ‘y’ of an example ‘X’ is obtained by passing the example down from the root to the leaf, testing the appropriate attribute at each node and following the branch corresponding to the attribute’s value in the example. A decision tree is learned by recursively replacing the leaves by test nodes until certain stopping criterion is met, starting at the root. Please refer to the textbook for details.

Problem

Your program should have the functions, described below:

A function named “read-data” which should take an input file and store it in a structure in some format of your choice and return the structure (henceforth called dataset). A sample data set is given below.

(read-data “train.arff”)

A function named “number-of-instances” which returns the number of instances when being passed the dataset structure (returned by read-data).

(number-of-instance test-data)

A function named “instance” which on being passed the dataset structure (returned by read-data) and the instance number (a valid value) returns the instance (whatever format you think is good).

(instance test-data 10)

Function “build-classifier” which should take the dataset (returned by read-data) and an optional impurity function as its parameter. The function should build a decision tree using the impurity function which is given to it. If the impurity function is not provided (passed when the function is called) then your build-classifier function should use a random method of attribute selection (taking the attributes in some order is not random). The reason why your function should take another function as a parameter is that in case if a different impurity measure is used, one should have the flexibility of doing that. This function should return the built classifier (some structure).

(build-classifier train-data #’entropy)

(build-classifier train-data )

A function named “entropy” should take a list containing probability values (any number of them) and return the impurity (purity) value.

(entropy ‘(0.5 0.5))

(entropy ‘(1))

Function “evaluate-instance” which should take as input a classifier and an instance, and return its classification.

(evaluate-instance (build-classifier train-data)

(instance test-data 5))

A function named “classification-error” which should take as input parameter a classifier and a dataset, and return the classification error (percentage) on it.

(classification-error (build-classifier train-data)

test-data)

A function named “confusion-matrix” which will take the same parameters as classification-error but returns the confusion matrix (a two-dimensional array). A confusion matrix is a matrix showing the predicted and actual classifications. It is of size L x L, where L is the number of different label values.

(confusion-matrix (build-classifier train-data)

test-data)

A function “evaluate-classifier” which should take three(third one optional) input parameters a classifier, a dataset and an open stream(optional parameter) and write the classification of instances, classification error and confusion matrix on the stream. If no stream is given as input then it should write on the default output screen.

(evaluate-classifier (build-classifier train-data)

test-data file-str)

(evaluate-classifier (build-classifier train-data)

test-data)

Entropy Measure

As discussed in the class and in the textbook.

Classification error

It is the percentage of examples that are not classified correctly by the classifier.

Classification Error = (1 – accuracy)*100 (as x%)

(where accuracy = (a + e + i )/N)

Confusion matrix

It is a matrix showing the predicted and actual classifications. A confusion matrix is of size L x L, where L is the number of different label values. The following confusion matrix is for L=2:

actual \ predicted	negative	positive
Negative	a	b
Positive	c	d

A Confusion Matrix of L = 3 is shown below.

Class1 Class2 Class3

Class1 a b c

Class2 d e f

Class3 g h i

Datasets

The datasets will have the format as this file (Parity5.arff).

All the attributes will be nominal (not necessarily Boolean) and the class label will also be discrete (not necessarily Boolean, but no more than 10). There will not be any missing values in the data. The attribute named “class” is the class attribute.

Project Submission Guidelines

You need to submit the soft copy of the project via myASU Assignments section.

The LISP file should be named {last name}-{first-name}.lisp
Submit the file through Assignment section in myASU.
Please make sure that the file is in simple text format and not in document (Microsoft Word or some other document) format.
The first line in the file should be your name. Please comment it.
If you want to label portions of your code as solution to part one, part two, etc., or you want to write the problem statement in the lisp file(it is not needed, you can do it if you feel it is convenient for you), please make sure that they are all commented out.
There should be no syntax errors in the program(like forgotten closing brackets, misplaced semicolon unmatched double quotes, etc.,)
Please try to comment out whatever code you have written for debugging purpose. The function execution calls like (tsp “C:\\input.txt” ‘Bavaria ‘DFS) should all be commented out.

Also, you are supposed to turn in a hardcopy report either in class or at the CSE department office. The report should be brief and contain the following:

Submission and Grading

You need to submit the soft copy of the project via myASU Assignments section and turn in a report in the class or at the CSE department office. The report should be brief and contain the following:

A short problem statement
Design – Data Structure Design and Algorithm design
Sample Input and Output (One for a random tree, one for a tree induced using information gain)
Discussion on any topic on decision trees. It could be on any other impurity measure, or creating ensembles using decision trees, etc.

The distribution of grades for the project is:

Soft Copy – 80%

Project Report – 20%