CSE 471/598 - Introduction to Artificial intelligence

Project II - Requirement Specifications

Decision Tree Induction

Please note the following to avoid any misunderstanding: The highest standards of academic integrity are expected of all students. The failure of any student to meet these standards may result in suspension or expulsion from the university and other sanctions as specified in the academic integrity policies of the individual colleges. Violations of academic integrity include, but are no limited to, cheating, fabrication, tampering, plagiarism, or facilitating such activities. If you are not sure of something or in doubt then please contact the TA or the professor.

Deadline: Monday, April 24, 2006

In this project we implement a decision tree induction algorithm in Lisp.

Induction of Decision Trees

Decision Trees are one type of classifiers widely used by the machine learning community. The label 'Y' of an example 'X' is obtained by passing the example down from the root to the leaf, testing the appropriate attribute at each node and following the branch corresponding to the attribute's value in the example. A decision tree is learned by starting at the root and recursively replacing the leaves following a feature selection algorithm until a certain stopping criterion is met.

Problem

Your program should have the following functions:

A function named "read-data" which should take an input file and store it in a structure in some format (its up to you) and return the structure (henceforth called dataset). A sample input file is given as
(read-data "train.arff")
Function "build-classifier" which should take the dataset structure (returned by read-data) and an impurity function as its parameter. The function should build a decision tree using a given impurity function. The reason why your function should take another function as a parameter is that in case if you want to use a different impurity measure you should have the flexibility of doing that. This function should return the built classifier (some structure).
(build-classifier train-data #'entropy)
A functions named "entropy" should take probability values (any number of them) and return the impurity (purity) value.
(entropy 0.5 0.5 0.5)

(entropy 1)
A function named "evaluate-classifier", it should take as input parameter a classifier and a dataset and return the classification error (percentage) on it. It should print the classification of all the instances in the dataset and also print the confusion matrix. A confusion matrix is a matrix showing the predicted and actual classifications. It is of size L x L, where L is the number of different label values.
(evaluate-classifier (build-classifier train-data) train-data)

(evaluate-classifier (build-classifier train-data) test-data)

Entropy Measure

As discussed in the class and the book.

Classification Error

It is the percentage of examples that are not classified correctly by the classifier.

Classification Error = (1 - accuracy)*100 (as x%), where accuracy = (a + e + i )/N (corresponding to the 3X3 confusion matrix below)

Confusion matrix

It is a matrix showing the predicted and actual classifications. A confusion matrix is of size L x L, where L is the number of different label values. The following confusion matrix is for L=2:

actual \ predicted	negative	positive
Negative	a	b
Positive	c	d

A confusion Matrix of 3 classes is as follows:

	Class I (predicted)	Class II (predicted)	Class III (predicted)
(actual) Class I	a	b	c
(actual) Class II	d	e	f
(actual) Class III	g	h	i

Datasets

The datasets will have the format as this file. (Parity5.arff)

All the attributes will be nominal (not necessarily Boolean) and the class label will also be discrete (not necessarily Boolean). There will not be any missing values in the data. The attribute named "class" is the class attribute.

More information on the file format can be obtained at http://www.cs.waikato.ac.nz/~ml/weka/arff.html. Both the training and test datasets will have the same format. Below is a Lisp function for reading from a stream till a certain delimiter is found. You are encouraged to use your own function, or read the file in a different way.

(defun read-till-delimiter (stream delimiter)
   (with-output-to-string (str)
      (loop for ch = (read-char stream nil stream)
            until (or (eq ch stream) (eql ch delimiter))
            do (princ ch str))))

Project Submission Guidelines

You need to submit the soft copy of the project via myASU Assignments section. Please read carefully the guidelines.

You should submit one LISP file named {last name}-{first-name}.lisp
Submit the file through Assignment section in myASU.
Please make sure that the file is in simple text format and not in document (Microsoft Word or some other document) format.
The first line in the file should be your name. Please comment it, e.g., ;Lei Tang.
If you want to label portions of your code as solution to part one, part two, etc., or you want to write the problem statement in the lisp file (it is not needed, you can do it if you feel it is convenient for you), please make sure that they are all commented.
There should be no syntax errors in the program (like forgotten closing brackets, misplaced semicolon unmatched double quotes, etc.,); the program should compile without errors.
Please comment out whatever code you have written for debugging purpose. The function execution calls like (sudoku "C:\\input.txt" 'DFS) should all be commented.
Make sure your program executes in Allegro Lisp (ideally it should execute in clisp too).
Do not use #t as it is something from Scheme and not part of Ansi Lisp. Allegro Lisp supports it but clisp doesn't.
Please do not change the package name or create functions in different packages.
Please follow the Lisp style guide available at http://www.lisp.org/table/style.htm

Also, you should turn in a hardcopy report either in class or at the CSE department office. The report should be brief and analytical. It should contain discussion on any topic on decision trees, for example, any other impurity measures, any method of pruning the decision tree, etc. It should also contain the difficulties you encounter, interesting observations, etc.

Grading

The distribution of grades for the project is :

Code - 80%

Project Report - 20%

(updated on April 2nd, 2006)