CSE/CBS 572: Project Two

Due 4/17/2008

 

In this project, you will be required to code the Apriori algorithm to mine association rules for the Congressional Voting Records data (link to access the data). Your program should first generate the frequent itemsets from the datasets, and then generate the association rules from those frequent itemsets.

1.      Data pre-processing. Extract 34 features as in Table 6.3 (pp. 353 of the textbook). “?” means simply, that the value is not “yes” or “no”. For each transaction, omit any feature with “?”. See class note for more information.

2.      Your program should generate frequent k-itemsets for size k = {1,2,3,4}. Set minimum support = 0.3. Report (up to) the 5 most frequent k-itemsets for size k = {1,2,3,4} with their support values.

Your output format must be as shown below. (Note: The itemsets shown in the table are for illustration purposes only and may not represent the truth from the data.)

 

K = 4

Support

Rank 1

Republican , Budget resolution = no, MX-missile = no, aid to El Salvador = yes

0.55

Rank 2

Democrat , Budget resolution = yes, MX-missile = yes, aid to El Salvador = no

0.45

Rank 3

...

 

Rank 4

...

 

Rank 5

...

 

 

K = 3

 

 

…

 

 

K = 2

 

 

…

 

 

K = 1

 

 

…

 

3.      Your program should generate the association rules for the (up to) top 5 frequent 4-itemsets. Set minimum confidence = 0.9. Report the (up to) top 5 association rules ranked according to the confidence.

Your output format should be as shown below. (Note: The association rules shown in the table are for illustration purposes only and may not represent the truth from the data.)

 

Association rules from the (up to) top 5 frequent 4-itemsets

Confidence

Rank 1

Budget resolution = yes, MX-missile = yes, aid to El Salvador = noΰ Democrat

0.975

Rank 2

Budget resolution = no, MX-missile = no, aid to El Salvador = yesΰ Republican

0.910

Rank 3

…

 

Rank 4

...

...

Rank 5

...

...

 

The project should be written in C, C++, Java, or Matlab, which should compile on the general machine (general.asu.edu, ASURITE user ID, password). Students should hand in an electronic (blackboard) copy of the solution by providing:

  1. The source codes for parts 1, 2, and 3.
  2. The collection of transactions after the data pre-processing step.
  3. Tables including (up to) the 5 most frequent k-itemsets for size k = {1,2,3,4} with their support values.
  4. Tables including (up to) top 5 association rules ranked according to the confidence.

Data source: UCI Machine Learning Repository