CSE 591 Data Mining

Instructor: Huan Liu

=============== LEARNING, INNOVATING via PRACTICE =============

COURSE AND OBJECTIVES


The course is at the graduate level and most topics are on-going research
work. It is a seminar course and active class-room participation is
expected. It is designed to encourage students to actively learn advanced
concepts, to independently think over research and development issues,
to pro-actively relate what we learn to the real problems in practice,
to stimulate and brain-storm new ideas, to intelligently solve pressing
problems in various phases of data mining, and to create and implement
systems or components that can be used and re-used for data mining.


FORMAT

In order to achieve the objectives, we design the course into several parts:

Part 1. Brief introduction and review

Part 2. Paper presentation, questions to think, problems to solve

Part 3. Project and presentation

        Projects can be done in groups or individually
        Projects can be one of the following types
           a. Survey
           b. Real-world problem identification, analysis and solution
           c. Research and implementation
           d. Building a data mining environment

Part 4. Quizzes only

SCHEDULE (tentative)

 Introduction and Organization

 Classification: ID3                              Quinlan86
 Classification: NBC & k-NN                       Domingos&Pazzani96
 Classification: Neural Networks                  Lu et al95

 Evaluation: Performance Measures                 statistics book
 Evaluation: Comparison                           Lim etal99

 Preprocessing: Feature Selection                 Dash-Liu97
 Preprocessing: Discretization                    Hussain et al99
 Preprocessing: Sampling and Its Application      Gu et al00

 Clustering:    k-Means and EM                    Meila-Heckerman98
 Clustering:    Hierarchical CLustering           Fisher96
 Clustering:    BIRCH                             Zhang-etal96

 Association:  Rule Mining                        Agrawal-Srikantl94
 Association:  Multi-level Rules                  Han-Fu95
 Association:  Parallel Mining                    Agrawal etal96,Han etal97

 Data Warehousing: Schemas and Data Cubes        Garcia-Molina99,Widom96
 Data Warehousing: Data Marting and Metadata      Chapters6-9,Anahory-Murray

 Semi-Structured Data: XML, etc.                  Deutsch etal99
 Semi-Structured Data: RDF, Meta-Data             Mutsumi00

 Web Mining: Extracting Patters and Relations     Brin98
 Web Mining: Dynamic Data Mining via Sampling     Brin-Page

 Real-World Challenges: Invited Talk by  Mr Mike Gardner, Motorola Labs
 Real-World Challenges: Customer Retention

C1 Proposal Presentation and Discussion
C2 Proposal Presentation and Discussion

C3 Project Presentation and Discussion
C4 Project Presentation and Discussion
C5 Project Presentation and Discussion
C6 Project Presentation and Discussion


Classification
--------------
  Quinlan86
     J.R. Quinlan, ``Induction of Decision Trees'', Machine Learning, 1:81-106
        1986

  Domingos&Pazzani96
     P. Domingos and M. Pazzani, ``Beyond Independence: Conditions for
       the Optimality of the Simple Bayesian Classifier'', Proceedings of
       the 13th International Conference on Machine Learning. 1996. 105-112.
     ps.gz

  Lu et al95
     H. Lu, R.Setiono and H. Liu, ``NeuroRule: A Connectionist Approach to
       Data Mining'', International Conference on Very Large Databases
       (VLDB'95), Oct. 1995, Zurich, Switzland.
     ps

Evaluation
----------
  Witten&Frank00
     Witten, I.H. and Frank, E. ``Data Mining - Practical Machine
     Learning Tools and Techniques with JAVA Implementations'',
     Morgan Kaufmann, 2000.

  Lim et al99
     Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. ``A Comparison of Prediction
       Accuracy, Complexity, and Training Time of Thirty-three Old and
       New Classification Algorithms'', Machine Learning. Forthcoming.
       (Appendix containing complete tables of error rates, ranks, and
       training times; download the data sets in C4.5 (ASCII) format)
     link

Preprocessing
-------------
  Dash-Liu97
     M. Dash and H. Liu, ``Feature Selection for Classification''.
       Intelligent Data Analysis - An International Journal,
       Elsevier, Vol. 1, No. 3, 1997
       ps

  Hussain et al99
     F. Hussain, H. Liu, C.L. Tan, and M. Dash, ``Discretization:
       An Enabling Technique'', Technical Report, TRC6/99,
       School of Computing, National University of Singapore, 1999.
       link

  Gu et al00
     B. Gu, F. Hu, and H. Liu, ``Sampling and Its Application to
     Data Mining: A Survey", Technical Report, TRA6/00,
       School of Computing, National University of Singapore, 2000.
       link

Clustering
----------

  Meila-Heckerman98
     M. Meila and D. Heckerman, ``An Experimental Comparison of Several
       Clustering and Initialization Methods'', Technical Report,
       MSR-TR-98-06, Microsoft Research, Redmond, WA..
     ps

  Fisher96
     D.H. Fisher, ``Iterative Optimization and Simplification of Hierarchical
       Clusterings'', JAIR, 4, 147-178, 1996.
     ps

  Zhang-etal96
     T. Zhang, R. Ramakrishnan, and M. Livny, ``BIRCH: An Efficient Data
       Clustering Method for Very Large Databases'', SIGMOD96, 103-114.

Association
-----------
  Agrawal-Srikantl94
     R. Agrawal and R. Srikant, ``Fast Algorithms for Mining Association
       Rules'', Proceedings International Conference on Very Large
       Data Bases, Santiago, Chile, September, 1994, 487-499.
     ps

  Han-Fu95
     J. Han and Y. Fu. ``Discovery of Multiple-level Association Rules from
       Large databases''. Proceedings International Conference on Very Large
       Data Bases, Zurich, Switzerland, September 1995, 420-431.
     ps

  Agrawal etal96
     R. Agrawal and J.C. Shafer, ``Parallel Mining of Association Rules''
       IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6,
       December 1996.
     ps

  Han etal97
     E-H. Han, G. Karypis and V. Kumar, ``Scalable Parallel Data Mining for
       Association Rules''. Proceedings ACM SIGMOD International Conference on
       Management of Data, May 1997.
     ps

Real-World Applications
----------------------
  Ng etal98
     K.S. Ng, H. Liu, and H.B. Kwah, ``A Data Mining Application:
       Customer Retention at the Port of Singapore Authority (PSA)'',
       SIGMOD'98, Industrial Track, June 1-4, 1998. Seattle, Washington, USA

Data Warehousing
----------------

  Garcia-Molina99
     H. Garcia-Molina, W. J. Labio, J. L. Wiener, Y. Zhuge. ``Distributed and
       Parallel Computing Issues in Data Warehousing" (invited talk). To
       appear in Proceedings of ACM Principles of Distributed Computing
       Conference, 1999.
       link

  Widom95
     J.Widom, ``Research Problems in Data Warehousing'', Proceedings 4th
       International Conference on Information and Knowledge Management,
       Baltimore, Maryland, Nov. 1995, 25-30.
       link

  Anahory-Murray
     Data Warehousing in the Real World - A Practical Guide for
     Building Decision Support Systems

Web Mining
----------

  Deutsch etal99, ``A Query Language for XML'', Proceedings of the
       Eighth International World Wide Web Conference (WWW8), 1999
       ps

  Mutsumi00, RDF Resources
       Lass97,  Introduction RDF Metadata,
            html, 1997
       Bern98a, What the Semantic Web can represent
            html, 1998
       Bern98b, Why RDF model is different from the XML
            html, 1998

  Brin98, ``Extracting Patterns and Relations from the World-Wide
       Web'', WebDB Workshop at EDBT, 1998
       ps

  Brin-Page, ``Dynamic Data Mining'',
       ps

  Brin-Page98, ``The Anatomy of a Large-Scale Hypertextual Web Search
       Engine'', WWW7 / Computer Networks 30(1-7): 107-117 (1998)
       htm

  Kosala-Blockeel00, ``Web Mining Research: A Survey", SIGKDD
       Explorations, Volume 2, Issue 1: 1-15, 2000.
       link

======================================================

PROJECTS

  The following list of topics is suggested for your reference. Students
can choose to work on one topic or to attack other big problems suggested
by the students after consulting with the lecturer.

Group a. Survey

You should only choose this type of projects if you are quite familar with
the subfield you wish to survey; otherwise, you're advised not to do it.

A survey should follow the style in ACM Computing Surveys. Extensiveness,
comprehensibility, technical worthiness are major considerations. The
survey should be in such a shape that can lead to a tutorial material
or a journal paper.

Group b. Real-world problem identification, analysis and solution

You are encouraged to study your organization's problems and needs for
data mining. Ng et al98 is a good example of real world data mining. You
should discuss with the lecturer about your detailed plan.

Group c. Research Topics

Novel ideas or methods for the work to be claimed as an excellent piece of
research work. You should discuss with the instructor about the novelty and
significance of your ideas or methods.

Group d. Building an environment

You will help instal, implement, or maintain some data mining
systems. The end result of an implementation project should be a self
contained system/component that should be thoroughly tested and can be
of real-use. You should discuss with the instructor for more details.


Latest update: August 16, 2000