CSE 591 Data Mining
Instructor: Huan Liu
=============== LEARNING, INNOVATING via PRACTICE =============
COURSE AND OBJECTIVES
The course is at the graduate level and most topics are on-going research
work. It is a seminar course and active class-room participation is
expected. It is designed to encourage students to actively learn advanced
concepts, to independently think over research and development issues,
to pro-actively relate what we learn to the real problems in practice,
to stimulate and brain-storm new ideas, to intelligently solve pressing
problems in various phases of data mining, and to create and implement
systems or components that can be used and re-used for data mining.
FORMAT
In order to achieve the objectives, we design the course into several parts:
Part 1. Brief introduction and review
Part 2. Paper presentation, questions to think, problems to solve
Part 3. Project and presentation
Projects can be done in groups or individually
Projects can be one of the following types
a. Survey
b. Real-world problem identification, analysis and solution
c. Research and implementation
d. Building a data mining environment
Part 4. Quizzes only
SCHEDULE (tentative)
- Introduction and Organization
- Classification: ID3 Quinlan86
- Classification: NBC & k-NN Domingos&Pazzani96
- Classification: Neural Networks Lu et al95
- Evaluation: Performance Measures statistics book
- Evaluation: Comparison Lim etal99
- Preprocessing: Feature Selection Dash-Liu97
- Preprocessing: Discretization Hussain et al99
- Preprocessing: Sampling and Its Application Gu et al00
- Clustering: k-Means and EM Meila-Heckerman98
- Clustering: Hierarchical CLustering Fisher96
- Clustering: BIRCH Zhang-etal96
- Association: Rule Mining Agrawal-Srikantl94
- Association: Multi-level Rules Han-Fu95
- Association: Parallel Mining Agrawal etal96,Han etal97
- Data Warehousing: Schemas and Data Cubes Garcia-Molina99,Widom96
- Data Warehousing: Data Marting and Metadata Chapters6-9,Anahory-Murray
- Semi-Structured Data: XML, etc. Deutsch etal99
- Semi-Structured Data: RDF, Meta-Data Mutsumi00
- Web Mining: Extracting Patters and Relations Brin98
- Web Mining: Dynamic Data Mining via Sampling Brin-Page
- Real-World Challenges: Invited Talk by Mr Mike Gardner, Motorola Labs
- Real-World Challenges: Customer Retention
C1 Proposal Presentation and Discussion
C2 Proposal Presentation and Discussion
C3 Project Presentation and Discussion
C4 Project Presentation and Discussion
C5 Project Presentation and Discussion
C6 Project Presentation and Discussion
Classification
--------------
Quinlan86
J.R. Quinlan, ``Induction of Decision Trees'', Machine Learning, 1:81-106
1986
Domingos&Pazzani96
P. Domingos and M. Pazzani, ``Beyond Independence: Conditions for
the Optimality of the Simple Bayesian Classifier'', Proceedings of
the 13th International Conference on Machine Learning. 1996. 105-112.
ps.gz
Lu et al95
H. Lu, R.Setiono and H. Liu, ``NeuroRule: A Connectionist Approach to
Data Mining'', International Conference on Very Large Databases
(VLDB'95), Oct. 1995, Zurich, Switzland.
ps
Evaluation
----------
Witten&Frank00
Witten, I.H. and Frank, E. ``Data Mining - Practical Machine
Learning Tools and Techniques with JAVA Implementations'',
Morgan Kaufmann, 2000.
Lim et al99
Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. ``A Comparison of Prediction
Accuracy, Complexity, and Training Time of Thirty-three Old and
New Classification Algorithms'', Machine Learning. Forthcoming.
(Appendix containing complete tables of error rates, ranks, and
training times; download the data sets in C4.5 (ASCII) format)
link
Preprocessing
-------------
Dash-Liu97
M. Dash and H. Liu, ``Feature Selection for Classification''.
Intelligent Data Analysis - An International Journal,
Elsevier, Vol. 1, No. 3, 1997
ps
Hussain et al99
F. Hussain, H. Liu, C.L. Tan, and M. Dash, ``Discretization:
An Enabling Technique'', Technical Report, TRC6/99,
School of Computing, National University of Singapore, 1999.
link
Gu et al00
B. Gu, F. Hu, and H. Liu, ``Sampling and Its Application to
Data Mining: A Survey", Technical Report, TRA6/00,
School of Computing, National University of Singapore, 2000.
link
Clustering
----------
Meila-Heckerman98
M. Meila and D. Heckerman, ``An Experimental Comparison of Several
Clustering and Initialization Methods'', Technical Report,
MSR-TR-98-06, Microsoft Research, Redmond, WA..
ps
Fisher96
D.H. Fisher, ``Iterative Optimization and Simplification of Hierarchical
Clusterings'', JAIR, 4, 147-178, 1996.
ps
Zhang-etal96
T. Zhang, R. Ramakrishnan, and M. Livny, ``BIRCH: An Efficient Data
Clustering Method for Very Large Databases'', SIGMOD96, 103-114.
Association
-----------
Agrawal-Srikantl94
R. Agrawal and R. Srikant, ``Fast Algorithms for Mining Association
Rules'', Proceedings International Conference on Very Large
Data Bases, Santiago, Chile, September, 1994, 487-499.
ps
Han-Fu95
J. Han and Y. Fu. ``Discovery of Multiple-level Association Rules from
Large databases''. Proceedings International Conference on Very Large
Data Bases, Zurich, Switzerland, September 1995, 420-431.
ps
Agrawal etal96
R. Agrawal and J.C. Shafer, ``Parallel Mining of Association Rules''
IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6,
December 1996.
ps
Han etal97
E-H. Han, G. Karypis and V. Kumar, ``Scalable Parallel Data Mining for
Association Rules''. Proceedings ACM SIGMOD International Conference on
Management of Data, May 1997.
ps
Real-World Applications
----------------------
Ng etal98
K.S. Ng, H. Liu, and H.B. Kwah, ``A Data Mining Application:
Customer Retention at the Port of Singapore Authority (PSA)'',
SIGMOD'98, Industrial Track, June 1-4, 1998. Seattle, Washington, USA
Data Warehousing
----------------
Garcia-Molina99
H. Garcia-Molina, W. J. Labio, J. L. Wiener, Y. Zhuge. ``Distributed and
Parallel Computing Issues in Data Warehousing" (invited talk). To
appear in Proceedings of ACM Principles of Distributed Computing
Conference, 1999.
link
Widom95
J.Widom, ``Research Problems in Data Warehousing'', Proceedings 4th
International Conference on Information and Knowledge Management,
Baltimore, Maryland, Nov. 1995, 25-30.
link
Anahory-Murray
Data Warehousing in the Real World - A Practical Guide for
Building Decision Support Systems
Web Mining
----------
Deutsch etal99, ``A Query Language for XML'', Proceedings of the
Eighth International World Wide Web Conference (WWW8), 1999
ps
Mutsumi00, RDF Resources
Lass97, Introduction RDF Metadata,
html, 1997
Bern98a, What the Semantic Web can represent
html, 1998
Bern98b, Why RDF model is different from the XML
html, 1998
Brin98, ``Extracting Patterns and Relations from the World-Wide
Web'', WebDB Workshop at EDBT, 1998
ps
Brin-Page, ``Dynamic Data Mining'',
ps
Brin-Page98, ``The Anatomy of a Large-Scale Hypertextual Web Search
Engine'', WWW7 / Computer Networks 30(1-7): 107-117 (1998)
htm
Kosala-Blockeel00, ``Web Mining Research: A Survey", SIGKDD
Explorations, Volume 2, Issue 1: 1-15, 2000.
link
======================================================
PROJECTS
The following list of topics is suggested for your reference. Students
can choose to work on one topic or to attack other big problems suggested
by the students after consulting with the lecturer.
Group a. Survey
You should only choose this type of projects if you are quite familar with
the subfield you wish to survey; otherwise, you're advised not to do it.
A survey should follow the style in ACM Computing Surveys. Extensiveness,
comprehensibility, technical worthiness are major considerations. The
survey should be in such a shape that can lead to a tutorial material
or a journal paper.
Group b. Real-world problem identification, analysis and solution
You are encouraged to study your organization's problems and needs for
data mining. Ng et al98 is a good example of real world data mining. You
should discuss with the lecturer about your detailed plan.
Group c. Research Topics
Novel ideas or methods for the work to be claimed as an excellent piece of
research work. You should discuss with the instructor about the novelty and
significance of your ideas or methods.
Group d. Building an environment
You will help instal, implement, or maintain some data mining
systems. The end result of an implementation project should be a self
contained system/component that should be thoroughly tested and can be
of real-use. You should discuss with the instructor for more details.
Latest update: August 16, 2000