CSE 591 Data Mining
Instructor: Huan Liu
=============== LEARNING, INNOVATING via PRACTICE =============
COURSE AND OBJECTIVES
The course is at the graduate level and most topics are on-going research
work. It is a seminar course and active class participation is
expected. It is designed to encourage students to actively learn advanced
concepts, to independently think over research and development issues,
to pro-actively relate what we learn to the real problems in practice,
to stimulate and brain-storm new ideas, to intelligently solve pressing
problems in various phases of data mining, and to create and implement
systems or components that can be used and re-used for data mining.
You're welcome to contribute papers of significance and interest from your
perspective.
SCHEDULE (tentative)
- Introduction and Organization
- Classification: ID3 Quinlan86
- Classification: NBC & k-NN Domingos&Pazzani96
- Classification: Neural Networks Lu et al95
- Classification: MetaCost Domigo99
- Evaluation: Performance Measures statistics book
- Evaluation: Comparison Lim etal99
- Preprocessing: Feature Selection Dash-Liu97
- Preprocessing: Discretization Hussain et al99
- Preprocessing: Sampling and Its Application Gu et al00
- Clustering: k-Means and EM Meila-Heckerman98
- Clustering: Hierarchical CLustering Fisher96
- Clustering: BIRCH Zhang-etal96
- Clustering: DBSCAN Ester-etal98
- Clustering: Co-Train Blum-Mitchell98
- Association: Rule Mining Agrawal-Srikantl94
- Association: Multi-level Rules Han-Fu95
- Association: Mining without Frequent Itemsets Han-etal00
- Association: Parallel Mining Agrawal etal96,Han etal97
- Data Warehousing: Schemas and Data Cubes Garcia-Molina99,Widom96
- Data Warehousing: Data Marting and Metadata Chapters6-9,Anahory-Murray
- Semi-Structured Data: XML, etc. Deutsch etal99
- Semi-Structured Data: RDF, Meta-Data Mutsumi00
- Web Mining: Extracting Patters and Relations Brin98
- Web Mining: Dynamic Data Mining via Sampling Brin-Page
- Real-World Challenges: TBA
- Real-World Challenges: Customer Retention
C1 Project Presentation and Discussion
C2 Project Presentation and Discussion
C3 Project Presentation and Discussion
C4 Project Presentation and Discussion
Classification
--------------
Quinlan86
J.R. Quinlan, ``Induction of Decision Trees'', Machine Learning, 1:81-106
1986
Domingos&Pazzani96
P. Domingos and M. Pazzani, ``Beyond Independence: Conditions for
the Optimality of the Simple Bayesian Classifier'', Proceedings of
the 13th International Conference on Machine Learning. 1996. 105-112.
ps.gz
Lu et al95
H. Lu, R.Setiono and H. Liu, ``NeuroRule: A Connectionist Approach to
Data Mining'', International Conference on Very Large Databases
(VLDB'95), Oct. 1995, Zurich, Switzland.
ps
Domingos99
P. Domingo, ``MetaCost: A General Method for Making Classifiers
Cost-Sensitive''. Proceedings of the Fifth International Conference
on Knowledge Discovery and Data Mining (pp. 155-164), 1999.
San Diego, CA.
ps.gz
Evaluation
----------
Witten&Frank00
Witten, I.H. and Frank, E. ``Data Mining - Practical Machine
Learning Tools and Techniques with JAVA Implementations'',
Morgan Kaufmann, 2000.
Lim et al99
Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. ``A Comparison of Prediction
Accuracy, Complexity, and Training Time of Thirty-three Old and
New Classification Algorithms'', Machine Learning. Forthcoming.
(Appendix containing complete tables of error rates, ranks, and
training times; download the data sets in C4.5 (ASCII) format)
link
Preprocessing
-------------
Dash-Liu97
M. Dash and H. Liu, ``Feature Selection for Classification''.
Intelligent Data Analysis - An International Journal,
Elsevier, Vol. 1, No. 3, 1997
ps
Hussain et al99
F. Hussain, H. Liu, C.L. Tan, and M. Dash, ``Discretization:
An Enabling Technique'', Technical Report, TRC6/99,
School of Computing, National University of Singapore, 1999.
link
Gu et al00
B. Gu, F. Hu, and H. Liu, ``Sampling and Its Application to
Data Mining: A Survey'', Technical Report, TRA6/00,
School of Computing, National University of Singapore, 2000.
link
Liu-etal01
Database Selection
ps
Clustering
----------
Meila-Heckerman98
M. Meila and D. Heckerman, ``An Experimental Comparison of Several
Clustering and Initialization Methods'', Technical Report,
MSR-TR-98-06, Microsoft Research, Redmond, WA..
ps
Fisher96
D.H. Fisher, ``Iterative Optimization and Simplification of Hierarchical
Clusterings'', JAIR, 4, 147-178, 1996.
ps
Zhang-etal96
T. Zhang, R. Ramakrishnan, and M. Livny, ``BIRCH: An Efficient Data
Clustering Method for Very Large Databases'', SIGMOD96, 103-114.
Ester et al98
DBSCAN
html
Blum-Mitchell98
A. Blum and T. Mitchell, ``Combining Labeled and Unlabeled Data with
Co-Training'', Proceedings of the 11th Annual Conference on
Computational Learning Theory, pages 92--100, 1998.
ps.gz
Association
-----------
Agrawal-Srikantl94
R. Agrawal and R. Srikant, ``Fast Algorithms for Mining Association
Rules'', Proceedings International Conference on Very Large
Data Bases, Santiago, Chile, September, 1994, 487-499.
ps
Han-Fu95
J. Han and Y. Fu. ``Discovery of Multiple-level Association Rules from
Large databases''. Proceedings International Conference on Very Large
Data Bases, Zurich, Switzerland, September 1995, 420-431.
ps
Han-etal00
J. Han, J. Pei, and Y. Yin, `` Mining Frequent Patterns without Candidate
Generation '', Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data
(SIGMOD'00).
link
Agrawal etal96
R. Agrawal and J.C. Shafer, ``Parallel Mining of Association Rules''
IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6,
December 1996.
ps
Han etal97
E-H. Han, G. Karypis and V. Kumar, ``Scalable Parallel Data Mining for
Association Rules''. Proceedings ACM SIGMOD International Conference on
Management of Data, May 1997.
ps
Real-World Applications
----------------------
Ng etal98
K.S. Ng, H. Liu, and H.B. Kwah, ``A Data Mining Application:
Customer Retention at the Port of Singapore Authority (PSA)'',
SIGMOD'98, Industrial Track, June 1-4, 1998. Seattle, Washington, USA
Data Warehousing
----------------
Garcia-Molina99
H. Garcia-Molina, W. J. Labio, J. L. Wiener, Y. Zhuge. ``Distributed and
Parallel Computing Issues in Data Warehousing'' (invited talk). To
appear in Proceedings of ACM Principles of Distributed Computing
Conference, 1999.
link
Widom95
J.Widom, ``Research Problems in Data Warehousing'', Proceedings 4th
International Conference on Information and Knowledge Management,
Baltimore, Maryland, Nov. 1995, 25-30.
link
Anahory-Murray
Data Warehousing in the Real World - A Practical Guide for
Building Decision Support Systems
Web Data and Mining
-------------------
Deutsch etal99, ``A Query Language for XML'', Proceedings of the
Eighth International World Wide Web Conference (WWW8), 1999
ps
Candan etal01,``Resource Description Framework: Metadata and Its
Applications'', SIGKDD Explorations, July 2001. 3(1):6-19.
Abstract and Link
Mutsumi00, RDF Resources
Lass97, Introduction RDF Metadata,
html, 1997
Bern98a, What the Semantic Web can represent
html, 1998
Bern98b, Why RDF model is different from the XML
html, 1998
Brin98, ``Extracting Patterns and Relations from the World-Wide
Web'', WebDB Workshop at EDBT, 1998
ps
Brin-Page, ``Dynamic Data Mining'',
ps
Brin-Page98, ``The Anatomy of a Large-Scale Hypertextual Web Search
Engine'', WWW7 / Computer Networks 30(1-7): 107-117 (1998)
htm
Kosala-Blockeel00, ``Web Mining Research: A Survey", SIGKDD
Explorations, Volume 2, Issue 1: 1-15, 2000.
link
PROJECT TOPICS
The following list of topics is suggested for your reference. Students
can choose to work on one topic or to attack other big problems suggested
by the students after discussing with the lecturer.
o Image Mining
o Data Streaming
o Web Mining
o Data Selection (Focusing)
o Machine Learning (Co-training, ...)
o Scalable DM algorithms (Parallel Processing, Data Swarms)
o ...
Other paper links
o http://www.cs.rpi.edu/~zaki/dmcourse/papers.html
o ...
Latest update: Jan 30, 2002