CSE 591 Data Mining

Instructor: Huan Liu

=============== LEARNING, INNOVATING via PRACTICE =============

COURSE AND OBJECTIVES

The course is at the graduate level and most topics are on-going research work. It is a seminar course and active class participation is expected. It is designed to encourage students to actively learn advanced concepts, to independently think over research and development issues, to pro-actively relate what we learn to the real problems in practice, to stimulate and brain-storm new ideas, to intelligently solve pressing problems in various phases of data mining, and to create and implement systems or components that can be used and re-used for data mining. You're welcome to contribute papers of significance and interest from your perspective.
SCHEDULE (tentative) C1 Project Presentation and Discussion C2 Project Presentation and Discussion C3 Project Presentation and Discussion C4 Project Presentation and Discussion
Classification -------------- Quinlan86 J.R. Quinlan, ``Induction of Decision Trees'', Machine Learning, 1:81-106 1986 Domingos&Pazzani96 P. Domingos and M. Pazzani, ``Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier'', Proceedings of the 13th International Conference on Machine Learning. 1996. 105-112. ps.gz Lu et al95 H. Lu, R.Setiono and H. Liu, ``NeuroRule: A Connectionist Approach to Data Mining'', International Conference on Very Large Databases (VLDB'95), Oct. 1995, Zurich, Switzland. ps Domingos99 P. Domingo, ``MetaCost: A General Method for Making Classifiers Cost-Sensitive''. Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (pp. 155-164), 1999. San Diego, CA. ps.gz Evaluation ---------- Witten&Frank00 Witten, I.H. and Frank, E. ``Data Mining - Practical Machine Learning Tools and Techniques with JAVA Implementations'', Morgan Kaufmann, 2000. Lim et al99 Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. ``A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms'', Machine Learning. Forthcoming. (Appendix containing complete tables of error rates, ranks, and training times; download the data sets in C4.5 (ASCII) format) link Preprocessing ------------- Dash-Liu97 M. Dash and H. Liu, ``Feature Selection for Classification''. Intelligent Data Analysis - An International Journal, Elsevier, Vol. 1, No. 3, 1997 ps Hussain et al99 F. Hussain, H. Liu, C.L. Tan, and M. Dash, ``Discretization: An Enabling Technique'', Technical Report, TRC6/99, School of Computing, National University of Singapore, 1999. link Gu et al00 B. Gu, F. Hu, and H. Liu, ``Sampling and Its Application to Data Mining: A Survey'', Technical Report, TRA6/00, School of Computing, National University of Singapore, 2000. link Liu-etal01 Database Selection ps Clustering ---------- Meila-Heckerman98 M. Meila and D. Heckerman, ``An Experimental Comparison of Several Clustering and Initialization Methods'', Technical Report, MSR-TR-98-06, Microsoft Research, Redmond, WA.. ps Fisher96 D.H. Fisher, ``Iterative Optimization and Simplification of Hierarchical Clusterings'', JAIR, 4, 147-178, 1996. ps Zhang-etal96 T. Zhang, R. Ramakrishnan, and M. Livny, ``BIRCH: An Efficient Data Clustering Method for Very Large Databases'', SIGMOD96, 103-114. Ester et al98 DBSCAN html Blum-Mitchell98 A. Blum and T. Mitchell, ``Combining Labeled and Unlabeled Data with Co-Training'', Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92--100, 1998. ps.gz Association ----------- Agrawal-Srikantl94 R. Agrawal and R. Srikant, ``Fast Algorithms for Mining Association Rules'', Proceedings International Conference on Very Large Data Bases, Santiago, Chile, September, 1994, 487-499. ps Han-Fu95 J. Han and Y. Fu. ``Discovery of Multiple-level Association Rules from Large databases''. Proceedings International Conference on Very Large Data Bases, Zurich, Switzerland, September 1995, 420-431. ps Han-etal00 J. Han, J. Pei, and Y. Yin, `` Mining Frequent Patterns without Candidate Generation '', Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00). link Agrawal etal96 R. Agrawal and J.C. Shafer, ``Parallel Mining of Association Rules'' IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, December 1996. ps Han etal97 E-H. Han, G. Karypis and V. Kumar, ``Scalable Parallel Data Mining for Association Rules''. Proceedings ACM SIGMOD International Conference on Management of Data, May 1997. ps Real-World Applications ---------------------- Ng etal98 K.S. Ng, H. Liu, and H.B. Kwah, ``A Data Mining Application: Customer Retention at the Port of Singapore Authority (PSA)'', SIGMOD'98, Industrial Track, June 1-4, 1998. Seattle, Washington, USA Data Warehousing ---------------- Garcia-Molina99 H. Garcia-Molina, W. J. Labio, J. L. Wiener, Y. Zhuge. ``Distributed and Parallel Computing Issues in Data Warehousing'' (invited talk). To appear in Proceedings of ACM Principles of Distributed Computing Conference, 1999. link Widom95 J.Widom, ``Research Problems in Data Warehousing'', Proceedings 4th International Conference on Information and Knowledge Management, Baltimore, Maryland, Nov. 1995, 25-30. link Anahory-Murray Data Warehousing in the Real World - A Practical Guide for Building Decision Support Systems Web Data and Mining ------------------- Deutsch etal99, ``A Query Language for XML'', Proceedings of the Eighth International World Wide Web Conference (WWW8), 1999 ps Candan etal01,``Resource Description Framework: Metadata and Its Applications'', SIGKDD Explorations, July 2001. 3(1):6-19. Abstract and Link Mutsumi00, RDF Resources Lass97, Introduction RDF Metadata, html, 1997 Bern98a, What the Semantic Web can represent html, 1998 Bern98b, Why RDF model is different from the XML html, 1998 Brin98, ``Extracting Patterns and Relations from the World-Wide Web'', WebDB Workshop at EDBT, 1998 ps Brin-Page, ``Dynamic Data Mining'', ps Brin-Page98, ``The Anatomy of a Large-Scale Hypertextual Web Search Engine'', WWW7 / Computer Networks 30(1-7): 107-117 (1998) htm Kosala-Blockeel00, ``Web Mining Research: A Survey", SIGKDD Explorations, Volume 2, Issue 1: 1-15, 2000. link
PROJECT TOPICS The following list of topics is suggested for your reference. Students can choose to work on one topic or to attack other big problems suggested by the students after discussing with the lecturer. o Image Mining o Data Streaming o Web Mining o Data Selection (Focusing) o Machine Learning (Co-training, ...) o Scalable DM algorithms (Parallel Processing, Data Swarms) o ... Other paper links o http://www.cs.rpi.edu/~zaki/dmcourse/papers.html o ...

Latest update: Jan 30, 2002