CSE 591 Data Mining

Instructor: Huan Liu

=============== LEARNING, INNOVATING via PRACTICE =============

COURSE AND OBJECTIVES

The course is at the graduate level and most topics are on-going research work. It is a seminar course and active class-room participation is expected. It is designed to encourage students to actively learn advanced concepts, to independently think over research and development issues, to pro-actively relate what we learn to the real problems in practice, to stimulate and brain-storm new ideas, to intelligently solve pressing problems in various phases of data mining, and to create and implement systems or components that can be used and re-used for data mining.
FORMAT In order to achieve the objectives, we design the course into several parts: Part 1. Brief introduction and review Part 2. Paper presentation, questions to think, problems to solve Part 3. Project and presentation Projects can be done in groups or individually Projects can be one of the following types a. Survey b. Real-world problem identification, analysis and solution c. Research and implementation d. Building a data mining environment Part 4. Quizzes only
SCHEDULE (tentative) C1 Proposal Presentation and Discussion C2 Proposal Presentation and Discussion C3 Project Presentation and Discussion C4 Project Presentation and Discussion C5 Project Presentation and Discussion C6 Project Presentation and Discussion
Classification -------------- Quinlan86 J.R. Quinlan, ``Induction of Decision Trees'', Machine Learning, 1:81-106 1986 Domingos&Pazzani96 P. Domingos and M. Pazzani, ``Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier'', Proceedings of the 13th International Conference on Machine Learning. 1996. 105-112. ps.gz Lu et al95 H. Lu, R.Setiono and H. Liu, ``NeuroRule: A Connectionist Approach to Data Mining'', International Conference on Very Large Databases (VLDB'95), Oct. 1995, Zurich, Switzland. ps Evaluation ---------- Witten&Frank00 Witten, I.H. and Frank, E. ``Data Mining - Practical Machine Learning Tools and Techniques with JAVA Implementations'', Morgan Kaufmann, 2000. Lim et al99 Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. ``A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms'', Machine Learning. Forthcoming. (Appendix containing complete tables of error rates, ranks, and training times; download the data sets in C4.5 (ASCII) format) link Preprocessing ------------- Dash-Liu97 M. Dash and H. Liu, ``Feature Selection for Classification''. Intelligent Data Analysis - An International Journal, Elsevier, Vol. 1, No. 3, 1997 ps Hussain et al99 F. Hussain, H. Liu, C.L. Tan, and M. Dash, ``Discretization: An Enabling Technique'', Technical Report, TRC6/99, School of Computing, National University of Singapore, 1999. link Gu et al00 B. Gu, F. Hu, and H. Liu, ``Sampling and Its Application to Data Mining: A Survey", Technical Report, TRA6/00, School of Computing, National University of Singapore, 2000. link Clustering ---------- Meila-Heckerman98 M. Meila and D. Heckerman, ``An Experimental Comparison of Several Clustering and Initialization Methods'', Technical Report, MSR-TR-98-06, Microsoft Research, Redmond, WA.. ps Fisher96 D.H. Fisher, ``Iterative Optimization and Simplification of Hierarchical Clusterings'', JAIR, 4, 147-178, 1996. ps Zhang-etal96 T. Zhang, R. Ramakrishnan, and M. Livny, ``BIRCH: An Efficient Data Clustering Method for Very Large Databases'', SIGMOD96, 103-114. Association ----------- Agrawal-Srikantl94 R. Agrawal and R. Srikant, ``Fast Algorithms for Mining Association Rules'', Proceedings International Conference on Very Large Data Bases, Santiago, Chile, September, 1994, 487-499. ps Han-Fu95 J. Han and Y. Fu. ``Discovery of Multiple-level Association Rules from Large databases''. Proceedings International Conference on Very Large Data Bases, Zurich, Switzerland, September 1995, 420-431. ps Agrawal etal96 R. Agrawal and J.C. Shafer, ``Parallel Mining of Association Rules'' IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, December 1996. ps Han etal97 E-H. Han, G. Karypis and V. Kumar, ``Scalable Parallel Data Mining for Association Rules''. Proceedings ACM SIGMOD International Conference on Management of Data, May 1997. ps Real-World Applications ---------------------- Ng etal98 K.S. Ng, H. Liu, and H.B. Kwah, ``A Data Mining Application: Customer Retention at the Port of Singapore Authority (PSA)'', SIGMOD'98, Industrial Track, June 1-4, 1998. Seattle, Washington, USA Data Warehousing ---------------- Garcia-Molina99 H. Garcia-Molina, W. J. Labio, J. L. Wiener, Y. Zhuge. ``Distributed and Parallel Computing Issues in Data Warehousing" (invited talk). To appear in Proceedings of ACM Principles of Distributed Computing Conference, 1999. link Widom95 J.Widom, ``Research Problems in Data Warehousing'', Proceedings 4th International Conference on Information and Knowledge Management, Baltimore, Maryland, Nov. 1995, 25-30. link Anahory-Murray Data Warehousing in the Real World - A Practical Guide for Building Decision Support Systems Web Mining ---------- Deutsch etal99, ``A Query Language for XML'', Proceedings of the Eighth International World Wide Web Conference (WWW8), 1999 ps Mutsumi00, RDF Resources Lass97, Introduction RDF Metadata, html, 1997 Bern98a, What the Semantic Web can represent html, 1998 Bern98b, Why RDF model is different from the XML html, 1998 Brin98, ``Extracting Patterns and Relations from the World-Wide Web'', WebDB Workshop at EDBT, 1998 ps Brin-Page, ``Dynamic Data Mining'', ps Brin-Page98, ``The Anatomy of a Large-Scale Hypertextual Web Search Engine'', WWW7 / Computer Networks 30(1-7): 107-117 (1998) htm Kosala-Blockeel00, ``Web Mining Research: A Survey", SIGKDD Explorations, Volume 2, Issue 1: 1-15, 2000. link ====================================================== PROJECTS The following list of topics is suggested for your reference. Students can choose to work on one topic or to attack other big problems suggested by the students after consulting with the lecturer. Group a. Survey You should only choose this type of projects if you are quite familar with the subfield you wish to survey; otherwise, you're advised not to do it. A survey should follow the style in ACM Computing Surveys. Extensiveness, comprehensibility, technical worthiness are major considerations. The survey should be in such a shape that can lead to a tutorial material or a journal paper. Group b. Real-world problem identification, analysis and solution You are encouraged to study your organization's problems and needs for data mining. Ng et al98 is a good example of real world data mining. You should discuss with the lecturer about your detailed plan. Group c. Research Topics Novel ideas or methods for the work to be claimed as an excellent piece of research work. You should discuss with the instructor about the novelty and significance of your ideas or methods. Group d. Building an environment You will help instal, implement, or maintain some data mining systems. The end result of an implementation project should be a self contained system/component that should be thoroughly tested and can be of real-use. You should discuss with the instructor for more details.

Latest update: August 16, 2000