CSE 591 Data Mining Instructor: Huan Liu =============== LEARNING, INNOVATING via PRACTICE ============= COURSE AND OBJECTIVES -------------------------------------------------------------------------------- The course is at the graduate level and most topics are on-going research work. It is a seminar course and active class participation is expected. It is designed to encourage students to actively learn advanced concepts, to independently think over research and development issues, to pro-actively relate what we learn to the real problems in practice, to stimulate and brain-storm new ideas, to intelligently solve pressing problems in various phases of data mining, and to create and implement systems or components that can be used and re-used for data mining. You're welcome to contribute papers of significance and interest from your perspective. -------------------------------------------------------------------------------- SCHEDULE (tentative) -------------------------------------------------------------------------------- Classification -------------- Quin86 J.R. Quinlan, ``Induction of Decision Trees'', Machine Learning, 1:81-106 1986 Domi-Pazz96 P. Domingos and M. Pazzani, ``Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier'', Proceedings of the 13th International Conference on Machine Learning. 1996. 105-112. Lu-etal95 H. Lu, R.Setiono and H. Liu, ``NeuroRule: A Connectionist Approach to Data Mining'', International Conference on Very Large Databases (VLDB'95), Oct. 1995, Zurich, Switzland. Tipp00 M. Tipping, "The relevance vector machine", In Advances in Neural Information Processing Systems, San Mateo, CA, 2000. Cort-Vapn95 C. Cortes and V. Vapnik, "Support-Vector Networks", Machine Learning Journal, Volume 20, Number 3: 273-297, 1995. Burg98 C. J. C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition", Data Mining and Knowledge Discovery, 2(2): 121-167, 1998. Evaluation ---------- Witt-Fran00 Witten, I.H. and Frank, E. ``Data Mining - Practical Machine Learning Tools and Techniques with JAVA Implementations'', Morgan Kaufmann, 2000. Lim-etal99 Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. ``A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms'', Machine Learning. Forthcoming. (Appendix containing complete tables of error rates, ranks, and training times; download the data sets in C4.5 (ASCII) format) Preprocessing ------------- Dash-Liu97 M. Dash and H. Liu, ``Feature Selection for Classification''. Intelligent Data Analysis - An International Journal, Elsevier, Vol. 1, No. 3, 1997 Liu-etal02 H. Liu, F. Hussain, C.L. Tan, and M. Dash. "Discretization: An Enabling Technique", Journal of Data Mining and Knowledge Discovery 6(4): 393-423; Oct 2002 Gu-etal00 B. Gu, F. Hu, and H. Liu, ``Sampling and Its Application to Data Mining: A Survey'', Technical Report, TRA6/00, School of Computing, National University of Singapore, 2000. Liu-etal01 Database Selection Clustering ---------- Meil-Heck98 M. Meila and D. Heckerman, ``An Experimental Comparison of Several Clustering and Initialization Methods'', Technical Report, MSR-TR-98-06, Microsoft Research, Redmond, WA.. Fish96 D.H. Fisher, ``Iterative Optimization and Simplification of Hierarchical Clusterings'', JAIR, 4, 147-178, 1996. Zhan-etal96 T. Zhang, R. Ramakrishnan, and M. Livny, ``BIRCH: An Efficient Data Clustering Method for Very Large Databases'', SIGMOD96, 103-114. Este-etal98 DBSCAN Hame-Elka03 G. Hamerly, and C. Elkan, "Learning the k in k-means", Neural Information Processing Systems, 2003. Chen-etal02 G. CHen, M. S. H. Ko and M. Q. Zhang, "Evaluation and Comparison of Clustering Algorithms in Analyzing ES Cell gene expression data", Statistica Sinica 12, 2002. Meil-Heck98 M. Meila and D. Heckerman, "An Experimental Comparison of Several Clustering and Initialization Methods", In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 386-395, 1998. Association ----------- Agra-Srik94 R. Agrawal and R. Srikant, ``Fast Algorithms for Mining Association Rules'', Proceedings International Conference on Very Large Data Bases, Santiago, Chile, September, 1994, 487-499. Han-etal00 J. Han, J. Pei, and Y. Yin, `` Mining Frequent Patterns without Candidate Generation '', Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00). Agra-etal96 R. Agrawal and J.C. Shafer, ``Parallel Mining of Association Rules'' IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, December 1996. Han-etal97 E-H. Han, G. Karypis and V. Kumar, ``Scalable Parallel Data Mining for Association Rules''. Proceedings ACM SIGMOD International Conference on Management of Data, May 1997. Cohe-etal00 E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman and C. Yang ``Finding Interesting Associations without Support Pruning'', in Proceedings of the 16th International Conference on Data Engineering, 28 February - 3 March, 2000, San Diego, California. zaia-etal01 O. R. Zaiane, M. El-Hajj and P. Lu "Fast Parallel Association Rule Mining without Candidacy Generation", In Proceedings of ICDM, 665-668, 2001. Real-World Applications ---------------------- Ng-etal98 K.S. Ng, H. Liu, and H.B. Kwah, ``A Data Mining Application: Customer Retention at the Port of Singapore Authority (PSA)'', SIGMOD'98, Industrial Track, June 1-4, 1998. Seattle, Washington, USA Co-Training ----------- Blum-Mitc98 A. Blum and T. Mitchell, ``Combining Labeled and Unlabeled Data with Co-Training'', Proceedings of the 11th Annual Conference on Computational Learning Theory, pages 92--100, 1998. Niga-etal00 K. Nigam, and A. K. McCallum, and S. Thrun and T. M. Mitchell", "Text Classification from Labeled and Unlabeled Documents using {EM}", Machine Learning Journal, Volume 39, Number 2/3: 103--134, 2000. Ensemble Methods ---------------- Brei96 Breiman, ``Bagging Predictors". Machine Learning Journal, Volume 24, Number 2: 123-140, 1996. Freu-Scap96 Freund and Scaphire, ``Experiments with a new boosting algorithm", Proceedings of the 13th International Conference on Machine Learning, 148-156, 1996. Brei01 Breiman, ``Random Forest", Technical Report,Statistics Department, University of California Berkeley, 2001. Cunn00 P. Cunningham, "Overfitting and Diversity in Classification Ensembles based on Feature Selection", Technical Report, Department of Computer Science, Trinity College, Dublin, 2000. Baue-Koha99 E. Bauer and R. Kohavi, "An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants", Machine Learning Journal, Volume 36, Number 1-2: 105-139, 1999, Zhan-Yu03 T. Zhang and B. Yu, "Boosting with early stopping: convergence and consistency", Tech. Report 635, Department of Statistics, UC Berkeley, 2003. Data Security ------------- Lee-etal99 W. Lee, S. Stolfo, and K. Mok. ``A Data Mining Framework for Building Intrusion Detection Models'' In Proceedings of the 1999 IEEE Symposium on Security and Privacy, Oakland, CA, May 1999. Schu-etal01 M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo, ``Data Mining Methods for Detection of New Malicious Executables'', IEEE Symposium on Security and Privacy, Oakland, CA, May 2001. Agra-Rama00 R. Agrawal and S. Ramakrishnan. “Privacy-preserving data mining”. In Proceedings of of the 2000 ACM SIGMOD International Conference on Management of Data, 439--450, 2000. Kant-Vaid03 M. Kantarcoglu and J. Vaidya. “Privacy Preserving Naive Bayes Classifier for Horizontally Pertitioned Data”. In IEEE ICDM Workshop on Privacy Preserving Data Mining, Melbourne, Florida, USA, 3-9, 2003. Oliv-Zaïa03 S. R. M. Oliveira and O. R. Zaïane. “Protecting Sensitive Knowledge By Data Sanitization”. In Proceedings of the Third IEEE International Conference on Data Mining (ICDM'03), Melbourne, Florida, USA, 613-616, 2003. Port-etal01 L. Portnoy and E. Eskin and S. Stolfo, "Intrusion detection with unlabeled data using clustering", In ACM Workshop on Data Mining Applied to Security (DMSA) 2001. Vaid-Clif03 J. Vaidya and C. Clifton, "Privacy-Preserving K-Means Clustering over Vertically Partitioned Data", In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 206-215, 2003. Clif-etal02 C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin and M. Y. Zhu, "Tools for Privacy Preserving Distributed Data Mining", In SIGKDD Explorations, 4(2): 28-34, 2002. Data Privacy ------------ Clif-Mark96 Clifton-Marks96, ``Security and Privacy Implications of Data Mining", In Proceedings of the ACM SIGMOD Workshop of Data Mining and Knowledge Discovery, 1996. Agra-Srik00 R. Agrawal and R. Srikant, "Privacy-Preserving Data Mining", Proc. of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000. Pink02 B. Pinkas, "Cryptographic Techniques for Privacy-Preserving Data Mining", In SIGKDD Explorations, 4(2): 12-19, 2002. BioInformatics -------------- Swan97 D. R. Swanson and N. R. Smalheiser, "An interactive system for finding complementary literatures: a stimulus to scientific discovery", Artificial Intelligence 91(2): 183-203, 1997. Swan01 D. R. Swanson, N. R. Smalheiser and A. Bookstein, "Information discovery from complementary literatures: categorizing viruses as possible weapons", Journal of American Society for Information Science and Technology 52(10): 797-812, 2001. KDD02 KDD Cup 2002 http://www.biostat.wisc.edu/~craven/kddcup/tasks.html Page-etal02 D. Page, F. Zhan, J. Cussens, M. Waddell, J. Hardin, B. Barlogie and J. Shaughnessy, Jr., "COmparative Data Mining for Microarrays: A Case Study Based on Multiple Myeloma", Technical Report, CS Deaprtment, University of Wisconsin, Nov 2002. Fure-etal00 T. S. Furey, and N. Christianini, and N. Duffy, and D. W. Bednarski, and M. Schummer and D. Hauessler, "Support vector machine classification and validation of cancer tissue samples using microarray expression data.", Journal of Bioinformatics, Volume 16, Number 10: 906--914, 2000. Haut-etal03 S. Hautaniemi, H. Edgren, P. Vesanen , M. Wolf, A. K. Järvinen, O. Yli-Harja, J. Astola, O. Kallioniemi and O. Monni, "A novel strategy for microarray quality control using Bayesian networks", Bioinformatics Vol. 19 no. 16: 2031-2038, 2003. Bicc-etal01 S. Bicciato, and M. Pandin, and G. Didone and C. Di Bello, "Analysis of an associative memory neural network for pattern identification in gene expression data", BIOKDD, 22-30, 2001. Streaming Data -------------- Domi-Hult00 P. Domingos and G. Hulten, "Mining high-speed data streams", In Proceedings of Knowledge Discovery and Data Mining, 71-80, 2000. Hult-etal01 G. Hulten and L. Spencer and P. Domingos, "Mining Time-Changing Data Streams", Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 97-106, 2001. Jin-Agra03 R. Jin and G. Agarwal, "Efficient Decision Tree Construction on Streaming Data", Proceedings of the ACM SIGKDD, 2003. Babc-etal02 B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and Issues in Data Stream Systems", In Proceeding of PODS, ACM, 2002. Babc-etal03 B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan, "Maintaining Variance and k-Medians over Data Stream Windows", In Proceeding of PODS, ACM, 2003. Guha-etal00 S. Guha, N. Mishra, R. Motwani and L. O'Callaghan, "Clustering Data Streams", {IEEE} Symposium on Foundations of Computer Science, 359-366, 2000. Additional Papers ----------------- Chi-etal03 Y. Chi, Y.Yang, and R. R. Muntz, "Indexing and Mining Free Trees", Proceedings of the third IEEE International Conference on Data Mining (ICDM), 2003. Ordo-Omie99 C. Ordonez and E. Omiecinski, "Discovering Association Rules based on Image Content", Proceeding of the IEEE Advances in Digital Libraries Conference, 38-49, 1999. Zaia-etal00 O. R. Zaiane, J. Han and H. Zhu, "Mining Recurrent Items in Multimedia with Progressive Resolution Refinement", Proceedings of the IEEE International Conference of Data Engineering, 461-470, 2000. Zaki-Agga03 M. J. Zaki and C. C. Aggarwal, "XRules: An Effective Structural CLassifier for XML Data", KDD, 2003. -------------------------------------------------------------------------------- PROJECT TOPICS The following list of topics is suggested for your reference. Students can choose to work on one topic or to attack other big problems suggested by the students after discussing with the lecturer. o o o o o o Other paper links o http://www.cs.rpi.edu/~zaki/dmcourse/papers.html o ... Latest update: Feb 4, 2004