Beyond Feature Selection and Extraction

- An Integrated Framework for High-Dimensional Data of Small Labeled Samples

Description

High-dimensional data is ubiquitous in real-world applications - from text categorization, to image processing, and to Web searches. The shortage of labeled data, resulting from high labeling costs, necessitates the need to explore machine learning approaches beyond classic classification and clustering paradigms. Semi-supervised learning is one such approach that demonstrates its potential in handling data with small labeled samples and reducing the need for expensive labeled data. However, high-dimensional data with small labeled samples permits too large a hypothesis space yet with too few constraints (labeled instances). The combination of the two data characteristics manifests a new research challenge. Employing computational and statistical learning theory, we analyze specific challenges presented by such data, show preliminary studies, delineate the need to integrate feature selection and extraction in a novel framework to reduce hypothesis space, propose to design efficient and novel algorithms, and conduct theoretical and empirical studies to understand complex relationships between high-dimensional data and classification performance.

Publications

Journal Articles

Z. Zhao, L. Wang, H. Liu and J. Ye "On Similarity Preserving Feature Selection" ,IEEE Transactions on Knowledge and Data Engineering (TKDE), forthcoming.
L. Yuan, Y. Wang, P. Thompson, V. Narayan and J. Ye, Multi-source Feature Learning for Joint Analysis of Incomplete Multiple Heterogeneous Neuroimaging Data, NeuroImage Volume 61, Issue 3, 2 July 2012, Pages 622-632
Z. Zhao and H. Liu. "Multi-Source Feature Selection via Geometry-Dependent Covariance Analysis", JMLR Workshop and Conference Proceedings Volume 4: New challenges for feature selection in data mining and knowledge discovery, 4:36-47, 2008
Z. Zhao and H. Liu. "Searching for Interacting Features in Subset Selection", Intelligent Data Analysis - An International Journal, 13:207-228, 2009.
M. Berens, H. Liu, L. Parsons, L. Yu, and Z. Zhao. “Fostering Biological Relevance in Feature Selection for Microarray Data”, Trends and Controversies,[PDF], pp 71 - 73. November/December 2005, IEEE Intelligent Systems.
H. Liu and L. Yu. "Toward Integrating Feature Selection Algorithms for Classification and Clustering", IEEE Trans. on Knowledge and Data Engineering, pdf, 17(4), 491-502, 2005.
Jieping Ye, Jianhui Chen, Ravi Janardan, and Sudhir Kumar. Developmental Stage Annotation of Drosophila Gene Expression Pattern Images via an Entire Solution Path for LDA. ACM Transactions on Knowledge Discovery from Data. special issue on Bioinformatics. Vol. 2, No. 1, pp. 1-21, 2008. [ PDF]

Conferences and Workshops
- J. Tang and H. Liu. Unsupervised Feature Selection for Linked Soical Media Data, The ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2012). [PDF]
- J. Tang and H. Liu. Feature Selection with Linked Data in Social Media, SIAM International Conference on Data Mining (SDM2012) [PDF]
- L. Yuan, Y. Wang, P. Thompson, V. Narayan and J. Ye, Multi-Source Learning for Joint Analysis of Incomplete Multi-Modality Neuroimaging Data, The ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2012)
- L. Yuan, J. Liu and J. Ye, Efficient Methods for Overlapping Group Lasso, Twenty-Fifth Annual Conference on Neural Information Processing Systems (NIPS 2011)
- J. Liu, L. Yuan, and J. Ye, An Efficient Algorithm for a Class of Fused Lasso Problems, The Sixteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2010).
- Z. Zhao, L. Wang, and H. Liu. Efficient Spectral Feature Selection with Minimum Redundancy. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI), 2010 . [PDF, Supplementary]
- Z. Zhao, J. Wang, S. Sharma, N. Agarwal, H. Liu, and Y. Chang. An Integrative Approach to Identifying Biologically Relevant Genes. In Proceedings of SIAM International Conference on Data Mining (SDM), 2010. [PDF]
- Z. Zhao, J. Wang, H. Liu, and Y. Chang. Biological relevance detection via network dynamic analysis. In Proceedings of 2nd International Conference on Bioinformatics and Computational Biology (BICoB), 2010. BEST PAPER AWARD [PDF]
- J. Liu, L. Yuan, and J. Ye. An Efficient Algorithm for a Class of Fused Lasso Problems. The Sixteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2010). [PDF]
- L. Sun, B. Ceran, and J. Ye. A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques. The Sixteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2010).
- J. Chen, J. Liu, and J. Ye. Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks. The Sixteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2010).
- H. Liu, H. Motoda, R. Setiono, and Z. Zhao. Feature Selection: An Ever Evolving Frontier in Data Mining, Journal of Machine Learning Research, Workshop and Conference Proceedings Volume 10, 10:4-13, 2010.[PDF]
- L. Sun, J. Liu, J. Chen, and J. Ye. Efficient Recovery of Jointly Sparse Vectors. The Twenty-Third Annual Conference on Neural Information Processing Systems (NIPS 2009). [PDF]
- J. Liu, S. Ji, and J. Ye. Multi-task Feature Learning via Efficient L2,1-Norm Minimization. The Twenty-fifth Conference on Uncertainty in Artificial Intelligence (UAI 2009).[PDF]
- J. Liu, J. Chen, and J. Ye. Large-Scale Sparse Logistic Regression. The Fifteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2009), pp. 547-556.
- L. Sun, S. Ji, and J. Ye. A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning. The Twenty-Sixth International Conference on Machine Learning (ICML 2009). [PDF]
- S. Ji and J. Ye. Linear Dimensionality Reduction for Multi-label Classification. The Twenty-first International Joint Conference on Artificial Intelligence (IJCAI 2009).[PDF]
- Z. Zhao, J. Wang, S. Sharma, N. Agarwal, H. Liu and Y. Chang. " A Knowledge-Oriented Framework for Gene Selection", Poster. Tuscon, Arizona, May 18-21. RECOMB'09
- Z. Zhao, L. Sun, S. Yu, H. Liu, J. Ye. "Multiclass Probabilistic Kernel Discriminant Analysis", IJCAI'09 [PDF]

Z. Zhao, J. Wang, H. Liu, J. Ye, and Y. Chang. "Identifying Biologically Relevant Genes via Multiple Heterogeneous Data Sources", KDD'08: 839 - 847. [PDF]
Z. Zhao and H. Liu. ``Spectral Feature Selection for Supervised and Unsupervised Learning''. International Conference on Machine Learning (ICML-07), June 20-24, 2007, Corvallis, Oregon. [PDF]
Z. Zhao and H. Liu. ``Semi-supervised Feature Selection via Spectral Analysis", SIAM International Conference on Data Mining (SDM-07), April 26-28, 2007, Minneapolis, Minnesoda. [PDF]
Z. Zhao and H. Liu. ``Searching for Interacting Features", The 20th International Joint Conference on AI (IJCAI-07), January 6-12 Hyderabad, India. [PDF]. Software available.
Jieping Ye. Least Squares Linear Discriminant Analysis. The Twenty-Fourth International Conference on Machine Learning (ICML 2007), pp. 1087-1093. Technical Report TR-06-003, Department of Computer Science and Engineering, Arizona State University , March, 2006. [PDF]

Books or Chapters

Z. Zhao and H. Liu., "Spectral Feature Selection for Data Mining ", December 2011, ISBN 978-1439862094, by Chapman and Hall/CRC
Huan Liu and Hiroshi Motoda, "Feature Selection for Knowledge Discovery and Data Mining", July 1998, ISBN 0-7923-8198-X, by Kluwer Academic Publishers
Huan Liu and Hiroshi Motoda, “Computational Methods of Feature Selection”, editors, 2008, Chapman and Hall/CRC Press.
H. Liu and Z. Zhao. "Manipulating Data and Dimensionality Reduc-tion Methods: Feature Selection", in Encyclopedia of Complexity and Systems Science, Robert Meyers (Ed.), Springer. 2009.
H. Liu. "Feature Selection: An Overview", in Encyclopedia of Machine Learning, Claude Sammut (Ed.), Springer. Forthcoming.
Z. Zhao and H. Liu. "On Interacting Features in Subset Selection", in Encyclopedia of Data Warehousing and Mining, 2nd Edition, Idea Group, Inc. pp 1079 -- 1084, September, 2008.

Technical Reports

Z. Zhao and H. Liu. ``Semi-supervised Feature Selection via Spectral Analysis", Technical Report, TR-06-022, Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, 2006.
Y. Ye, L. Yu, and H. Liu. ``Sparse Linear Discriminant Analysis", Technical Report, TR-06-010, Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, 2006.

Thesis

Z. Zhao. Spectral Feature Selection for Mining Ultrahigh Dimensional Data [PDF]

Resources

Related Activities

Workshop on Feature Selection in Data Mining (FSDM 10) [link]
The proceedings of FSDM 2010 has been published by JMLR Workshop and Conference Proceedings
Tutorial at SDM10: Mining Sparse Representations: Formulations, Algorithms, and Applications
SIAM Data Mining SDM 2007 Tutorial: Dimensionality Reduction for Data Mining - Techniques, Applications, and Trends
AAAI 2005 Tutorial: Notes on Downsizing Data for High Performance in Learning - Feature Selection Methods, pdf.zip.

Project Members

Huan Liu (PI)
Jieping Ye (Co-PI)
Zheng Zhao (Graduated with PhD in 2010 and joined SAS institute)
Salem Alelyani (PhD Student)
Lei Yuan (PhD Student)
Shashvata Sharma (Graduated with Master in 2009 and joined Microsoft)
Fred Morstatter (PhD Student)
Aneeth Anand (Graduated with Master in 2011 and joined PayPald)
Jiliang Tang (PhD Student)
Ling Yan (Undergraduate Student)
Qingyun Li (Undergraduate Student)
Shantanu Bala (Undergraduate Student)
Grant Marshall (Undergraduate Student)

Acknowledgments

This project is sponsored by NSF (#0812551), 9/2008 - 8/2012.

Created on Oct 26, 2008.
Contact: Huan Liu via Email: huan.liuATasu.edu.
Webmaster: Jiliang Tang, Email: Jiliang.TangATasu.edu

Last Upadted: Tuesday, May 22, 2012