Time: August 13th, 2017
Location: Halifax, Nova Scotia, Canada
Feature selection, as a data preprocessing strategy, is imperative in preparing high-dimensional data for myriad of data mining and machine learning tasks. By selecting a subset of features of high quality, feature selection can help build simpler and more comprehensive models, improve data mining performance, and prepare clean and understandable data. The proliferation of big data in recent years has presented substantial challenges and opportunities for feature selection research. In this tutorial, we provide a comprehensive overview of recent advances in feature selection research from a data perspective. After we introduce some basic concepts, we review state-of-the-art feature selection algorithms and recent techniques of feature selection for structured, social, heterogeneous, and streaming data. In particular, we also discuss what the role of feature selection is in the context of deep learning and how feature selection is related to feature engineering. To facilitate and promote the research in this community, we present an open-source feature selection repository scikit-feature that consists of most of the popular feature selection algorithms. We conclude our discussion with some open problems and pressing issues in future research.
1. Introduction to Feature Selection
Curse of dimensionality and dimensionality reduction
Do we still need feature selection?
Supervised, semi-supervised and unsupervised methods
Wrapper, filter and embedded methods
2. Traditional Feature Selection
Similarity based methods
Information theoretical based methods
Sparse learning based methods
Statistical based methods
3. Feature Selection with Structured Features
Feature selection with group structured features
Feature selection with tree structured features
Feature selection with graph structured features
4. Feature Selection with Heterogeneous Data
Feature selection with linked data
Feature selection with multi-view and multi-source data
5. Feature Selection with Streaming Data
Feature selection for data streams
Feature selection for feature streams
6. Feature Selection Repository scikit-feature
7. Open Problems
Scalability of feature selection algorithms
Stability of feature selection algorithms
Model Selection of feature selection algorithms
Relations between feature selection and feature engineering
Tutorial Slides [pdf]
Technical Survey Paper [pdf]
Jundong Li is a PhD student of Computer Science and Engineering at Arizona State University since August, 2014. He obtained his Master degree from Department of Computing Science, University of Alberta, Canada in 2014 and Bachelor degree from College of Computer Science and Technology, Zhejiang University, China in 2012. His research interests are in feature selection, data mining, social media mining and machine learning. He has published innovative works in top conference proceedings and highly ranked journals such as IJCAI, AAAI, ICDM, SDM, CIKM, WSDM, WWW, IEEE Intelligent Systems, IEEE TNNLS and Geoinformatica. He worked as an intern scientist in Yahoo! Research in 2016. He also leads and is the main contributor of the feature selection repository scikit-feature (http:featureselection.asu.edu/), which has been featured by several news articles and blogs such as KDNuggets.
Jiliang Tang is an assistant professor of computer science and engineering at Michigan State University. He is directing the data science and engineering lab (http:dse.cse.msu.edu) at MSU. He received his Ph.D. of Computer Science at Arizona State University in 2015, and B.S.M.S. from Beijing Institute of Technology in 2008 and 2010, respectively. His research interests include social computing, data mining and machine learning. He was awarded the best paper award of SIGKDD2016 and the Runner Up of SIGKDD Dissertation Award 2015. He was the industry chair of SBP2017, the sponsorship co-chair of SDM2017, the poster chair of SIGKDD2016 and serves as regular journal reviewers and numerous conference program committees. He co-presented three tutorials in KDD2014, WWW2014, and Recsys2014, and has published innovative works in highly ranked journals and top conference proceedings that have received extensive coverage in the media.
Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He obtained his Ph.D. in Computer Science at University of Southern California and B.Eng. in Computer Science and Electrical Engineering at Shanghai JiaoTong University. Before he joined ASU, he worked at Telecom Australia Research Labs and was on the faculty at National University of Singapore. He was recognized for excellence in teaching and research in Computer Science and Engineering at Arizona State University. His research interests are in data mining, machine learning, social computing, and artificial intelligence, investigating problems that arise in many real-world, data-intensive applications with high-dimensional data of disparate forms such as social media. His well-cited publications include books, book chapters, encyclopedia entries as well as conference and journal papers. He serves on journal editorial boards and numerous conference program committees, and is a founding organizer of the International Conference Series on Social Computing, Behavioral-Cultural Modeling, and Prediction (http:sbp.asu.edu/). He is an IEEE Fellow.