Beyond Feature Selection and Extraction

- An Integrated Framework for High-Dimensional Data of Small Labeled Samples


High-dimensional data is ubiquitous in real-world applications - from text categorization, to image processing, and to Web searches. The shortage of labeled data, resulting from high labeling costs, necessitates the need to explore machine learning approaches beyond classic classification and clustering paradigms. Semi-supervised learning is one such approach that demonstrates its potential in handling data with small labeled samples and reducing the need for expensive labeled data. However, high-dimensional data with small labeled samples permits too large a hypothesis space yet with too few constraints (labeled instances). The combination of the two data characteristics manifests a new research challenge. Employing computational and statistical learning theory, we analyze specific challenges presented by such data, show preliminary studies, delineate the need to integrate feature selection and extraction in a novel framework to reduce hypothesis space, propose to design efficient and novel algorithms, and conduct theoretical and empirical studies to understand complex relationships between high-dimensional data and classification performance.


Related Activities

Project Members


This project is sponsored by NSF (#0812551), 9/2008 - 8/2012.

Created on Oct 26, 2008.
Contact: Huan Liu via Email:
Webmaster: Jiliang Tang, Email:

Last Upadted: Tuesday, May 22, 2012