goto Ontominer Project homapage...
Abstract: In this paper, we present a fast and scalable Spreading Activation Network (SAN) framework for improving weakly annotated data which is typically generated by an automated information extraction system from Web documents. Weakly annotated data suffers from two major problems; (i) might contain incorrect ontological role assignments, and (ii) might have many missing attributes. The SAN model described here is shown to substantially improve the Web data annotations for Web document collections. Our experimental evaluations with the TAP data set indicate that our model can improve the accuracy of role assignments up to 75% even with 60% and (35%,35%) distortion, and can recover more than half of the 35% missing attributes.
Abstract: In this paper we describe the semantic partitioner algorithm,
that uses the structural and presentation regularities of the Web pages to
automatically transform them into hierarchical content structures. These
content structures enable us to automatically annotate labels in the Web
pages with their semantic roles, thus yielding meta-data and instance information
for the Web pages. Experimental results with the TAP knowledge
base and computer science departmentWeb sites, comprising 16, 861
Web pages indicate that our algorithm is able gather meta-data accurately
from various types ofWeb pages. The algorithm is able to achieve this performance
without any domain specific engineering requirement.
* Srinivas Vadrevu, Fatih Gelgi, Hasan Davulcu: Semantic Partitioning of Web Pages. WISE 2005: 107-118
Abstract: The Web has established itself as the largest public data repository ever available. Even though the vast majority of information on the Web is formatted to be easily readable by the human eye, “meaningful information” is still largely inaccessible for the computer applications. In this paper, we present automated algorithms to gather meta-data and instance information by utilizing global regularities on the Web and incorporating the contextual information. Our system is distinguished since it does not require domain specific engineering. Experimental evaluations were successfully performed on the TAP knowledge base and the faculty-course home pages of computer science departments containing 16,861 Web pages.
* Fatih Gelgi, Srinivas Vadrevu, Hasan Davulcu: Improving Web Data Annotations with Spreading Activation. WISE 2005: 95-106
This project focuses on identifying missing and similar labels in ontologies built from web pages. We start by converting HTML pages into DOM trees via a semantic partitioner. Next, we extract a set of value-category sets, based on the structure of the trees. Using these sets, we then attempt to identify missing and similar labels by using a clustering algorithm over the entire domain. To do this, we calculate a vector of syntactic statistics for the values in each category set. The vectors are then clustered using the K-means algorithm, with the hopes that both good and bad categories will be combined, thus allowing us to identify missing labels. To improve performance over simple statistical analysis, we also perform some preprocessing to improve our clustering results, including scanning for regular expressions and identifying enumerated types. Our results show promise, although clearly more work is needed in this field before this technique will be considered practical.