There is an accompanying research seminar in which we will discuss related topics, extensions, more details, and applications. Please see here. Participation is not mandatory for students enrolled in CSE 591.
The most exhaustive and most recent knowledge in scientific domains such as biology, biochemistry, or archaeology currently is available only in the form of scientific publications. Efforts exist to store valuable facts also in databases, to make content more accessible to users. Such databases are maintained by trained personnel to provide high-quality data, but still lack behind for years and cover only a small proportion of data discussed in the literature. In this course, we will discuss approaches towards automated fact extraction from literature. Main topics include text classification, natural language processing, and text mining. Focusing on the domains of biology and archaeology, we will look into various issues along the path from information retrieval to knowledge integration and question answering: how to select relevant documents, how to summarize documents, how important entities (geographic locations, drugs, genes) can be recognized and identified, how associations between entities can be extracted and verified, and how all these information can be stored to facilitate querying with respect to question answering.
There are some books I recommend for further reading. However, most deal with a specific topic only (covered on one or two days) or become useful only if you plan on working in this field (project, thesis, ...). I will bring all books into the class on the first day, so everybody can have a look at them. Don't buy them right away, most are quite expensive.
Most of the topics have nice Wikipedia entries, however, and you can start reading from there.