ASU logo
ASU Sunburst
  • Home
  • BioAI
  • Topics
  • Schedule
  • Course material

CSE 591 - Natural Language Processing with Biomedical and Archaeological Applications

Jörg Hakenberg, Bob Leaman, Luis Tari

Lectures: Mo & We, 3:30-4:45pm, Artisn Crt Brckyrd 240

Office: BYENG 569BE

There is an accompanying research seminar in which we will discuss related topics, extensions, more details, and applications. Please see here. Participation is not mandatory for students enrolled in CSE 591.


Summary

The most exhaustive and most recent knowledge in scientific domains such as biology, biochemistry, or archaeology currently is available only in the form of scientific publications. Efforts exist to store valuable facts also in databases, to make content more accessible to users. Such databases are maintained by trained personnel to provide high-quality data, but still lack behind for years and cover only a small proportion of data discussed in the literature. In this course, we will discuss approaches towards automated fact extraction from literature. Main topics include text classification, natural language processing, and text mining. Focusing on the domains of biology and archaeology, we will look into various issues along the path from information retrieval to knowledge integration and question answering: how to select relevant documents, how to summarize documents, how important entities (geographic locations, drugs, genes) can be recognized and identified, how associations between entities can be extracted and verified, and how all these information can be stored to facilitate querying with respect to question answering.


Topics

  • Bioinformatics, archaeological informatics
  • Information retrieval, IR (text classification, text clustering, VSM, bag-of-words, features, SVM)
  • Named entity recognition, NER (dictionary, rules, features for NER, HMM, CRF, BANNER)
  • Entity mention normalization, EMN (normalization, standardization, grouding)
  • Word sense disambiguation, WSD (knowledge sources, one-sense-per-discourse, one-sense-per-collocation, decision lists, supervised/unsupervised/semi-supervised WSD)
  • Natural language processing, NLP (tokenization, period disambiguation, part-of-speech tagging, stemming, lemmatizing, shallow parsing, chunking, full sentence parsings, link grammar)
  • Information extraction, IE (relation mining, linguistic frames, co-occurrence, MEMM, convolution kernels)
  • Active learning; collection-wide analysis; question answering

Schedule

  1. Intro, tasks, challenges; tasks for biomedical NLP and archaeological NLP - presented by Chitta Baral [Slides]
  2. Information retrieval, document representation, question answering - presented by Luis Tari [Slides]
  3. Natural language processing - tokenization, stemming, lemmatizing, part-of-speech tagging - [Slides]
  4. Natural language processing - period disambiguation, chunking, shallow parsing, full sentence parsing, Link Grammar - [Slides]
  5. Machine learning - introduction, classification, clustering, SVM, k-means - [Slides]
  6. Named entity recognition - dictionary-based methods, rule-based methods; string-similarity measures - presented by Bob Leaman [Slides]
  7. Machine learning - hidden Markov modeling, conditional random fields - presented by Bob Leaman [Slides]
  8. Named entity recognition - features for NER; BANNER - presented by Bob Leaman [Slides]
  9. Named entity recognition - extensions: semi-supervised and active learning - presented by Bob Leaman [Slides]
  10. Word sense disambiguation - task, examples; most frequent sense, one-sense-per-discourse, one-sense-per-collocation, dictionary-based WSD - [Slides]
  11. Word sense disambiguation - knowledge sources; decision lists, supervised/unsupervised/semi-supervised WSD - [Slides]
  12. Entity mention normalization - dictionary-based, concept profiles, differences EMN/WSD - [Slides]
  13. Summary - NLP, ML, IR, NER, WSD, EMN - [Slides]
  14. Mid-term exam (13 Oct 2008)
  15. Machine learning - features, features, features - [Slides]
  16. Machine learning - supervised classification, naive Bayes, kNN, SVM - [Slides]
  17. Machine learning - text classification; document zoning; multi-class learning; class project: naive Bayes and SVM for text classification - [Slides]
  18. Machine learning - unsupervised learning, semi-supervised learning; unsupervised WSD - [Slides]
  19. Searching for related articles - pmra and BM25 - [Slides]
  20. 11/03: Information extraction - co-occurrences, statistical approaches - [Slides]
  21. 11/05: Information extraction - template filling, linguistic frames, patterns - [Slides]
  22. 11/10: Active learning for NER - presented by Bob Leaman [Slides]
  23. 11/12: Information extraction - Direct memory access parsing - [Slides]
  24. 11/17: Information extraction - Convolution kernels, tree kernels, TSVM, SVM with structured output - [Slides]
  25. 11/24: Query languages - parse tree databases and parse tree query languages - presented by Luis Tari [Slides]
  26. 11/26: Information extraction - Convolution kernels continued - [Slides]
  27. 12/01: Summary - [Slides]
  28. 12/03: Final exam

Course material

There are some books I recommend for further reading. However, most deal with a specific topic only (covered on one or two days) or become useful only if you plan on working in this field (project, thesis, ...). I will bring all books into the class on the first day, so everybody can have a look at them. Don't buy them right away, most are quite expensive.
Most of the topics have nice Wikipedia entries, however, and you can start reading from there.

Books:
  • "Speech and Language Processing" (Jurafsky and Martin) 2nd edition
    ---
    Amazon
    --- Amazon has used or new copies of the 1st edition paperback for $25
  • "Foundations of Statistical Natural Language Processing" (Manning and Schütze)
    --- limited preview at Google Books
    --- Amazon
  • "Text mining for biology and biomedicine" (Ananiadou and McNaught)
    --- no preview at Google Books
  • "Word Sense Disambiguation" (Agirre and Edmonds)
    --- Amazon
Online:
  • Natural language processing (NLP)
    --- Wikipedia - lists concrete problems and tasks, with links to IR, IE, NER, QA, ...
  • Word sense disambiguation (WSD)
    --- Wikipedia
    --- Word Sense Disambiguation: The State of the Art (Ide and Veronis) - excellent overview, though slightly outdated (1998)
    -- Advances in WSD (Mihalcea and Pedersen) - nice online tutorial with more than 200 slides
  • Question answering (QA)
    --- Wikipedia