I will be researching about Information Retrieval with Lanbo Zhang, a Ph.D student of Dr. Yi Zhang. During this week I have had to learn some new things so that I can be able to contribute in the research. Lanbo has recommended me the following books:
I also have had to read the notes of a graduate course in information retrieval and finish the first homework. That homework is like nothing else I have ever done. I have been given a document collection. And I am supposed to create an inverted list index of the document collection using lemur (an information retrieval system created at Carnegie Mellon University and University of Massachusetts) . Then I have to write a retrieval program that return a ranking list of documents for a set of queries. I have to develop 4 ranking algorithms: Boolean, tf (term frequency), tf*idf (term frequency*), and my own ranking algorithm.
I have finished the assignment of the first week, and now I am doing some dirty job with Lanbo. We have been given a list of queries from a search engine, along with the link to the document (and the ranking) that the user clicked after the search engine returned the results. Right now we are crawling all those pages on the web to index them so that we can obtain some useful information on why those users clicked on those links.
This week have been tough. My mentor has decided to create a machine learning algorithm that learns to assign a proper weigth to different features that can be extracted from a document collection. Those weights are used to produce a more accurate ranking function of the documents in the collection when they are retrieved as a response to a query. My task is write the code of the functions that will eventually extract those features from each of the documents in the collection.
I have been writing the code in C++, but it have been really challenging. First of all, there are about 45 different features. Secondly, for most of them I did not know what they were, so before coding anything I had to remit myself to the original paper that describes the features, then read the paper and understand the feature, and then create the algorithm that extracts it. Thirdly, I have got to read through the source code of Lemur so that I am able to use it for my convenience and save some time later.
But so far, so good. I have been able to implement most of them
This week I have implemented additional features. The ones of this week have been harder. I have had to read three papers about the language model in information retrieval in order to understand it well enough as to implement it.
I have also begun to run everyday. The weather here at Santa Cruz is incredible. It is a bad thing that I cannot get accomodated to it too much, because I am returning to Arizona at the "end" of the summer. The end of the summer in Phoenix is probably November.
We have been running different tests and using different datasets during this week. This has helped me correct some issues with the code and make it more flexible for the different situations it can encounter
Oh, I have forgot to mention that I have had to export all the code to Linux. If you ever have to do that, please remember this: Suppose any template of this form: Container<Container<T>>. In Windows this code will compile, but not using g++ in Linux. Please, change it to: Container<Container<T> >. (Take note of the extra space that is needed).
Last weekend I went to Orlando, FL. Good break after 6 weeks of intense mental activity.
This week I have spent most of the time trying to make PageRank work in Lemur. I have been also testing with different parameter files, so that the inlinks field gets indexed in Indri. I have learned the lesson: Before trying to index, it's needed to run the application HarvestLinks
New assignment from professor Zhang: Create an algorithm to cluster documents from a click-through collection. This is the task: Given a number of documents, find a way to cluster them by topic. At the beginning all they belong to a different set, but at the end the algorithm should have joined some of them together in one set depending on the similarity
I used a Hierarchichal approach to solve the problem. The greatest challenge was to determine when to stop. Because if you do not stop, then all the documents will end up in one set. I have decided to cut it when the gap between two successive combination similarities is the largest, because such large gaps arguably indicate “natural” clusterings.