Back to home page

Under continuous construction... as my algorithm evolves.

On this page I will comment on my own algorithm for information retrieval and how I improve it as time passes.

Description of the project

The project I will be working in is in perfecting an algorithm for information retrieval.

Such algorithm assumes that an inverted index for the data from where it is wanted to retrieve information has been already created. Right now I am using the vector model to retrieve the information, but I now that a pure vector model algorithm is not the most accurate one, so I guess I will be hybridizing it.

First algorithm

General Steps

  1. Parse the queries and store them in an array.
  2.  I am giving each word of the queries a weight of one.
  3. Make a document-frequency matrix that stores the amount of times each of the words of the query appears in each document.
  4. Normalize the matrix: For each document in the data set, find the number of times of appearance of the  word (excluding stop words) that appears the most in that document (let's name it fmax). For each row (each document), divide the number of times of the word of the query by fmax.
  5. Find cosine between the query vector and the document vector.
  6. Sort results in descendant order.

Results

Back to home page