Share » Learn » eZ Publish » Creating a Search Engine

Creating a Search Engine

Tuesday 19 December 2006 8:10:00 pm

  • Currently 3 out of 5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

The most frequent approach in commercial search engines is the vector space model. This is the basis of almost all web search engines (of course in different variations and combined with other strategies like evaluating the popularity of a site based on link relationships).

It is interesting that the vector space model has no theoretical / scientific foundation. But other approaches (which are, for example, based on probability theory) do not yield better results in general, that is if they are used on different document/data sets. In general, the other approaches are much more complex, both in understanding them and in their calculations. Some of them need to be trained on the dataset (and sometimes even on each query) to provide good results. Even after training they are sometimes still less accurate than the simple vector space model. Compared to those alternative approaches the vector space model is easy to understand and calculate.

Visualizing the vector space model

The vector space model can be visualized as shown in the following graphic. The graphic shows some (blue) document vectors and one (green) query vector in a coordinate system (black).

Vector space model

A vector is simply an arrow from the origin (where the axis of the coordinate system intersects) to a specific point in the coordinate system. The vector is represented by a value for each axis, for example (2, 3) (that is, "2 units right and 3 units up").

In the vector space model there is one axis for each term that exists in the whole document set. Of course there are many more than two axis - but this doesn't matter when we want to understand the idea of the model

In a document vector, the value on each term axis represents the importance of the term in the document. Therefore the document vector is merely a set of all terms with a value representing the importance of each term in the document. Note that all terms that do not occur in the document have a value of "0". You could write the information of a document or query vector like a row in a table:

document\term ez publish cms admin user and ...
document 1 3 2 4   1 10  
document 2       3 5 3  
document 3 4 6 2 1 3 4  
               
Query "ez publish" 1 1          

Calculating the similarity of a document related to a specific query is like calculating the similarity of the document vector and the query vector. As a simple visual model, this could be something like the smallest angle between the document and the query vector. The smaller the angle, the more similar the document is to the query.

Now that we understand what similarity means, we need to know how the importance of a term in a document is calculated.

Printable

Printer Friendly version of the full article on one page with plain styles

Author(s)