Share » Learn » eZ Publish » Creating a Search Engine

Creating a Search Engine

Tuesday 19 December 2006 8:10:00 pm

  • Currently 3 out of 5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

The importance (or weight) of each term in a document is individually calculated. There is no magic or intelligence but only the simple counting of words. The weight is based on two factors: the local weighting factor and the global weighting factor.

Local weighting factor

The local weighting factor is based on each individual document. The idea is to measure how important a term is in the document itself. Terms that occur often should be more important than keywords that occur only a few times. This is based on simply counting how often the term occurs in the document the so-called term frequency.

Terms occur more often in longer texts than in short texts. Because of that, the term frequency can be normalized - that is, setting the term frequency in relation to the length or the maximum term frequency of the document. Without normalization longer documents rank higher simply because they contain more words.

But there is still another problem. Terms like "and", "a", "of" and "the" occur profusely in many documents. Those terms are not important. But unfortunately they have a very high term frequency because they occur so often. So while the term weight is high, they neither discriminate different documents nor indicate the content of a document.

For example, if your search query is "a website", the term "a" is not discriminating the documents while the term "website" is. Therefore, there is a second weighting factor - the global weighting factor.

Global weighting factor

The global weighting factor is based on the whole dataset as opposed to each individual document. The idea is to measure how discriminating a term is. A term is discriminating if it only occurs in a few documents, and is not discriminating at all if it occurs in every document.

For example, on ez.no terms like "is" or "a" are not discriminating at all. The terms "ez", "systems" and "publish" are a little bit more discriminating. Terms like "cluster" and "rss" are much more discriminating.

Like the local weighting factor, the global weighting factor is also based on simple counting. It is based on the number of documents in which a term occurs, the so-called document frequency. Normally the document frequency is set in relation to the total number of all documents.

If a term occurs in only a few documents, it is a discriminating term. Terms that occur in almost every document are not discriminating, regardless of whether they have a high or low term frequency in a document.

Combining term and document frequency

The importance of a term in a document is based on how often it occurs in the document and how discriminating the term is related to the whole dataset.

In the simplest way, the weight is calculated by taking the local weighting factor and dividing it by the global weighting factor. Thus the importance of terms that occur in many documents will become really small as they are divided by a big number. And on the other side, terms that occur only in a few documents will have a high weight even if they do not occur very often in those documents.

Vector space on-the-fly

Normally the term frequency and document frequency are pre-calculated when indexing the documents. But this has one disadvantage. Imagine a query like "partner bug". Assume those two terms would discriminate the documents on ez.no in a similar dimension (that is, in the whole ez.no dataset both terms have a similar document frequency).

But what if you limit your search to the Partner section on ez.no? The proportion of documents that contain the term "partner" is much higher than in the whole dataset. So "partner" is less discriminating in that context than it is in the whole ez.no dataset. Similarly, the "bug" is much more discriminating. If you search in the bug system for the same terms it will be the other way around.

Thus it could be interesting to calculate the global weighting factor - the document frequency - on-the-fly, depending on the context to which you limit your search.

Now that we understand how search engines work, the next question is how to write powerful queries. The next section introduces a powerful query language that supports structural and content-related query conditions.

Printable

Printer Friendly version of the full article on one page with plain styles

Author(s)