Share » Learn » eZ Publish » Creating a Search Engine

Creating a Search Engine

Tuesday 19 December 2006 8:10:00 pm

Currently 3 out of 5 Stars.
1
2
3
4
5

Based on the existing search engine infrastructure in eZ Publish 3, different variants of the concepts described above were implemented. Because the main focus was research and evaluation, the actual implementation is a naive approach intended as a "proof of concept" rather than an actual implementation.

All terms in English texts were stemmed by the PHP PECL stem extension.

Implementation challenges

Scientific papers in the area of information retrieval often only contain the basics of their concepts. Information for special or more complex cases is sometimes missing. Often you cannot find successful parameters for the formulas used by the algorithms. Sometimes you cannot even find the formulas.

Sometimes this is because of the shortness of the articles. But it also appears that people working for commercial search engines want to show great results without disclosing details that would benefit their competitors.

Therefore, there is some trial and error when putting the theory into practice. An evaluation with many users shows which algorithms are good and which are not.

Configuration

In order to get valuable results, we weighted different parts of our content on a class level. For example, we increased the weight of attributes like the title and the abstract / intro in the article content class (relative to the other attributes). The same could be done for different XML tags in an XML attribute, for example headings.

Additionally we wanted to experiment with the index. In an XML document, you can store statistical data for each XML element. But probably this is too fine-grained, because users tend to rate really small pieces of information as less relevant because they do not contain much information. So instead we combined different XML elements into one index node.

As a short example, the content of an XML field could be visualized as tree structure as shown in the following graphic. The ellipses (for example "section") are the XML tags, while the rectangles represent the content. If the elements of each section are included in one index node, the blue boxes show which XML elements are combined together. [4]

Combining index nodes

The configuration of index nodes is flexible and if a retrieved object contains different index nodes the statistical data is "summarized". This would also be useful if, for example, you wanted to calculate a ranking value for this article including its sub-pages while also calculating a ranking value for each sub-page separately (as a sub-page might be more relevant for a query than the whole article). In the same way, when calculating the ranking value of a content object, the statistics of related, embedded or linked content objects can be included.

In the current test implementation, the search is configured in a simple way that treats documents as almost plain text with up-weighting only applied to a few important tags and attributes. This configuration is comparable to normal web search. In the last step of the evaluation phase the configuration of the search index will be changed and the existing queries will be run automatically and compared to the user ratings. In this way we can evaluate many different configurations and variants of the algorithms.

You are invited to try this new search and evaluate the returned results (see the end of the article for information).

Limitations

At the moment only a subset of XIRQL is implemented. But it is enough to answer the most important question for now: What are good configurations and algorithms for relevance-based ranking of content-related queries?

Add reply

Printable

Printer Friendly version of the full article on one page with plain styles

Author(s)

design implementation by Netgen / design by Xvision