Share » Learn » eZ Publish » Creating a Search Engine

Creating a Search Engine

Tuesday 19 December 2006 8:10:00 pm

Currently 3 out of 5 Stars.
1
2
3
4
5

Normal users want to have a simple form where they just input some words. More advanced users want additional options. Then there are developers who want to have the maximum power to create customized and advanced search forms. In all these cases, the system uses a powerful query language in the background. Users enter the query conditions on a simple form; the conditions are transformed into the query language on the back end.

Let's think about what our query language should be able to do.

Fine-grained search queries

We talked about the advantages of structured information. Attributes and XML elements have a semantic meaning and are based on a datatype with special comparison operators. We looked at the example of searching for temperature values that could be stored in attributes or (custom) XML tags. Additionally, the temperature values of different measurements could be converted for comparison.

To have this kind of power, you need functionality similar to the following:

limit the search to one or a few content classes
write constraints that are based on attributes and custom XML elements with the same (or even similar) names in different classes
search in all attributes and XML tags of a specific datatype
limit the search to one or a few sub-trees
nested definitions of logical operators like and, or and not

In a limited way these options are already available with the fetch() and search() operators in eZ Publish. However, this functionality could be enhanced.

eZ Publish and XML

All content objects in eZ Publish are stored as XML (or can easily be mapped to XML). But not only the content objects are XML: you can even imagine the whole content tree of an eZ Publish installation as one big XML tree, based on the content objects' XML representations and the node tree hierarchy.

There is also additional information that web search engines can't use. For example, although this article is split over multiple web pages, we know that it is one unit. You might want to combine the article pages when calculating the ranking value of the whole article. The same may be true if you embed an object in another object or if objects are related.

The power and limits of XPath

In the world of XML, XPath is a very general and powerful query language for structured XML documents (like SQL is for relational databases). Here are some XPath examples:

//article

retrieves all article

//article[./author[contains( 'fred' )]]

retrieves all articles where the author element contains "fred"

//article[./author[contains( 'fred' )]]//heading

retrieves all headings of articles where the author element contains "fred". Note that in this example the headings are retrieved, not the articles. First, all articles are searched for the author condition. Next, for those articles that matched the author condition, the headings are retrieved.

//article[./author[contains( 'fred' )]]/heading[contains( 'license' )]

retrieves all headings which contain "licence" in articles where the author element contains "fred"

These are only a few examples of the power of XPath with a really informal explanation of how they work. More information can be found on Wikipedia or on the W3C site.

But XPath can only be used to retrieve XML elements that fulfil some conditions - there is no relevance-based ranking of the retrieved XML sub-trees (in the same way that there is no such ranking in SQL). Therefore we need an approach that extends XPath.

XIRQL

XIRQL (spoken like "circle") enhances XPath by adding some important concepts and functionality for information retrieval. [3] For example, it introduces:

vagueness related to structural and content conditions (for example if you search for an XML element called "list" it could also match tags called "list-item")
a concept of datatypes with vague operators (for example if you want to search locations around "Oslo")
structural conditions based on hyperlinks
a concept for calculating the similarity of combined objects (useful for including the content of child objects and related or embedded objects with reduced weight in the actual object)

We extended XIRQL with the concepts of filter() and rank() functions:

filter() is responsible for building a context set. Based on this context set the document frequency can be calculated for the vector space on-the-fly
rank() contains all constraints that should be used for relevance-based ranking

A XIRQL query could, for example, look like this:

//article[filter(./author[contains( 'fred' )]), rank(./heading[contains( 'licence' )])]

All constraints that are possible in XPath and XIRQL can be defined as filter or ranking conditions.

Ok, enough theory. Let's have a look at our current status.

Add reply

Printable

Printer Friendly version of the full article on one page with plain styles

Author(s)

design implementation by Netgen / design by Xvision