Indexing content of files using Solr (ezfind)

Indexing content of files using Solr (ezfind)

Wednesday 01 October 2008 1:36:23 am - 3 replies

Modified on Wednesday 01 October 2008 4:17:49 am by Laurence Bonhomme

Author Message

Christian Rößler

Wednesday 22 October 2008 11:01:54 am

Laurence,

perhaps a bit late but better late than never :-)

I have had pretty the same problems. Perhaps this link will help you: http://ez.no/developer/articles/indexing_multiple_binary_file_types

I was able to setup a generic binaryfilehandler was was called for every physical file by ezfind. This binaryfilehandler was calling several external programs (pdftotxt, doc2txt), the parsed contents of each file was printed to stdout and catched by the binaryfilehandler, later on returned to ezflow, which saved it into ezsolr-index via a http-request.

The tricky point is to get ezfind use the custom file-handler to parse the binaryfile's content that ezflow/ezsolr can work with.

The above supplied link contains a full featured howto + downloads to get it working.
If you need any further help, feel free to reply to this post.

chris.

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

Paul Borgermans

Wednesday 22 October 2008 2:21:15 pm

Something to point out here: the configuration for indexing files like pdf, word, ... depends on the configuration of eZ publish to convert these to plain text. It has nothing to do with the search plugin used (default, Solr/eZ Find, ...).

We'll improve the conversion mechanism options in eZ Publish for the next iteration of eZ Publish (4.1), I'm investigating a few more options to handle also more file formats.

You'll learn more about that very soon (< 3 weeks)

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Geoff Bentley

Wednesday 25 February 2009 3:21:12 pm

Check out Paul's ezTika extension ( http://projects.ez.no/eztika ) which draws on the Apache Tika toolkit ( http://lucene.apache.org/tika/ ) - this works seamlessly (so far) with eZ Find.

You must be logged in to post messages in this topic!

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.