Share » Learn » eZ Publish » Indexing Multiple Binary File Types

Indexing Multiple Binary File Types

Wednesday 20 September 2006 1:35:00 pm

  • Currently 5 out of 5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Parsers

eZ publish ships with the ability to index PDF files and Word documents (assuming you have installed the pstotext and wvware utilities). However, we found that this functionality didn't meet our needs, so we did an extensive search for other parsing tools. Our solution is based on the tools listed below.

  • pdftotext (for parsing PDFs): a full blown PDF reader that also provides numerous PDF and PS utilities.
  • catdoc (for parsing Word documents): a set of parsers and utilities including:
    • catppt (for parsing Powerpoint documents)
    • xls2csv (for parsing Excel documents): by default, this parses XLS files into comma-delimited format, but it also provides options to specify other output formats.

These parsers handle PDFs, Word documents, Powerpoint presentations, and Excel spreadsheets. Our solution is customizable, allowing you to add other parsers as needed, but this set of parsers covers the most common file formats.

Install these parsers in a locations where they can be executed by your web server user / group.

Configuration

Place the following code in your settings/override/binaryfile.ini.append.php file (in the siteaccess folder of choice):

# Here you can add handlers for new datatypes.
[HandlerSettings]
MetaDataExtractor[text/plain]=plaintext
MetaDataExtractor[application/pdf]=ezbinaryfile
MetaDataExtractor[application/msword]=ezbinaryfile
MetaDataExtractor[application/vnd.ms-excel]=ezbinaryfile
MetaDataExtractor[application/vnd.ms-powerpoint]=ezbinaryfile

# The full path to your log file (used for debugging/testing)</span>
[BinaryFileHandlerSettings]
LogFile=var/log/index.log

Note that this configuration example is for eZ publish version 3.8. If you are using previous versions of eZ publish (we tried it on 3.6) remove "ez" from the "ezbinaryfile" strings.

Save this file and clear the cache. Next, touch the file where you placed the configuration code to create an empty log file in the specified location. (Make sure that this file is writeable by your web server user / group.)

36 542 Users on board!

Tutorial menu

Printable

Printer Friendly version of the full article on one page with plain styles

Author(s)