Wednesday 20 September 2006 1:35:00 pm
eZ publish ships with the ability to index PDF files and Word documents (assuming you have installed the pstotext and wvware utilities). However, we found that this functionality didn't meet our needs, so we did an extensive search for other parsing tools. Our solution is based on the tools listed below.
These parsers handle PDFs, Word documents, Powerpoint presentations, and Excel spreadsheets. Our solution is customizable, allowing you to add other parsers as needed, but this set of parsers covers the most common file formats.
Install these parsers in a locations where they can be executed by your web server user / group.
Place the following code in your settings/override/binaryfile.ini.append.php file (in the siteaccess folder of choice):
# Here you can add handlers for new datatypes. [HandlerSettings] MetaDataExtractor[text/plain]=plaintext MetaDataExtractor[application/pdf]=ezbinaryfile MetaDataExtractor[application/msword]=ezbinaryfile MetaDataExtractor[application/vnd.ms-excel]=ezbinaryfile MetaDataExtractor[application/vnd.ms-powerpoint]=ezbinaryfile # The full path to your log file (used for debugging/testing)</span> [BinaryFileHandlerSettings] LogFile=var/log/index.log
Note that this configuration example is for eZ publish version 3.8. If you are using previous versions of eZ publish (we tried it on 3.6) remove "ez" from the "ezbinaryfile" strings.
Save this file and clear the cache. Next, touch the file where you placed the configuration code to create an empty log file in the specified location. (Make sure that this file is writeable by your web server user / group.)