Solr Indexing Error

Solr Indexing Error

Tuesday 07 April 2009 5:04:14 am - 3 replies

Author Message

Sylvain Gogel

Tuesday 07 April 2009 5:12:41 am

I use both

[PDFHandlerSettings]
TextExtractionTool=pstotext

and

[PDFHandlerSettings]
TextExtractionTool=mypdftotext

the last is a shell script based on xpdf tool pdftotext

#!/bin/sh
/usr/bin/pdftotext -enc "UTF-8" $1 -

--
http://www.ecedi.fr
Agence Web, Créa/Conseils, Accessibilité
eZPublish, Drupal, Zend, Symfony

Geoff Bentley

Wednesday 08 April 2009 10:15:04 pm

Another option is to use the eZ Tika extension, which allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF:

* http://projects.ez.no/eztika
* http://lucene.apache.org/tika/

Christian Rößler

Friday 08 May 2009 4:40:23 pm

As it seems there is a utf8 doublebyte character being interpreted as two 8-bit characters - which is not correct.

I think i have the same issue here (german umlauts and stuff like that) and found a promising site which explains the tomcat/solr charset settings:

http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4

Perhaps this is the issue. xml cannot have those characters in it. so the solr-xml-parser crashes. can you try it out? currently i am not able to get access to a ez-installation.

cheers,
christian

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

You must be logged in to post messages in this topic!

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.