Solr Indexing Error - community.ez.no

Tuesday 07 April 2009 5:04:14 am - 3 replies

3 replies

Author	Message
Sylvain Gogel	Tuesday 07 April 2009 5:12:41 am I use both [PDFHandlerSettings] TextExtractionTool=pstotext and [PDFHandlerSettings] TextExtractionTool=mypdftotext the last is a shell script based on xpdf tool pdftotext #!/bin/sh /usr/bin/pdftotext -enc "UTF-8" $1 - -- http://www.ecedi.fr Agence Web, Créa/Conseils, Accessibilité eZPublish, Drupal, Zend, Symfony
Geoff Bentley	Wednesday 08 April 2009 10:15:04 pm Another option is to use the eZ Tika extension, which allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF: * http://projects.ez.no/eztika * http://lucene.apache.org/tika/
Christian Rößler	Friday 08 May 2009 4:40:23 pm As it seems there is a utf8 doublebyte character being interpreted as two 8-bit characters - which is not correct. I think i have the same issue here (german umlauts and stuff like that) and found a promising site which explains the tomcat/solr charset settings: http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4 Perhaps this is the issue. xml cannot have those characters in it. so the solr-xml-parser crashes. can you try it out? currently i am not able to get access to a ez-installation. cheers, christian Hannover, Germany eZ-Certified http://auth.ez.no/certification/verify/395613

Author

Message

Tuesday 07 April 2009 5:12:41 am

I use both

[PDFHandlerSettings]
TextExtractionTool=pstotext

and

[PDFHandlerSettings]
TextExtractionTool=mypdftotext

the last is a shell script based on xpdf tool pdftotext

#!/bin/sh
/usr/bin/pdftotext -enc "UTF-8" $1 -

--
http://www.ecedi.fr
Agence Web, Créa/Conseils, Accessibilité
eZPublish, Drupal, Zend, Symfony

Geoff Bentley

Wednesday 08 April 2009 10:15:04 pm

Another option is to use the eZ Tika extension, which allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF:

* http://projects.ez.no/eztika
* http://lucene.apache.org/tika/

Christian Rößler

Friday 08 May 2009 4:40:23 pm

As it seems there is a utf8 doublebyte character being interpreted as two 8-bit characters - which is not correct.

I think i have the same issue here (german umlauts and stuff like that) and found a promising site which explains the tomcat/solr charset settings:

http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4

Perhaps this is the issue. xml cannot have those characters in it. so the solr-xml-parser crashes. can you try it out? currently i am not able to get access to a ez-installation.

cheers,
christian

Hannover, Germany
eZ-Certified http://auth.ez.no/certification/verify/395613

You must be logged in to post messages in this topic!