Tuesday 07 April 2009 5:04:14 am - 3 replies
Sylvain Gogel
Tuesday 07 April 2009 5:12:41 am
I use both
[PDFHandlerSettings] TextExtractionTool=pstotext
and
[PDFHandlerSettings] TextExtractionTool=mypdftotext
the last is a shell script based on xpdf tool pdftotext
#!/bin/sh /usr/bin/pdftotext -enc "UTF-8" $1 -
-- http://www.ecedi.fr Agence Web, Créa/Conseils, Accessibilité eZPublish, Drupal, Zend, Symfony
Geoff Bentley
Wednesday 08 April 2009 10:15:04 pm
Another option is to use the eZ Tika extension, which allows indexing of a large variety of binary file types like MsWord, MsOffice, PDF, Excel, ODF:
* http://projects.ez.no/eztika* http://lucene.apache.org/tika/
Christian Rößler
Friday 08 May 2009 4:40:23 pm
As it seems there is a utf8 doublebyte character being interpreted as two 8-bit characters - which is not correct.
I think i have the same issue here (german umlauts and stuff like that) and found a promising site which explains the tomcat/solr charset settings:
http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4
Perhaps this is the issue. xml cannot have those characters in it. so the solr-xml-parser crashes. can you try it out? currently i am not able to get access to a ez-installation.
cheers,christian
Hannover, Germany eZ-Certified http://auth.ez.no/certification/verify/395613
You must be logged in to post messages in this topic!