eZFind - indexing errors - eZ Publish

Share » Forums » Developer » eZFind - indexing errors

Friday 11 September 2009 2:32:02 am - 7 replies

7 replies

Author	Message
Paul Borgermans	Friday 11 September 2009 2:58:06 am What do you use for conversion of binary files? It seems you use pstotext as is the default setting (but far from the best) Paul eZ Publish, eZ Find, Solr expert consulting and training http://twitter.com/paulborgermans
Fabien Mas	Friday 11 September 2009 5:01:22 am Hi Paul, Effectively, I am using pstotext Which one do you advice me to use ? thx for your help :) Fabien
Vincent Tabary	Friday 11 September 2009 5:31:52 am Hi all, That could be interesting for me too :) I installed pstotext because eZFind asked for it but I do not know any other software for that Vinz http://vincent.tabary.me
Fabien Mas	Friday 11 September 2009 5:40:38 am I have activated the eztika extension but I have also some troubles Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:111) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:57)
Paul Borgermans	Friday 11 September 2009 12:32:31 pm Hello eztika is not too robust wrt asian character sets, but should be fine with others For pdf in general, the best is to use xpdf tools You need to create a wrapper script for xpdf's pdftotext utility This is what I use (locally called ezpdftotext): #!/bin/sh /opt/local/bin/pdfinfo $1 >> /tmp/ezpdftotext.log /opt/local/bin/pdftotext -enc "UTF-8" $1 - the pdfinfo line is used for logging and can be suppressed if all goes well configuration wise So all considered: use eztika for everything except pdf, for which you should use xpdf Expect eztika to improve in the future, it is also getting into Solr (and when stable enough, eZ Find will use that instead of the binary file wrappers) Cheers Paul eZ Publish, eZ Find, Solr expert consulting and training http://twitter.com/paulborgermans
Fabien Mas	Monday 14 September 2009 7:59:32 am Hi Paul, I have created my own parser using xpdf. I have no error now. I log the text generated and it's ok But I have a new problem ;) With the default searchengine, it works well but with ezfind activated, no word of my file is indexed (even if xpdf works well) When I search a word, I have no result Is there a specific thing to do for ezfind ? Thx, Fabien
Fabien Mas	Thursday 17 September 2009 1:41:30 am I got it :) That was the pagebreaks who made mischief in the xml generated by solr so now I use this code and it works fine : pdftotext -enc "UTF-8" -eol unix -nopgbrk $1 -

You must be logged in to post messages in this topic!

Contact us

Powered by eZ Publish^® Copyright © 2023 Share eZ Publish! (except where otherwise noted.) All rights reserved.

design implementation by Netgen / design by Xvision