Searching content into a pdf file - eZ Publish

Share » Forums » Developer » Searching content into a pdf file

Monday 01 August 2011 7:34:00 am - 5 replies

pdf
search

5 replies

Author	Message
Peter Keung	Monday 01 August 2011 7:51:40 am This is the first thing that comes to mind: http://projects.ez.no/eztika http://www.mugo.ca Mugo Web, eZ Partner in Vancouver, Canada
Simone Conti	Monday 01 August 2011 8:05:24 am Unfortunately is not what I'm looking for. I need something that allows me to search into a pdf files. Somebody told me that EzPublish has this feature embedded but it needs to be allowed. Any suggestions?
Steven E. Bailey	Monday 01 August 2011 9:07:18 am eZPublish does have this feature and you should be seeing your pdfs indexed - with a bunch of caveats. What happens is that when a pdf is saved (or you update your search index), the pdf is run through the tool defined by [PDFHandlerSettings] TextExtractionTool=pstotext in your binaryfile.ini If you don't have this tool on your machine, then your pdfs won't be indexed. If you search for TextExtractionTool or pdftotext in these forums you'll see a couple other possible tools - such as: http://share.ez.no/forums/extensions/ez-find/solr-indexing-error If you have whatever tool you are using and if you're pdfs aren't being indexed, then it probably means that your pdfs aren't structurally text - the content is actually an image (or series of images) saved in the pdf container. It means that you're not going to be able to index using pdftotext - a good test is to run whatever tool you have on the command line against the file that isn't be indexed to see what actually comes out. If nothing comes out you'll have to use some other tool - like eztika (I've never used it) or, something like tesseract to extract the text. Certified eZPublish developer http://ez.no/certification/verify/396111 Available for ezpublish troubleshooting, hosting and custom extension development: http://www.leidentech.com
Simone Conti	Thursday 04 August 2011 3:26:58 am Now something works. I decided to use eztika as suggested by Peter. I have a question: where does eztika store its data? I hope it's not made to scan all pdf for each search... I have a very large number of pdf files!! Thanks
Paul Borgermans	Friday 05 August 2011 10:23:26 am Hi eztika does not store the data itself, its goal is to extract the plain text for subsequent indexing by the configured search plugin (you should use eZ Find of course :) ) the default search plugin stores the indexing result in the database, while eZ Find uses Solr which stores its data into Lucene index files on teh filesystem This is done only when the pdf is uploaded or updated. hth Paul eZ Publish, eZ Find, Solr expert consulting and training http://twitter.com/paulborgermans

You must be logged in to post messages in this topic!

Contact us

Powered by eZ Publish^® Copyright © 2023 Share eZ Publish! (except where otherwise noted.) All rights reserved.

design implementation by Netgen / design by Xvision