Tuesday 07 November 2006 2:51:11 pm
Ive looked into the search engine indexing code and found two things which may cause the problem: \kernel\search\plugins\ezsearchengine
LINE 105: $wordArray = split( " ", $text );
This splits the text all at once. In my experience this takes a long, long time. I guess maybe it would be better to do it in different parts. The place where it breaks is when words are indexed in 1000 word groups. i think it goes up to 127.000 words before breaking:
LINE 129:
$db =& eZDB::instance();
$db->begin();
for( $arrayCount = 0; $arrayCount < $wordCount; $arrayCount += 1000 )
{
$placement = $this->indexWords( $contentObject, array_slice( $indexArray, $arrayCount, 1000 ), $wordIDArray, $placement );
}
$db->commit();
I did try putting the db->begin() and db->commit() inside the for loop, to favour several transactions instead of a huge one, but it didn't help. On windows, PHP crashes after reaching the 127.000th word. I upload the complete PHP manual in html which is about 12 MB and then do indexing through the shell. It's interesting to see that the text gets stripped off the html tags:
Line 94:
$text = eZSearchEngine::normalizeText( strip_tags( $metaDataPart['text'] ), true );
So this way, you don't need to do a 'lynx --dump' to index html files as they will get stripped of html tags anyway. Therefore to index html, one could use the plaintext plugin:
settings/override/binaryfile.ini.append.php
[HandlerSettings]
MetaDataExtractor[text/html]=plaintext
Hope this helps a bit. Regards, Felipe
Felipe Jaramillo
eZ Certified Extension Developer
http://www.aplyca.com | Bogotá, Colombia
|