Import XML Data Topic

Import XML Data Topic

Wednesday 07 December 2005 1:09:59 pm - 54 replies

Author Message

Xavier Dutoit

Wednesday 07 December 2005 1:20:07 pm

Salut,

I changed quite a few things on the ImportXML.
Some could be useful for everyone, like setting the publication date based on a xml field and dealing with a few more attribute types than what you did, other are quite specific (eg finding the parent's node based on a value in an xml field).

The big issue on my side was the memory: it just doesn't handle a file bigger than a few hundred records and xml fields (the xml path library seems to be quite sub-optimal to say the least and ez on the other hand...).

I had to reimplement it to run from the shell, and it worked like a charm!

Don't hesitate to contact me by mail if you want some of these features (I probably won't have the time to clean the hacks, but you might find a few things to reuse.)

Thanks for your extension !

X+

http://www.sydesy.com

Olivier Pierret

Wednesday 07 December 2005 2:17:32 pm

Xavier,

I already dropped you an email saying I was waiting for your changes but unfortunately you could not answer. I'll retry...

Olivier

Vytautas Germanavičius

Wednesday 07 December 2005 10:53:57 pm

"Known issues:" still shows "UTF-8 is not supported", while "Changelog" shows "1.4.1 .... - UTF-8 support "

{set-block scope=root variable=cache_ttl}0{/set-block}

Vytautas Germanavičius

Thursday 08 December 2005 12:16:08 am

Sorry, but

$fieldValue = utf8_decode($xPathEngine->wholeText("$path_item/$fieldName"."[1]"));

does not solve problem. I'm still getting ??? instead of UTF8 symbols.
I do not unerstand, why utf8_decode is used. As i understood from documentation http://lt.php.net/manual/en/function.utf8-decode.php , this function converts string to ISO-8859-1. After such convertion UTF8 symbols are lost...

Problem is that xml_parser reads file as iso-8859-1 and ignores encoding specified in xml file.

This problem should be fixed in xpathengine. I putted to XPatth.class.php line 1680:

      $parser = xml_parser_create('UTF-8');

This is only way i found, to get it working with utf-8 ...

 

{set-block scope=root variable=cache_ttl}0{/set-block}

Olivier Pierret

Thursday 08 December 2005 1:57:43 pm

Now I got it vytis !

I actually tested utf8_decode with characters convertible to iso-8859-1 so it *seemed* to work.
For the moment situation is as follows:
if "is UTF8" checkbox is ticked it will use utf8_decode() else not.
This is broken so I will remove this checkbox asap - in the meantime do not tick it.

Best and only (known) way for now to import UTF-8 is yours.

I will change the doc and the code accordingly.

Olivier

Vytautas Germanavičius

Friday 09 December 2005 7:41:08 am

Import function has limitations: i cannot import more than 340 records by one turn... :(

{set-block scope=root variable=cache_ttl}0{/set-block}

Vytautas Germanavičius

Tuesday 13 December 2005 5:21:03 am

Is it any way to run import script from comman line?

{set-block scope=root variable=cache_ttl}0{/set-block}

Olivier Pierret

Wednesday 14 December 2005 12:28:02 pm

Surely
I guess we need to add a file called import-cli.php that would parse the argument and call the

function &importXMLData( $xmldata, $datatype, $remove, $movetotrash) 

in importXMLDatafunctioncollection.

Of course context should be set appropriately if not I think the call:

$class =& eZContentClass::fetchByIdentifier( $identifiantClasse );

as many other kernel related eZ API calls.

I think we should have a look at how runcronjob.php is written.

Another option would be to wait for the Xavier hacks to this extension because I know he is using a script approach to run this extension.

Vytautas Germanavičius

Wednesday 14 December 2005 11:02:01 pm

I spent two days trying to write such code, but not successful... Finaly i found, that administrator updated php, but without mysql support...
I moved my site to another server, i will test my written code there. if it works, i will post it here.

{set-block scope=root variable=cache_ttl}0{/set-block}

Xavier Dutoit

Thursday 15 December 2005 12:36:54 am

Sorry Olivier,

I now I'm late, but I can't find the time to clean up the mess of custom things I've added. I'll try to do that this week-end, the delay is just ridiculous.

Thanks for your patience

X+

http://www.sydesy.com

Vytautas Germanavičius

Thursday 15 December 2005 4:09:13 am

Finally i made it! ;) now you can import data from commandline. I wrote additional script to read xml file and initialize ez object user in Olivier's script. I will clean debug prints, and will post it here.

Idea is great, but import script is very slow, it takes about 51s to import 100 entries.
I need to import ~23 000 entries.. With current speed of script it will take about 3.5 hours...

How fast is your import algorithm, Xavier?

{set-block scope=root variable=cache_ttl}0{/set-block}

Vytautas Germanavičius

Thursday 15 December 2005 5:55:23 am

There is it:

<?php
//
// Created on: <2005-12-15 14:52:57 vytis>
//
// This file may be distributed and/or modified under the terms of the
// "GNU General Public License" version 2 as published by the Free
// Software Foundation and appearing in the file LICENSE included in
// the packaging of this file.
//
// This file is provided AS IS with NO WARRANTY OF ANY KIND, INCLUDING
// THE WARRANTY OF DESIGN, MERCHANTABILITY AND FITNESS FOR A PARTICULAR
// PURPOSE.
//
// The "GNU General Public License" (GPL) is available at
// http://www.gnu.org/copyleft/gpl.html.
//
// Contact licence@ez.no if any conditions of this licencing isn't clear to
// you.
//

include_once( 'lib/ezutils/classes/ezcli.php' );
include_once( 'kernel/classes/ezscript.php' );

$cli =& eZCLI::instance();
$script =& eZScript::instance( array( 'description' => ( "eZ publish Script Executor\n\n" .
                                                         "Allows execution of simple PHP scripts which uses eZ publish functionality,\n" .
                                                         "when the script is called all necessary initialization is done\n" .
                                                         "\n" .
                                                         "ezexec.php myscript.php" ),
                                      'use-session' => true,
                                      'use-modules' => true,
                                      'use-extensions' => true ) );

$script->startup();

$options = $script->getOptions( "",
                                "[scriptfile]",
                                array() );

if ( count( $options['arguments'] ) < 5 )
{
    $script->shutdown( 1, "Usage of import script:\n SiteAccess, \n XML file, \n datatype, \n remove? (0 - no, 1 - yes) , \n move to trash? (0 - no, 1 - yes),\n user's ID");
    die();
}

$script->setUseSiteAccess($options['arguments'][0]);

$options = $script->getOptions( "",
                                "[scriptfile]",
                                array() );
$script->initialize();


include_once ('extension/importXMLData/modules/importXMLData/importXMLDatafunctioncollection.php');
include_once ('kernel/classes/ezcontentclass.php');
 
$xmldata = file_get_contents($options['arguments'][1]);
importXMLDataFunctionCollection::importXMLData($xmldata, $options['arguments'][2], $options['arguments'][3], $options['arguments'][4], $options['arguments'][5]);

$script->shutdown();
?>

This need some modifications of import script:
1. I didn't login system. Instead of this, i put users ID as parameter in importXMLDatafunctioncollection.php:

	function &importXMLData( $xmldata, $datatype, $remove, $movetotrash, $userID)

So, you should delete from importXMLDatafunctioncollection.php:

		  $user =& eZUser::currentUser();
		  // set user ID 
		  $userID =& $user->attribute( 'contentobject_id' );

2. Additionaly, i put some debug print to see progres of import:
Iin the beginning of function:

 
$cli =& eZCLI::instance();

 

Then:

 
		$cli->output( "Preparing list for import" );
		$paths_item = $xPathEngine->match("//$listTag/$itemTag");
		$cli->output("List size: ".count($paths_item));


Then i changed:

 
	$ii=0;  
	foreach ($paths_item as $path_item) 
	{
		$ii++;
		if( bcmod($ii, 100) == 0)
		{
			$cli->output("\n imported $ii of ".count($paths_item) );
		}		
		foreach($fieldNameList as $fieldName) 
		{
 

Good luck.
Next, i'm going to make shell script to import multiple xml documents. I think this is usefull, when you need to import several thousands of records, because xPathEngine uses to much memory.

{set-block scope=root variable=cache_ttl}0{/set-block}

Xavier Dutoit

Friday 16 December 2005 12:40:57 am

Hi,

Yes, the import is dead slow and yes Xpath swallows all the memory it can find (and more). I modified a few things to release memory into Olivier's script.

I didn't properly benchmark, but it was very long. In my case, that was a one shot import, so didn't mattered too much.

X+

http://www.sydesy.com

Philip K.

Friday 03 February 2006 6:02:47 am

Hey.

I've tried to change the importer so that it's possible to import utf-8 files, but it doesn't work...

$parser = xml_parser_create('UTF-8');

doesnt't work

My problem is that I can't import chars like "ä", "ö", "ü"

Any ideas?? Thanks a lot...

Linux is like a wigwam; no windows, now gates, and apache inside!

Vytautas Germanavičius

Sunday 05 February 2006 11:16:08 pm

it should work.
I had similar problem, when i tested this extension. Problem can be, that your data file is saved not in UTF-8.
If you use windows, open data file with notepad, and save with different name, then you can choose UTF-8 encoding. If UTF-8 is selected by default in "save as" dialog, then your file is in UTF-8, if not - save it as UTF-8 and try to import new file.

{set-block scope=root variable=cache_ttl}0{/set-block}

Philip K.

Monday 06 February 2006 12:02:01 am

Hm, thanks for your reply, but it still doesn't work...

btw: I'm using eZ version 3.6.0

Linux is like a wigwam; no windows, now gates, and apache inside!

Vytautas Germanavičius

Monday 06 February 2006 12:32:19 am

i used it on ez 3.6.4. What you see instead of letter with umlauts?
Maybe UTF8 is not set on your template?

{set-block scope=root variable=cache_ttl}0{/set-block}

Philip K.

Monday 06 February 2006 1:05:17 am

Hm, ok, I can't write the symbols down here, they are changed into html letters... I made a screenshot:

http://www.philip-kahlen.de/import_error.gif

Linux is like a wigwam; no windows, now gates, and apache inside!

Vytautas Germanavičius

Wednesday 08 February 2006 11:37:17 pm

Do you have on top of your templates

{*?template charset=utf-8?*}

I had several cases, when utf was not displayed because of missing that header.

I put this header to all templates of xmlimport extension.

But i still think, that your data file is in different encoding. for editing utf8 files i recomend notepad++ from sourceforge.net

{set-block scope=root variable=cache_ttl}0{/set-block}

Guillaume Kulakowski

Tuesday 14 February 2006 2:58:54 am

Hello,

I have a little question: This import system is compatible with the ezxmltext format ?

My blog : http://www.llaumgui.com (not in eZ Publish ;-))
eZC on RHEL : http://blog.famillecollet.com/pages/Config-en
eZC on Fedora : just "yum install php-channel-ezc"

You must be logged in to post messages in this topic!

Powered by eZ Publish™ CMS Open Source Web Content Management. Copyright © 1999-2014 eZ Systems AS (except where otherwise noted). All rights reserved.