Share » Forums » Setup & design » Unicode: to use ore not to use.

Unicode: to use ore not to use.

Unicode: to use ore not to use.

Tuesday 04 July 2006 1:36:06 pm - 16 replies

Modified on Tuesday 04 July 2006 2:01:26 pm by Evgeniy K

Author Message

Paul Borgermans

Tuesday 04 July 2006 3:07:49 pm

Hi Evgeniy

The performance difference is not that huge (rather small), but the gain with using UTF-8 in other areas (like copy/paste from external programs into ezp attributes) makes it imho mandatory.

unicode is the default here, though latin-1 is "native" to us

hth

--paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Evgeniy K

Tuesday 04 July 2006 9:52:07 pm

Hi Paul .

Thanks for your answer.
We use eZ Publish to create our site in Russia. Today over 90% of Russian sites use charset=windows-1251.
Yesterday I turned on the eZ debug output and compared "Total runtime" and "Total script time" for Unicode instalation of eZ and windows-1251 installation. Content was absolutely the same.
I refreshed the page hundred times, and I have seen, that the difference is near 30-50%!!! For example: win-1251-0.6 sec, UTF-8 - 0.9 sec, or 0.8sec/1.2 sec. The result was stable.
I'm afraid it may take a large effect on the project with 3.000-5.000 hosts per day.
Or "Total runtime" and "Total script time" are not very important?
And please can you give some examples of "copy/paste from external programs into ezp attributes"?

Thanks.

Łukasz Serwatka

Tuesday 04 July 2006 11:26:24 pm

Evgeniy, there might be some configuration issue and charset conversion is done which is not necessary and bad for performance:

Make sure that your settings are following:

settings/override/i18n.ini.append.php
<?php /*
[CharacterSettings]
Charset=utf-8
*/ ?>
settings/override/template.ini.append.php
<?php /*
[CharsetSettings]
DefaultTemplateCharset=utf-8
*/ ?>

and your database should be in UTF-8 as well.

Usually a lot of memory is spent on parsing the huge translation files (usually around 1MB) when you use other then English languages (German, French, etc). So in that case rise the memory_limit might be necessary.

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

Evgeniy K

Wednesday 05 July 2006 12:35:38 am

Lukasz, thank you.
We will use <b>only</b> one language, Russian. Our MySQL version is 5.0, so the charset of database is UTF-8, but we can choose collation (utf8_unicode_ci or cp_1251).
And at setup wizard it is possible to choose "Enable Unicode Setup".

So, if we shall <b>not</b> use Unicode, will the effect of parsing the huge translation files takes place? Or it takes place <b>only</b> with Unicode-setup?

Thanks.

Nathan Kelly

Wednesday 05 July 2006 12:39:00 am

On the subject of UTF-8 is there an option for UTF-8 encoding in the install proccess?

I didn't install the current eZ site I'm working on so I'm unaware of the setup proccess, I am going to be running the UTF-8 update script soon and I hope it works without any problems.

If there is no option for this in the setup stage, is there any plan to add it in the future?

Cheers!

[EDIT] Looks like I'm not the only one wondering this! We must have been thinking the same thing at the same time?

Pardon me while I burst into flames...

Łukasz Serwatka

Wednesday 05 July 2006 12:44:43 am

The parsing is done only for the first time (if translation cache does not exists). XML file is compiled ot native PHP (translation cache) so this rise the performance. As long as you don't clear the cache or change the translation file, eZ publish will use translation from cache.

When you use other language then eng-GB then TextTranslation and TranslationCache are enabled. If you use eng-GB then translation is not necessary since all labels are in English by default. So parsing of translation file is disabled.

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

Evgeniy K

Wednesday 05 July 2006 2:01:02 am

Lukasz,
thanks.
I checked settings you wrote about. Result is still the same. So, the interesting fact is, that 2 absolutely the same sites shows different perfomance with Unicode and non-Unicode setup, and the difference in "Total runtime" and "Total script time" is not very small (up to 30-50%). If it is interesting for you, I can email you 2 URL of sites with the "debug output" turned on. Both of them are located on the dedicated server, and all the content is absolutely the same.
What do you think, can this difference have strong influence on the project with 3.000-5.000 hosts per day (CPU, memory...)?

Kristof Coomans

Wednesday 05 July 2006 3:59:08 am

@Nathan:

On the subject of UTF-8 is there an option for UTF-8 encoding in the install proccess?

Yes there is, since eZ 3.8. But make sure you set utf8 as the default charset for your database.

independent eZ Publish developer and service provider | http://blog.coomanskristof.be | http://ezpedia.org

Evgeniy K

Wednesday 05 July 2006 4:51:28 am

Nathan, and if you run MySQL 5.0.x (or later) you must set the collation. It may be utf8_general_ci, utf8_unicode_ci ... . If I for example choose windows-1251 collation, I cannot do Unicode installation, even if database charset is UTF-8.

Kristof Coomans

Wednesday 05 July 2006 5:04:59 am

If you don't specify a collation in MySQL, then the default collation of the charset will be used ( http://dev.mysql.com/doc/refman/5.0/en/charset-database.html ), which is utf8_general_ci for utf8.

independent eZ Publish developer and service provider | http://blog.coomanskristof.be | http://ezpedia.org

Nathan Kelly

Wednesday 05 July 2006 5:05:03 am

Good to know for future installs, thanks guys.

Pardon me while I burst into flames...

Evgeniy K

Friday 14 July 2006 8:20:00 am

Hello.
In this thread Paul Borgermans wrote:

<i>the gain with using UTF-8 in other areas (like copy/paste from external programs into ezp attributes) makes it imho mandatory.</i>

Please, can somebody give any examples? Not links, only description. It is very important for me to understand, in what cases UTF-8 is mandatory.
Thanks.

Evgeniy.

Paul Borgermans

Friday 14 July 2006 8:37:57 am

Hi Evgeniy

It is just my opinion. In our case it is about copy/paste from MsWord into (xml) text fields which happens alot. With UTF-8 this goes fine, most (if not all) special characters are retained. Before converting everything to utf-8 whe had latin1 and this caused the loss of some special characters.

have a nice weekend

Paul

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Paul Borgermans

Friday 14 July 2006 8:40:00 am

And something else: we now use the Lucene search plugin. The Lucene backend stores everything as UTF-8

OK, that are 2 arguments/examples ;-)

eZ Publish, eZ Find, Solr expert consulting and training
http://twitter.com/paulborgermans

Evgeniy K

Friday 14 July 2006 8:40:36 am

Thanks a lot, Paul.

Marko Žmak

Sunday 16 July 2006 2:57:32 pm

<i>I refreshed the page hundred times, and I have seen, that the difference is near 30-50%!!! For example: win-1251-0.6 sec, UTF-8 - 0.9 sec, or 0.8sec/1.2 sec. The result was stable.
I'm afraid it may take a large effect on the project with 3.000-5.000 hosts per day.
Or "Total runtime" and "Total script time" are not very important?
</i>

Evgeniy, the "Total script time" is not actually the real timer of your page execution, but rather the time of loading the page. So this timer depends on the speed of your internet connection. Try loading your page on a slow connection and summing up all the bolded values in debug. You'll see you won't get a value much lower from "Total script time".

So, it's possible that when you changed the encoding to UTF-8, the duration of transfering the data over your connection increased for 30-50% and not the actuall execution of eZ. That's normal because in pages in UTF you have two bytes for each character, and in other encodings only one.

--
Nothing is impossible. Not if you can imagine it!

Hubert Farnsworth

You must be logged in to post messages in this topic!

36 542 Users on board!

Forums menu