This is related to my previous post. I sent this to support@ today:
From: David
Subject: my virtual host (is04607.com) is forcing charset=iso-8859-1
To: support@
My virtual host (is04607.com) is configured in such a way that sends this
header to the browsers:
Content-Type: text/html; charset=iso-8859-1
By doing that, browsers cannot render utf8 characters.
Could you please fix it?
I just need apache not to force the character set, or if it does, to use
utf8.
Support replied to me, telling me that the other virtual hosts where not having the same problem (BTW, these guys rock. Thanks Matt!). They even pointed me to an utf8 blosxom plugin. But Matt pointed me in the right direction:
From: Matt
Subject: Re: charset=iso-8859-1 .. found it!!
To: David
Its the perl module your blog uses :)
Reading the source code:
$ more /usr/local/lib/perl5/5.6.1/CGI.pm
The B<-charset> parameter can be used to control the character set
sent to the browser. If not provided, defaults to ISO-8859-1. As a
side effect, this sets the charset() method as well.
I did find a plugin for blosxom that forces utf8
http://www.vrtprj.com/misc/output_utf8
I dunno if that helps.,
And finally my answer with the solution:
From drio Sun Jun 24 19:48:13 2007
Subject: Re: charset=iso-8859-1 .. found it!!
To: Matt
I read about that plugin you sent me. I tried it out but it failed because
in required the Encode.pm module.
I read the code of the plugin and I think that plugin was not going to fix
the problem. My output, my blog entries, they use utf8 characters already.
That plugin was basically encoding to utf8 your blog entries.
I read the blosxom code and I found this line:
$header = {-type=>$content_type};
That was the one in charge of setting the character set in the http headers.
I changed it to:
$header = {-charset => 'UTF-8'};
and ...... success. My browser now renders the utf8 characters properly.
Thanks for your help!
Some of my friends were working in a site and they were using utf-8 to write their html/js. The main page page had a drop-down where you could switch between different languages:
English
Français
Español
Deutsch
日本語
中文
한국어
NOTE: I am assuming that your browser will render this last utf8 characters the proper way. At the time I was writing this, the http server that was sending these content to your browser was forcing this character set:
drio@simba:~/wwwroot $ curl -I http://blog.is04607.com HTTP/1.1 302 Found Date: Sun, 24 Jun 2007 18:45:10 GMT Server: Apache/1.3.37 (Unix) mod_perl/1.29 PHP/4.3.11 mod_gzip/1.3.26.1a Location: http://www.is04607.com/blog/blosxom.cgi Content-Type: text/html; charset=iso-8859-1
I have to shoot an email to the sysadmin where I am hosting this so he can force utf8 on my virtual host.
That was exactly the same problem my friends had. Just by telling apache to use utf8 ( or at least not to force iso-8859-1) things get fixed.
By the way, do you know how many bits does utf8 uses to encode the Japanese characters? 32 bits, 4 bytes:
drio@simba:~/wwwroot $ cat test5.html ----日----- drio@simba:~/wwwroot $ hexdump test5.html 0000000 2d2d 2d2d 97e6 2da5 2d2d 2d2d 000a 000000d
Yes 日 is 0x97e62da5.
I found this document
highly useful to understand what unicode is. I think it has become a classical already.
posted at: 13:11 | path: /programming | permanent link to this entry