RSS-feed

Sun, 24 Jun 2007

Blosxom and utf8


This is related to my previous post. I sent this to support@ today:

From: David 
Subject: my virtual host (is04607.com) is forcing charset=iso-8859-1                                         
To: support@

My virtual host (is04607.com) is configured in such a way that sends this                                    
header to the browsers:                                                                                      
                                                                                                             
Content-Type: text/html; charset=iso-8859-1                                                                  
                                                                                                             
By doing that, browsers cannot render utf8 characters.                                                       
                                                                                                             
Could you please fix it?                                                                                     
                                                                                                             
I just need apache not to force the character set, or if it does, to use                                     
utf8.             

Support replied to me, telling me that the other virtual hosts where not having the same problem (BTW, these guys rock. Thanks Matt!). They even pointed me to an utf8 blosxom plugin. But Matt pointed me in the right direction:

From: Matt
Subject: Re: charset=iso-8859-1 .. found it!!                                                                
To: David 

Its the perl module your blog uses :)   
                                                                                                             
Reading the source code:                                                                                     
$ more /usr/local/lib/perl5/5.6.1/CGI.pm                                                                     
                                                                                                             
The B<-charset> parameter can be used to control the character set                                           
sent to the browser.  If not provided, defaults to ISO-8859-1.  As a                                         
side effect, this sets the charset() method as well.                                                         
                                                                                                             
I did find a plugin for blosxom that forces utf8                                                             
http://www.vrtprj.com/misc/output_utf8                                                                       
                                                                                                             
I dunno if that helps.,      

And finally my answer with the solution:

From drio  Sun Jun 24 19:48:13 2007                                                                          
Subject: Re: charset=iso-8859-1 .. found it!!                                                                
To: Matt
                                                                                                             
I read about that plugin you sent me. I tried it out but it failed because                                   
in required the Encode.pm module.                                                                            
                                                                                                             
I read the code of the plugin and I think that plugin was not going to fix                                   
the problem. My output, my blog entries, they use utf8 characters already.                                   
That plugin was basically encoding to utf8 your blog entries.                                                
                                                                                                             
I read the blosxom code and I found this line:                                                               
                                                                                                             
$header = {-type=>$content_type};                                                                            
                                                                                                             
That was the one in charge of setting the character set in the http headers.                                 
I changed it to:                                                                                             
                                                                                                             
$header = {-charset => 'UTF-8'};                                                                             
                                                                                                             
and ...... success. My browser now renders the utf8 characters properly.                                     
                                                                                                             
Thanks for your help!                                                                                        


posted at: 18:48 | path: /blosxom | permanent link to this entry

Rendering unicode


Some of my friends were working in a site and they were using utf-8 to write their html/js. The main page page had a drop-down where you could switch between different languages:

English
Français
Español
Deutsch
日本語
中文
한국어

NOTE: I am assuming that your browser will render this last utf8 characters the proper way. At the time I was writing this, the http server that was sending these content to your browser was forcing this character set:

drio@simba:~/wwwroot $ curl -I http://blog.is04607.com
HTTP/1.1 302 Found
Date: Sun, 24 Jun 2007 18:45:10 GMT
Server: Apache/1.3.37 (Unix) mod_perl/1.29 PHP/4.3.11 mod_gzip/1.3.26.1a
Location: http://www.is04607.com/blog/blosxom.cgi
Content-Type: text/html; charset=iso-8859-1

I have to shoot an email to the sysadmin where I am hosting this so he can force utf8 on my virtual host.

That was exactly the same problem my friends had. Just by telling apache to use utf8 ( or at least not to force iso-8859-1) things get fixed.

By the way, do you know how many bits does utf8 uses to encode the Japanese characters? 32 bits, 4 bytes:

drio@simba:~/wwwroot $ cat test5.html 
----日-----
drio@simba:~/wwwroot $ hexdump test5.html 
0000000 2d2d 2d2d 97e6 2da5 2d2d 2d2d 000a     
000000d

Yes 日 is 0x97e62da5.

I found this document highly useful to understand what unicode is. I think it has become a classical already.

posted at: 13:11 | path: /programming | permanent link to this entry