Encoding problems between Perl and MySql V4.0

Filed under: Perl tips — vincent @ 22:01

MySQL v4.0 doesn’t handle encoding very well. Basically, its considering everything you send it as simple binary data. At least, it doesn’t corrupt it, but no conversion is made. I had to interact, from Perl, with a database which was populated by PHP (v4 also), and which had ISO-8859-1 data in it.

Following good practice, I choose to use Perl DBI infrastructure, to stay reasonably database independent. Here is the statements that I used to connect:

use DBI;
my $data_source="dbi:mysql:db_name:db_host";
my $dbh=DBI->connect($data_source,'user_name','user_passwd');

As you can see, there is no mention of any encoding. With a database understanding correctly encodings, it’s the job of the driver and the database server to communicate, and, if needed, convert encodings of text fields.

The DBI::DBD driver for MySQL under Perl, not knowing the encoding of the tables (since MySQL v4.0 doesn’t know it itself), saved all my data there in UTF-8 (generally a good default choice, in my opinion, but here the existing data was in ISO-8859-1).

My Perl program was consisting mainly of extraction of data from XML files (encoded in ISO-8859-15, seemingly just to have one more encoding in the game), and injection into MySQL.

The XML::DOM functions being well done, Perl is automagically converting this data when reading it into its internal encoding (called by Perl mongers ‘utf8′, and which is nearly the same as the standard ‘UTF-8′ encoding: see this thread). That’s what I like in a good parser: whatever the input encoding is, it gives you the same result. So you won’t get bitten the day when your partner changes its encoding, as long as his XML files are properly marked (and if they’re not, he’s in fault).

During injection, I had to check for existence of some data in the database, then simply delete and insert rows. The only DBI methods used were the next ones

$hash_ref = $dbh->selectall_hashref($sql, $key_field);
$rv  = $dbh->do($sql);
$hash_ref = $dbh->selectrow_hashref($sql);

That’s only three differents methods, but since they were used everywhere, I didn’t want to update my whole code, once the encoding problem surfaced. Therefore, I made a new object, having mostly the same methods as the $dbh from DBI, but converting SQL statements from utf8 to ISO-8859-1, before executing them in the server.

I should note that I chose not to derive from the $dbh object, because I wanted to be sure to catch (with a runtime error) any call not yet overwritten. Here is what it became:

use DBI;
use Encode;

my $data_source="dbi:mysql:db_name:db_host";
my $_dbh=DBI->connect($data_source,'user_name','user_passwd');

package MyDbh;
sub new {
	my $this=shift;
	my $class=ref($this)||$this;
	my $self={};
	my $dbh=shift;
	$self->{'dbh'}=$dbh;
	bless $self,$class;
	return $self;
}
sub selectrow_hashref {
	my ($this,$sql)=@_;
	return $this->{'dbh'}->selectrow_hashref(
		Encode::encode('ISO-8859-1',$sql));
}
sub selectall_hashref {
	my ($this,$sql,$key)=@_;
	return $this->{'dbh'}->selectall_hashref(
		Encode::encode('ISO-8859-1',$sql),$key);
}
sub do {
	my ($this,$sql)=@_;
	return $this->{'dbh'}->do(
		Encode::encode('ISO-8859-1',$sql));
}

# switch back to default package
package main;

my $dbh=new MyDbh($_dbh);

# from here, use $dbh normally:

my $sql=...; # SQL statement made from strings internally encoded in utf8
$hash_ref = $dbh->do($sql);
# and so on

For the record, here are the versions used:

$ mysql --version
mysql  Ver 12.22 Distrib 4.0.18, for suse-linux (i686)
$ perl --version
This is perl, v5.8.3 built for i586-linux-thread-multi
$ perl -MDBI -e 'print "$DBI::VERSION\n";'
1.41
$ perl -M'DBD::mysql' -e 'print "$DBD::mysql::VERSION\n";'
2.9003

No Comments »

No comments yet.

RSS feed for comments on this post.

Leave a comment

Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>

(required)

(required)