NOTE: This is very much a document in progress so please bare that in mind when reading it - I will be updating it.
There is an ongoing discussion on the dbi-dev mailing list about unicode support in DBI. I've been trying to collect information on DBD support for Unicode in the hope it will help show whether there are inconsistencies in DBDs and whether the Perl DBI could help defining Unicode support. Originally, I thought about doing a talk at the London Perl Workshop but I've been rather preoccupied with personal issues and didn't think I'd even make the LPW until early last week. What follows is a very cut down version of a large document I started writing to catalogue Unicode support in DBDs. I've not done some DBDs yet - notably DB2 (because I no longer have it, although castaway on #dbix-class ran some code for me) and Postgres (because I know people are currently working on Unicode support in DBD::Pg). If you have a DBD I've not covered I'd love to hear from you and especially if you can take one of my examples and change it to work for your DBD.
"Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8."
A brief mention in data_string_diff utility method.
In DBI::DBD a brief mention of unicode with respect to a new login method that I added (post 1.607) - more later.
Solution: Create a DBI attribute which can be used to specify the encoding the database expects and returns. Some scenarios which were discussed:
o encode the data sent to the database like "this" (which could be nothing)
o decode the data retrieved from the database like "this" (which could be nothing but if not nothing it could be using strict or loose for the UTF-8 and utf8 case)
o don't decode but use SvUTF8_on (a specific case since Perl uses that internally and a number of databases return UTF-8) one that seems to work but I worry about.
o do what the DBD thinks is best - whatever the behaviour is now?
Solution: Assuming all DBDs only do something special when a new attribute is defined (or their existing attribute) this should be ok.
The issue is for DBDs which do not know how the database data returned is encoded and need to be told what that encoding is - a flag is required for that - and vice versa.
Solution: a new attribute as described above.
Solution: a new attribute as described above which may be set in a way saying check the encoding or just set the flag.
Solution: increase dependency on Perl version or if this is unpopular we just cannot support strict encoding checking in older Perls.
Solution: make sure all DBDs are using dbd_st_prepare_sv
Solution: convert DBDs to use login_sv if they need to support Unicode in connect method.
Solution: examine the XS interface and DBI and change to use Perl scalars.
Solution: not worth considering until this is verified.
select * from table where column = :unicode_param_name
For DBD::ODBC it is because the parameter names are held in a char * and is arguably just a bug but it is rather more complicated to fix than you might think. I think other DBDs have a similar problem - they often parse the SQL looking for parameters and this is mostly done in C code.
Solution: Personally, I'm stating DBD::ODBC cannot do this and the effort involved in changing C code to parse UTF-8 SQL is too much to be worth it.
Solution: fix DBDs
There are some pure Perl ones but they don't seem to be used as much. The most popular use a combination of Perl and C (XS). Some of the Unicode problems arise from moving data from Perl to C and vice versa.
There are examples scattered throughout the rest of this DBD-specific discussion which are fairly easily changed to work with any DBD - mostly, it is just an attribute change. There are more examples in the DBD::SQLite section since I did this one first but I ran most of them on all the DBDs I tested. Bare in mind some of this was done on Windows in a terminal so some of the Unicode output cannot be relied on. You can find a more generic test case here in fulltest.pl.
Supports unicode data and collation.
However, you don't get it by default:
which produces:
$VAR1 = [
'Ôé¼'
];
Data selected does not match data inserted
SV = PV(0x2af9280) at 0x28d0558
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2aab008 "\342\202\254"\0
CUR = 3
LEN = 8
UTF8 flag is Off
You need to enable the sqlite_unicode flag and do it as soon as possible like in the connect method. Once on it tags all data from the database with the UTF8 flag:
which produces:
$VAR1 = [
"\x{20ac}",
"\x{20ac}"
];
Euro in/out successfully
UTF8 flag on char data is On
UTF8 flag on binary data is On
Note: we used Unicode in the SQL successfully.
But what about that blob (binary large object) column; the second column above. It looks the same as the first column but it is supposed to be a blob so why is the UTF8 flag on and is that really what we want? Probably not.
To stop that happening you need to bind the parameter as a blob when you insert/update it.
Before the fix for rt 71311 (which I reported whilst researching this) you cannot bind a column on a select as a blob or it will not be returned.
which produces:
UTF8 flag on char is On
UTF8 flag lob is Off
out euro = in euro
and the out.png does match the in.png.
NOTE: Although you don't seem to be able to have unicode table names you can have Unicode column names but the metadata methods like column_info don't seem to support Unicode. As a result, if you create a table with a column of a€ but then call column_info specifying column a€ you get no columns back - I think this is a limitation in DBI via XS rather than DBD::SQLite:
produces:
All columns in table:$VAR1 = [
[
undef,
'main',
'test1',
'aÔé¼',
undef,
'varchar',
'50',
undef,
undef,
undef,
1,
undef,
undef,
undef,
undef,
undef,
1,
'YES'
]
];
Just the euro column:$VAR1 = [];
UPDATE: Reported unicode table names not decoded in table_info_all call as it appears DBD::SQLite simply fails to decode table names.
DBD::CSV also supports Unicode characters encoded as UTF-8 or in fact any other encoding you like. However, like DBD::SQLite, you need to tell DBD::CSV the encoding using f_encoding attribute e.g.,
Which outputs:
Euro in/out successfully
UTF8 flag on char data is On
UTF8 flag on binary data is On
Wide character in print at csv2.pl line 50.
$VAR1 = [
"\x{20ac}"
];
If you omit the f_encoding the file is opened with no layer so reading from it or writing to it will not provide the correct result and may end up in issuing warnings.
Oops. The character data is ok but when writing the blob to out.png we get a warning as the blob has been UTF-8 decoded into Perl. I could only get the blob to work by removing the f_encoding so it looks like DBD::CSV does not have a specific case for blobs when the encoding is set.
You can happily use Unicode in SQL select statements.
Note, you cannot use Unicode table or column names - all of these must be alphanumeric:
Bad table or column name: 'test1Ôé¼' has chars not alphanumeric or underscore! at I:/strawberry/perl/site/lib/SQL/Statement.pm line 88
DBD::mysql supports Unicode via the UTF-8 encoding but as the DBDs above you need to set an attribute (mysql_enable_utf8 => 1) during connect and you need to create your tables using:
CREATE TABLE test ( a VARCHAR(50) ) CHARACTER SET utf8;
You also need mysql > 4.004.
DBD::mysql does support table_info/column_info but they don't work when a unicode table/column name is specified mostly because I have been unable to create tables/column with unicode table/column names.
DBD::Unify supports the insertion of Unicode data but when it is pulled back out of the database it is UFT-8 encoded but DBD::Unify does not decode it for you.
UPDATE: Merijn has added much better Unicode support to Unify which is currently not released but available in the subversion trunk.
Does support Unicode data but until recently reports the size of a VARCHAR as 32768 and yet that is bytes so if you are inserting unicode data into a Firebird varchar you'll need to divide that by at least 4.
Why does this matter?
I only know from experience running the DBD::ODBC test suite with the Firebird ODBC driver. It reports 32767 as the max size of a varchar but when you attempt to create a column of that size it baulks.
Passing Unicode data back and forth in DBD::Oracle seems to work reliably so long as you ensure to set up your environment properly e.g., set NLS_LANG to AMERICAN_AMERICA.AL32UTF8.
DBD::Oracle attempts to just do the right thing. It should know how the data returned from Oracle is encoded and decode it properly. Whether it is for speed or just because it was done that way, DBD::Oracle mostly only sets the utf8 flag on data returned from Oracle which is UTF-8 encoded with SvUTF8_on (I say mostly as for Oracle error messages the encoding is checked if you have a new enough Perl). There is a RT from David Wheeler where he apparently got non-UTF-8 encoded data back from Oracle and DBD::Oracle set the utf8 flag which led to a later crash. I've never been able to reproduce this but it is the main reason why David would like an option to check the encoding.
Unicode can happily be used in SQL, table names and column names but some metadata methods cannot use Unicode as input parameters. There is also no problem mixing blobs and Unicode data in the same select.
Jonathan Leffler has reported that he has received some Unicode patches recently and applied one of them so far. The attributes added in the supplied patches use ix_ prefix and contain UTF-8 in some part of the remaining string. I don't have Informix and I don't have the patches so I've been unable to test DBD::Informix. Jonathan however said:
"So, DBD::Informix is endeavouring to move forward, knowing that the
underlying database layers (ESQL/C on the client, and the Informix data
server) handle UTF-8 OK, so any problems are in persuading Perl (and perhaps
DBI) to handle it appropriately too."
Ok, I know quite a bit about this DBD as I maintain it.
DBD::ODBC has some interesting problems as ODBC is an API to talk to databases and it is not database specific so you can just as easily use DBD::ODBC to connect to MS SQL Server, Oracle, DB2, Interbase etc (literally dozens - probably easily more than 50 ODBC drivers but I've not counted them).
Microsoft (TO_DO mostly but others involved TO_DO) wrote the ODBC API and passed it to X/Open but have continued developing it since. In ODBC 3.51 they introduced some Unicode support in the form of the so called "Wide" APIs. Basically, it differs from the other DBDs in that there are 2 APIs in ODBC (and you select the one you use at compile time) - the original ANSI API and the newer Wide API. The ODBC APIs are named SQLxxxA and SQLxxxW and when you write your application you use SQLxxx but set a macro which selects the SQLxxxA or SQLxxxW APIs.
This is where the hideous bit comes in. Between the ODBC enabled application and the ODBC driver is the ODBC Driver Manager. When you call SQLDriverConnect you name the DSN or DRIVER and the ODBC Driver manager loads the DLL for that driver and examines it's exports to see if it exports the SQLxxxW functions. If it does, and your application is calling the SQLxxxW APIs there is no problem - data is just passed through as it is. If the application was built without the unicode setting but the driver has the "Wide" API the driver manager maps all the application calls to SQLxxxW but converts the 8 bit data to UCS2 (wasting time). If the application uses the wide API but the driver does not the driver manager calls the SQLXXX functions in the driver (which are effectively SQLxxxA) but drops the the high byte in the UCS encoding! In reality this does not matter that much as at least on Windows the applications and ODBC drivers are aware of the change in ODBC and provide the wide APIs.
In UNIX this difference is larger because the ODBC API was written for Windows and UCS2 encoding. The most popular ODBC Driver Manager is unixODBC and it attempts to do what the MS ODBC Driver Manager does and this works Ok. However, for UNIX, UTF-8 encoding of data is more popular and in same cases storing unicode characters in wchar_t types in C is still used. There are 3 ODBC driver managers used in UNIX: unixODBC (which performs like the MS one), iODBC (which expects unicode data to be placed into wchar_t types in C and a commercial driver manager which attempts to send Unicode data back UTF-8 encoded. DBD::ODBC only supports the first and last of those.
If the ODBC Driver returns data UTF-8 encoded you need to set the odbc_utf8_on flag but I've never actually seen a driver myself which does this (though the person who supplied the patch says an offshoot of Postgres does this) and there are a few commercial drivers as well.
If the driver works like drivers on MS does and you are using unixODBC there is no problem; DBD::ODBC converts all data sent to the database to UCS2 and calls the wide APIs and all data coming back is decoded to UTF-8 for Perl.
If you are using the iODBC driver manager and you build it with Unicode support you are on your own as DBD::ODBC would have to support wchar_t as well and I've not got the time or inclination to support that.
DBD::ODBC Unicode support is:
o Unicode SQL from 1.16_2.
o Unicode in connection strings so for instance you can have Unicode database usernames from 1.16_2 and DBI post 1.607.
o Unicode column and table names BUT you cannot pass Unicode schema/table/column names to some metadata functions as the XS uses char * and not Perl scalars.
o bound columns and parameters may be Unicode from 1.15.
You cannot use Unicode in:
o some metadata calls - see above.
o parameter names although the ? is more usual in ODBC anyway.
o calls to prepare if you set the odbc_execdirect attribute
Demonstration that DBD::ODBC (and perhaps DBI itself) cannot handle Unicode in metadata calls through XS:
outputs:
$VAR1 = [];
$VAR1 = [
'"master"."dbo"."cb"',
'"master"."dbo"."cb1"',
"\"master\".\"dbo\".\"fred\x{100}\"",
];
UPDATE: In DBD::ODBC 1.32_3 I added support for unicode in catalog/schema/table names in the table_info method.
UPDATE: In DBD::ODBC 1.32_4 I changed DBD::ODBC to call SQLGetTypeInfoW in a unicode build which should resolve inconsisitencies between type_info_all and {TYPE} which comes from SQLDesribeColW.
Results are inconclusive right now. I don't have DB2 but Jess Robinson (thanks) ran my test code with DBD::DB2 1.81 against 08.02.0003 DB2. Pretty much all of the unicode tests failed. I really need to get access to someone who has a recent version of DBD::DB2 and DB2 and can work on this to get reliable results.
Recent comments
31 weeks 4 days ago
33 weeks 6 days ago
35 weeks 4 days ago
35 weeks 5 days ago
43 weeks 6 days ago
44 weeks 6 days ago
46 weeks 2 days ago
49 weeks 21 hours ago
1 year 1 week ago
1 year 4 weeks ago