Proposed change in behavior to Unicode support in DBD::ODBC

This all started with a stackoverflow question at Automatic character encoding handling in Perl / DBI / DBD::ODBC. It then led to DBD::ODBC doesn't handle windows-1252 characters and lastly a small discussion on dbi-dev at Decoding data from the database in DBD.

I am proposing a change to DBD::ODBC in the support of Unicode data. It is too complicated to explain all the mechanismns of Unicode support in the ODBC API so I've tried to simplify it here to the change I propose making. The ODBC API as handed over to X/Open did not support Unicode or the so called wide APIs that Microsoft added afterwards. Microsoft added a SQL_WCHAR type which requests any bound data retrieved is returned as Unicode encoded in UCS2 (yes I know this does not fully support all of Unicode). DBD::ODBC was changed years ago so it could be built to use the Wide ODBC API (SQLPrepareW etc) and SQL_WCHAR and generally it has worked ok. However, how DBD::ODBC decides to bind returned column data was determined by what the ODBC Driver reported the column type as. If the ODBC Driver said the column was an SQL_CHAR (varchar, longvarchar etc) it was bound as an SQL_CHAR and if the driver said the column was SQL_WCHAR it was bound as SQL_WCHAR (mostly nvarchar etc). In the latter case DBD::ODBC converts the UCS2 encoded data returned from the driver and you get Unicode characters in your Perl script. In the former case the data is returned in whatever charset the database or session or column is defined as and it will be 8bit data; it is this case I'd like to change.

The problem is that if the column (e.g., in MS SQL Server) is defined as windows-1252 (for instance) and it is a varchar you'll get windows-1252 data back and be forced to use Encode::decode on all character columns to get the data into your Perl script in a usable form. This is inconvenient to say the least although you can workaround it by binding all columns as SQL_WCHAR (which is also inconvenient).

The proposed change will bind all character columns as SQL_WCHAR (in the Unicode build of DBD::ODBC) and hence the data retrieved in your Perl script will be Unicode and you can encode it before output in any way you like. However, if you already knew your data was coming back as native 8bit characters in some codepage and are already decoding them then when you upgrade to the next DBD::ODBC your data will be incorrect. For this reason I propose adding a connection attribute which returns DBD::ODBC to the previous behavior. I could have made the default the other way around but a) this way around seems to make more sense b) I seriously doubt many (if any) people were really decoding data from the database themselves.

Now is your chance to tell me otherwise. I'd also be happy to receive any other comments on this proposed change.

Comments

++

I'm a fan. Bring on the future.