Problems with Perl JSON::XS, incremental parsing and unicode strings

Todays unicode problems are with JSON::XS. We are using JSON extensively in the project I am working on and we have a lot of unicode strings. JSON::XS really is the best and fastest perl JSON parser (I've tried them all) and Marc Lehmann who wrote JSON::XS is very knowledgeable and usually quick to respond to any query.

Until recently we have been decoding single JSON strings but now need to do this incrementally because we cannot guarantee we will get all the JSON text in one go; no problem JSON::XS has an incremental parser :-) except we mysteriously get the error "incr_text can not be called when the incremental parser already started parsing". Here is my test code:

use Data::Dumper;
use JSON::XS;
my $data = ["\x{53f0}\x{6240}\x{306e}\x{6d41}\x{3057}",
            "\x{6c60}\x{306e}\x{30ab}\x{30a8}\x{30eb}"];
# the following works:
#my $data = ["fred",
#            "blog"];
my $j = new JSON::XS;
my $js = $j->encode($data);
$j = undef;
print "encoded: " . $js, "\n";
my $j = JSON::XS->new;
my $object = $j->incr_parse($js);
die "no object" if !$object;
print Dumper($object);
eval {
    print $j->incr_text;
};
print "Why do we get this error - $@" if $@;

As the comments show, if I change the JSON to be decoded to not include unicode strings it works fine. According to the JSON::XS pod it is only legitimate to call incr_text after some JSON has been decoded into an object and in this case it has.

Just out of interest I tried the same code with the pure perl JSON parser JSON::PP and it worked fine although I believe I might have found another issue with JSON::PP (Possible inconsistencies using incr_text and incr_parsing).

After a lot of debugging I finally discovered the incr_text checks an internal variable which holds how many characters are left unparsed and for unicode strings the calculation seems to be wrong. I've mailed Marc with my findings but for now I have made a small change to JSON::XS's XS.xs code to workaround this:

 static SV *
decode_json (SV *string, JSON *json, STRLEN *offset_return)
{
  .
  .
  .
      offset = dec.json.flags & F_UTF8
               ? dec.cur - SvPVX (string)
          : utf8_distance (dec.cur, SvPVX (string));

utf8_distance returns the difference between dec.dur and SvPVX(string)
in characters not bytes but the offset to return needs to be bytes. For
my example the offset returned is 17 as there are a total of 17
characters but there are 37 bytes. If I change the above to:

      offset = dec.json.flags & F_UTF8
               ? dec.cur - SvPVX (string)
          : dec.cur - SvPVX(string);

we are working again. Looking forward to hearing Marc's comments.

Comments

JSON::XS 2.24 fix

As I expected, Marc has released a new JSON::XS with a proper fix for this now - see JSON::XS 2.24. Thanks Marc.