Pasky’s Log

Conversion from mixed UTF8 / legacy encoding data to UTF8

September 23rd, 2012 No comments

For about 13 years now, I’m running the Muaddib IRC bot that serves a range of Czech channels. Its features varied historically, but the main one is providing conversational AI services (it learns from people talking to him and replies back based on the learnt stuff). It runs the Megahal Markov chain algorithm, using the Hailo implementation right now.

Sometimes, I need to reset its brain. Most commonly when the server happens to hit a disk full situation, something no Megahal implementation seems to be able to deal with gracefully. :-) (Hailo is SQLite-based.) Thankfully, it’s a simple sed job with all the IRC logs archived. However, Muaddib always had trouble with non-ASCII data, mixing a variety of encodings and liking to produce a gibberish result.

So, historically, people used to talk to Muaddib using ISO-8859-2 and UTF8 encodings and now I had mixed ISO-8859-2/UTF8 lines and I wanted to convert them all to UTF8. Curiously, I have not been able to quickly Google out a solution and had to hack together my own (and, well, dealing with Unicod ein Perl is never something that goes quickly). For the benefit of fellow Google wanderers, here is my take:

perl -MEncode -ple 'BEGIN { binmode STDOUT, ":utf8"; }
  $_ = decode("UTF-8", $_, sub { decode("iso-8859-2", chr(shift)) });'

It relies on the Encode::decode() ability to specify a custom conversion failure handler (and the fact that Latin2 character sequences that are also valid UTF-8 sequences are fairly rare). Note that Encode 2.35 (found in Debian squeeze) is broken and while it documents this feature, it doesn’t work. Encode 2.42_01 in Debian wheezy or latest CPAN version (use perl -MCPAN -e 'install Encode' to upgrade) works fine.

Categories: linux, software Tags: Encode, Hailo, irc, Megahal, Muaddib, perl, unicode, utf8

Perl and UTF8

June 24th, 2012 1 comment

I love Perl and it’s my language of choice for much of the software I write (between shell at one extreme and C at the other). However, there is one thing Perl really sucks at – Unicode and UTF8 encoding support. It is not that the features aren’t there, but that getting it to work is so tricky. It is so much tricks to remember already that I started writing them down:

http://brmlab.cz/user/pasky/perl-utf8

It’s a wiki, anyone is welcome to contribute. :-)

Categories: linux, software Tags: perl, unicode, utf8, wiki

Archive

Conversion from mixed UTF8 / legacy encoding data to UTF8

Perl and UTF8

Recent Comments

Categories

Blogroll

Licence