Archive

Posts Tagged ‘perl’

speedread – A simple terminal-based open source Spritz-alike.

March 2nd, 2014 1 comment

A few hours ago, I have read about Spritz, a new speed-reading app, and I was quite impressed by the idea. I know the underlying concept is not new and I didn’t even try out Spritz itself (they announce their idea but not release the software, huh?), but this was the first time I have heard about it and I really liked it!

So I decided to implement my own terminal version of this idea, acting as a regular command-line filter. Find the new tool speedread at:

https://github.com/pasky/speedread

(Yes, it may not work well for beletry. Or slides. Yes, it may not work well for emails that you want to just skim for keywords. But then there’s the other 80% of text I need to chew through that does not fall in either category. I’ll have to continue trying it out for longer but it might be really useful.)

Meanwhile, I have also learned about OpenSpritz, a web-based implementation of Spritz. Can be a good match for speedread!

Categories: software Tags: , , , , ,

Conversion from mixed UTF8 / legacy encoding data to UTF8

September 23rd, 2012 No comments

For about 13 years now, I’m running the Muaddib IRC bot that serves a range of Czech channels. Its features varied historically, but the main one is providing conversational AI services (it learns from people talking to him and replies back based on the learnt stuff). It runs the Megahal Markov chain algorithm, using the Hailo implementation right now.

Sometimes, I need to reset its brain. Most commonly when the server happens to hit a disk full situation, something no Megahal implementation seems to be able to deal with gracefully. :-) (Hailo is SQLite-based.) Thankfully, it’s a simple sed job with all the IRC logs archived. However, Muaddib always had trouble with non-ASCII data, mixing a variety of encodings and liking to produce a gibberish result.

So, historically, people used to talk to Muaddib using ISO-8859-2 and UTF8 encodings and now I had mixed ISO-8859-2/UTF8 lines and I wanted to convert them all to UTF8. Curiously, I have not been able to quickly Google out a solution and had to hack together my own (and, well, dealing with Unicod ein Perl is never something that goes quickly). For the benefit of fellow Google wanderers, here is my take:

perl -MEncode -ple 'BEGIN { binmode STDOUT, ":utf8"; }
  $_ = decode("UTF-8", $_, sub { decode("iso-8859-2", chr(shift)) });'

It relies on the Encode::decode() ability to specify a custom conversion failure handler (and the fact that Latin2 character sequences that are also valid UTF-8 sequences are fairly rare). Note that Encode 2.35 (found in Debian squeeze) is broken and while it documents this feature, it doesn’t work. Encode 2.42_01 in Debian wheezy or latest CPAN version (use perl -MCPAN -e 'install Encode' to upgrade) works fine.

Perl and UTF8

June 24th, 2012 1 comment

I love Perl and it’s my language of choice for much of the software I write (between shell at one extreme and C at the other). However, there is one thing Perl really sucks at – Unicode and UTF8 encoding support. It is not that the features aren’t there, but that getting it to work is so tricky. It is so much tricks to remember already that I started writing them down:

http://brmlab.cz/user/pasky/perl-utf8

It’s a wiki, anyone is welcome to contribute. :-)

Categories: linux, software Tags: , , ,

Realtime Signal Analysis in Perl

September 24th, 2011 No comments