Posts Tagged ‘perl’

speedread – A simple terminal-based open source Spritz-alike.

March 2nd, 2014 1 comment

A few hours ago, I have read about Spritz, a new speed-reading app, and I was quite impressed by the idea. I know the underlying concept is not new and I didn’t even try out Spritz itself (they announce their idea but not release the software, huh?), but this was the first time I have heard about it and I really liked it!

So I decided to implement my own terminal version of this idea, acting as a regular command-line filter. Find the new tool speedread at:

(Yes, it may not work well for beletry. Or slides. Yes, it may not work well for emails that you want to just skim for keywords. But then there’s the other 80% of text I need to chew through that does not fall in either category. I’ll have to continue trying it out for longer but it might be really useful.)

Meanwhile, I have also learned about OpenSpritz, a web-based implementation of Spritz. Can be a good match for speedread!

Categories: software Tags: , , , , ,

Conversion from mixed UTF8 / legacy encoding data to UTF8

September 23rd, 2012 No comments

For about 13 years now, I’m running the Muaddib IRC bot that serves a range of Czech channels. Its features varied historically, but the main one is providing conversational AI services (it learns from people talking to him and replies back based on the learnt stuff). It runs the Megahal Markov chain algorithm, using the Hailo implementation right now.

Sometimes, I need to reset its brain. Most commonly when the server happens to hit a disk full situation, something no Megahal implementation seems to be able to deal with gracefully. :-) (Hailo is SQLite-based.) Thankfully, it’s a simple sed job with all the IRC logs archived. However, Muaddib always had trouble with non-ASCII data, mixing a variety of encodings and liking to produce a gibberish result.

So, historically, people used to talk to Muaddib using ISO-8859-2 and UTF8 encodings and now I had mixed ISO-8859-2/UTF8 lines and I wanted to convert them all to UTF8. Curiously, I have not been able to quickly Google out a solution and had to hack together my own (and, well, dealing with Unicod ein Perl is never something that goes quickly). For the benefit of fellow Google wanderers, here is my take:

perl -MEncode -ple 'BEGIN { binmode STDOUT, ":utf8"; }
  $_ = decode("UTF-8", $_, sub { decode("iso-8859-2", chr(shift)) });'

It relies on the Encode::decode() ability to specify a custom conversion failure handler (and the fact that Latin2 character sequences that are also valid UTF-8 sequences are fairly rare). Note that Encode 2.35 (found in Debian squeeze) is broken and while it documents this feature, it doesn’t work. Encode 2.42_01 in Debian wheezy or latest CPAN version (use perl -MCPAN -e 'install Encode' to upgrade) works fine.

Perl and UTF8

June 24th, 2012 1 comment

I love Perl and it’s my language of choice for much of the software I write (between shell at one extreme and C at the other). However, there is one thing Perl really sucks at – Unicode and UTF8 encoding support. It is not that the features aren’t there, but that getting it to work is so tricky. It is so much tricks to remember already that I started writing them down:

It’s a wiki, anyone is welcome to contribute. :-)

Categories: linux, software Tags: , , ,

Realtime Signal Analysis in Perl

September 24th, 2011 No comments

About a month ago, we were working on the Fluffy Ball project – a computer input device that can react to fondling and punching. Thanks to a nice idea on the brmlab mailing list, we use a microphone and process the noise coming from the ball’s scratchy stuffing and an embedded jingle. The sounds from the outside are almost entirely dampened by the stuffing and for a human, the noise of fondling and punching is easily distinguishable.

Frequency spectrum, for our purposes, is just an array indexed by frequency, storing the amplitude of each frequency (in some range). A common variation is the power spectrum that describes the power of each frequency, i.e. the amplitude squared. Frequency spectrum is obtained by splitting the input signal to fixed-size samples and performing Discrete-Time Fourier Transform.

It turns out that trivial spectrum-based rules can be used to achieve reasonably high detection accuracy for a computer too (especially when the user is allowed to “train” her input based on feedback); I had big plans to use ANN and all the nifty things I have learned in our AI classes, but it turned out to be simply an overkill. The input signal is transformed to a frequency spectrum (see box) using real discrete FFT.

So, we have the audio signal coming in from a regular mic device and need to process it further. I chose Perl for quick prototyping and I have assumed that I would find some pre-made scaffolding for this ready. But it turns out that noone really published a simple example of even just showing a real-time frequency spectrum. So, here you go! :-)

First, we need some reasonable way to continuously display the spectrum. Most GUI paradigms are event-driven, but events are usually user interaction pieces and while it would be possible to incorporate continuous data-based updates in this model, it feels quite backwards. So we use a trick:

use warnings;
use strict;
init_dsp(); init_fft();
use Tk;
our $mw = MainWindow->new;
$mw->after(1, \&ticks); # after 1ms, give control back
sub ticks {
	while (1) {

This circumvents the event-driven architecture of Tk and instead puts our main loop in control, processing any GUI events when it’s good time. For more complex programs, this is a bad idea and it will lead to poorly maintaineable code, but when writing simple tools, you should not succumb to grand frameworks and let your code overgrow you.

Okay, how to grab audio input signal in Perl? Unfortunately, there are not really any handy modules you could use thoughtlessly. Audio::DSP is a possibility, but using it is clumsy, especially in the current world of ALSA as you have to rely on the imperfect aoss wrapper. A simple alternative is to get the raw byte data through a pipeline from the ALSA arecord tool:

our ($devname, $fmt, $bitrate, $wps, $bps, $bufsize, $dsp);
	$devname = "default"; # or e.g. hw:1,0 for an additional USB soundcard input
	$fmt = 16;            # sample format (bits per sample)
	$bitrate = 16384;     # sample rate (number of samples per second)
	$wps = 8;             # FFT windows per second (rate of FFT updates)
	$bps = ($fmt * $bitrate) / 8; # bytes per second
	$bufsize = $bps / $wps;   # window buffer size in bytes
sub init_dsp {
	open ($dsp, '-|', 'arecord', '-D', $devname, '-t', 'raw',
		'-r', $bitrate, '-f', 'S'.$fmt) or die "arecord: $!";
	use IO::Handle;
sub read_dsp {
	my $w;
	read $dsp, $w, $bufsize or die "read: $!";
	return $w;

read_dsp will return one signal window per call, the window being a binary blob consisting of one two-byte word per sample. We want to magically convert this to a spectrogram.

Audio::Analyze is again the simple way to get a signal spectrum. If you are after analyzing a pure audio signal, you probably want to use it since it can easily filter the signal based on relative human perception of frequencies etc. But for us, it is inconvenient to feed it data through a pipe and we will directly use Math::FFT. It will still handle all the gory math for our case (and we care about the actual noise, not the way people would hear it).

use Math::FFT;
use List::Util qw(sum);
our @freqs;
sub init_fft {
	my $dft_size = $bitrate / $wps;
	for (my $i = 0; $i < $dft_size / 2; $i++) {
		$freqs[$i] = $i / $dft_size * $bitrate;
sub process_signal {
	my ($bytes) = @_;
	# Convert raw bytes to a list of numerical values.
	$fmt == 16 or die "unsupported $fmt bits per sample\n";
	my @samples;
	while (length($bytes) > 0) {
		my $sample = unpack('s<', substr($bytes, 0, 2, ''));
		push(@samples, $sample);
	# Perform RDFT
	my $fft = Math::FFT->new(\@samples);
	my $coeff = $fft->rdft;
	# The output are complex numbers describing the exactly phased
	# sin/cos waves. By taking an abs value of the complex numbers,
	# we just measure the amplitude of a wave for each frequency.
	my @mag;
	$mag[0] = sqrt($coeff->[0]**2);
	for (my $k = 1; $k < @$coeff / 2; $k++) {
		$mag[$k] = sqrt(($coeff->[$k * 2] ** 2)
		                + ($coeff->[$k * 2 + 1] ** 2));
	# Rescale to 0..1. Many fancy strategies are possible, this is
	# extremely silly.
	my $avgmag = sum (@mag) / @mag;
	@mag = map { $_ / $avgmag * 0.3 } @mag;
	return @mag;

Not much to add besides the inline comments. The input of the process_signal function is a raw byte stream, the output is a list of amplitudes; @freqs maps the list indices to the actual Hz frequencies. The normalization to [0,1] interval shown here (pitching the mean at 0.3) is extremely naive, again there are many possible strategies. Also, you certainly want to use a window function etc. in more serious applications.

Now, for the visualization. We have chosen Tk for our GUI (it looks ugly, but it is reasonably easy to use despite its Tcl antics). We will use its Canvas object where we can draw freely, and just plot a line for each frequency:

our $canvas;
sub render_signal {
	# Display parameters, tweak to taste:
	my $rows = 2;
	my $hspace = 20;
	my $height = 150;
	my $vspace = 20;
	my @spectrum = @_;
	my $row_freqn = @spectrum / $rows;
	unless ($canvas) {
		$canvas = $mw->Canvas(
			-width => $row_freqn + $hspace * 2,
			-height => $height * $rows + $vspace * ($rows + 1));
	for my $y (0..($rows-1)) {
		for my $x (0..($row_freqn-1)) {
			my $hb = ($height + $vspace) * ($y + 1);
			my $i = $row_freqn * $y + $x;
			# Draw line:
			my $ampl = $spectrum[$i];
			$ampl <= 1.0 or $ampl = 1.0;
			my $bar = $height * $ampl;
			$canvas->createLine($x + $hspace, $hb,
			                    $x + $hspace, $hb - $bar);
			# Draw label:
			if (!($x % ($row_freqn/4))) {
				$canvas->createLine($x + $hspace, $hb + 0,
				                    $x + $hspace, $hb + 5,
				                    -fill => 'blue');
				$canvas->createText($x + $hspace, $hb + 15,
				                    -fill => 'blue',
				                    -font => 'small',
				                    -text => $freqs[$i]);

This suffices for a naive visualization, you can easily tweak it to do thresholding and whatever else you desire. I have found that on some of my computers, the X protocol is pushed to its limits by repeatedly drawing a large amount of lines, and sometimes the spectrum will start to lag behind the signal; either show wider bars averaging together multiple frequencies, or use something other than a Canvas object – raw pixmap transfer would likely be better than such a large amount of line drawing operations.

For a serious signal analysis work, you will also want a spectrogram – a time-based plot of amplitude of various frequencies.

To get a working script skeleton, simply piece the code snippets together ( See for the real fluffy ball script. It is much uglier, but it maintains sample averages over longer time windows (essential for more complex signal analysis), it has simple sample recording capabilities, and an example of naive threshold-based classifier.

Categories: software Tags: , , , ,

brmd: A Case for POE

May 26th, 2011 No comments

In brmlab, we want to track who is unlocking the space, whether someone is inside, have some good visual indicator that live stream is on air, and so on. In other words, we have an Arduino with some further hardware, and we want to show whatever is reported by the Arduino on IRC and web, and provide some web-based control (open/closed status override) in the opposite direction too.

What to use for a service (we call it brmd) that will bind all these interfaces together? It just needs a lot of boring frontends and simple state maintenance. It turns out that Perl’s POE framework is ideal for this – most of the code for IRC, HTTP and device read/write is already there, so you just grab the modules, slam them together and you have exactly what you need with minimal effort. Right?

It turns out that there are caveats – basically, the idea is correct, aside of getting stuck on a single stupidity of mine, I’d have the whole thing put together in something like half an hour. Unfortunately, the moment you want robustness too, things are getting a lot more complex; to handle the device disappearing, IRC disconnections, not having HTTP socket fds leak away, etc., you suddenly need to either magically know what further modules to hook up or start exeting some manual effort. Still, I like how POE is making it so easy to give a simple state machine many input/output interfaces and when you get used to the idiosyncracies, you can even make it somewhat reliable.

Example POE code

While this task seems to be ideal fit for POE, I’ve found surprisingly few examples of more complex POE component interaction on the web. Therefore, I’m going to lay out at least tersed up version of brmd below to help fellow future googlers. Nothing of this is anything ground-breaking, but it should help a lot to have a template to go by. Our real version is somewhat more verbose and includes some more interfaces: brmdoor.git:brmd/

I assume that you already know what “POE kernel” and “POE session” is. Beware that I’m a POE beginner myself, and I haven’t gone through much effort to clean the code up the way I would if I were to work together with someone else on this. Some things surely aren’t solved optimally and you might even pick up a bad habit or two.

In order to have some neat separation, we will divide brmd to several components where each will take care of a single interface; the main package will only spawn them up and do some most basic coordination. If we were to grow much larger, it would be worth the effort to even set up some kind of message bus (I wish POE would provide that implicitly), here we just directly signal components needing the info.

Read more…

Categories: software Tags: , , ,

Sticker Generator

February 19th, 2011 3 comments

At KAM MFF CUNI, we have about ~60 to ~70 computers scattered over the department, be them at desks of professors, older ones stocked in the basements or near-identical rack servers. Especially when older machines are shuffled around, it is difficult to track them, and the standard tiny inventory stickers keep falling off (and you have to keep looking up the number).

Therefore, we are making stickers to put on all our computers with some details about them. I used that as a pretense for learning some basic Cairo and I really like the paradigm (though I wish text manipulation was be easier). In case anyone would find this ever useful (or just a good scaffolding for their own Cairo Perl script), the script goes here; the default sticker size is to have height of a 5.25″ drive bay and width so that two rows fit nicely on A4.

# (c) Petr Baudis  2011  MIT licence
# Generate PDF with stickers.
# Expects tab-delimited columns on stdin:
# hostname mac invnumber buydate
use strict;
use warnings;
# -- User configuration --
our @paper = (598, 842); # A4, 72dpi
our @margin = (72, 72);
our @sticker = (227, 116);
our $filename = 'nalepky.pdf';
our $fontface = 'Arial';
our $ffontface = 'Courier New';
# large, normal, small
our @fontsize = (26, 15, 10);
our @linespacing = (18, 4, 18);
our $topmargin = 2;
our $contact = '<contact information>';
# -- User configuration end --
use lib 'perl';
use List::Util qw(min);
use List::MoreUtils qw(pairwise);
use Cairo;
our $surface = Cairo::PdfSurface->create ($filename, @paper);
# Effective surface area
our @surfsize = pairwise { $a - $b * 2 } @paper, @margin;
# Grid layout on effective surface
our @grid = pairwise { int($a / $b) } @surfsize, @sticker;
# Grid surface area
our @gridsurfsize = pairwise { $a * $b } @grid, @sticker;
# Start of grid surface so that it is centered on the paper
our @gridsurfstart = pairwise { ($a - $b) / 2 } @paper, @gridsurfsize;
# Produce a context for single sticker, starting at coordinates [0,0]
sub nalepka_cr {
        my ($surface, $cell) = @_;
        my @startM = pairwise { $a * $b } @sticker, @$cell;
        my @start = pairwise { $a + $b } @gridsurfstart, @startM;
        my $cr = Cairo::Context->create($surface);
        $cr->set_source_rgb(0, 0, 0);
# Centered text with top border at $$y. $size is index in font config above.
sub nalepka_text {
        my ($cr, $y, $face, $slant, $weight, $size, $text) = @_;
        $$y += $linespacing[$size] / 2;
        $cr->select_font_face($face, $slant, $weight);
        my $textents = $cr->text_extents($text);
        my $fextents = $cr->font_extents();
        $$y += $fextents->{height};
        $cr->move_to(($sticker[0] - $textents->{width}) / 2, $$y);
        $$y += $linespacing[$size] / 2;
sub nalepka {
        my ($cr, $host, $mac, $inv, $since) = @_;
        $cr->rectangle(0, 0, @sticker);
        my $invs = $since ? "$inv      $since" : $inv;
        my $ypos = $topmargin - $linespacing[0] / 2;
        nalepka_text($cr, \$ypos, $fontface, 'normal', 'bold', 0, $host);
        nalepka_text($cr, \$ypos, $ffontface, 'normal', 'normal', 1, $mac);
        nalepka_text($cr, \$ypos, $fontface, 'normal', 'normal', 1, $invs);
        nalepka_text($cr, \$ypos, $fontface, 'italic', 'normal', 2, $contact);
my @table;
while (<>) {
 my @col = split(/\t/);
 push @table, \@col;
my ($x, $y) = (0, 0);
for my $c (@table) {
        nalepka(nalepka_cr($surface, [$x, $y]), @$c);
        if ($y >= $grid[1]) {
                $y = 0; $x++;
                if ($x >= $grid[0]) {
                        my $cr = Cairo::Context->create($surface);
                        ($x, $y) = (0, 0);
Categories: software Tags: , ,

memtester and Virtual->Physical Address Translation

May 13th, 2010 5 comments server started having trouble with randomly corrupted repositories a while ago; short-time memtests were showing nothing and I was reluctant to take it offline for many days. So I found out the neat memtester tool and fired it up.

Sure enough, in some time, a bitflip error popped up – several times on the same memory offset:

Block Sequential : testing 12FAILURE: 0xc0c0c0c0c0c0c0c != 0xc0c0c0c0c0c0c0e at offset 0x0739610a.

Okay! That might hint on a bad memory cell in a DIMM. But which DIMM? Weelll…

We have to figure out which virtual address does the offset correspond with. Then, we have to figure out which physical address would that be. Finally, we have to guess the appropriate DIMM. We will make a lot of assumptions along the way – generally, the mapping can change all the time, pages may be swapped out, etc. – but memtester keeps the single mmap()ed region mlock()ed all the time, the architecture is regular i7, etc. And we don’t have to be 100% sure about the result.

First, keep the memtester running, do not restart it! Let’s assume its pid is 25773. First, we need to look at its memory maps:

# cat /proc/25773/maps 
00400000-00403000 r-xp 00000000 09:00 19511804                    /usr/bin/memtester
00602000-00603000 rw-p 00002000 09:00 19511804                   /usr/bin/memtester
7fea279c9000-7fea279ca000 rw-p 7fea279c9000 00:00 0 
7fea279ca000-7feb601ca000 rw-p 7fea279ca000 00:00 0 
7feb601ca000-7feb60314000 r-xp 00000000 09:00 28713328    /lib/

We can see pretty much immediately that the 7fea279ca000-7feb601ca000 map is the memory region of the main testing buffer – it’s just huge!

# echo $((0x7fea279ca000-0x7feb601ca000))

Splendid – 5GiB, just as much as we told memtester to check. Now, what is the virtual memory address of the fault? memtester grabs the buffer, splits it in two halves, then fills both halves with identical stuff and then goes through them, comparing. The second address contained the faulty bit set, so the printed offset is within the second buffer; its start is at (0x7feb601ca000-0x7fea279ca000)/2 + 0x7fea279ca000 = 0x7feac3dca000, add up the offset 0x0739610a… but beware! The offset is in ulongs, which means we need to multiple it by 8. and we get 0x7feacb16010a.

The 0x7feacb16010a should be the faulty address!

*** EDIT *** – this turns out to be wrong! There are two reasons:

  1. The offset is actually in multiples of sizeof(unsigned long), which is 8 on 64-bit archs; multiply the number by 8.
  2. There is some other problem – some slight shift. In my experiments, 0x7f1bd4f71a40 would be the real address but the computed one came out as 0x7f1bd4f72238 – not sure what causes that.

Therefore, the best solution is to apt-source memtester and tweak tests.c to print out also the actual pointers of the fault.

*** END EDIT *** – the rest should work as described again.

Ok. How to get the physical address? New Linux kernels have a nifty invention – /proc/.../pagemap – that provides access to per-page mapping information for the whole process virtual space; see Documentation/vm/pagemap.txt for details. Unfortunately, accessing it is not so simple, but I hacked together a simple Perl script:

# (c) Petr Baudis 2010 &lt;;
# Public domain.
# This won't work on 32-bit systems, sorry.
use warnings;
use strict;
use POSIX;
our ($pid, $vaddr);
($pid, $vaddr) = @ARGV;
open my $pm, "/proc/$pid/pagemap" or die "pagemap: $!";
binmode $pm;
my $pagesize = POSIX::sysconf(&amp;POSIX::_SC_PAGESIZE);
my $ofs = int((hex $vaddr) / $pagesize) * 8;
seek $pm, $ofs, 0 or die "seek $vaddr ($pagesize * $ofs): $!";
read $pm, my $b, 8 or die "read $vaddr ($pagesize * $ofs): $!";
my $n = unpack "q", $b;
# Bits 0-54  page frame number (PFN) if present
# Bits 0-4   swap type if swapped
# Bits 5-54  swap offset if swapped
# Bits 55-60 page shift (page size = 1&lt;&lt;page shift)
# Bit  61    reserved for future use
# Bit  62    page swapped
# Bit  63    page present
my $page_present = ! ! ($n &amp; (1 &lt;&lt; 63));
my $page_swapped = ! ! ($n &amp; (1 &lt;&lt; 62));
my $page_size = 1 &lt;&lt; (($n &amp; ((1 &lt;&lt; 61) - 1)) &gt;&gt; 55);
if (!$page_present and !$page_swapped) {
        printf "[%s: %d * %d] %x: not present\n", $vaddr, $pagesize, $ofs, $n;
if (!$page_swapped) {
        my $pfn = ($n &amp; ((1 &lt;&lt; 55) - 1));
        printf "[%s: %d * %d] %x: present %d, size %d, PFN %x\n", $vaddr, $pagesize, $ofs, $n, $page_present, $page_size, $pfn;
} else {
        my $swapofs = (($n &amp; ((1 &lt;&lt; 55) - 1)) &gt;&gt; 5);
        my $swaptype = ($n &amp; ((1 &lt;&lt; 5) - 1));
        printf "[%s: %d * %d] %x: present %d, size %d, swap type %x, swap offset %x\n", $vaddr, $pagesize, $ofs, $n, $page_present, $page_size, $swaptype, $swapofs;

Fire this up, and you should see something like:

# perl ~pasky/ 25773 0x7feacb16010a
Hexadecimal number > 0xffffffff non-portable at /home/pasky/ line 18.
[0x7feacb16010a: 4096 * 274700012288] 860000000002adf7: present 1, size 4096, PFN 2adf7

PFN stands for Page Frame Number. To get physical address on a PC
from this, just multiply it by page size. The physical address should be 0x2adf7000.

So, which DIMM do we have to replace? This is the most problematic stpe, it does not seem that the mapping would be available anywhere.
Let’s look at the physical mappings available in total:

# dmesg | head -n 20
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 2.6.26-2-amd64 (Debian 2.6.26-21lenny4) ( (gcc version 4.1.3 20080704 (prerelease) (Debian 4.1.2-25)) #1 SMP Tue Mar 9 22:29:32 UTC 2010
[    0.000000] Command line: root=/dev/md0 ro quiet
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[    0.000000]  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000bf780000 (usable)
[    0.000000]  BIOS-e820: 00000000bf78e000 - 00000000bf790000 type 9
[    0.000000]  BIOS-e820: 00000000bf790000 - 00000000bf79e000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000bf79e000 - 00000000bf7d0000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000bf7d0000 - 00000000bf7e0000 (reserved)
[    0.000000]  BIOS-e820: 00000000bf7ec000 - 00000000c0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[    0.000000]  BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 0000000200000000 (usable)
[    0.000000] Entering add_active_range(0, 0, 159) 0 entries of 3200 used

There is 7GB available on the machine and that can be easily found to correspond to the 0000000000100000 - 00000000bf780000 3GiB and 0000000100000000 - 0000000200000000 4GiB ranges. Our physical address is very low in the range. I guess the best we could do is assume that the DIMMs provide mappings in the same order they are in the slot, and replace the first one…

Or, you could use dmidecode, but only if your BIOS is not broken like mine and actually reports the start/stop addresses. :(