Wednesday, December 08, 2010

Sometimes, a period doesn't match any character

Recently I found that in regular expressions in GNU sed, ‘.’ failed to match some characters. At first it was very surprising, but there's a simple explanation in the GNU sed manual:
s/.*// does not clear pattern space
This happens if your input stream includes invalid multibyte sequences. POSIX mandates that such sequences are not matched by ‘.’, so that ‘s/.*//’ will not clear pattern space as you would expect. In fact, there is no way to clear sed's buffers in the middle of the script in most multibyte locales (including UTF-8 locales). For this reason, GNU sed provides a `z' command (for `zap') as an extension.

To work around these problems, which may cause bugs in shell scripts, set the LC_COLLATE and LC_CTYPE environment variables to ‘C’.

The used the ISO-8859-1 character set while Cygwin by default uses UTF-8. The characters not being matched formed invalid sequences in UTF-8. Setting ‘LANG=C’ also fixes the problem.

No comments: