automated, scripts, shell

Perl for accents and spaces in html hrefs

Despite the title looking like spam, it uses all the relevant keywords in a problem which I guess is more common than it looks like, especially in non-English speaking countries.
I had created (automatically, not by hand) a set of html, pdf and rtf files of the 477 items of a regional catalogue of industrial estate (sorry, no references because it belongs to other people and they have not published it yet). Then I made the index.html page listing and linking to all those files. It looked like a list of lines like the following one (notice that what follows looks like several lines but they were actually in the same line):

<li>34.001.0000-1:  
<a href="file://E:/inventario/ABARCA DE CAMPOS/fábrica_de_harinas/plantilla.html">
La Primera de Campos</a>.  
<a type="application/pdf" href="file://E:/inventario/ABARCA DE CAMPOS/fábrica_de_harinas/plantilla.pdf">
(vers. PDF)</a>. 
<a type="application/msword" href="file://E:/inventario/ABARCA DE CAMPOS/fábrica_de_harinas/plantilla.rtf">
(vers. RTF)</a>
</li>

Then, for obvious reasons, I change all the file and directory names using a shell script so that they had neither spaces nor accents (they were created by others, specifically University Professors). But by then, they had asked me to rename the pdf’s using a different scheme and I was not in the mood to re-write the index-creating utility (it should have been no big deal, but I thought I could do it faster with a simple Perlism). By the way, I was using OS X but the data were all in an NTFS drive, and you should try and copy/move/rename accented directory between NTFS drives and HFS+… Throwing up is a kind word for what Mac does at this job.
So, I had to rewrite all the accents and change all the spaces for underscores. I first had to find out what an ‘accented character’ means for OS X when read from an NTFS filesystem, which is not so easy a task as it ought to be, to write it correctly in the script. And then, writing down how the system had modified (previously) accented NTFS paths. It turned out that if you say in the shell:

$ /Volumes/NTFS_drive/ ls | sed -e 's/[^A-Za-z0-9\._]/_/g'

It changes

fábrica de harinas

into

fa_brica de harinas

We define the following substitution hash (for Spanish), being careful to add the underscore later on:

use locale;
%change = ( 'Á' => "A",
            'É' => "E",
            'Í' => "I",
            'Ó' => "O",
            'Ú' => "U",
            'Ñ' => "N",
            'Ü' => "U",
            'á' => "a",
            'é' => "e",
            'í' => "i",
            'ó' => "o",
            'ú' => "u",
            'ñ' => "n",
            'ü' => "u");

The ‘replacement’ problem is a bit delicate because we want the accents removed only inside hrefs. Perl’s ability to run code (using the ‘e’ option at the end of the regexp) inside a regexp substitution comes in quite handy for this. We define a cleanup subroutine which does the accent substitution and, for each match inside an href, we run it:

sub cleanup {
    my $t = shift;
    # change all accents for their unaccented version, AND add the underscore
    $t =~ s|([^A-Za-z_\./])|$change{$1}_|g;
    return $t;
}

while($line=<>){
do {
	print $line;
	next ;
	} unless $line =~ /href/;

# real work: notice the 'e' (evaluate) and 'x' (ignore spaces inside the regexp)
$line =~ s|(href="file://E:/)([^"]+)"|$1 . cleanup($2) . "\""|gex;

print $line;
}

I did not want to code the above using the shell especially because of the accented characters, which I would not know how to deal with properly in a shell script in OS X (it is not as easy as it should be).
Hope you like the code and feel free to comment.
To get the full Perl script you only need to join the two code blocks above and, if you want to, insert the customary #!/usr/bin/perl -w headline if you are making it executable.

speak up

Add your comment below, or trackback from your own site.

Subscribe to these comments.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*Required Fields