Re: [9fans] simplicity
  Home FAQ Contact Sign in
comp.os.plan9 only
 
Advanced search
POPULAR GROUPS

more...

 Up
Re: [9fans] simplicity         

Group: comp.os.plan9 · Group Profile
Author: erik quanstrom
Date: Oct 10, 2007 05:23

> I was thinking of the simplistic scenario, where someone might be
> looking for niño in some file, regardless of what locale they might
> happen to be in. Now I can imagine the nightmare it must be for
> non-English speakers looking for letter combinations irrespective of
> accents.
>
> But, it seems more like a problem with the shorthand than grep, per
> se.

i agree with this. or it's a historical problem with the character set.
clearly if you were designing a universial character set with no compatability
constraints, the alphabet would have nñ together so [a-z] would
match both.
> I could see an argument for [:alpha:] potentially matching n and
> ñ depending on the locale, but [a-z] not matching ñ in any locale. But
> even that, my tendency would be that [:alpha:] match ñ in every
> locale.
>
> But then, does [:alpha:] match ἄγαθος? How ironic that it doesn't match α.

i don't think one can go this route. you can't have a magic environment
variable that changes everything. testing is a nightmare in such a world.
you have to go through every combination of (data cs, locale) to see if
things are working.

a better solution is to use the properties of unicode. ñ is noted in the
table as

00f1;latin small letter n with tilde;ll;0;l;006e 0303;;;;n;latin small letter n tilde;;00d1;;00d1

field 6 has the base codepoint 006e as its first subfield. it would not be hard
to build a table quickly mapping a codepoint to its base codepoint σ.
but it would probablly be most useful to also have a mapping from
base codepoints to all composed forms ξ.

suppose, for lack of creativity, we use » to mean all base codepoints
matching the next item character so »a matches ä as does »[a-z].
so for » of a letter c can be grepped by taking ξσ(c) which results
in a character class.

plan 9 already has some of this in the c library with tolowerrune, etc.
i did some work with this some time ago and wrote some rc scripts to
generate the to*rune tables from the unicode standard data. it would
be easy to adapt them to generate ξ and σ. (the tables would be pretty big.)
>
> What an ugly problem.

it can be made ugly quickly. but i'm not convinced that all approaches
to this problem are bad.

- erik
no comments
diggit! del.icio.us! reddit!