I know this is not really very on-topic for either of these groups,
but searching Google for a newsgroup dedicated to unicode or
encoding-specific groups, I didn't come up with anything ... hopefully
somebody else will have wrestled with this ...
I am parsing data which can contain UTF-8 sequences. The data is
encoded like this:
Sigur%%20R%%C3%%B3s
This represents "Sigur Rós". The second to last letter is supposed to
be this:
http://www.fileformat.info/info/unicode/char/00f3/index.htm
in case that doesn't come through correctly on your end.
So that makes sense, 0xC3 0xB3 is the correct encoding for the weird
o, "o with acute". So far, so good.
But then I realized I could have sequences like
%%C3%%B3%%20
Which would be "o with acute" followed by a space, and I don't know
how to know when to stop.
If I understand UTF-8 encoding right, I can use this logic: