|
|
Up |
|
|
  |
Author: Bernd PaysanBernd Paysan Date: Jul 14, 2007 12:56
Since it's time to post RfDs, I want to throw in the updated proposal for
the XCHAR wordset. I hope I have included all comments so far, and I also
included a reference implementation.
Problem:
ASCII is only appropriate for the English language. Most western
languages however fit somewhat into the Forth frame, since a byte is
sufficient to encode the few special characters in each (though not
always the same encoding can be used; latin-1 is most widely used,
though). For other languages, different char-sets have to be used,
several of them variable-width. Most prominent representant is
UTF-8. Let's call these extended characters XCHARs. Since ANS Forth
specifies ASCII encoding, only ASCII-compatible encodings may be
used. Furtunately, being ASCII compatible has so many benefits that
most encodings actually are ASCII compatible.
Proposal
Datatypes:
|
| Show full article (12.30Kb) |
|
| | 46 Comments |
|
  |
Author: Bruce McFarlingBruce McFarling Date: Jul 15, 2007 06:06
How hard would it be to extend the reference implemenation to UTF-32?
Erratum:
XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
characters large. xc_addr2 points to the first memory location after
xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
into the buffer, flag is true, otherwise flag is false, and xc_addr2
u2 equal xc_addr1 u1. XC!+? is -save- +safe+ for buffer overflows, and
therefore preferred over XC!+.
|
| |
|
| | no comments |
|
  |
Author: Anton ErtlAnton Ertl Date: Jul 15, 2007 10:24
Bernd Paysan writes:
>xc_addr is the address of an XCHAR in memory. Alignment requirements are
> the same as c_addr. The memory representation of an XCHAR differs
> from the stack location, and depends on the encoding used. An XCHAR
^^^^^^^^
representation?
>Common encodings:
...
>Side issues to be considered:
These appear to be subsections that should be put in informative
sections, not the normative "Proposal" section.
|
| Show full article (5.01Kb) |
| no comments |
|
  |
Author: Bernd PaysanBernd Paysan Date: Jul 15, 2007 12:33
Bruce McFarling wrote:
> How hard would it be to extend the reference implemenation to UTF-32?
UTF-32 is not ASCII compatible, unless you have a system where 1 CHAR = 32
bit.
> Erratum:
>
> XC!+? ( xc xc_addr1 u1 -- xc_addr2 u2 flag ) XCHAR EXT
> Stores the XCHAR xc into the buffer starting at address xc_addr1, u1
> characters large. xc_addr2 points to the first memory location after
> xc, u2 is the remaining size of the buffer. If the XCHAR xc did fit
> into the buffer, flag is true, otherwise flag is false, and xc_addr2
> u2 equal xc_addr1 u1. XC!+? is -save- +safe+ for buffer overflows, and
> therefore preferred over XC!+.
Thanks, there was another save/safe error, as well.
|
| |
| no comments |
|
  |
Author: Bernd PaysanBernd Paysan Date: Jul 15, 2007 13:02
Anton Ertl wrote:
> Bernd Paysan writes:
>>xc_addr is the address of an XCHAR in memory. Alignment requirements are
>> the same as c_addr. The memory representation of an XCHAR differs
>> from the stack location, and depends on the encoding used. An
>> XCHAR
> ^^^^^^^^
> representation?
Yes.
>>Common encodings:
> ...
>>Side issues to be considered:
>
> These appear to be subsections that should be put in informative
> sections, not the normative "Proposal" section.
Moved it to an appendix
|
| Show full article (4.90Kb) |
| no comments |
|
  |
Author: Alex McDonaldAlex McDonald Date: Jul 15, 2007 13:33
Bernd Paysan wrote:
[snipped]
Unfortunately, on first analysis, this is one proposal that Win32Forth
will not be adopting any time soon.
Windows is UTF-16, which is not ASCII compliant. Although Windows
provides APIs to translate from locale to locale, there is no method in
Win32Forth to automatically identify which parameters would be require
to be translated from XHCARS to UTF-16 and back; the programmer would be
responsible for coding the conversions.
We would need something like the proposal Anton made at EuroForth 2006
( http://dec.bournemouth.ac.uk/forth/euro/ef06/ertl06.pdf, A Portable C
Function Call Interface), with extensions to identify string pointers,
before implementing this.
--
Regards
Alex McDonald
|
| |
| no comments |
|
  |
Author: Anton ErtlAnton Ertl Date: Jul 16, 2007 02:46
Bernd Paysan writes:
>Anton Ertl wrote:
>> Bernd Paysan writes:
>>>XSTRING+ ( xcaddr1 u1 -- xcaddr2 u2 ) XCHAR
>>>Step forward by one xchar in the buffer defined by xcaddr1 u1. xcaddr2
>>>u2 is the remaining buffer after stepping over the first XCHAR in the
>>>buffer.
>>>
>>>-XSTRING ( xcaddr1 u1 -- xcaddr1 u2 ) XCHAR
>>>Step backward by one xchar in the buffer defined by xcaddr1 u1,
>>>starting at the end of the buffer. xcaddr1 u2 is the remaining buffer
>>>after stepping backward over the last XCHAR in the buffer. Unlike
>>>XCHAR-, -XSTRING can be implemented in encodings that have only a
>>>forward-working string size.
>>
>> The assymetry in the stack effects of XSTRING+ and -XSTRING is
>> probably hard to remember and may be confusing.
>
>Oops, got it wrong, the description is actually of +XSTRING and XSTRING-.
>The sign is on the side of the string which gets modified, and indicates ...
|
| Show full article (3.66Kb) |
| no comments |
|
  |
Author: Anton ErtlAnton Ertl Date: Jul 16, 2007 04:28
Alex McDonald rivadpm.com> writes:
>Bernd Paysan wrote:
>
>[snipped]
>
>Unfortunately, on first analysis, this is one proposal that Win32Forth
>will not be adopting any time soon.
>
>Windows is UTF-16, which is not ASCII compliant. Although Windows
>provides APIs to translate from locale to locale, there is no method in
>Win32Forth to automatically identify which parameters would be require
>to be translated from XHCARS to UTF-16 and back; the programmer would be
>responsible for coding the conversions.
I don't see that you are any worse off with xchars in this situation
than with chars.
|
| Show full article (1.40Kb) |
| 8 Comments |
|
  |
Author: Alex McDonaldAlex McDonald Date: Jul 16, 2007 05:39
> Alex McDonald rivadpm.com> writes:
>>Bernd Paysan wrote:
>
>>[snipped]
>
>>Unfortunately, on first analysis, this is one proposal that Win32Forth
>>will not be adopting any time soon.
>
>>Windows is UTF-16, which is not ASCII compliant. Although Windows
>>provides APIs to translate from locale to locale, there is no method in
>>Win32Forth to automatically identify which parameters would be require
>>to be translated from XHCARS to UTF-16 and back; the programmer would be
>>responsible for coding the conversions.
>
> I don't see that you are any worse off with xchars in this situation
> than with chars.
|
| Show full article (2.61Kb) |
| no comments |
|
  |
|
|
  |
Author: Bernd PaysanBernd Paysan Date: Jul 16, 2007 07:58
Anton Ertl wrote:
>>Windows is UTF-16, which is not ASCII compliant. Although Windows
>>provides APIs to translate from locale to locale, there is no method in
>>Win32Forth to automatically identify which parameters would be require
>>to be translated from XHCARS to UTF-16 and back; the programmer would be
>>responsible for coding the conversions.
>
> I don't see that you are any worse off with xchars in this situation
> than with chars.
It's somewhat worse, because Windows has "A" prototypes, which convert the
current code page (can be multibyte) into UTF-16 on the fly. The "W"
prototypes take UTF-16 directly. But there's some light: UTF-8 is one of
the code pages in Windows (number 65001), and you can at least use
MultiByteToWideChar to convert data.
Actually, it might be possible to change the current code page to UTF-8, but
I didn't see a hint how to do that other than for console i/o (SetConsoleCP
and SetConsoleOutputCP). I must honestly admit that I don't like the online
access to MSDN represents information. The internal search is horrible, and
it's one of the rare sites where even Google is confused.
|
| Show full article (1.85Kb) |
| no comments |
|
|
|
|