Re: RfD: Escaped Strings
  Home FAQ Contact Sign in
comp.lang.forth only
 
Advanced search
POPULAR GROUPS

more...

 Up
Re: RfD: Escaped Strings         

Group: comp.lang.forth · Group Profile
Author: Stephen Pelc
Date: Jul 14, 2007 08:06

On Fri, 13 Jul 2007 09:03:32 -1000, Elizabeth D Rather
forth.com> wrote:
>How do you feel about Greg Bailey's suggestion of some years back that
>we introduce the data type 'byte' or 'octet' with a small set of
>operators to handle explicitly 8-bit units? That's sort of moving in
>the opposite direction from what you suggest, but seems an equally valid
>approach, I think. Greg's solution leaves everything regarding chars in
>place, while introducing a new opportunity for situations in which you
>need exactly 8 bits (e.g. comms, I/O).

Greg's solution has merit, especially for word/cell addressed
machines, however the discussion in it indicates that life
isn't that simple unless you use his alternative 2.

Given that nearly all comms and character systems are defined
in bytes, most CPUs are byte-addressed, and many Forth systems
and/or programmers assume char=byte=au, the least effort is to
permit wide characters (xchars) without breaking the assumption
or code.

For those who haven't seen it, Greg's proposal is attached below.

Stephen

====================================
From: Greg Bailey [greg at minerva dot com]
Sent: Tuesday, June 01, 1999 7:41 PM
To: 'ANSForth real mailgroup'
Cc: 'Localisation and Internationalisation'; 'ark-gvb-i'
Subject: Octet String Prospectus

Problem Statement:
------------------

Most standards defining interoperable data structures, such as for
example those used in networking and cryptography, do so in terms of
sequences of octets. Even in embedded applications, these standards
are
increasingly relevant and are indeed supporting them is often a
critical
application requirement.

The most commonly encountered computer architectures today address
their memories in units of 8 bit bytes, and Standard Forth appli-
cations have no difficulty in manipulating octet sequences directly
when
running on typical systems, with eight bit character sets, for such
machines.

However, such applications are environmentally dependent upon this
common combination in which addresses are in units of bytes or octets,
*and* in which characters are eight bits wide; or upon machines whose
addresses are in units such as 4-bit nibbles which divide 8, and whose
characters are also eight bits wide. On these families of
architectures
portable software may manipulate octet sequences by treating them as
characters.

If, however, either character size or address units are larger
than eight bits, we do not document standard ways of allocating,
manipulating, or performing I/O using sequences of octets.

This proposal provides mechanism that may be used by standard
programs to manipulate sequences of octets on any standard system
which supports it.

(Actual packaging TBD. Should probably be an extension, but if
so it will depend upon presence of the DOUBLE extension; and it
will include additions to the FILE extension if both are present.)

Discussion of common practice and architectural tradeoffs:
----------------------------------------------------------

Many systems and applications have been written for "cell addressed"
machines with 16 bit and larger address units. Many strategies have
been used for addressing characters, which were generally equivalent
to
octets, on such machines. In general the hardware does not directly
support linear addressing of bytes, characters, or octets, so this
type
of arithmetically usable address has generally been simulated in
software. The most commonly used strategy has been to multiply the
physical, cell address by the number of octets held within a cell, and
add to this product the relative position of the octet within the
cell,
in order to form a linear octet address. Coding strategies for
employing this additional, synthetic address data type depend on the
nature of the underlying CPU. Since there is usually a substantial
performance penalty for using these synthetic addresses, it has been
common practice to use the octet address data type only in conjunction
with octet operators, and to use native cell addresses for all other
purposes.

Since the dynamic range required of this synthetic data type is
one or more bits larger than for native address units, it follows
that if the machine supports full cell width cell addresses, then
an address capable of identifying any stored character or octet
within the memory must be greater than one cell in width.

A number of practical systems have used cell width octet addresses
with varying degrees of success. For example, a number of the 16-
bit minicomputers have been restricted architecturally to 15 bit
cell addressing; in fact, in some cases, the 16th bit has been used to
mark indirect addresses. On such systems, it has been possible to
address all of memory with a 16 bit octet address, with no negative
side
effects.

Less successful have been efforts to use 16 bit synthetic octet
addresses on machines that support full 16 bit cell addressing.
One strategy is to limit octet addressing to the low half of
memory. Another is to "float" octet addressing upon each task's
private memory. Yet another subdivides octet addressable space
into a static, common region and another which is "floated". Each
of these strategies has inflicted pain upon programmers who have
had to live with them.

A slightly less obvious form of this pain has been experienced
when maintaining a single source base that runs on both cell and
octet addresed machines. In a typical synthetic addressing scheme
for such 16 bit machines, it is possible to convert a cell address
into the synthetic address of its first octet by simply doubling
the cell address. The advantage of this transformation was that
all the system had to do was specify which operators took octet
addresses as opposed to cell addresses, and expect the programmer
to use the conversion operator when needed. This avoided the need
for special allocation and declaration functions for octet space.
The disadvantage is that, when running on an octet addressed machine,
the conversion operators were no-ops. The consequence of failing to
use
a conversion operator, or of using the wrong address type with a given
function, were nil. As a result, a programmer could change such a
program inattentively, test it on an octet addressed machine, and
never
discover the bugs thus introduced until the program was later run on a
cell addressed machine. Practical experience has shown that this
error
is easy to make, hard to detect, and is a direct consequence of having
an octet address that is of the same size and the same value as is the
regular memory address on octet addressed machines. As a result, it
appears that from the perspective of human factors this is an
architecture to be avoided.

Based on this experience, it is proposed that explicit octet add-
ressing be done using an ordered pair. This practice has actually
been used in a number of systems, and is also the method often
used in hardware and software support for octet sequences on
large cell addressed mainframes.

Synopsis of proposed architecture:
----------------------------------

The ordered pair of an Octet Address consists of a Base Address
and an Octet Index. The base Address is the standard Address of
the beginning of a memory allocation declared for an Octet Sequence.
All
Octet Addresses within that allocation share the same Base Address,
and
there is no portable method for transforming an Octet Address with a
given Base Address to use a different Base Address. The Octet Index is
a
zero relative positive integer denoting the position of an octet
within
the sequence which starts at the Base Address.

On the stack, the Base Address is on top. Arithmetic on Octet
Addresses is meaningful only when subtracting the address of
one octet from that of another within the same sequence, or
when adding or subtracting a scalar to or from the address of
an octet. This structure and these rules allow the application
to use double operators such as M+ and D- for the valid arithmetic
if those operators are assumed present; otherwise, since such valid
arithmetic never involves carries or borrows between the Index and
Base
parts of the Octet Address, they are amenable to simple arithmetic
operations using standard CORE operators and similarly for machine
code.
For example, the difference between two Octet Addresses that may be
validly compared may be computed

ROT 2DROP - ( in lieu of D- )

and an Octet Address may be decremented using

SWAP 1- SWAP ( in lieu of -1 M+ )

Incrementation is of course done by the dedicated operator below.

Finally, this arrangement leads to syntax which is analogous to
that which is commonly used with arrays in Forth. If PACKET has
been declared as an octet sequence, the phrase:

5 PACKET

places on the stack the formal Octet Address of the sixth octet
in that sequence since PACKET simply provides the Base Address
for that sequence. In a loop,

I PACKET

or 4 + DUP PACKET

occurs naturally as it does with arrays, helping out with stack
bloat that would occur if "indexing" were not available and
arithmetic on the double form was the only way to navigate.

I believe, based on considerable experience, that this is the
cleanest way to deal with this issue. In fact, it is precisely
the solution that ATHENA uses for data structures defined as
sequences of *bits*, where it has served well, led to readable
code, and produced no glaring inconsistencies. Based on this,
the minimum set of things we might need is:

OCTETS ( n1 - n2) Clone defn from CHARS
OCTET+ ( 8-addr1 - 8-addr2) Clone defn from CHAR+
8@ ( 8-addr - u) Clone defn from C@
8! ( u 8-addr) Clone defn from C!
8MOVE ( 8-addr1 8-addr2 u) Clone defn from CMOVE

It is strictly coincidental that "8" looks very much like "B"
at first glance ;-)

Storage for octet sequences is allocated using the present
conventions for allocating and identifying *aligned* addresses.
For example,

CREATE PACKET 536 OCTETS ALLOT
... , ... ALIGN HERE 64 OCTETS ALLOT ...

For the purpose of complying with standards, the first form is
more likely to be used. The requirement for ALIGNing Base
Addresses facilitates efficient implementations on the universe
of equipment.

Addition of octet sequence support to the FILE extension must be
done in such a way that it is independent of character size, which
may be larger than an octet. However, as written all FILE operators
function in terms of lengths and positions whose units are charcters.
Because more than one octet position may map onto the same character
position, dealing with the same file ID in terms of both octets and
characters would be problematic.

Instead, the following is proposed:

OCT ( fam1 - fam2)

Modify the implementation-defined file access method fam1
to additionally select an octet oriented, as opposed to character
or file oriented, access method. When a file ID has been opened
with the OCT access method, all file positions and sizes used in
association with that file are in units of octets instead of
characters. In addition, it is an amgiguous condition to use
READ-FILE, READ-LINE, WRITE-FILE, WRITE-LINE, or INCLUDE-FILE
with such a file ID. INCLUDED is not mentioned in this list
because it does not consume a file ID.

READ-OCTET ( 8-addr u1 fileid - u2 ior) Clone from READ-FILE

Note ambiguous condition if used with a fileid not opened as OCT

WRITE-OCTET ( 8-addr u fileid - ior) Clone from WRITE-FILE

Note ambiguous condition if used with a fileid not opened as OCT

This appears to be the minimum necessary change. READ-FILE and
WRITE-FILE are not overloaded because experience indicates that
having different arguments for the same function depending on a
flag leads to maintenance problems.

If written, this proposal will of course have to include a number
of details in sections 2, 3, and 4 as well as 11 and whatever is
assigned for this extension.

ALTERNATIVE STRUCTURE 1:
------------------------

If the TC strongly feels that this is too much solution for the
problem, there is a simpler alternative that is logically self
consistent:

1. An octet is guaranteed to fit inside the storage allocation
for a character.

2. Therefore, omit all of this except the FILE wordset part.

3. In the FILE wordset, include OCT but simply note that in
this access method octets are read from and written to the
device, sizes and positions are in octets, and the data
are read into and written from character storage such that
octets are right justified and zero filled into characters
on READ-FILE, only the low order eight bits of each character
are written by WRITE-FILE, and that READ-LINE, WRITE-LINE,
and INCLUDE-FILE are ambiguous with an OCT file handle.

The disadvantage of this is that while it would allow everyone
with the AU=byte=char=octet dependency to congratulate themselves
as having complied without doing any work, it would not address
the physical storage structures commonly used by hardware and
operating systems for cell addressed equipment, and would be
inefficient on byte addressed machines with large characters.

ALTERNATIVE STRUCTURE 2:
------------------------

It might be more useful to use the initial structure above but to
de-ambiguify READ-FILE and WRITE-FILE by incorporating the con-
ventions in item 3. of alternative 1 above. What this would buy
is that an existing AU=byte=octet=char application that had to
be converted in a hurry to use say 16 bit characters could adapt
to such a system by using OCT as file access method with no other
changes (assuming it was coded with CHARS and CHAR+ as needed)
and still operate upon its octet sequence structures with reduced
efficiency. For that matter, it could run on cell addressed hard-
ware with similarly reduced efficiency. In either case, at leisure
and
if necessary the application could be upgraded to actually use the
Octet
Addressing functions, but in the meanwhile there would be a fast and
dirty way to solve the problem with minimal effort.

At present I think that Alternative 2 would be the wisest of these
three. Perhaps the part of Alternative 2 taken from Alternative 1
could be the OCTET extension, and the rest of it could be called
OCTET EXT.

Or, if one felt more strongly about it, OCT could be added to the
base FILE wordset along with the change in behavior of that wordset
per
Alternative 1, and the rest of 2 implemented as simply the OCTET
wordset
with no OCTET EXT (as yet). For those maintaining typical systems,
that
could require as little as adding OCT as a no-op.

Obviously it would be nice to have a first draft that might pass,
so these packaging issues should be more or less resolved first.
In that regard the central question is, to me, how essential and
therefore how non-optional each of these layers should be.

-----------------------------------------------------------------

--
Stephen Pelc, stephenXXX@mpeforth.com
MicroProcessor Engineering Ltd - More Real, Less Time
133 Hill Lane, Southampton SO15 5AF, England
tel: +44 (0)23 8063 1441, fax: +44 (0)23 8033 9691
web: http://www.mpeforth.com - free VFX Forth downloads
no comments
diggit! del.icio.us! reddit!