UTF-8
  Home FAQ Contact Sign in
comp.lang.fortran only
 
Advanced search
POPULAR GROUPS

more...

comp.lang.fortran Profile…
 Up
UTF-8         


Author: Mik
Date: Apr 16, 2008 09:59

I have files with data and text in Russian Windows encoding (CP1251). My
current locale is UTF-8 (Linux). My Fortran program parses strings in
files and produces computations. I use a utility named 'recode' to
convert text to UTF-8. Windows version of program works without errors,
but Linux version can't parse these files, because Russian Unicode
characters place two bytes per symbol. Which solution is there?

Thanks
12 Comments
Re: UTF-8         


Author: Mik
Date: Apr 16, 2008 10:12

Mik пишет:
> I have files with data and text in Russian Windows encoding (CP1251). My
> current locale is UTF-8 (Linux). My Fortran program parses strings in
> files and produces computations. I use a utility named 'recode' to
> convert text to UTF-8. Windows version of program works without errors,
> but Linux version can't parse these files, because Russian Unicode
> characters place two bytes per symbol. Which solution is there?
>
> Thanks

Strings are approximately such as:

| абвгд | 1 | 23.45 | 67.89 | опрст |
no comments
Re: UTF-8         


Author: Terence
Date: Apr 17, 2008 16:28

The whole problem is that 2-byte usage for Russian.

I provide software which runs in many left-to-right languages by
providing external modules of message strings, in several languages,
for each internal message in the program.

Here I use ONLY a one-byte symbol and select the appropriate Microsoft
table for the language required. For Russian this would be the Cyrilic
table. For Polish it's the Slavic table and so on. For Greek I use a
complete Greek table, not the 10 or so top-table physics notation set.

So one solution that occurs to me is:-

Write a program to read the data file and detect the leading byte of
the two-byte UTF-8 code (D0h=Cyrilic, for the Cyrilic coding
throughout the data), and convert the second byte to a new byte
corresponding to a 256-byte DOS Miscrosoft Cyrilic symbol table.

Then use a single-byte Cyrilic table when reading Russian data if
this is possible in Linux or else the nearest distinct Latin
equivalent to make the text understandable (R.N P F...).
Its obviously possible here in the Forum as the Russian comes out
readably.
Show full article (1.29Kb)
no comments
Re: UTF-8         


Author: Terence
Date: Apr 17, 2008 16:34

I wrote a reply with two soutions. I don't see it.
I was about to comment that the first byte of UTF=8 for Cyrilic is D0h
AND D1h, not jut D0H as I stated. The previous message SAYS it got
posted the simple way. This time there's a different screen!
no comments
Re: UTF-8         


Author: Gerry Ford
Date: Apr 17, 2008 18:06

"Terence" cantv.net> wrote in message
news:1fd9e9b5-0c7a-45eb-906f-e7c9d2db6bb2@f63g2000hsf.googlegroups.com...
> The whole problem is that 2-byte usage for Russian.

That's one of the problems. Another is that the wall, that used to divide
Berlin, shifted east and has kept westerners--at least this westerner--from
communicating with Gospodun Putin's russia on the internet, in particular,
in newsgroups.

I could shed plenty of light on this question, if OP can help me, for
example, use the cyrillic keys on my keyboard.

I could then replicate his data set and put his question in the crossfire.
--
"A belief in a supernatural source of evil is not necessary; men alone
are quite capable of every wickedness."

~~ Joseph Conrad (1857-1924), novelist
no comments
Re: UTF-8         


Author: Greg Lindahl
Date: Apr 17, 2008 21:49

In article <1fd9e9b5-0c7a-45eb-906f-e7c9d2db6bb2@f63g2000hsf.googlegroups.com>,
Terence cantv.net> wrote:
> Then use a single-byte Cyrilic table when reading Russian data if
>this is possible in Linux

Well, in Linux you might use ISO 8859-5, which is an actual standard.
>Its obviously possible here in the Forum as the Russian comes out
>readably.

"here" is the Usenet group comp.lang.fortran.

-- greg
no comments
Re: UTF-8         


Author: Terence
Date: Apr 18, 2008 00:25

Grag's cooment essentially says "use the Russian two-byte table" for
Russian and Ukranian, etc.

But that does not help citizen Mik; and the suggestion
1) requires programming for two-byte text treatments in his
objectives,
2) totally junks the possibility of him using long-standing Text and
Graphic user interfaces based on one-byte palette colour plus one byte
screen data outputs,
3) OR means the user has to write two programs; one for all but a few
common world-wide languages and one for whatever of the othes
including Russian he is interested in (Hebrew, Arabic, Hindi, Sanskit,
Indonesian and Chinese quickly come to mind; Japan at least has Romaji
to use; wakarimas?).
no comments
Re: UTF-8         


Author: Greg Lindahl
Date: Apr 18, 2008 13:46

In article <29a72546-4d97-4652-a151-9595f29c1bc1@d1g2000hsg.googlegroups.com>,
Terence cantv.net> wrote:
>Grag's cooment essentially says "use the Russian two-byte table" for
>Russian and Ukranian, etc.

ISO 8859-5 is a single-byte character set.

http://www.kostis.net/charsets/iso8859.5.htm

I agree that anything bigger than 8 bits is an incredible pain in
Fortran. It's actually pretty annoying in Perl, too, something I'm
learning the hard way.

-- greg
no comments
Re: UTF-8         


Author: nospam
Date: Apr 18, 2008 13:59

Greg Lindahl pbm.com> wrote:
> In article <29a72546-4d97-4652-a151-9595f29c1bc1@d1g2000hsg.googlegroups.com>,
> Terence cantv.net> wrote:
>
>>Grag's cooment essentially says "use the Russian two-byte table" for
>>Russian and Ukranian, etc.
>
> ISO 8859-5 is a single-byte character set.

But it is actually an international standard instead of something MS
specific. Being an international standard seems to be enough to give
some people heartburn. :-)

--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain
no comments
Re: UTF-8         


Author: Gerry Ford
Date: Apr 20, 2008 03:45

"Terence" cantv.net> wrote in message
news:7a8d172b-88a1-4ec5-b671-599b35514e59@24g2000hsh.googlegroups.com...
>I wrote a reply with two soutions. I don't see it.
> I was about to comment that the first byte of UTF=8 for Cyrilic is D0h
> AND D1h, not jut D0H as I stated. The previous message SAYS it got
> posted the simple way. This time there's a different screen!

???????

--
"A belief in a supernatural source of evil is not necessary; men alone
are quite capable of every wickedness."

~~ Joseph Conrad (1857-1924), novelist
no comments

RELATED THREADS
SubjectArticles qty Group
Re: Converting from UTF-16 to UTF-32comp.lang.c++ ·
1 2