On 2007-01-23 13:24:39 -0400, nospam@see.signature (Richard Maine) said:
> David Koch googlemail.com> wrote:
>
>> In terms of execution time with large files what's the fastest?
>
> As with many things in life, it depends.
>
>> Is the order something like:
>>
>> 1. Unformatted, Direct
>> 2. Unformatted, Sequential
>> 3. Formatted Direct
>> 4. Formatted Sequential
>
> Well, before you get into which way is the fastest, you have to address
> the question of which way can read the file *AT ALL*. That is often a
> critical determiner. If you can't read the file at all, then speed is
> pretty irrelevant. One might say that it fails pretty quickly, I
> suppose. Sometimes, though, you do have a choice in how the file is
> written.
>
> You can probably find isolated exceptions, but as a general rule:
>
> All forms of formatted are slow. Very slow. My rule of thumb is roughly
> a factor of 10 slower than unformatted, although that can vary.
> Formatting, by definition, involved conversion between internal form and
> text form. That conversion takes time. Formatted I/O is for human
> consumption (printouts, terminals and the like). Also, it is a simple
> way of doing highly portable files readable with almost any programming
> language and easily transported across different machine architectures.
> (There can be differences between architectures, but there will also be
> standard utilities for dealing with those differences).
>
> Sequential vs direct access (vs stream) is more complicated. Sequential
> requires you to read through a file sequentially (thus the name)
> starting at the beginning. Often that's what you want to do anyway. But
> if you have a 1GB file and want to read only a small part near the end,
> it can be a large penalty to have to read all the way throug hthe file
> to find the part you want. Direct access allows jumping directly to any
> record (thus its name) without reading others first. That can be a huge
> advantage in some cases, and make no difference at all in others.
>
> Sequential *might* help the system to be smart about read-ahead, helping
> the speed. But that is highly system dependent. On some systems, it will
> make no difference. For example, the sequential access pattern might be
> noticed and taken advantage of even for direct access. This one is hard
> to generalize on.
>
>> The thing is - using unformatted, sequential works nice for files where
>> each "record" has the same length such as matrices. I guess it cannot be
>> used for files, where for instance each line has a different number of
>> entries which is what I have at hand now
>
> Depending how hard you are willing to work, the fixed record length
> limitation of direct access can be worked around. You can read largish
> fixed-size blocks from the file and, in essence, do your own record
> management. That can be a fair amount of work, particularly ifor
> variable lenth records and if you require capability to quickly jump to
> arbitrary points in the file. For example, you might need to build some
> kind of internal index. I've done things like that for an application
> that is highly I/O performance sensitive and needs the capability.
> However, you aren't ready for that degree of sophistication (or perhaps
> I should more tactfully say that I'm not up to explaining it at this
> level).
>
>> - so the question is whether to use sequential formatted or unformatted?
>
> If the file exists, use whatever format it is in. If you get the choice,
> and speed is the main criterion, then this one is easy - unformatted
> will be far faster.
> Also, stream was mentioned earlier in this thread. Stream unformatted is
> another option. It has some big advantages in some cases, but the
> details of how it is done in f77/f90/f95 compilers vary today. It is
> standardized in f2003, but for today, there are compiler variations. I
> think that getting into its pros and cons is a bit much for this level.
>
>> Another question - what determines the order in which data are read from
>> files?
>
> Records are read either sequentially in the order they were written
> (yes, you can backspace or rewind, but those are small modifications to
> the overall scheme), or directly in any order that you specify. Within a
> record, your I/O list is procesed in order.
>
>> Say, a matrix is read from (binary) file using unformatted direct
>> access. My program seems to read records column-wise - but does that
>> depend on the way the binary file was generated or is it always the
>> case? I used Matlab and happen to know that it writes data column, by
>> column.
>
> None of this part has anything to do with formatted vs unformatted. As I
> mentioned above, it is determined by the I/O list. If you have just an
> array name in the I/O list, the array is processed in "array element
> order", which is columnwise. However, you can use an implied DO to
> specify elements in other orders (including rowwise). See any Fortran
> text on implied DO lists. I'll direct you to the particular subject so
> you can easily find it, but I don't consider newsgroup postings to be a
> good substitute for a text.
Another issue which might be important is the form of the i/o list.
For "REAL :: MAT(1000,1000)" you migh have either "READ ( 21 ) MAT"
which reads the whole array in its natural order or you might try
"READ ( 21 ) ( ( MAT(I,J), I = 1, 1000 ), J = 1, 1000 )" (subject to
my blunders) which lists the elements of the array for reading.
The implied DO loops might have overhead associated with each array
element which can add up quickly. Implementation quality can vary
widely so you would need to experiment.
Fortran i/o is often said to be slow because it is often formmated.
I have seen Fortrans that could "spin tape" with the best of them
when doing whole array i/o. I have heard many sad tales of other
languages turning into turtles when the application needed lots
of formatted i/o. The details of formatted vrs unformatted and whole
array vrs element by element are often lost when either complaining or
bragging about i/o speed. Notice that such details were not listed
here.
Formatted i/o tends to be element by element and unformatted i/o is
often for whole arrays. Formatted i/o is certainly slower (my personal
discount factor is larger than Richard's but that is quibbling) and
often has the extra overhead. How much of the rap on formatted i/o is
really due to i/o list issues is an interesting arcane insider's question
that really does not much matter in the long run.
> A text file - which is accessed using sequential, formatted seems is
> read line by line - is that the same for formatted direct access?
>
> See above, using the translation that a "line" in a text file is a
> "record". The term record is more general, so that's what I've used
> above; in a text file, the records are lines.