David Koch googlemail.com> wrote:
> In terms of execution time with large files what's the fastest?
As with many things in life, it depends.
> Is the order something like:
>
> 1. Unformatted, Direct
> 2. Unformatted, Sequential
> 3. Formatted Direct
> 4. Formatted Sequential
Well, before you get into which way is the fastest, you have to address
the question of which way can read the file *AT ALL*. That is often a
critical determiner. If you can't read the file at all, then speed is
pretty irrelevant. One might say that it fails pretty quickly, I
suppose. Sometimes, though, you do have a choice in how the file is
written.
You can probably find isolated exceptions, but as a general rule:
All forms of formatted are slow. Very slow. My rule of thumb is roughly
a factor of 10 slower than unformatted, although that can vary.
Formatting, by definition, involved conversion between internal form and
text form. That conversion takes time. Formatted I/O is for human
consumption (printouts, terminals and the like). Also, it is a simple
way of doing highly portable files readable with almost any programming
language and easily transported across different machine architectures.
(There can be differences between architectures, but there will also be
standard utilities for dealing with those differences).
Sequential vs direct access (vs stream) is more complicated. Sequential
requires you to read through a file sequentially (thus the name)
starting at the beginning. Often that's what you want to do anyway. But
if you have a 1GB file and want to read only a small part near the end,
it can be a large penalty to have to read all the way throug hthe file
to find the part you want. Direct access allows jumping directly to any
record (thus its name) without reading others first. That can be a huge
advantage in some cases, and make no difference at all in others.
Sequential *might* help the system to be smart about read-ahead, helping
the speed. But that is highly system dependent. On some systems, it will
make no difference. For example, the sequential access pattern might be
noticed and taken advantage of even for direct access. This one is hard
to generalize on.
> The thing is - using unformatted, sequential works nice for files where
> each "record" has the same length such as matrices. I guess it cannot be
> used for files, where for instance each line has a different number of
> entries which is what I have at hand now
Depending how hard you are willing to work, the fixed record length
limitation of direct access can be worked around. You can read largish
fixed-size blocks from the file and, in essence, do your own record
management. That can be a fair amount of work, particularly ifor
variable lenth records and if you require capability to quickly jump to
arbitrary points in the file. For example, you might need to build some
kind of internal index. I've done things like that for an application
that is highly I/O performance sensitive and needs the capability.
However, you aren't ready for that degree of sophistication (or perhaps
I should more tactfully say that I'm not up to explaining it at this
level).
> - so the question is whether to
> use sequential formatted or unformatted?
If the file exists, use whatever format it is in. If you get the choice,
and speed is the main criterion, then this one is easy - unformatted
will be far faster.
Also, stream was mentioned earlier in this thread. Stream unformatted is
another option. It has some big advantages in some cases, but the
details of how it is done in f77/f90/f95 compilers vary today. It is
standardized in f2003, but for today, there are compiler variations. I
think that getting into its pros and cons is a bit much for this level.
> Another question - what determines the order in which data are read from
> files?
Records are read either sequentially in the order they were written
(yes, you can backspace or rewind, but those are small modifications to
the overall scheme), or directly in any order that you specify. Within a
record, your I/O list is procesed in order.
> Say, a matrix is read from (binary) file using unformatted direct
> access. My program seems to read records column-wise - but does that
> depend on the way the binary file was generated or is it always the
> case? I used Matlab and happen to know that it writes data column, by
> column.
None of this part has anything to do with formatted vs unformatted. As I
mentioned above, it is determined by the I/O list. If you have just an
array name in the I/O list, the array is processed in "array element
order", which is columnwise. However, you can use an implied DO to
specify elements in other orders (including rowwise). See any Fortran
text on implied DO lists. I'll direct you to the particular subject so
you can easily find it, but I don't consider newsgroup postings to be a
good substitute for a text.
> A text file - which is accessed using sequential, formatted seems is
> read line by line - is that the same for formatted direct access?
See above, using the translation that a "line" in a text file is a
"record". The term record is more general, so that's what I've used
above; in a text file, the records are lines.
--
Richard Maine | Good judgement comes from experience;
email: last name at domain . net | experience comes from bad judgement.
domain: summertriangle | -- Mark Twain