>> I'm anxiously looking forward your reply and guidelines.
I too recommend OpenMP if you want really quick turnaround and very
modest speedups. Obviously, the actual speedup is very algorithm- and
hardware-dependent. On my applications, I can parallelize 95%% of the
code (~10,000 lines) with OpenMP in a day or two using mostly one or
two commands, but I could never get beyond a speedup in the 2-4 range
on 6-8 single-core processors.
My experiences with automatic parallelization by compilers proved even
less fruitful, but it's so easy to try that it's worth an attempt.
If you want large speedup on large numbers of processors, it seems MPI
is the only way to go, but it will require significant time investment
and will touch every part of your code. As a compromise, you might
consider a toolkit such as Bundle-Exchange-Compute (http://
www.cs.sandia.gov/BEC/). It is compatible with MPI (not a competitor)
but much simpler to use and apparentely gives speedup comparable to
hand-tuned MPI with far fewer commands.
A very important first step is to profile your code to determine which
parts are taking the most time and to concentrate your efforts there.
For serial code on Linux, you might use the 'gprof' system command.
For both serial and parallel (of any flavor), you might use the Timing
Analysis Utilities (TAU):
http://www.cs.uoregon.edu/research/tau/cca/index.php.
Damian