Re: defined operator & assignment speed & memory usage problem
  Home FAQ Contact Sign in
comp.lang.fortran only
 
Advanced search
POPULAR GROUPS

more...

 Up
Re: defined operator & assignment speed & memory usage problem         

Group: comp.lang.fortran · Group Profile
Author: fj
Date: Aug 13, 2008 09:54

On 12 août, 14:49, Arjen Markus wrote:
> On 17 jul, 19:43, Arjen Markus wrote:
>
>
>
>> On 17 jul, 18:30, Damian rouson.net> wrote:
>
>>> Aha! So we finally have some empirical data. I'm going to try this
>>> code out with the IBM XL Fortran compiler because I have been told it
>>> will do interprocedural optimization that might eliminate the
>>> differences between the two approaches.
>
>>> Damian
>
>> That would be interesting indeed.
>
>> By the way, I have thought out the memory pool I mentioned. I hope to
>> get a small program together before my holidays to illustrate it (and
>> measure the performance), but otherwise it will be in a few weeks
>> time. Just remind me of it (after 10 august)
>
>> Regards,
>
>> Arjen
>
> Well, I am back and I have completed the almost but not quite trivial
> program that uses a memory pool I promised. The code is below.
>
> Using g95 on a Windows PC via MinGW (but those are mere details, I
> suppose)
> I get something like this:
>
> 1. Without optimisation (default options only) the memory pool version
> is
> 5 to 10%% slower than the do-loop version
> 2. With optimisation (-O2) the two versions are almost as fast -
> marginal
> differences only
> 3. If I replace the assignment in the function multiply_fields by a
> subroutine
> that does the same thing, but does not retain the pointer
> attributes,
> the memory pool version is actually faster by 30-50%% in the
> unoptimised case. With optimisation there is no significant
> difference.
>
> The drawbacks of the memory pool version in an actual program (that
> would
> require substantive additional work though), are that it would use
> more memory (there is a constant or almost constant overhead) and
> that
> the user would be confronted with the need to manage the memory in
> some
> way - to get rid of local variables and left-over intermediate
> results.
> I think this can be fairly easily automated (up to a point), so the
> disadvantage is not that big.
>
> The good thing though is that you can simply specify array operations
> like:
>
> a = a + dt * diffusion * laplace(a) / dx**2
>
> (dt, dx being scalars) without performance overhead or memory leaks.
>
> Regards,
>
> Arjen
>
> ------ memory pool version ------
>
> ! mempool.f90 --
> ! Implement a memory pool that is then used for
> ! measuring the performance of a derived type with overloaded
> ! operations.
> ! (see: assign_comp.f90)
> ! Various ways to implement "A = B * C" where A, B and C
> ! are derived types
> !
> ! Managing the stack level:
> ! - A higher stack level means the results are more "temporary"
> ! - The stack level is increased by 2:
> ! call accum( sum , 2*x )
> ! 2*x is a temporary variable but the associated memory should
> ! not be reused within the subroutine accum
> !
> module assignment
> implicit none
>
> integer, parameter :: fieldsize = 10000000
>
> private :: stack_level
> private :: block
> private :: block_pool
>
> type block
> integer :: level
> real, dimension(:), pointer :: data
> end type
>
> type(block), dimension(10) :: block_pool
>
> integer, parameter :: UNASSIGNED = -1
> integer, parameter :: PERSISTENT = 0
> integer, parameter :: TEMPORARY = huge(0)
> integer, save :: stack_level = 1
>
> type field
> integer :: pool_id = UNASSIGNED
> end type field
>
> interface assignment(=)
> module procedure assign_field
> module procedure assign_field_scalar
> end interface
>
> interface operator(*)
> module procedure multiply_fields
> end interface
> contains
>
> real function value( obj, idx )
> type(field) :: obj
> integer :: idx
>
> value = block_pool(obj%%pool_id)%%data(idx)
> end function value
>
> subroutine enter_routine
> stack_level = stack_level + 1
> end subroutine enter_routine
>
> subroutine leave_routine
> stack_level = stack_level - 1
> end subroutine leave_routine
>
> subroutine init_pool
> integer :: i
>
> do i = 1,size(block_pool)
> block_pool(i)%%level = TEMPORARY
> allocate( block_pool(i)%%data(1:fieldsize) )
> enddo
>
> end subroutine init_pool
>
> subroutine find_block( obj, pid )
> type(field), intent(in) :: obj
> integer :: pid
>
> integer :: i
>
> if ( obj%%pool_id /= UNASSIGNED ) then
> pid = obj%%pool_id
> ! write(*,*) 'Reusing: ', pid
> else
> pid = -1
> do i = 1,size(block_pool)
> if ( block_pool(i)%%level > stack_level ) then
> pid = i
> ! write(*,*) 'Assigned: ', pid, block_pool(i)%%level
> exit
> endif
> enddo
> endif
>
> if ( pid == -1 ) then
> stop 'Not enough memory blocks!'
> endif
>
> end subroutine find_block
>
> subroutine assign_field( left, right )
> type(field), intent(inout) :: left
> type(field), intent(in) :: right
>
> integer :: pid
>
> if ( block_pool(right%%pool_id)%%level > stack_level ) then
> if ( left%%pool_id /= UNASSIGNED ) then
> block_pool(left%%pool_id)%%level = TEMPORARY
> endif
> pid = right%%pool_id
> left%%pool_id = pid
> block_pool(pid)%%level = stack_level
> else
> call find_block( left, pid )
>
> block_pool(pid)%%level = stack_level
> block_pool(pid)%%data = block_pool(right%%pool_id)%%data
> endif
>
> end subroutine assign_field
>
> subroutine assign_field_scalar( left, right )
> type(field), intent(inout) :: left
> real, intent(in) :: right
>
> integer :: pid
>
> call find_block( left, pid )
>
> left%%pool_id = pid
> block_pool(pid)%%level = stack_level ! TODO: figure out if this
> should be the minimum?
> block_pool(pid)%%data = right
>
> end subroutine assign_field_scalar
>
> type(field) function multiply_fields( op1, op2 )
> type(field), intent(in) :: op1, op2
>
> integer :: pid
>
> call find_block( multiply_fields, pid )
>
> multiply_fields%%pool_id = pid
>
> block_pool(pid)%%level = stack_level + 1
> block_pool(pid)%%data = block_pool(op1%%pool_id)%%data *
> block_pool(op2%%pool_id)%%data
> ! write(*,*) ' Multiply - using ', pid
>
> !! Alternatively:
> !! call multiply_array( block_pool(pid)%%data,
> block_pool(op1%%pool_id)%%data, &
> !! block_pool(op2%%pool_id)%%data )
>
> end function multiply_fields
>
> !
> ! Possible optimisation: the arrays have no pointer attributes anymoer
> !
> subroutine multiply_array( a, b, c )
> real, dimension(:), intent(out) :: a
> real, dimension(:), intent(in) :: b, c
>
> a = b * c
>
> end subroutine multiply_array
> ! Do it the classic way
> !
> subroutine multiply_assign( left, op1, op2 )
> type(field) :: left, op1, op2
> integer :: i
>
> integer :: pid
>
> call find_block( left, pid )
>
> left%%pool_id = pid
>
> block_pool(pid)%%level = stack_level
>
> do i = 1,fieldsize
> block_pool(pid)%%data(i) = block_pool(op1%%pool_id)%%data(i) &
> * block_pool(op2%%pool_id)%%data(i)
> enddo
>
> end subroutine multiply_assign
>
> end module assignment
>
> program test_assign
> use assignment
>
> implicit none
>
> type(field) :: field_a
> type(field) :: field_b
> type(field) :: field_c
>
> real :: time1, time2, cumulative1, cumulative2
> real :: walltime1, walltime2
> integer :: count, count1, count2, countrate
>
> character(len=10) :: dummy
>
> call init_pool
>
> field_b = 1.1
> field_c = 21.4
>
> field_a = field_b * field_c
>
> cumulative1 = 0.0
> cumulative2 = 0.0
> walltime1 = 0.0
> walltime2 = 0.0
>
> call system_clock( count_rate = countrate )
> write(*,*) 'Counts per second: ', countrate
>
> do count = 1,100
> call system_clock( count1 )
> call cpu_time( time1 )
> field_a = field_b * field_c
> ! write(*,*) 'fields: ', field_a%%pool_id, field_b%%pool_id,
> field_c%%pool_id
> call cpu_time( time2 )
> call system_clock( count2 )
>
> cumulative1 = cumulative1 + (time2-time1)
> walltime1 = walltime1 + (count2-count1)
>
> call system_clock( count1 )
> call cpu_time( time1 )
> call multiply_assign( field_a, field_b, field_c )
> call cpu_time( time2 )
> call system_clock( count2 )
>
> cumulative2 = cumulative2 + (time2-time1)
> walltime2 = walltime2 + (count2-count1)
> enddo
>
> write(*,*) value(field_a,1), value(field_b,1), value(field_c,1)
>
> write(*,*) 'Method 1 (operator): ', cumulative1, walltime1/
> countrate
> write(*,*) 'Method 2 (do-loop): ', cumulative2, walltime2/
> countrate
>
> write(*,'(a)') 'Waiting ...'
> read(*,'(a)') dummy
>
> end program

This thread is interesting : I was faced with a similar problem in
trying to overload classical operators. I also observed efficiency
variations, depending on the compiler, when I compared operators with
routines like add_multiply_assigned.

I don't use FFT but simply variables associated to partial
derivatives :

TYPE var
REAL(r8) :: value ! the value of the variable
REAL(r8) :: deriv(ndmax) ! deriv(i)=d(value)/dx(list(i))
INTEGER :: index(ndmax) ! the list of associated state variables
END TYPE

Of course, such simple derided type does not correspond to gigabytes
of memory. But the associated field may become much larger when the
numbers of nodes (or meshes) is high :

TYPE(var),ALLOCATABLE :: T(:) ! temperature field with SIZE(T) >
100000

But as you see, the field is described as an array. Therefore, I don't
have all the troubles met by Damian because I can take advantage of
elemental functions for overloading operators. However a problem
occurs when computing simple expressions. I join two identical test
cases, the first one written in calling subroutines, the second one in
using operators instead. Here are the results with 3 compilers :

g95 -O3 : G95 (GCC 4.0.3 (g95 0.92!) Jul 30 2008)

3.3424919999999996 4.230357

gfortran -O3 : GNU Fortran (GCC) 4.3.0 20080102 (experimental) [trunk
revision 131253]

1.0338430000000001 1.4547790000000000

ifort -O3 : ifort (IFORT) 10.0 20070426

0.972852000000000 0.759885000000000

As you see, the compilers g95 and gfortran show a decrease of
performance when using operators.
On the contrary, operators are more efficient than subroutines with
ifort.

MODULE var ! a little bit too large to be published but it contains
only pure subroutines and elemental functions
...
END MODULE

SUBROUTINE test1(nx)

USE var
IMPLICIT NONE

INTEGER,INTENT(in) :: nx

TYPE(var_type) :: P(nx),T(nx),n(nx),rho(nx),x(nx)
REAL(r8) :: M=18.e-3_r8,V=50._r8,R=8.314_r8
INTEGER :: i

DO i=1,nx
CALL newMainVar(P(i),1.e5_r8,2*i)
CALL newMainVar(T(i),1000._r8,2*i-1)
ENDDO

DO i=1,nx

CALL newVar(x(i),T(i))
CALL inverseVar(x(i))
CALL addVar(x(i),T(i))
CALL multVar(x(i),2._r8)
CALL minusVar(x(i),T(i))
CALL divVar(x(i),2._r8)

CALL newVar(n(i),P(i))
CALL divVar(n(i),T(i))
CALL multVar(n(i),V/R)

CALL newVar(rho(i),n(i))
CALL multVar(rho(i),M/V)

ENDDO

!write(*,*) 'PV/RT',n(1)%%value,n(1)%%deriv(1:n(1)%%nderiv)
!write(*,*) 'MP/RT',rho(1)%%value,rho(1)%%deriv(1:rho(1)%%nderiv)

END SUBROUTINE

SUBROUTINE test2(nx)

USE var
IMPLICIT NONE

INTEGER,INTENT(in) :: nx

TYPE(var_type) :: P(nx),T(nx),n(nx),rho(nx),x(nx)
REAL(r8) :: M=18.e-3_r8,V=50._r8,R=8.314_r8
INTEGER :: i

DO i=1,nx
CALL newMainVar(P(i),1.e5_r8,2*i)
CALL newMainVar(T(i),1000._r8,2*i-1)
ENDDO

x=((1._r8/T+T)*2._r8-T)/2._r8 ! an arbitrary expression
n=(P/T)*(V/R)
rho=n*(M/V)

!write(*,*) 'PV/RT',n(nx)%%value,n(nx)%%deriv(1:n(nx)%%nderiv)
!write(*,*) 'MP/RT',rho(nx)%%value,rho(nx)%%deriv(1:rho(nx)%%nderiv)

END SUBROUTINE

PROGRAM main
IMPLICIT NONE
DOUBLE PRECISION :: t0,t1,t2
CALL cpu_time(t0)
CALL test1(500000)
CALL cpu_time(t1)
CALL test2(500000)
CALL cpu_time(t2)
WRITE(*,*) t1-t0,t2-t1
END PROGRAM
no comments
diggit! del.icio.us! reddit!