At the bottom of this message is an example of what I believe is a bug
in the behavior of __forceinline when used with SSE intrinsics in
Visual Studio 2005. The summary is it seems that when __forceinline
is used on functions that contain SSE intrinsics, the functions are
not inlined after a very small depth (something under 10 levels of
functions). If __forceinline is replaced with __inline, the functions
are all inlined. From what I've read, there is no way __forceinline
should cause less inlining than __inline, but that's the behavior I'm
seeing. Also, if the SSE intrinsics are replaced with regular C, the
problem goes away as well.
I know the functions don't get inlined by warning 4714 which gets
raised on level 4 (not level 1 as MSDN seems to suggest), and by
examining the assembly which clearly shows function calls.
In the code below, the template parameter to the inlinetest function
determines how many levels of function calls there are before the call
to vset, which contains the SSE intrinsics. On my system, only very
small numbers (around 5) will successfully be inlined - there's some
ambiguity because the exact value has changed a few times and I'm not
sure why. Replacing __forceinline with __inline or changing vset to
not use intrinsics successfully inlines for much larger numbers (I've
tested with 100).
If anyone has any input on this problem - something I'm doing wrong or
ways to proceed, I'd appreciate it very much. Thanks,
-stephen diverdi
-stephen.diverdi@gmail.com
#include <stdio.h>
#include <intrin.h>
#define FORCEINLINE __forceinline
FORCEINLINE void vset ( float * const _a, float x )
{
#if 1
__m128 &a = *( __m128 * )_a;
a = _mm_set1_ps( x );
#else
for( int i = 0 ; i < 4 ; ++i )
_a[ i ] = x;
#endif
}
template < int N >
FORCEINLINE void inlinetest ( float * const a, int i )
{
return inlinetest< N - 1 >( a, i + 1 );
}
template < >
FORCEINLINE void inlinetest< 1 > ( float * const a, int i )
{
vset( a, ( float )i );
}
int main ()
{
float a[ 4 ];
inlinetest< 100 >( a, 0 );
fprintf( stderr, "%f\n", a[ 0 ] );
getc( stdin );
return 0;
}