AppletTalk.com Forum Index AppletTalk.com
Java discussions newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Optimal x86-32 Sun Hotspot code generation?

 
Post new topic   Reply to topic    AppletTalk.com Forum Index -> JVM, native methods and hardware
View previous topic :: View next topic  
Author Message
Adam Warner
Guest





PostPosted: Fri Mar 24, 2006 2:12 am    Post subject: Optimal x86-32 Sun Hotspot code generation? Reply with quote



Hi all,

I'm trying to create the fastest way to cast a Java long to an int while
preserving array bounds checking. This is the approach I suspect should be
optimal:

public final static int toIntIndex(long index) {
int high=(int) (index>>>32);
if (high!=0) throw new ArrayIndexOutOfBoundsException();
return (int) index;
}

If the long index is positive and in int range high will be 0. If the
long index is negative then high will be non-zero. If the long index is
between 2^31 and below 2^32 then it will pass this test but still be
caught by Java's int bounds checking.

I believe the unsigned right shift by 32 should permit the test to be
conducted upon the 32 most significant bits of the 64-bit value, that is
no shift should actually be performed on 32-bit platforms.

I've induced the Sun Mustang b75 HotSpot debug server JIT to compile
toIntIndex and this is the generated assembly:

{method}
- klass: {other class}
- method holder: 'LongIndex'
- constants: 0x0693c688{constant pool}
- access: 0x81000019 public static final
- name: 'toIntIndex'
- signature: '(J)I'
- max stack: 3
- max locals: 3
- size of params: 2
- method size: 22
- vtable index: -2
- code size: 21
- code start: 0xb0e45210
- code end (excl): 0xb0e45225
- method data: 0xb0e47a58
- checked ex length: 0
- linenumber start: 0xb0e45225
- localvar length: 0
#
# int ( long, half )
#
#r063 ESP+20: parm 0: long
#r062 ESP+16: parm 0: long
# -- Old ESP -- Framesize: 16 --
#r061 ESP+12: return address
#r060 ESP+ 8: pad2, in_preserve
#r059 ESP+ 4: pad2, in_preserve
#r058 ESP+ 0: pad2, in_preserve
#
abababab N1: # B1 <- B3 B2 Freq: 6.66667
abababab
000 B1: # B3 B2 <- BLOCK HEAD IS JUNK Freq: 6.66667
000 # stack bang
PUSHL EBP
SUB ESP,8 # Create frame
00e MOV ECX,[ESP + #16]
MOV EBX,[ESP + #20]
016 MOV ECX.lo,ECX.hi
SHR ECX.lo,#32-32
XOR ECX.hi,ECX.hi
01a MOV ECX,ECX.lo
01a TEST ECX,ECX
01c Jne,s B3 P=0.000000 C=4.466667
01c
01e B2: # N1 <- B1 Freq: 4.46666
01e MOV ECX,[ESP + #16]
MOV EBX,[ESP + #20]
026 MOV EAX,ECX.lo
028 ADD ESP,8 # Destroy frame
POPL EBP
TEST PollPage,EAX ! Poll Safepoint

032 RET
032
033 B3: # N1 <- B1 Freq: 1e-06
033 MOV ECX,#-67
038 NOP # Pad for loops and calls
039 NOP # Pad for loops and calls
03a NOP # Pad for loops and calls
03b CALL,static wrapper for: uncommon_trap
# LongIndex::toIntIndex @ bci:10 L0=_ L1=_ L2=_
#
040 INT3 ; ShouldNotReachHere
040

This of course is the non-inlined version of toIntIndex. I don't
understand some of the disassembly syntax (.hi, .lo?) but it at least
appears clear that a redundant "SHR ECX.lo,#32-32" instruction is being
generated. I'd appreciate confirmation my reasoning is correct/this is an
actual inefficiency before filing any report with Sun.

Regards,
Adam
Back to top
Brendan
Guest





PostPosted: Fri Mar 24, 2006 3:22 pm    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote



Hi,

Does this thing have an optimizer that you forgot to turn on?

The stack frame is a waste of time, they've inserted padding in code
that should never run, the branch prediction is wrong (forward branches
are assumed to be taken), the register usage and chosen instructions
are a joke, etc.

<some alignment here if you like>
convertSignedLongToUnsignedInt:
cmp dword [esp+8],0
jne .withinBounds
MOV ECX,#-67
CALL,static wrapper for: uncommon_trap
INT3 ; ShouldNotReachHere

<some alignment here if you like>
..withinBounds:
mov eax,[esp+4]
TEST PollPage,EAX ! Poll Safepoint ;Don't know what this is
meant to do! Smile
ret


Cheers,

Brendan
Back to top
Chris Uppal
Guest





PostPosted: Fri Mar 24, 2006 3:58 pm    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote



Adam Warner wrote:

Quote:
I believe the unsigned right shift by 32 should permit the test to be
conducted upon the 32 most significant bits of the 64-bit value, that is
no shift should actually be performed on 32-bit platforms.

I'm somewhat puzzled by this sentence. I may well be misunderstanding you but
it sounds as if you assume that an int is 64-bit on a 64-bit platform or
possibly that a long is 32-bit on a 32-bit platform. That's not the case: ints
are 32-bit, and longs 64-bit, on every platform.

-- chris
Back to top
Grumble
Guest





PostPosted: Fri Mar 24, 2006 5:12 pm    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

Adam Warner wrote:
Quote:
I'm trying to create the fastest way to cast a Java long to an int while
preserving array bounds checking. This is the approach I suspect should be
optimal:

public final static int toIntIndex(long index) {
int high=(int) (index>>>32);
if (high!=0) throw new ArrayIndexOutOfBoundsException();
return (int) index;
}

If the long index is positive and in int range high will be 0. If the
long index is negative then high will be non-zero. If the long index is
between 2^31 and below 2^32 then it will pass this test but still be
caught by Java's int bounds checking.

I believe the unsigned right shift by 32 should permit the test to be
conducted upon the 32 most significant bits of the 64-bit value, that is
no shift should actually be performed on 32-bit platforms.

For what it's worth, out of curiosity, I wrote a similar function in C.

#include <stdint.h>
void abort(void);
int32_t foo(int64_t index)
{
int32_t high = (uint64_t)index >> 32;
if (high != 0) abort();
return index;
}

for which gcc-3.4.4 -O2 generates the following code.

_foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
/*
What for? Stack alignment?
Why won't it go away with -mpreferred-stack-boundary=4 ??
*/
movl 12(%ebp), %edx
movl 8(%ebp), %eax
testl %edx, %edx
jne L4
leave
ret
L4:
call _abort

and gcc-3.4.4 -Os -fomit-frame-pointer generates the following code.

_foo:
cmpl $0, 8(%esp)
movl 4(%esp), %eax
je L2
call _abort
L2:
ret

(I'd switch je to jne and exchange call _abort and ret.)
Back to top
Skarmander
Guest





PostPosted: Fri Mar 24, 2006 7:12 pm    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

Grumble wrote:
Quote:
Adam Warner wrote:
I'm trying to create the fastest way to cast a Java long to an int while
preserving array bounds checking. This is the approach I suspect should be
optimal:

public final static int toIntIndex(long index) {
int high=(int) (index>>>32);
if (high!=0) throw new ArrayIndexOutOfBoundsException();
return (int) index;
}

If the long index is positive and in int range high will be 0. If the
long index is negative then high will be non-zero. If the long index is
between 2^31 and below 2^32 then it will pass this test but still be
caught by Java's int bounds checking.

I believe the unsigned right shift by 32 should permit the test to be
conducted upon the 32 most significant bits of the 64-bit value, that is
no shift should actually be performed on 32-bit platforms.

For what it's worth, out of curiosity, I wrote a similar function in C.

#include <stdint.h
void abort(void);
int32_t foo(int64_t index)
{
int32_t high = (uint64_t)index >> 32;
if (high != 0) abort();
return index;
}

for which gcc-3.4.4 -O2 generates the following code.

_foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
/*
What for? Stack alignment?

Yes. In particular, the Pentiums and in particular SSE do not like data
that's not royally aligned.

Quote:
Why won't it go away with -mpreferred-stack-boundary=4 ??

Because -mpreferred-stack-boundary is the base 2 logarithm of the number of
bytes to align to, not the actual number of bytes. In this case, you've
asked for a stack alignment of 16 bytes, which is the default. Try
-mpreferred-stack-boundary=2.

S.
Back to top
Adam Warner
Guest





PostPosted: Fri Mar 24, 2006 11:12 pm    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

On Fri, 24 Mar 2006 10:58:48 +0000, Chris Uppal wrote:
Quote:
Adam Warner wrote:

I believe the unsigned right shift by 32 should permit the test to be
conducted upon the 32 most significant bits of the 64-bit value, that
is no shift should actually be performed on 32-bit platforms.

I'm somewhat puzzled by this sentence. I may well be misunderstanding
you but it sounds as if you assume that an int is 64-bit on a 64-bit
platform or possibly that a long is 32-bit on a 32-bit platform. That's
not the case: ints are 32-bit, and longs 64-bit, on every platform.

I had a mental model of the long being transferred in two 32-bit registers
on a 32-bit platform. Let's call the registers H and L and write the long
as HL. In a higher level language to obtain the 32 most significant bits
of the long HL one could unsigned shift the long right by 32 and perhaps
cast the result to 32 bits. But at the lower level I was hoping the
compiler would say "let's just return the value of H".

Whether the value of H could be returned without shifting on a 64-bit
platform could depend upon whether the architecture permits 64-bit
registers to be accessed as independent 32-bit registers (which is why I
made that qualification).

Regards,
Adam
Back to top
Adam Warner
Guest





PostPosted: Sat Mar 25, 2006 1:12 am    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

On Fri, 24 Mar 2006 02:22:45 -0800, Brendan wrote:
Quote:
Hi,

Does this thing have an optimizer that you forgot to turn on?

The stack frame is a waste of time, they've inserted padding in code
that should never run, the branch prediction is wrong (forward branches
are assumed to be taken), the register usage and chosen instructions are
a joke, etc.

I now realise it's a Catch 22. The undocumented option
-XX:+PrintOptoAssembly "is not final ASM code but it's very close":
<http://www.javalobby.org/java/forums/m91938827.html>

But this undocumented option is only available in the fastdebug builds. I
remember reading somewhere that Sun does not have legal permission to
distribute the disassembler with their release products. Thus one can only
disassemble code generated by these builds:
<http://blogs.sun.com/roller/page/kto?entry=mustang_jdk_6_0_fastdebug>

"So using a fastdebug build might provide some information you wouldn't
get from running a product build. It is slower, but no where near as slow
as a debug build. The optimization isn't as high as with the product
build, but since the assert checking and debug code exists in these
builds, the code isn't the same anyway."

This explains the redundant stack frame and likely invalidates any
inference one can make about the quality of release build assembly code.
I apologise for not appreciating this earlier.

Regards,
Adam
Back to top
Roedy Green
Guest





PostPosted: Sat Mar 25, 2006 1:12 am    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

On Sat, 25 Mar 2006 10:56:40 +1200, Adam Warner <spamtrap (AT) crayne (DOT) org>
wrote, quoted or indirectly quoted someone who said :

Quote:
I had a mental model of the long being transferred in two 32-bit registers
on a 32-bit platform. Let's call the registers H and L and write the long
as HL. In a higher level language to obtain the 32 most significant bits
of the long HL one could unsigned shift the long right by 32 and perhaps
cast the result to 32 bits. But at the lower level I was hoping the
compiler would say "let's just return the value of H".

Yes, at least Jet does just that.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Back to top
Adam Warner
Guest





PostPosted: Sat Mar 25, 2006 4:12 am    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

On Sat, 25 Mar 2006 00:17:39 +0000, Roedy Green wrote:
Quote:
On Sat, 25 Mar 2006 10:56:40 +1200, Adam Warner <spamtrap (AT) crayne (DOT) org
wrote, quoted or indirectly quoted someone who said :

I had a mental model of the long being transferred in two 32-bit registers
on a 32-bit platform. Let's call the registers H and L and write the long
as HL. In a higher level language to obtain the 32 most significant bits
of the long HL one could unsigned shift the long right by 32 and perhaps
cast the result to 32 bits. But at the lower level I was hoping the
compiler would say "let's just return the value of H".

Yes, at least Jet does just that.

Thanks Roedy, that's great to know! It looks like I will be able to build
relatively efficient long index bounds checking upon the JVM. By only
checking the H bits are zero the L check remains with the JVM (it's not
duplicated).

Regards,
Adam
Back to top
Eric Albert
Guest





PostPosted: Sat Mar 25, 2006 3:03 pm    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

In article <442434b3$0$11064$e4fe514c (AT) news (DOT) xs4all.nl>,
Skarmander <spamtrap (AT) crayne (DOT) org> wrote:

Quote:
Grumble wrote:
Adam Warner wrote:
I'm trying to create the fastest way to cast a Java long to an int while
preserving array bounds checking. This is the approach I suspect should be
optimal:

public final static int toIntIndex(long index) {
int high=(int) (index>>>32);
if (high!=0) throw new ArrayIndexOutOfBoundsException();
return (int) index;
}

If the long index is positive and in int range high will be 0. If the
long index is negative then high will be non-zero. If the long index is
between 2^31 and below 2^32 then it will pass this test but still be
caught by Java's int bounds checking.

I believe the unsigned right shift by 32 should permit the test to be
conducted upon the 32 most significant bits of the 64-bit value, that is
no shift should actually be performed on 32-bit platforms.

For what it's worth, out of curiosity, I wrote a similar function in C.

#include <stdint.h
void abort(void);
int32_t foo(int64_t index)
{
int32_t high = (uint64_t)index >> 32;
if (high != 0) abort();
return index;
}

for which gcc-3.4.4 -O2 generates the following code.

_foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
/*
What for? Stack alignment?

Yes. In particular, the Pentiums and in particular SSE do not like data
that's not royally aligned.

Why won't it go away with -mpreferred-stack-boundary=4 ??

Because -mpreferred-stack-boundary is the base 2 logarithm of the number of
bytes to align to, not the actual number of bytes. In this case, you've
asked for a stack alignment of 16 bytes, which is the default. Try
-mpreferred-stack-boundary=2.

As far as I know, Mac OS X is the only widely used x86 operating system
to use 16-byte stack alignment by default for 32-bit. Everyone else
uses 4-byte alignment. For 64-bit, though, the AMD64 ABI requires
16-byte stack alignment.

-Eric

--
Eric Albert ejalbert (AT) cs (DOT) stanford.edu
http://outofcheese.org/
Back to top
Skarmander
Guest





PostPosted: Sat Mar 25, 2006 11:12 pm    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

Eric Albert wrote:
Quote:
In article <442434b3$0$11064$e4fe514c (AT) news (DOT) xs4all.nl>,
Skarmander <spamtrap (AT) crayne (DOT) org> wrote:

Grumble wrote:
snip
_foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
/*
What for? Stack alignment?
Yes. In particular, the Pentiums and in particular SSE do not like data
that's not royally aligned.

Why won't it go away with -mpreferred-stack-boundary=4 ??
Because -mpreferred-stack-boundary is the base 2 logarithm of the number of
bytes to align to, not the actual number of bytes. In this case, you've
asked for a stack alignment of 16 bytes, which is the default. Try
-mpreferred-stack-boundary=2.

As far as I know, Mac OS X is the only widely used x86 operating system
to use 16-byte stack alignment by default for 32-bit. Everyone else
uses 4-byte alignment.

Well, it's true that, say, Windows doesn't *need* 16-byte aligment, but
recent gccs use 16-byte alignment by default for x86-32. This does often
raise eyebrows, but there seems to be some truth to the defense that those
extra bytes are a small price to pay for avoiding the risk of performance
loss when the alignment is necessary (for SSE and friends). The Pentium 3
and 4 allegedly like 16-byte alignment better as well, even without SSE
(I've never tested any of this, mind you).

S.
Back to top
Eric Albert
Guest





PostPosted: Sun Mar 26, 2006 10:12 am    Post subject: Re: Optimal x86-32 Sun Hotspot code generation? Reply with quote

In article <4425a0f2$0$11073$e4fe514c (AT) news (DOT) xs4all.nl>,
Skarmander <spamtrap (AT) crayne (DOT) org> wrote:

Quote:
Eric Albert wrote:
In article <442434b3$0$11064$e4fe514c (AT) news (DOT) xs4all.nl>,
Skarmander <spamtrap (AT) crayne (DOT) org> wrote:

Grumble wrote:
snip
_foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
/*
What for? Stack alignment?
Yes. In particular, the Pentiums and in particular SSE do not like data
that's not royally aligned.

Why won't it go away with -mpreferred-stack-boundary=4 ??
Because -mpreferred-stack-boundary is the base 2 logarithm of the number
of
bytes to align to, not the actual number of bytes. In this case, you've
asked for a stack alignment of 16 bytes, which is the default. Try
-mpreferred-stack-boundary=2.

As far as I know, Mac OS X is the only widely used x86 operating system
to use 16-byte stack alignment by default for 32-bit. Everyone else
uses 4-byte alignment.

Well, it's true that, say, Windows doesn't *need* 16-byte aligment, but
recent gccs use 16-byte alignment by default for x86-32. This does often
raise eyebrows, but there seems to be some truth to the defense that those
extra bytes are a small price to pay for avoiding the risk of performance
loss when the alignment is necessary (for SSE and friends). The Pentium 3
and 4 allegedly like 16-byte alignment better as well, even without SSE
(I've never tested any of this, mind you).

Ah; you're completely right about gcc. I'd missed that it used
-mpreferred-stack-boundary=4 by default when not using -Os. The
difference in Apple's gcc is that -mpreferred-stack-boundary=4 is also
set for -Os, since the system's ABI requires it.

-Eric

--
Eric Albert ejalbert (AT) cs (DOT) stanford.edu
http://outofcheese.org/
Back to top
Display posts from previous:   
Post new topic   Reply to topic    AppletTalk.com Forum Index -> JVM, native methods and hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.