AppletTalk.com Forum Index AppletTalk.com
Java discussions newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Problem with memory corruption in JVM 1.3.1 SR5

 
Post new topic   Reply to topic    AppletTalk.com Forum Index -> ibm.software.java.linux
View previous topic :: View next topic  
Author Message
Thorkild Stray
Guest





PostPosted: Wed Nov 19, 2003 2:32 am    Post subject: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote



Hi.

I am looking for advice at debugging a problem concerning the JVM dieing
with a segmentation fault. We're running release cxia32131-20030618 (which
I believe is SR5).

We've earlier had problems with constant crashing due to bugs in the
JIT-compiler, but thanks to the improved javacore-dumps and the Diagnostic
guide we've found suitable workaround for those. We still have one
problem, though, which I can't seem to figure out.

The setup:
We've got boxes running RedHat 2.1 AS with 2 cpus and plenty of ram. The
JVM is running with 1 GB heap where it normally GCes down to 500-600 MB
usage when it needs to. On top of the JVM, we're running Oracle
application server (oc4j).

The problem:
The system keeps running for 6-10 hours and then it crashes. The timing
seems rather random, but normally about twice a day.

Test-setup:
We have failed to recreate this error in our test setup. We've tried to
run some of the same traffic, but still, we can't seem to make it die in
the same way. This is also our main problem, since it would be much
simpler to do further detective work if we had a known way of making it
crash.

Work done:
According to the javacore files, the JVM is crashing with a SIGSEGV in
IBMJava2-131-sr5/bin/classic/libjvm.so. After disassembling and
recalculating the crash address according to where in memory it is mapped
in, we see that it crashes in the localMark-call. The thread that leads to
the crash is (again, according to the javacore file) the GC Daemon. At
first we thought this might have something to do with the parallell GC
system in the newer versions, but turning this off only meant that the
crash moved from parallellmark to localmark. If we look at the
stackpointer that is written to STDERR when it crashes, it also looks like
the same thing is happening each time. All crashes end with the JVM
giving a stackpointer that has the value ending i "c7a78", where the
complete adress (for example 0xbe9c7a78) always maps into something that
looks like the following line from the memory map:

2HPMEMMAPLINE b9bbf000-b9bc8000 rwxp 00001000 00:00 0

(which is the followed by this one:)
2HPMEMMAPLINE b9bc8000-b9bc9000 ---p 0000a000 00:00 0

After reading the Diagnostics guide and deciphering the memory map output,
this looks like the memory region of a thread.

I have also collected core dumps from the boxes and used jextract/jformat
to view their status. The core dumps are often quite reduced (since the
process does receive a segmentation fault, this doesn't seem all that
surprising to me), but some are complete. Those which are complete
complain about finding an object on the heap that has the length of 0, but
I can't seem to find any information about what kind of object that is. My
initial theory about that is that the reference to the object doesn't
contain anything about what kind of class it is, and so I would have to
have the content to be able to decide what class it is. This is, according
to jformat, a memory corruption. I don't know what happened to the object,
and I'd believe this should never happen. By the way, I am running the
java with verbose compiling, and none of the memory addresses are ever
used by the code being JIT-compiled.

My questions are:

1) Does anybody have any good ideas on how to trace this? I have looked at
using the tracing features (outlined in the Diagnostics-guide), but since
this is a production system I haven't found anything suitable which
doesn't slow the system down a lot.

2) Could this have anything to do with JITC? I can't run the production
system without any JIT compiling, but should I go through testing SKIP on
all the classes one-by-one? We've tried to both reduce the threshold on
how much it must be used before being JIT-compiled/optimized (and even
tried to use FORCE on all code) without any luck at crashing it.

Btw, great of IBM of releasing the Diagnostics guide. It has helped us
tremendously, and is really a very handy tool.

Any comments? Ideas?

--
Thorkild - allergic to SIGSEGVs.

Back to top
Neil Masson
Guest





PostPosted: Wed Nov 19, 2003 2:07 pm    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote



Thorkild,

I am glad you like the Diagnostics Guide. I suggest that first of all
you upgrade either to JDK 131 SR6 or JDK 141 SR1. Also check to see if
there are any Oracle upgrades.

It looks like there is corruption in the Java heap. This could be
caused by
- an internal error in the JVM code
- a memory overwrite in native code
- shortage of memory

Check for shortage of memory first. On RHAS2.1, you will see
MEMMAPLINE entries going up from 0x4000000 and thread stacks coming
down from 0xc0000000. Is there a gap between the entries coming up
and those coming down? If not you have run out of address space. The
quickest fix for this would be to reduce the Java heap size to 740MB
when it will fit below the 0x40000000 boundary.

Next enable the heap verification option. Set the trace point
-Dibm.dg.trc.print=st_verify_heap
which will check the heap before and after each GC cycle.

Memory overwrites from native code are very hard to find. Does
Oracle application server use native code?

The JIT addresses relate to where JITed methods are located. These
are not Java objects and so will not match anything in the Java heap.
The Garbage Collector is written in C, so you are not running a
JITed method at the point of failure. However it is possible that
the JIT might be wrongly allocating an object. Try setting
JITC_COMPILEOPT=NALL to disable all JIT optimisations.

Hope this helps.
Neil

Thorkild Stray wrote:

Quote:
Hi.

I am looking for advice at debugging a problem concerning the JVM dieing
with a segmentation fault. We're running release cxia32131-20030618 (which
I believe is SR5).

We've earlier had problems with constant crashing due to bugs in the
JIT-compiler, but thanks to the improved javacore-dumps and the Diagnostic
guide we've found suitable workaround for those. We still have one
problem, though, which I can't seem to figure out.

The setup:
We've got boxes running RedHat 2.1 AS with 2 cpus and plenty of ram. The
JVM is running with 1 GB heap where it normally GCes down to 500-600 MB
usage when it needs to. On top of the JVM, we're running Oracle
application server (oc4j).

The problem:
The system keeps running for 6-10 hours and then it crashes. The timing
seems rather random, but normally about twice a day.

Test-setup:
We have failed to recreate this error in our test setup. We've tried to
run some of the same traffic, but still, we can't seem to make it die in
the same way. This is also our main problem, since it would be much
simpler to do further detective work if we had a known way of making it
crash.

Work done:
According to the javacore files, the JVM is crashing with a SIGSEGV in
IBMJava2-131-sr5/bin/classic/libjvm.so. After disassembling and
recalculating the crash address according to where in memory it is mapped
in, we see that it crashes in the localMark-call. The thread that leads to
the crash is (again, according to the javacore file) the GC Daemon. At
first we thought this might have something to do with the parallell GC
system in the newer versions, but turning this off only meant that the
crash moved from parallellmark to localmark. If we look at the
stackpointer that is written to STDERR when it crashes, it also looks like
the same thing is happening each time. All crashes end with the JVM
giving a stackpointer that has the value ending i "c7a78", where the
complete adress (for example 0xbe9c7a78) always maps into something that
looks like the following line from the memory map:

2HPMEMMAPLINE b9bbf000-b9bc8000 rwxp 00001000 00:00 0

(which is the followed by this one:)
2HPMEMMAPLINE b9bc8000-b9bc9000 ---p 0000a000 00:00 0

After reading the Diagnostics guide and deciphering the memory map output,
this looks like the memory region of a thread.

I have also collected core dumps from the boxes and used jextract/jformat
to view their status. The core dumps are often quite reduced (since the
process does receive a segmentation fault, this doesn't seem all that
surprising to me), but some are complete. Those which are complete
complain about finding an object on the heap that has the length of 0, but
I can't seem to find any information about what kind of object that is. My
initial theory about that is that the reference to the object doesn't
contain anything about what kind of class it is, and so I would have to
have the content to be able to decide what class it is. This is, according
to jformat, a memory corruption. I don't know what happened to the object,
and I'd believe this should never happen. By the way, I am running the
java with verbose compiling, and none of the memory addresses are ever
used by the code being JIT-compiled.

My questions are:

1) Does anybody have any good ideas on how to trace this? I have looked at
using the tracing features (outlined in the Diagnostics-guide), but since
this is a production system I haven't found anything suitable which
doesn't slow the system down a lot.

2) Could this have anything to do with JITC? I can't run the production
system without any JIT compiling, but should I go through testing SKIP on
all the classes one-by-one? We've tried to both reduce the threshold on
how much it must be used before being JIT-compiled/optimized (and even
tried to use FORCE on all code) without any luck at crashing it.

Btw, great of IBM of releasing the Diagnostics guide. It has helped us
tremendously, and is really a very handy tool.

Any comments? Ideas?



Back to top
G.S. Link
Guest





PostPosted: Wed Nov 19, 2003 3:00 pm    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote



What threading model are you using? What happens if you put a
LD_ASSUME_KERNEL=2.2.5 in your profile? What happens if you try the Sun
142 jdk?
Back to top
Thorkild Stray
Guest





PostPosted: Wed Nov 19, 2003 3:42 pm    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote

On Wed, 19 Nov 2003 14:07:49 +0000, Neil Masson wrote:
Quote:
Thorkild,

I am glad you like the Diagnostics Guide. I suggest that first of all
you upgrade either to JDK 131 SR6 or JDK 141 SR1. Also check to see if
there are any Oracle upgrades.

I hadn't noticed that the SR6 had been released and will try it out as
quickly as possible. We've submitted some other JIT-bugs to IBM and they
were fixed in an internal release we were allowed to test (not on
production, of course). It'll be interesting to check if it got into SR6
or if it was to late.

I especially noticed this fix:

jit405-20030909 63969 - c N/A JIT: Memory corruption while propagating
sync info

Quote:
Check for shortage of memory first. On RHAS2.1, you will see MEMMAPLINE
entries going up from 0x4000000 and thread stacks coming down from
0xc0000000. Is there a gap between the entries coming up and those
coming down? If not you have run out of address space. The quickest
fix for this would be to reduce the Java heap size to 740MB when it will
fit below the 0x40000000 boundary.

Interesting. The memory map mixes up thread stacks in between the
Jar-files and the loaded libraries, it seems. If I blame that on startup,
the jar-files stop at:

2HPMEMMAPLINE 847ff000-84800000 r--s 00000000 00:0a 2073857
/www/application/common_files/WEB-INF/lib/README

(which accidently really shouldn't be loaded. I'll need to fix that Smile
Probably somebody playing around with a wildcard too many).

and it goes straight over to:

2HPMEMMAPLINE 84800000-84ac4000 rw-p 00000000 00:00 0

Which looks like, for me, that it doesn't have any gaps. Doesn't this look
like we're running out of address space? How can we increase how much
address space we have? You suggest reducing the heap size, but since we're
using a lot of memory constantly, I am afraid that'll trigger to many GC
operations.

(the full map is available from
http://heim.ifi.uio.no/~thorkild/hpmemmapline.txt , but I didn't want to
post the whole long thing here)

Quote:
Next enable the heap verification option. Set the trace point
-Dibm.dg.trc.print=st_verify_heap
which will check the heap before and after each GC cycle.

Will this just alert me of problems, or handle them? Do you have any idea
of how much this would impact performance? I'll test it out in our
test setup too.

Quote:
Memory overwrites from native code are very hard to find. Does Oracle
application server use native code?

The whole setup uses pure java, and no JNI. That was the first I thing I
checked, and they had converted from the JNI-oracle-libraries to the pure
java ones last February.

Quote:
The Garbage Collector is written in C, so you are not running a JITed
method at the point of failure. However it is possible that the JIT
might be wrongly allocating an object. Try setting JITC_COMPILEOPT=NALL
to disable all JIT optimisations.

I think we've have done this, but I'll make sure and test it again.

Thanks for you insight, Neil. It is very helpful!

--
Thorkild


Back to top
Thorkild Stray
Guest





PostPosted: Wed Nov 19, 2003 3:50 pm    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote

On Wed, 19 Nov 2003 10:00:24 -0500, G.S. Link wrote:

Quote:
What threading model are you using? What happens if you put a
LD_ASSUME_KERNEL=2.2.5 in your profile?

It is running an updated RHAS 2.1 kernel, so that means 2.4.9-e27 (iirc).
Since this is less than 2.4.10, it won't run without LD_ASSUME_KERNEL.

So, it is running the old (2.2.5) threading model.

Quote:
What happens if you try the Sun 142 jdk?

Other kind of problems[1] Sad. We're considering trying that again soon,
though. In addition, the IBM really improves on the performance.

[1] Not the same type of problems, though. Not only technical issues,
either.

--
Thorkild

Back to top
G.S. Link
Guest





PostPosted: Thu Nov 20, 2003 3:17 pm    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote

If this were mine I would probably replace both the kernel and the jre.
The new IBM 141 jre fixes several bugs. The 2.4.20+ kernel is an
improvement and the latest Sun 142 works much better. Things are changing
so fast in this area that it is difficult to keep up so I haven't been
able to test some of this. The point of my questions was to discover
where this problem lies. Is it in your code? Is it in the IBM jre or is
it in Java? If the problem shows up in both the Sun and IBM jre then it
is either in your code, in Java or in your hardware. If it only shows up
in one jre then it is either that jre or hardware. If you are not running
ecc memory this could be hardware. IBM Vast tends to produce complaints
like this. In almost all cases they can be traced to the software
stressing the hardware. Have you eliminated this as a factor?
Back to top
Thorkild Stray
Guest





PostPosted: Fri Nov 21, 2003 12:11 am    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote

On Thu, 20 Nov 2003 10:17:54 -0500, G.S. Link wrote:

Quote:
If this were mine I would probably replace both the kernel and the jre.
The new IBM 141 jre fixes several bugs. The 2.4.20+ kernel is an
improvement and the latest Sun 142 works much better. Things are
changing so fast in this area that it is difficult to keep up so I
haven't been able to test some of this. The point of my questions was
to discover where this problem lies. Is it in your code? Is it in the
IBM jre or is it in Java? If the problem shows up in both the Sun and
IBM jre then it is either in your code, in Java or in your hardware. If
it only shows up in one jre then it is either that jre or hardware. If
you are not running ecc memory this could be hardware. IBM Vast tends
to produce complaints like this. In almost all cases they can be traced
to the software stressing the hardware. Have you eliminated this as a
factor?

Hi,

I realized why you asked those questions, but I think I was a little quick
when answering them. Sorry about that.

Kernel 2.4.9e27 is the newest supported RH AS kernel, and it is quite
different from a stock 2.4.9. Only RH AS machines under support contract
run those kinds of kernels, and I am aware of the improvements in kernels.
I don't see that this should have an impact on this kind of problem,
though. Libc/Threading library on the other hand was a suspect earlier
(when we had JITC-problems), but those have also been cleared in this case
(well, we think we've cleared them).

When it comes to my problems:

- This problem exists on 4 different servers, all equipped with the same
kind of hardware (and ECC ram I believe). The problem appears on them all.
The problem also happens at times when the traffic is low, although not
that frequently. This leads me to believe that this is connected with
amount of traffic, but not dependent on it. A heavy loaded server will gc
more often, so it makes sense. There are no other known problems with
these servers, so I have ruled out hardware problems.

- The same problem does not appear with the Sun JVM, but that one has
earlier had other problems.

When it comes to the fact if this is a bug in our java code or not, I
would like to point out that it is a Segmentation fault. Normally, since
we're not using JNI methods, something like that should never be able to
happen in a JVM (theoretically) if there is no underlying OS problem (like
running out of ram or hardware failure). Now, I said theoretically, since
I've seen quite a lot of horrible java code (that would probably make a
JVM crash out of embarrassment), and I do believe that something in the
code is triggering a bug in the JVM. The problem is that I can't see what
object suddenly has the length of 0 (and not what kind of object it is
either). What makes it even more annoying is that we can't reproduce it on
our test setup yet, which makes debugging all that much harder.

SR6 has passed testing, so we're pushing that out soon, and I am working
on the things Neil mentions. Hopefully, we'll get it stable soon.

--
Thorkild

Back to top
G.S. Link
Guest





PostPosted: Fri Nov 21, 2003 2:10 pm    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote

The only other question I have is whether you have ever encountered the
situation with this bug where attempting to use a debugger fixes the
problem? Since it doesn't appear in Sun code and since your hardware is
solid that implicates the IBM code. Since Java is forked and partly OO my
experience has been that threading is the first place to look when you get
one of these problems. The object that has a length of zero may simply
not be there. I ran into this the other day in Smalltalk. The instance
was destroyed by one thread before it could be used by another thread. Of
course, putting in a debugger made the problem vanish. Another clue is
the segmentation fault. Missing code almost always ends up as a
catastrophic failure of this type. When you write a jvm for a machine with
x processors and a nonstandard kernel that may be doing strange threading
you may have timing problems and these things are always hard to find. If
you can eliminate timing as a possibility it will make this problem easier
to find. There are simply not that many places in that jvm, outside of
timing, where such a problem could be. Incidently, when it comes to
kernels, I trust Linus further than Red Hat. Thank you for doing your
homework before coming here! That really helps!
Back to top
Neil Masson
Guest





PostPosted: Thu Nov 27, 2003 3:04 pm    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote

Quote:
I especially noticed this fix:

jit405-20030909 63969 - c N/A JIT: Memory corruption while propagating
sync info

I think that this problem only applies to Power processors.


Quote:
Check for shortage of memory first. On RHAS2.1, you will see MEMMAPLINE
entries going up from 0x4000000 and thread stacks coming down from
0xc0000000. Is there a gap between the entries coming up and those
coming down? If not you have run out of address space. The quickest
fix for this would be to reduce the Java heap size to 740MB when it will
fit below the 0x40000000 boundary.

Interesting. The memory map mixes up thread stacks in between the
Jar-files and the loaded libraries, it seems. If I blame that on startup,
the jar-files stop at:

2HPMEMMAPLINE 847ff000-84800000 r--s 00000000 00:0a 2073857
/www/application/common_files/WEB-INF/lib/README

(which accidently really shouldn't be loaded. I'll need to fix that Smile
Probably somebody playing around with a wildcard too many).

and it goes straight over to:

2HPMEMMAPLINE 84800000-84ac4000 rw-p 00000000 00:00 0

Which looks like, for me, that it doesn't have any gaps. Doesn't this look
like we're running out of address space? How can we increase how much
address space we have? You suggest reducing the heap size, but since we're
using a lot of memory constantly, I am afraid that'll trigger to many GC
operations.

(the full map is available from
http://heim.ifi.uio.no/~thorkild/hpmemmapline.txt , but I didn't want to
post the whole long thing here)


Thanks for not posting the full javacore - they tend to crash my
viewer. There is a gap between
2HPMEMMAPLINE 89100000-89583000 rw-p 00000000 00:00 0
and
2HPMEMMAPLINE b3dbe000-b3dbf000 ---p 00000000 00:00 0
so it is not an address-space issue.

Quote:
Next enable the heap verification option. Set the trace point
-Dibm.dg.trc.print=st_verify_heap
which will check the heap before and after each GC cycle.

Will this just alert me of problems, or handle them? Do you have any idea
of how much this would impact performance? I'll test it out in our
test setup too.


This just flags problems. It does impact performance bya few percent
depending on how frequently garbage is collected.

Quote:
Memory overwrites from native code are very hard to find. Does Oracle
application server use native code?

The whole setup uses pure java, and no JNI. That was the first I thing I
checked, and they had converted from the JNI-oracle-libraries to the pure
java ones last February.

The Garbage Collector is written in C, so you are not running a JITed
method at the point of failure. However it is possible that the JIT
might be wrongly allocating an object. Try setting JITC_COMPILEOPT=NALL
to disable all JIT optimisations.

I think we've have done this, but I'll make sure and test it again.

Thanks for you insight, Neil. It is very helpful!


Neil

Back to top
Thorkild Stray
Guest





PostPosted: Fri Nov 28, 2003 5:46 am    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote

On Thu, 27 Nov 2003 15:04:51 +0000, Neil Masson wrote:

Quote:
Thanks for not posting the full javacore - they tend to crash my
viewer. There is a gap between
2HPMEMMAPLINE 89100000-89583000 rw-p 00000000 00:00 0
and
2HPMEMMAPLINE b3dbe000-b3dbf000 ---p 00000000 00:00 0
so it is not an address-space issue.

Oh, I didn't see that gap.

I thought I'd post the happy news (well, for me at least) that the JVM now
look stable on all productions machines after upgrading to SR6. After SR6
passed testing and was put in production, we've had no Segmentation fault
crashes at all. We've had two situations where the JVM has terminated, but
these were more controlled and it gave us a proper reason why it
terminated.

The reason for crashing seems to be connected with a situation where it
simply runs out of memory. My theory is that a very high load combined
with a low memory situation is what triggered the segmentation fault. What
confused us earlier was that neither the javacore-dumps or the logs from
the application showed that it ran out of memory. I guess there is
something in the application that can make it eat all it's memory in a
very short time. The JVM didn't even generate OutOfMemory-exceptions in
time. Two times this out-of-memory-situation has lead to the JVM
terminating, but most times it handles the situation quite nice and is
able to survive. The resulting javacore files told us that it had plenty
of available heap space, so we were a little confused (although we are
aware that the javacore files can be misleading/wrong).

I can't see any fixes in the Changelog that would explain what was fixed,
but I must admit I am just happy that it is fixed now..:-)

I've now sent this issue back to the application developers, so they can
fix it. Hopefully the Heapdumps will show them the errors of their ways
(combined with HeapRoots).

Thanks for all your help!

(if this posting doesn't lead to Murphy crashing the JVMs, then I think
I'll label that system stable)

--
Thorkild (quite happy now)


Back to top
G.S. Link
Guest





PostPosted: Tue Dec 02, 2003 12:38 am    Post subject: Re: Problem with memory corruption in JVM 1.3.1 SR5 Reply with quote

JVM 131 is used to drive WebSphere Developer Workbench. I find that your
problem appears in two of three machines causing the workbench to crash.
As you found out it doesn't happen often enough to find the problem.
Back to top
Display posts from previous:   
Post new topic   Reply to topic    AppletTalk.com Forum Index -> ibm.software.java.linux All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.