 |
AppletTalk.com Java discussions newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
m.colom@barcelo.com Guest
|
Posted: Fri Dec 19, 2003 7:13 am Post subject: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Hello
We are testing 1.4.1 on RHES 3.0 (yes, we know it's not still
certified). So far this is what we've seen (and still see):
We have a cluster of web servers. We've migrated one of them to RHES 3.0
and 1.4.1 (using NPTL) and are testing stability. First of all the good
things:
1-Under high pressure the other JVM could degrade performance. It could
appear very high context switches rates (about 1.000.000) and 70% of cpu
used by system. With the new JVM and NPTL, we don't see this problem and
we get a performance gain of about 3x (under high load).
2-We can raise the number of threads in the system without seeing
performance degradation. With 1.3.0 and no NPTL the system begins
behaving strangely with about 300 threads.
3-With JVM 1.3.0 and no NPTL sometimes we get threads stuck and eating a
whole CPU all the time.This doesn't happen with RHES 3.0 and JVM 1.4.1.
In resume, we feel the new NPTL/1.4.1 configuration gives a much more
predictable performance.
Now bad things:
1-XML compatibility: We had to modify one internal text file
($JAVA_HOME/jre/lib/jaxp.properties) to use xalan and xerces for XML
processing, or the application could not work. Neil provided us with the
tip to modify the jaxp.properties file. Then we could make work our web
application.
Once we were able to start the application, we faced different problems.
I explain them, solutions we've found, and problems we still have.
1-First problem: The JVM was stopping after an aleatory, not predictable
amount of time, with an error "too many open files"
The initial ulimit -n value was 1024. We raised it to 4096. Still the
same problem. The other serves with a limit of 1024 and JVM 1.3.0 were
working without problems.
We saw that the 1.4.1 JVM had opened 4000 times the file
$JAVA_HOME/jre/lib/jaxp.properties. Reading the file we saw that the XML
classess use this file to get three values, which are:
javax.xml.transform.TransformerFactory=org.apache.xalan.processor.TransformerFactoryImpl
javax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl
javax.xml.parsers.DocumentBuilderFactory=org.apache.xerces.jaxp.DocumentBuilderFactoryImpl
To solve this issue, we set these values as system properties, via -D
flags at JVM startup. This way the XML classes don't have to read this
file all the time. The "open files" error has dissapeared since this change.
We think this is a workaround, not a solution. There is a bug somewhere.
2-Second Problem: We were getting "Out Of Memory Errors" after an
aleatory, not predictable amount of time.
The JVM was generating then a HeapDump. We analyzed it with tool
HeapRoot from IBM and saw a huge number of char and byte arrays the GC
was not able to free. Because XML libraries uses a lot of temporary
bytes and char arrays, we suspected a leak in the XML libraries provided
at the $JAVA_HOME/jre/lib/xml.jar file.
We deleted it (yes, I know this is rude) and forced the JVM to use the
same XML libraries we use at the servers with JVM version 1.3.0.
(xalan.jar and xerces.jar). The "Out of Memory" Problem has dissapeared.
3- Third Problem:
This is the one we're trying to solve now. The JVM hangs after an
aleatory, not predictable amount of time. Ths server is a SMP machine
with 2 CPUs and 2 Gigas of RAM. Heap parameters are Mx=512m.
We've only seen these hangs so far with the JIT activated. We've had 4
of these hangs. The first 3 times we were unable to get any additional
information. No error in logs (stdout and stderr), kill -QUIT no
working, kill -KILL necessary to kill the JVM and start again (kill
-TERM didn't work).
The forth time we were able to get a javacore with kill -QUIT. We had to
kill-KILL the JVM again after that. We can provide this javacore for
further examination if someone is interested.
The only additional tip we have is that 2 of the 4 times the JVM hung we
had the -verbosegc flag active. The last lines in stderr were
<AF[2349]: Allocation Failure. need 32784 bytes, 7143 ms since last AF>
<AF[2349]: managing allocation failure, action=1 (41920/237740296)
(2513184/2644216)>
When it should be something like:
<AF[2348]: Allocation Failure. need 10016 bytes, 41017 ms since last AF>
<AF[2348]: managing allocation failure, action=1 (94128/237740296)
(1547032/2644216)>
<GC(2359): GC cycle started Fri Dec 19 10:32:25 2003
157 ms>
<GC(2359): mark: 143 ms, sweep: 14 ms, compact: 0 ms>
<GC(2359): refs: soft 0 (age >= 6), weak 0, final 556, phantom 0>
<AF[2348]: completed in 158 ms>
Perhaps there is a problem in the GC?.
So what we're planning to do right now to try to find an stable
configuration is:
1-Run the server two weeks without JIT enabled. If we don't see any
other hang we'll suppose there is a JIT problem.
2-Then we'll activate again the JIT with -verbosegc flag and the MMI
disabled.
3-If this doesn't work, we'll begin deactivating different parts of the
JIT, as stated in the "Diagnostics Guide", until we find an stable
configuration.
4-If nothing of this works, we're afraid we will be forced to work with
JIT deactivated. In this case is probable that we will have to upgrade
our servers because without the JIT there is a serious performance penalty.
Does anyone has additional advice or tips so we can find an stable
configuration?
Best regards
Miquel
|
|
| Back to top |
|
 |
Mike Edwards Guest
|
Posted: Mon Dec 22, 2003 1:35 pm Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Miquel,
With IBM Java SDK 1.4.1 SR1 on RHEL 3.0 on Intel 32-bit only,
there is a known problem with the NPTL thread library that causes the
third problem that you describe - ie the unpredictable hang on SMP
machines relating to GC.
We have not yet certified the 1.4.1 SR1 release to run on RHEL 3.0
on IA32 because of this problem. A bug report has been made to Red
Hat and we are waiting for a fix.
Interestingly, we do not have this problem with NPTL on other hardware
platforms with RHEL 3.0. The NPTL implementation is different on IA32
than on the other platforms.
Thanks for the information about the problems with Xalan, Xerces.
I am interested to understand why you are not able to use the version of
Xalan and Xerces included in the Java SDK 1.4.1 SR1 - why did you need
to change to use a different version of these?
Yours, Mike.
|
|
| Back to top |
|
 |
Miquel Colom Guest
|
Posted: Mon Dec 22, 2003 3:14 pm Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Hello Mr. Edwards
1-When we began having problem, I took a look at bugzilla.redhat.com and
saw a problem reported by IBM. Is this the one you're referring to:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=108631
If it's this one there is already an errata and we can test.
2-With respect to the Out of Memory problem, which we tried to solve
changing the XML libraries, I'm sad to say that the "Out of Memory problem"
reproduced again with the XML libraries from JVM 1.3.0, so it seems the
1.4.1 ones (at file $JAVA_HOME/jre/lib/xml.jar) were not to blame for this
problem. We can although confirm that the "too many open files error" I
described in the former email is there, and we can only workaround it using
the system.properties variables solution.
3-This lets us with two problems to solve: the "Out of Memory error", and
the JVM hang. Again I have to say that the same code is working flawlessly
in version 1.3.0.
But, since there seems ot be a problem with the GC in RHEL 3.0, could this
one be the real reason for the JVM hangs and the "Out Of Memory" errors?.
Thank you very much for your help.
Best regards
Miquel Colom
"Mike Edwards" <mike_edwards (AT) uk (DOT) ibm.com> escribió en el mensaje
news:bs6s05$5kgk$1 (AT) news (DOT) boulder.ibm.com...
| Quote: | Miquel,
With IBM Java SDK 1.4.1 SR1 on RHEL 3.0 on Intel 32-bit only,
there is a known problem with the NPTL thread library that causes the
third problem that you describe - ie the unpredictable hang on SMP
machines relating to GC.
We have not yet certified the 1.4.1 SR1 release to run on RHEL 3.0
on IA32 because of this problem. A bug report has been made to Red
Hat and we are waiting for a fix.
Interestingly, we do not have this problem with NPTL on other hardware
platforms with RHEL 3.0. The NPTL implementation is different on IA32
than on the other platforms.
Thanks for the information about the problems with Xalan, Xerces.
I am interested to understand why you are not able to use the version of
Xalan and Xerces included in the Java SDK 1.4.1 SR1 - why did you need
to change to use a different version of these?
Yours, Mike.
|
|
|
| Back to top |
|
 |
Neil Masson Guest
|
Posted: Tue Jan 06, 2004 10:39 am Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Miguel,
| Quote: | 3- Third Problem:
This is the one we're trying to solve now. The JVM hangs after an
aleatory, not predictable amount of time. Ths server is a SMP machine
with 2 CPUs and 2 Gigas of RAM. Heap parameters are Mx=512m.
We've only seen these hangs so far with the JIT activated. We've had 4
of these hangs. The first 3 times we were unable to get any additional
information. No error in logs (stdout and stderr), kill -QUIT no
working, kill -KILL necessary to kill the JVM and start again (kill
-TERM didn't work).
.....
Perhaps there is a problem in the GC?.
|
As a workaround, you might like to try single-threading the
garbage collector, which you can do with the flag
-Xgcthreads1.
Regards,
Neil
|
|
| Back to top |
|
 |
G.S. Link Guest
|
Posted: Tue Jan 06, 2004 4:12 pm Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Have you tried the latest XML library versions from Apache?
|
|
| Back to top |
|
 |
Miquel Colom Guest
|
Posted: Wed Jan 07, 2004 9:39 pm Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Hello
"Neil Masson" <nmasson (AT) nospam (DOT) ibm.com> escribió en el mensaje
news:bte38g$1vma$1 (AT) news (DOT) boulder.ibm.com...
| Quote: | Miguel,
3- Third Problem:
This is the one we're trying to solve now. The JVM hangs after an
aleatory, not predictable amount of time. Ths server is a SMP machine
with 2 CPUs and 2 Gigas of RAM. Heap parameters are Mx=512m.
We've only seen these hangs so far with the JIT activated. We've had 4
of these hangs. The first 3 times we were unable to get any additional
information. No error in logs (stdout and stderr), kill -QUIT no
working, kill -KILL necessary to kill the JVM and start again (kill
-TERM didn't work).
....
Perhaps there is a problem in the GC?.
As a workaround, you might like to try single-threading the
garbage collector, which you can do with the flag
-Xgcthreads1.
Regards,
Neil
|
We've done the following:
1-Applied all the pending errata from rhn.redhat.com. Specifically, there is
an update for glibc related to a problem with NPTL.
2-We've applied the parameter -Xgcthreads1 and also -verbosegc and started
again the server.
It's been in production service for 12 hours now without problems. We'll
wait a reasonable amount of time (1 week) to see if the system has become
stable. We'll tell you as soon as we can.
Thank you very much
Miquel Colom
|
|
| Back to top |
|
 |
Miquel Colom Guest
|
Posted: Mon Jan 12, 2004 8:36 pm Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Hello
Update on our issues.
We haven't got any other hard (the kill -KILL ones) JVM hang with
the -Xgcthreads1 parameter and the last RHEL errata applied. One of these
(or both) solved the problem. We still havent tested which one is.
Nevertheless we continue seeing aleatory OutOfMemory errors, even when we
have doubled the heap size. Suddenly, ina few minutes the memory heap
increases to its maximum value (going from 300 Mbytes to 800 Mbytes) and
slowly the JVM goes to a no-memory situation. The garbage collector is
working but doesn't free the heap. It looks like the mark phase has lost
some parts of the heap. The same code is working without problems on IBM
1.3.0, that's why we think this is a JVM bug.
Unfortunately, we're unable to reproduce the error with a testcase. It's
completely aleatory. In fact, most errors happen in low load conditions.
Just to make sure this OutOfMemory is JVM related, we'll try Sun JVM next
days and let you know. We'll also analyze the heapdumps we've got during
this week.
Best regards
Miquel Colom
"Miquel Colom" <m.colom (AT) barcelo (DOT) com> escribió en el mensaje
news:bthucc$3djk$1 (AT) news (DOT) boulder.ibm.com...
| Quote: | Hello
"Neil Masson" <nmasson (AT) nospam (DOT) ibm.com> escribió en el mensaje
news:bte38g$1vma$1 (AT) news (DOT) boulder.ibm.com...
Miguel,
3- Third Problem:
This is the one we're trying to solve now. The JVM hangs after an
aleatory, not predictable amount of time. Ths server is a SMP machine
with 2 CPUs and 2 Gigas of RAM. Heap parameters are Mx=512m.
We've only seen these hangs so far with the JIT activated. We've had 4
of these hangs. The first 3 times we were unable to get any additional
information. No error in logs (stdout and stderr), kill -QUIT no
working, kill -KILL necessary to kill the JVM and start again (kill
-TERM didn't work).
....
Perhaps there is a problem in the GC?.
As a workaround, you might like to try single-threading the
garbage collector, which you can do with the flag
-Xgcthreads1.
Regards,
Neil
We've done the following:
1-Applied all the pending errata from rhn.redhat.com. Specifically, there
is
an update for glibc related to a problem with NPTL.
2-We've applied the parameter -Xgcthreads1 and also -verbosegc and started
again the server.
It's been in production service for 12 hours now without problems. We'll
wait a reasonable amount of time (1 week) to see if the system has become
stable. We'll tell you as soon as we can.
Thank you very much
Miquel Colom
|
|
|
| Back to top |
|
 |
Neil Masson Guest
|
Posted: Tue Jan 13, 2004 11:09 am Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Miquel Colom wrote:
| Quote: | Hello
Update on our issues.
We haven't got any other hard (the kill -KILL ones) JVM hang with
the -Xgcthreads1 parameter and the last RHEL errata applied. One of these
(or both) solved the problem. We still havent tested which one is.
Nevertheless we continue seeing aleatory OutOfMemory errors, even when we
have doubled the heap size. Suddenly, ina few minutes the memory heap
increases to its maximum value (going from 300 Mbytes to 800 Mbytes) and
slowly the JVM goes to a no-memory situation. The garbage collector is
working but doesn't free the heap. It looks like the mark phase has lost
some parts of the heap. The same code is working without problems on IBM
1.3.0, that's why we think this is a JVM bug.
Unfortunately, we're unable to reproduce the error with a testcase. It's
completely aleatory. In fact, most errors happen in low load conditions.
Just to make sure this OutOfMemory is JVM related, we'll try Sun JVM next
days and let you know. We'll also analyze the heapdumps we've got during
this week.
|
Miquel,
GC works the other way around - it marks live objects not dead ones - so
excessive memory use is usually a problem with Java classes not with the
JVM. The Heapdump analysis tool is pretty good at finding the cause of
problems like these.
Neil
|
|
| Back to top |
|
 |
Neil Masson Guest
|
Posted: Fri Jan 16, 2004 11:12 am Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Miquel Colom wrote:
| Quote: | Hello
"Neil Masson" <nmasson (AT) nospam (DOT) ibm.com> escribió en el mensaje
news:bte38g$1vma$1 (AT) news (DOT) boulder.ibm.com...
As a workaround, you might like to try single-threading the
garbage collector, which you can do with the flag
-Xgcthreads1.
Regards,
Neil
We've done the following:
1-Applied all the pending errata from rhn.redhat.com. Specifically, there
is an update for glibc related to a problem with NPTL.
2-We've applied the parameter -Xgcthreads1 and also -verbosegc and started
again the server.
|
Our latest tests show that NPTL is not delivering all the signals it
should. This is most often manifested as a hang in GC on SMP
machines, but can also be seen in other situations.
Hence we still recommend disabling NPTL with the environment
variable
export LD_ASSUME_KERNEL=2.4.19
rather than hiding the symptoms with -Xgcthreads1
Neil
|
|
| Back to top |
|
 |
Miquel Colom Guest
|
Posted: Sat Jan 17, 2004 12:06 am Post subject: Re: Tests with 1.4.1 SR1 on Redhat Enterprise Server 3.0 |
|
|
Hello Neil
Thanks you very much for the update.
We've been testing the Sun JVM becuase we were unable to obtain a good
reliability with IBM JVM and NPTL. We haven't seen hangs so far with NPTL
(no JVM hang, no OutOfMemory error). The drawback is that it's clearly
slower.
Next week we'll revert to IBM JVM without NPTL to see if its stable again.
Is Redhat working on these NPTL issues? Can I see it at bugzilla.redhat.com?
Best regards
"Neil Masson" <nmasson (AT) nospam (DOT) ibm.com> escribió en el mensaje
news:bu8gvf$4osu$1 (AT) news (DOT) boulder.ibm.com...
| Quote: | Miquel Colom wrote:
Hello
"Neil Masson" <nmasson (AT) nospam (DOT) ibm.com> escribió en el mensaje
news:bte38g$1vma$1 (AT) news (DOT) boulder.ibm.com...
As a workaround, you might like to try single-threading the
garbage collector, which you can do with the flag
-Xgcthreads1.
Regards,
Neil
We've done the following:
1-Applied all the pending errata from rhn.redhat.com. Specifically,
there
is an update for glibc related to a problem with NPTL.
2-We've applied the parameter -Xgcthreads1 and also -verbosegc and
started
again the server.
Our latest tests show that NPTL is not delivering all the signals it
should. This is most often manifested as a hang in GC on SMP
machines, but can also be seen in other situations.
Hence we still recommend disabling NPTL with the environment
variable
export LD_ASSUME_KERNEL=2.4.19
rather than hiding the symptoms with -Xgcthreads1
Neil
|
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|