Problem with tomcat hanging on Shib 2.4

Discussion:

John Kamminga

2014-08-28 23:49:11 UTC

We've migrated our production Shibboleth environment from Solaris 10 to Redhat 6 and are now experiencing problems with the app becoming unresponsive every couple weeks. A tomcat reboot fixes it, but we'd like to find out what is causing it. Has anyone else experienced issue migrating to or running on Redhat 6?
Or, does anyone see any potential problems with our setup?

Here is our environment setup on a Redhat VM.
Redhat Linux version: 2.6.32-431.20.3.el6.x86_64

Shibboleth Idp 2.4

Tomcat 6.0.24
JAVA_OPTS=" -Xmx1024M -XX:MaxPermSize=512M -server -Djava.library.path=/usr/lib64 -Djavax.net.ssl.trustStore=/jdk/cacerts"

Java -version:
java version "1.7.0_55"
OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Thanks,
John Kamminga
Web Application Development
Information Technology Department
University of California, Merced
T: 209.228.2965<tel:209.228.2965>
E: jkamminga-DHU18zts72H2fBVCVOL8/***@public.gmane.org<mailto:jkamminga-DHU18zts72H2fBVCVOL8/***@public.gmane.org>
W: it.ucmerced.edu<http://it.ucmerced.edu/>

Cantor, Scott

2014-08-29 00:48:00 UTC

Permalink

Post by John Kamminga
java version "1.7.0_55"
OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

You can ignore me, but I wouldn't touch Red Hat's OpenJDK with a 10 foot
pole. If nothing else, I'd sure rule it out as a cause.

-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Ted Fisher

2014-08-29 13:05:13 UTC

Permalink

We've done a nearly identical migration two months ago for our test IDP and the first of this month for production. We've gone from Solaris to everything exactly the same as yours with the exception of Java. Due to warnings about OpenJDK we instead use java-1.6.0-sun-1.6.0.81-1jpp.1.el6_5.x86_64 which is available in the Redhat repo rhel-x86_64-server-6-thirdparty-oracle-java.

The production IDPs have been fine. We have had two occurrence with the pair of test IDPs in the eight weeks they've been in place that were similar to what you describe. In both cases a Tomcat restart fixed it and we were never able to determine the cause - nothing logged anywhere, just a non-responsive hang.

Given the similarities of our environments I would be very interested in what you find and willing to share our findings. Unexplained hangs are unnerving.

Ted F. Fisher
Information Technology Services
[Description: BGSU]

From: users-bounces-***@public.gmane.org [mailto:users-bounces-***@public.gmane.org] On Behalf Of John Kamminga
Sent: Thursday, August 28, 2014 7:49 PM
To: users-***@public.gmane.org
Subject: Problem with tomcat hanging on Shib 2.4

We've migrated our production Shibboleth environment from Solaris 10 to Redhat 6 and are now experiencing problems with the app becoming unresponsive every couple weeks. A tomcat reboot fixes it, but we'd like to find out what is causing it. Has anyone else experienced issue migrating to or running on Redhat 6?
Or, does anyone see any potential problems with our setup?

Here is our environment setup on a Redhat VM.
Redhat Linux version: 2.6.32-431.20.3.el6.x86_64

Shibboleth Idp 2.4

Tomcat 6.0.24
JAVA_OPTS=" -Xmx1024M -XX:MaxPermSize=512M -server -Djava.library.path=/usr/lib64 -Djavax.net.ssl.trustStore=/jdk/cacerts"

Java -version:
java version "1.7.0_55"
OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Thanks,
John Kamminga
Web Application Development
Information Technology Department
University of California, Merced
T: 209.228.2965<tel:209.228.2965>
E: jkamminga-DHU18zts72H2fBVCVOL8/***@public.gmane.org<mailto:jkamminga-DHU18zts72H2fBVCVOL8/***@public.gmane.org>
W: it.ucmerced.edu<http://it.ucmerced.edu/>

C R

2014-08-29 13:10:14 UTC

Permalink

Hi John,

We run IDP on RHEL 6.5 using java-1.7.0-openjdk-devel. Very stable so far.

I have only experienced a tomcat6 "hang" once. The IdP kept working, but
behaved strangely. "Active" logs stayed empty while the messages arrived in
the log files with the date of the day before. Most logins worked as
expected, but a few failed. While debugging, a hup signal let the tomcat
process hang. Tomcat had to be restarted to fix the problem.

Claudio

Weâve migrated our production Shibboleth environment from Solaris 10 to
Redhat 6 and are now experiencing problems with the app becoming
unresponsive every couple weeks. A tomcat reboot fixes it, but weâd like to
find out what is causing it. Has anyone else experienced issue migrating to
or running on Redhat 6?
Or, does anyone see any potential problems with our setup?
Here is our environment setup on a Redhat VM.
Redhat Linux version: 2.6.32-431.20.3.el6.x86_64
Shibboleth Idp 2.4
Tomcat 6.0.24
JAVA_OPTS=" -Xmx1024M -XX:MaxPermSize=512M -server
-Djava.library.path=/usr/lib64 -Djavax.net.ssl.trustStore=/jdk/cacerts"
java version "1.7.0_55"
OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
Thanks,
*John Kamminga*
Web Application Development
Information Technology Department
University of California, Merced
T: 209.228.2965
W: it.ucmerced.edu
--
To unsubscribe from this list send an email to

Matthew Slowe

2014-08-29 14:29:55 UTC

Permalink

Post by John Kamminga
We've migrated our production Shibboleth environment from Solaris 10 to
Redhat 6 and are now experiencing problems with the app becoming
unresponsive every couple weeks. A tomcat reboot fixes it, but we'd like
to find out what is causing it. Has anyone else experienced issue
migrating to or running on Redhat 6?
Or, does anyone see any potential problems with our setup?
Here is our environment setup on a Redhat VM.
Redhat Linux version: 2.6.32-431.20.3.el6.x86_64
Shibboleth Idp 2.4
Tomcat 6.0.24
JAVA_OPTS=" -Xmx1024M -XX:MaxPermSize=512M -server
-Djava.library.path=/usr/lib64 -Djavax.net.ssl.trustStore=/jdk/cacerts"
java version "1.7.0_55"
OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

First I'm going to refer to a thread on the JISC-SHIBBOLETH mailing list
last year on the subject (no signin required)

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1310&L=jisc-shibboleth&F=&S=&P=60

We have three IDPs running in a very similar setup to yourselves (3 RHEL
VMs (each are 2cpu, 4G) on VMware) running, then, 1.7.0_25 (now _55)
each servicing up to 330,000 authentications per day.

Anywhere from a few days to a week or two after startup, the JVM will go
into a wierd state and stop responding to practically anything. It
appears to get stuck doing some massive Garbage Collect which we've not
been able to tune out (which is what that thread is about).

Having sunk days of time into it, we bailed and scheduled rolling
overnight tomcat restarts :-(

Take a look at the GC logs (which you may need to turn on) to see if
you're hitting long GCs (hint, not recommendation):

-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
-XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -verbose:gc
-Xloggc:/var/log/tomcat6/gc.log

Good luck!

--
Matthew Slowe
Server Infrastructure Team e: m.slowe-***@public.gmane.org
IS, University of Kent t: +44 (0)1227 824265
Canterbury, UK w: www.kent.ac.uk
--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Cantor, Scott

2014-08-29 14:40:46 UTC

Permalink

Post by Matthew Slowe
First I'm going to refer to a thread on the JISC-SHIBBOLETH mailing list
last year on the subject (no signin required)
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1310&L=jisc-shibboleth&F
=&S=&P=60
We have three IDPs running in a very similar setup to yourselves (3 RHEL
VMs (each are 2cpu, 4G) on VMware) running, then, 1.7.0_25 (now _55)
each servicing up to 330,000 authentications per day.

And that's Oracle's JVM. So that points the finger at Tomcat. Jetty is a
much better choice, so that would be my next suggestion. I actually have
never run 2.x under Tomcat in fact, only 1.3. I ran for a couple of years
on Jetty 7.5 and now I'm on 9.1 (for about a month, so far no issues and
classes started this week).

I can't directly compare, as I'm on Red Hat 5, not 6. But we have had no
issues under comparable loads to that, slightly lower.

I'm also not on VMs, and that's another red flag for me. I don't know if
the OP is.

The other thing I'd point to is whether authentication might be involved.
The IdP can appear to hang if that's blocking.

-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

David Mansfield

2014-08-29 14:47:53 UTC

Permalink

Post by Cantor, Scott

And that's Oracle's JVM. So that points the finger at Tomcat. Jetty is a
much better choice, so that would be my next suggestion. I actually have
never run 2.x under Tomcat in fact, only 1.3. I ran for a couple of years
on Jetty 7.5 and now I'm on 9.1 (for about a month, so far no issues and
classes started this week).
I can't directly compare, as I'm on Red Hat 5, not 6. But we have had no
issues under comparable loads to that, slightly lower.
I'm also not on VMs, and that's another red flag for me. I don't know if
the OP is.

I have dozens of centos6 machines running in VM and I've never had a
single issue that was caused by the VM so far. Not to rule it out, but
with java being java, it's more likely something higher in the software
stack.

Post by Cantor, Scott
The other thing I'd point to is whether authentication might be involved.
The IdP can appear to hang if that's blocking.

A 'kill -QUIT' will dump a thread trace to the log (usually) and then
you can determine if there are hanging threads, or a deadlock on a
resource that may be limited and obtained recursively or in a incorrect
order etc. For me, with tomcat, it has usually been the case that a
combination of the logger locks and the database connection pooling
causes these kind of deadlocks (when one thread has the potential of
obtaining more than one db connection for example).

David

Post by Cantor, Scott
-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Cantor, Scott

2014-08-29 15:21:46 UTC

Permalink

Post by David Mansfield
I have dozens of centos6 machines running in VM and I've never had a
single issue that was caused by the VM so far. Not to rule it out, but
with java being java, it's more likely something higher in the software
stack.

This depends enormously on your VM infrastructure. OSU has a history of
absolutely disastrous VM environments that have done nothing but cause
problems. So it really depends on your situation.

-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Brian Koehmstedt

2014-08-29 15:40:57 UTC

Permalink

I work with John Kamminga, the original poster. I'm out of the office
right now, so the team is looking into it in my absence, but I've taken
a peak while I've been been out. I don't have all the details yet, but
I do believe this is a memory problem as Matthew has suggested and
observed at his location. From Matthew's description, it sounds like we
may be hitting the same problem. Even the timeline is right. (He said
every couple of weeks, which is about what we're seeing.)

In a previous "hang" a few weeks ago (not the latest one John is
describing), I noticed an Out of Memory error in the log file. John
should check for this in the latest hang-up logs, but I am definitely
suspecting either:
- A memory leak
- An unexplained GC problem, as Matthew said. (Although the GCs of JVMs
should be so thoroughly tested and rock solid that I doubt it is a JVM
GC bug. A standard memory leak is much more likely.)
- The JVM just flat out running out of memory due to growing Incommon
metdata file, but it seems like -Xmx1024M should be sufficient even when
the current size of the metadata file. Matthew, I'd be curious to know
what you had your -Xmx parameter set at when you were experiencing the
hang-ups.

I've already begun taking heap dumps and analyzing them with jhat.
Analyzing the heap isn't always straight forward, but there is a
"tremendous" amount of char[], String, HashMapEntry, and various XML
objects in the heap. I put "tremendous" in quotes because I don't yet
know if it's a normal amount or abnormal amount. You can't tell just by
looking at a heap. Most of these objects look related to storing data
from the Incommon metadata file. Since this file is growing quite big,
the data in the heap could be normal, in which case -Xmx1024M is no
longer sufficient?

One thing I was definitely meaning to do when I got back was add
-XX:+HeapDumpOnOutOfMemoryError|.||

Post by Matthew Slowe

First I'm going to refer to a thread on the JISC-SHIBBOLETH mailing list
last year on the subject (no signin required)
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1310&L=jisc-shibboleth&F=&S=&P=60
We have three IDPs running in a very similar setup to yourselves (3 RHEL
VMs (each are 2cpu, 4G) on VMware) running, then, 1.7.0_25 (now _55)
each servicing up to 330,000 authentications per day.
Anywhere from a few days to a week or two after startup, the JVM will go
into a wierd state and stop responding to practically anything. It
appears to get stuck doing some massive Garbage Collect which we've not
been able to tune out (which is what that thread is about).
Having sunk days of time into it, we bailed and scheduled rolling
overnight tomcat restarts :-(
Take a look at the GC logs (which you may need to turn on) to see if
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
-XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -verbose:gc
-Xloggc:/var/log/tomcat6/gc.log
Good luck!

Cantor, Scott

2014-08-29 17:16:03 UTC

Permalink

Post by Brian Koehmstedt
In a previous "hang" a few weeks ago (not the latest one John is
describing), I noticed an Out of Memory error in the log file. John
should check for this in the latest hang-up logs, but I am definitely
- A memory leak

If you have a leak, it's in some component you've added to the system. A
driver or what not.

Post by Brian Koehmstedt
- The JVM just flat out running out of memory due to growing Incommon
metdata file, but it seems like -Xmx1024M should be sufficient even when
the current size of the metadata file. Matthew, I'd be curious to know
what you had your -Xmx parameter set at when
you were experiencing the hang-ups.

That isn't the cause. It's possible you're handling too much traffic and
caching attribute resolution results for too long, but that's a choice.
Otherwise, you shouldn't have any trouble with heap.

Post by Brian Koehmstedt
I've already begun taking heap dumps and analyzing them with jhat.
Analyzing the heap isn't always straight forward, but there is a
"tremendous" amount of char[], String, HashMapEntry, and various XML
objects in the heap. I put "tremendous" in quotes because
I don't yet know if it's a normal amount or abnormal amount. You can't
tell just by looking at a heap. Most of these objects look related to
storing data from the Incommon metadata file. Since this file is growing
quite big, the data in the heap could be normal, in which case -Xmx1024M
is no longer sufficient?

It's sufficient. Those objects are dropped after processing the file, they
don't stay in use.

-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

John Kamminga

2014-08-29 18:10:02 UTC

Permalink

Thank you everyone for the responses. (very much!)

Scott, our traffic has increased and I'd like to explore the 'caching attribute resolution results for too long'. Where is this configured? In multiple places? And, what is the default that is shipped with Shibboleth Idp?

Unfortunately we are still running one node, but with the increase in traffic are wanting to add another node but we need to prioritize that project with the many other projects:(

Thanks,
John Kamminga
Web Application Development
Information Technology Department
University of California, Merced
T: 209.228.2965
E: jkamminga-DHU18zts72H2fBVCVOL8/***@public.gmane.org
W: it.ucmerced.edu

-----Original Message-----
From: users-bounces-***@public.gmane.org [mailto:users-bounces-***@public.gmane.org] On Behalf Of Cantor, Scott
Sent: Friday, August 29, 2014 10:16 AM
To: Shib Users
Subject: Re: Problem with tomcat hanging on Shib 2.4

If you have a leak, it's in some component you've added to the system. A driver or what not.

Post by Brian Koehmstedt
- The JVM just flat out running out of memory due to growing Incommon
metdata file, but it seems like -Xmx1024M should be sufficient even
when the current size of the metadata file. Matthew, I'd be curious to
know what you had your -Xmx parameter set at when you were
experiencing the hang-ups.

That isn't the cause. It's possible you're handling too much traffic and caching attribute resolution results for too long, but that's a choice.
Otherwise, you shouldn't have any trouble with heap.

Post by Brian Koehmstedt
I've already begun taking heap dumps and analyzing them with jhat.
Analyzing the heap isn't always straight forward, but there is a
"tremendous" amount of char[], String, HashMapEntry, and various XML
objects in the heap. I put "tremendous" in quotes because I don't yet
know if it's a normal amount or abnormal amount. You can't tell just
by looking at a heap. Most of these objects look related to storing
data from the Incommon metadata file. Since this file is growing quite
big, the data in the heap could be normal, in which case -Xmx1024M is
no longer sufficient?

It's sufficient. Those objects are dropped after processing the file, they don't stay in use.

-- Scott

--
To unsubscribe from this list send an email to users-***@shibboleth.net

--
To unsubscribe from this list send an email to users-***@shibboleth.net

Cantor, Scott

2014-08-29 18:40:06 UTC

Permalink

Post by John Kamminga
Scott, our traffic has increased and I'd like to explore the 'caching
attribute resolution results for too long'. Where is this configured? In
multiple places? And, what is the default that is shipped with Shibboleth
Idp?

https://wiki.shibboleth.net/confluence/display/SHIB2/ResolverRDBMSDataConne
ctor
https://wiki.shibboleth.net/confluence/display/SHIB2/ResolverLDAPDataConnec
tor

There are no defaults, none of the connectors in the example file matter
much to anybody running a real IdP.

If it's true that permgen space comes out of the same heap maximum, that's
news to me, but I run with
-Xmx768m -XX:MaxPermSize=512m and have never had a problem. I cache
connector results for a few minutes.

If you ever get an OutOfMemory error in the log, you don't need to waste
time looking for other explanations.

-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Brian Koehmstedt

2014-08-29 18:58:42 UTC

Permalink

I am almost positive that PermGen does NOT come out out of heap.
Actually, I can confirm this as I am looking at JVM stats now that are
showing me perm is indeed separate.

John, based on the documentation, I believe caching is off by default,
and we have not added any of the caching options, so I believe that
attribute resolver caching is not in play here at thus not at fault.

Post by Cantor, Scott

https://wiki.shibboleth.net/confluence/display/SHIB2/ResolverRDBMSDataConne
ctor
https://wiki.shibboleth.net/confluence/display/SHIB2/ResolverLDAPDataConnec
tor
There are no defaults, none of the connectors in the example file matter
much to anybody running a real IdP.
If it's true that permgen space comes out of the same heap maximum, that's
news to me, but I run with
-Xmx768m -XX:MaxPermSize=512m and have never had a problem. I cache
connector results for a few minutes.
If you ever get an OutOfMemory error in the log, you don't need to waste
time looking for other explanations.
-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Cantor, Scott

2014-08-29 19:13:39 UTC

Permalink

Post by Brian Koehmstedt
I am almost positive that PermGen does NOT come out out of heap.
Actually, I can confirm this as I am looking at JVM stats now that are
showing me perm is indeed separate.

I certainly thought so. Anyway, nutshell, that means 768M is enough to
handle 200,000+ logins (that's not typical but we've hit it at times) with
some caching, loading both InCommon's metadata and a file of 250+ entities
locally, plus a couple more non-trivial sized files.

So if you're getting heap errors with 1G, something else has been added to
the JVM or Tomcat is a disaster. I am readily able to believe the latter.

-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Nguyen, Thai

2014-08-29 19:17:31 UTC

Permalink

All right, I was wrong about PermGen came out of heap.

Wow, 200,000+

Nguyen, Thai

Post by Cantor, Scott

Post by Brian Koehmstedt
I am almost positive that PermGen does NOT come out out of heap.
Actually, I can confirm this as I am looking at JVM stats now that are
showing me perm is indeed separate.

I certainly thought so. Anyway, nutshell, that means 768M is enough to
handle 200,000+ logins (that's not typical but we've hit it at times) with
some caching, loading both InCommon's metadata and a file of 250+ entities
locally, plus a couple more non-trivial sized files.
So if you're getting heap errors with 1G, something else has been added to
the JVM or Tomcat is a disaster. I am readily able to believe the latter.
-- Scott
--

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Brian Koehmstedt

2014-08-29 19:21:11 UTC

Permalink

Post by Cantor, Scott

Post by Brian Koehmstedt
I am almost positive that PermGen does NOT come out out of heap.
Actually, I can confirm this as I am looking at JVM stats now that are
showing me perm is indeed separate.

I certainly thought so. Anyway, nutshell, that means 768M is enough to
handle 200,000+ logins (that's not typical but we've hit it at times) with
some caching, loading both InCommon's metadata and a file of 250+ entities
locally, plus a couple more non-trivial sized files.
So if you're getting heap errors with 1G, something else has been added to
the JVM or Tomcat is a disaster. I am readily able to believe the latter.

I'm willing to believe there's a RedHat/Tomcat issue here, and I'm
hoping to eventually track it down with GC stats and heap dumps.

I will say there is a small amount of "custom" Java code called as
Script from an attribute-resolver to generate targeted IDs on a
per-service basis. I immediately jumped to this as a possibility, and
I've code-reviewed it and see no evidence of caching. i.e., I was
thinking maybe I was caching the targetedIds in a hash table, but as I
review this code, I do not indeed do any caching of these IDs.
(targetedId is essentially a salted hash of primary identifier and
service id).

But, usually the author code reviewing his own code is a bad idea, so
John, you're welcome to look at it too. :)

At any rate, I am going to add these JVM options to log GC and get heap
dumps when the OutOfMemoryExceptions occur:

JAVA_OPTS="${JAVA_OPTS} -Xmx1024M -XX:MaxPermSize=512M -server -Djava.library.path=/usr/lib64 -Djavax.net.ssl.trustStore=/jdk/cacerts -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath="${LOG_DIR}/java_pid<pid>.hprof" -XX:ErrorFile="${LOG_DIR}/hs_err_pid<pid>.log" -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:${LOG_DIR}/java_gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=7 -XX:GCLogFileSize=1M -XX:+PrintCommandLineFlags -verbose:gc"

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Cantor, Scott

2014-08-29 19:25:55 UTC

Permalink

Post by Brian Koehmstedt
I will say there is a small amount of "custom" Java code called as
Script from an attribute-resolver to generate targeted IDs on a
per-service basis. I immediately jumped to this as a possibility, and
I've code-reviewed it and see no evidence of caching. i.e., I was
thinking maybe I was caching the targetedIds in a hash table, but as I
review this code, I do not indeed do any caching of these IDs.
(targetedId is essentially a salted hash of primary identifier and
service id).

Hash-based generation like that is built-in to the IdP, so I'm not sure
why a script would be needed, but FWIW I use a few scripts, maybe 3-4 of
them run on any given login. So the script engine itself isn't leaking.

-- Scott

--
To unsubscribe from this list send an email to users-unsubscribe-***@public.gmane.org

Tompkins,Charles R

2014-08-29 17:27:27 UTC

Permalink

I'm running Shib 2.4 in Tomcat 6 on RHEL5 in VMware with no problems. I'm
using Oracle Java (available from RH) and the latest Tomcat 6 source, not
the RH version. We rebuild once a week for new SPs and have never
(*knocks-on-wood) had a hang.

We also allocate 8GB RAM per VM and a larger footprint for the JVM. We load
InCommon and almost 900 other pieces of Metadata at a go.

export JAVA_OPTS=""
export JAVA_OPTS="$JAVA_OPTS -server -d64 -XX:+PrintCommandLineFlags"

# only create a huge JVM if the operation is 'start'
if [[ "$1" == 'start' ]]; then
export JAVA_OPTS="$JAVA_OPTS -XX:+UseParallelOldGC
-XX:MaxGCPauseMillis=5000"
export JAVA_OPTS="$JAVA_OPTS -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:-TraceClassUnloading"

export JAVA_OPTS="$JAVA_OPTS -Xmx6144m -Xms4096m"
export JAVA_OPTS="$JAVA_OPTS -XX:MaxNewSize=512m -XX:NewSize=256m"
export JAVA_OPTS="$JAVA_OPTS -XX:MaxPermSize=2048m -XX:PermSize=512m"

# was "-XX:+CMSPermGenSweepingEnabled"
export JAVA_OPTS="$JAVA_OPTS -XX:+CMSClassUnloadingEnabled"
fi

As also mentioned, the environment needs to be solid. I love my
infrastructure crew! <3 <3

Regards,
-Charles

-----Original Message-----
From: users-bounces-***@public.gmane.org [mailto:users-bounces-***@public.gmane.org] On
Behalf Of Brian Koehmstedt
Sent: Friday, August 29, 2014 11:41 AM
To: users-***@public.gmane.org
Subject: Re: Problem with tomcat hanging on Shib 2.4

I work with John Kamminga, the original poster. I'm out of the office right
now, so the team is looking into it in my absence, but I've taken a peak
while I've been been out. I don't have all the details yet, but I do
believe this is a memory problem as Matthew has suggested and observed at
his location. From Matthew's description, it sounds like we may be hitting
the same problem. Even the timeline is right. (He said every couple of
weeks, which is about what we're seeing.)

In a previous "hang" a few weeks ago (not the latest one John is
describing), I noticed an Out of Memory error in the log file. John should
check for this in the latest hang-up logs, but I am definitely suspecting
either:
- A memory leak
- An unexplained GC problem, as Matthew said. (Although the GCs of JVMs
should be so thoroughly tested and rock solid that I doubt it is a JVM GC
bug. A standard memory leak is much more likely.)
- The JVM just flat out running out of memory due to growing Incommon
metdata file, but it seems like -Xmx1024M should be sufficient even when the
current size of the metadata file. Matthew, I'd be curious to know what you
had your -Xmx parameter set at when you were experiencing the hang-ups.

I've already begun taking heap dumps and analyzing them with jhat.
Analyzing the heap isn't always straight forward, but there is a
"tremendous" amount of char[], String, HashMapEntry, and various XML objects
in the heap. I put "tremendous" in quotes because I don't yet know if it's
a normal amount or abnormal amount. You can't tell just by looking at a
heap. Most of these objects look related to storing data from the Incommon
metadata file. Since this file is growing quite big, the data in the heap
could be normal, in which case -Xmx1024M is no longer sufficient?

One thing I was definitely meaning to do when I got back was add
-XX:+HeapDumpOnOutOfMemoryError.

On 8/29/2014 7:29 AM, Matthew Slowe wrote:

On Thu, Aug 28, 2014 at 11:49:11PM +0000, John Kamminga wrote:

We've migrated our production Shibboleth environment from
Solaris 10 to
Redhat 6 and are now experiencing problems with the app
becoming
unresponsive every couple weeks. A tomcat reboot fixes
it, but we'd like
to find out what is causing it. Has anyone else
experienced issue
migrating to or running on Redhat 6?

Or, does anyone see any potential problems with our
setup?

Here is our environment setup on a Redhat VM.

Redhat Linux version: 2.6.32-431.20.3.el6.x86_64

Shibboleth Idp 2.4

Tomcat 6.0.24

JAVA_OPTS=" -Xmx1024M -XX:MaxPermSize=512M -server
-Djava.library.path=/usr/lib64
-Djavax.net.ssl.trustStore=/jdk/cacerts"

Java -version:
java version "1.7.0_55"
OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64
u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

First I'm going to refer to a thread on the JISC-SHIBBOLETH mailing
list
last year on the subject (no signin required)

https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1310&L=jisc-shibboleth&F=&
S=&P=60

We have three IDPs running in a very similar setup to yourselves (3
RHEL
VMs (each are 2cpu, 4G) on VMware) running, then, 1.7.0_25 (now _55)
each servicing up to 330,000 authentications per day.

Anywhere from a few days to a week or two after startup, the JVM
will go
into a wierd state and stop responding to practically anything. It
appears to get stuck doing some massive Garbage Collect which we've
not
been able to tune out (which is what that thread is about).

Having sunk days of time into it, we bailed and scheduled rolling
overnight tomcat restarts :-(

Take a look at the GC logs (which you may need to turn on) to see if
you're hitting long GCs (hint, not recommendation):

-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
-XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps
-XX:+PrintGCApplicationStoppedTime -verbose:gc
-Xloggc:/var/log/tomcat6/gc.log

Good luck!

Nguyen, Thai

2014-08-29 18:23:50 UTC

Permalink

Hi John,

I have two suggestions here:
1. I saw a couple of replies to suggest staying away from OpenJDK. I dont have anything to say about OpenJDK since I dont use it. However, I dont have any problem with Oracle JDK so I also suggest you to use Oracle JDK.
2. Your HEAP size is only 1G and you already allocate half of it to Permanent generation which leaves only 512M to the rest (Young + Old generations). Thats may cause the problem. If your OS have more memory I would suggest to allocate at least 2G for the HEAP and keep the Permanent generation to just 128M. Only when you had the PermGen OutOfMemoryError then you need to increase the Permanent generation.
Also make sure that your Tomcat endorsed the jar files that mention in Shibboleth 2.4

Here is my environment:
10366 Bootstrap -Djava.util.logging.config.file=/opt/apache-tomcat-7.0.54/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms2048m -Xmx2048m -XX:MaxPermSize=128m -Djava.endorsed.dirs=/opt/apache-tomcat-7.0.54/endorsed -Dcatalina.base=/opt/apache-tomcat-7.0.54 -Dcatalina.home=/opt/apache-tomcat-7.0.54 -Djava.io.tmpdir=/opt/apache-tomcat-7.0.54/temp

java version "1.7.0_60"
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)

On a busy day our single IdP run on a VM serves: (out put from loganalysis.py)
5620 unique userids
10455 logins

I had no idea how many users your server handles.

Nguyen, Thai

Weve migrated our production Shibboleth environment from Solaris 10 to Redhat 6 and are now experiencing problems with the app becoming unresponsive every couple weeks. A tomcat reboot fixes it, but wed like to find out what is causing it. Has anyone else experienced issue migrating to or running on Redhat 6?
Or, does anyone see any potential problems with our setup?
Here is our environment setup on a Redhat VM.
Redhat Linux version: 2.6.32-431.20.3.el6.x86_64
Shibboleth Idp 2.4
Tomcat 6.0.24
JAVA_OPTS=" -Xmx1024M -XX:MaxPermSize=512M -server -Djava.library.path=/usr/lib64 -Djavax.net.ssl.trustStore=/jdk/cacerts"
java version "1.7.0_55"
OpenJDK Runtime Environment (rhel-2.4.7.1.el6_5-x86_64 u55-b13)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
Thanks,
John Kamminga
Web Application Development
Information Technology Department
University of California, Merced
T: 209.228.2965
W: it.ucmerced.edu
--