IBM reported process hang for long batch processes
I was looking into WAS logs, while I saw the warning
WSVR0605W: Thread “WebContainer : 2” (0000005a) has been active for 7305026 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
I went deeper in the stacktrace, and I found that this particular process is a batch function that may work for 24hours without any return (batch cleaning job). Well, Webpshere has a thread monitoring policy that will consider a thread ganging after a predefined interval (600 seconds by default). If the process finishes job after this the threshold interval, WAS will report a false alarm as an apologize and will increase threashold interval if this false alarm occurs many times(100) by 1.5 times.
hmmm… I don’t see this behavior in log files although the process completed after 5 hours…Moreover, the application reported an error (channel call failure).
A quick way to avoid this is to disable thread hang monitoring by setting com.ibm.websphere.threadmonitor.interval property to zero or less. This may not be a good choice for many applications as it may hide bigger problems.
The safer way is to calculte the maximum of the time taken by any process in your application and set com.ibm.websphere.threadmonitor.threshold to a suitable value. In my application that will never be the case as I’m not the full owner of the database, and I found that sometimes, billing system takes all available accesses to the database for many days.