The necessity of right tool for checking Coherence cluster configuration

I faced with a quite interesting problem after upgrading Coherence in my Oracle Access Manager cluster.

The issue appeared right after upgrading, all log files of OAM cluster were full of messages like:


<2015-08-28 12:37:09.878/1513.280 Oracle Coherence GE 3.7.1.13 (thread=Cluster, member=n/a): Delaying formation of a new cluster; IpMonitor failed to verify the reachability of senior Member(Id=1, Timestamp=2015-08-26 13:38:36.265, Address=xxx.xxx.xxx.xxx:9095, MachineId=3353, Location=site:,machine:oam1,process:12886, Role=WeblogicServer); if this persists it is likely the result of a local or remote firewall rule blocking either ICMP pings, or connections to TCP port 7>

At a first glance, it’s well known issue and relates in a some kind of network problems as described in Doc ID 1530288.1 IpMontor Failed To Verify The Reachability Of Senior Member. The only problem that I absolutely sure there is no firewall between coherence instances and all hosts in a cluster are reachable. Of course I’ve made all recommended tests: java ping test from Doc ID 1936105.1, multicast test from Doc ID 1936452.1, datagram test from 1936575.1.

I’ve even written a special program for checking whether method InetAddress.isReachable() works or not (as it’s described in What Is The Purpose Of IpMonitor In A Coherence Cluster? ( Doc ID 1526745.1 ))

It’s quite simple:

import java.net.*;

public class alive{
public static void main(String args[]){
try{
InetAddress ia = InetAddress.getByName("oam2");;
boolean b = ia.isReachable(10000);
if(b){
System.out.println("Reachable");
}
else{
System.out.println("Unreachable");
}

}catch(Exception e){
System.out.println("Exception: " + e.getMessage());
}
}
}

What I tried is to check availability of host oam2 from host oam1. And it worked.


[oracle@oam2 ~]$ /opt/oracle/jrockit/bin/java alive
Reachable

Also I colud see icmp packets in tcpdump:


12:32:40.656815 IP (tos 0x0, ttl 64, id 16265, offset 0, flags [DF], proto ICMP (1), length 72)
oam1 > oam2: ICMP echo request, id 25757, seq 1, length 52
12:32:40.656942 IP (tos 0x60, ttl 64, id 49076, offset 0, flags [none], proto ICMP (1), length 72)
oam2 > oam1: ICMP echo reply, id 25757, seq 1, length 52

The particularly strange was I couldn’t see any network packets (nor icmp nor tcp port 7) from working oam1 server whereas out file was full of error messages “IpMonitor failed to verify the reachability of senior Member”.

So OAM_Server seemed not to have sent network packets (I mean IpMonitor, there was a lot of traffic between servers on the other ports) at all.

After some investigation I found out a couple of interesting strings in a strace output:


27348 socket(PF_INET, SOCK_RAW, IPPROTO_ICMP) = -1 EPERM (Operation not permitted)
27348 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 665
27348 bind(665, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.128.20.100")}, 16) = 0
27348 connect(665, {sa_family=AF_INET, sin_port=htons(7), sin_addr=inet_addr("xxx.xxx.xxx.xxx")}, 16

Here you can see that program tried to send network packets to public address from a private address. In that moment it became clear the matter is in our specific network configuration.

But at first, let’s reproduce the issue. My code was wrong:


InetAddress ia = InetAddress.getByName("oam2");;
boolean b = ia.isReachable(10000);

And wrong was all coherence tests (like java ping test)
It is not how ipMonitor calls isReachable method. In Coherence 3.7.1.19 it selects network interface for sending a ping packet:


InetAddress ia = InetAddress.getByName("oam2");;
NetworkInterface netIntfc = NetworkInterface.getByInetAddress(InetAddress.getByName("oam1")) ;
boolean b = ia.isReachable(netIntfc, 0 , 10000);

Let’s see in jdk source code, I mean InetAddress source code and isReachable method implementation.
Here if we launch isReachable and pass the parameter NetworkInterface (as it does Coherence) it’ll take the first IPv4address found on interface and uses it as a source address.


cat ./java/net/Inet4AddressImpl.java
class Inet4AddressImpl implements InetAddressImpl {
...
private native boolean isReachable0(byte[] addr, int timeout, byte[] ifaddr, int ttl) throws IOException;
...
public boolean isReachable(InetAddress addr, int timeout, NetworkInterface netif, int ttl) throws IOException {
byte[] ifaddr = null;
if (netif != null) {
/*
* Let's make sure we use an address of the proper family
*/
java.util.Enumeration it = netif.getInetAddresses();
InetAddress inetaddr = null;
while (!(inetaddr instanceof Inet4Address) &&
it.hasMoreElements())
inetaddr = it.nextElement();
if (inetaddr instanceof Inet4Address)
ifaddr = inetaddr.getAddress();
}
return isReachable0(addr.getAddress(), timeout, ifaddr, ttl);
}

It works on windows (where interface could have only addresses from one network), but Linux kernel has source routing. We have 2 addresses in one interface: public and private. Jdk took first address it found (I checked, the same behavior is in last jdk 8u60), and in our case it was private address. And tried to send packets in a public network. Such types of network packets normally drop by kernel (see Reverse Path Filtering http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html) because they are meaningless. Programs should either do not use source address or use the right public address for send packets.

As a workaround I added rule for rewriting source addresses of those packets.

I’m not sure whether it’s bug or feature. And also I think our case is quite unique. But it’s once again reminded me the simple truth: if you test something you should do it that way as you do it in a production.

P.S. During investigation I realized that OAM Coherence is not synonymous with the Oracle Coherence (see Doc ID 1579874.1 Questions about OAM 11g Coherence Configuration) 🙂 It has their own settings in oam-config.xml, they difference from Oracle Coherence settings.

P.P.S. Java could send icmp requests instead of tcp packets. You could read about it at stackoverflow or at stackexchange. We don’t use it.

The necessity of right tool for checking Coherence cluster configuration

Upgrade Identity Management

It was the spring when we upgraded our IDM system for EBS authentication. Now a few months have passed and all seems to have been working fine so I’d like to share a brief notes about issues we met and a final configuration.

We have a classical for identity management 3-Nodes configuration and now we’re running on:
OEL6 with UEK R3 kernel
Oracle Database 12.1.0.2 + PSU3
Java 1.7_75
Weblogic 10.3.6.11
OAM 11.1.2.2 + PSU5
Webgate 11.1.2.2
OID/OVD 11.1.1.7 + PSU3
AccessGate 1.2.3.4

The most difficult (for me) question was where could I find documentation for upgrading? Here what I collected:
Database upgrade guide
Oracle® Fusion Middleware Upgrade Guide for Oracle Identity and Access Management 11g Release 2 (11.1.2.2.0)
Patch Management of an Oracle Identity Management Deployment
OAM Bundle Patch Release History [ID 736372.1]
How to Upgrade OID/ OVD 11.1.1.5 To 11.1.1.7 (IDM PatchSet 6) (Doc ID 1962045.1)
Considerations When Applying Patch Sets to FMW 11g Release 1 Identity Management (Doc ID 1298815.1)
Master Note on Fusion Middleware Proactive Patching – Patch Set Updates (PSUs) and Bundle Patches (BPs) (Doc ID 1494151.1)
Oracle Internet Directory (OID) Version 11.1.1.7 Bundle Patches For Non-Fusion Applications Customers (Doc ID 1614114.1)
Integrating Oracle E-Business Suite Release 12 with Oracle Access Manager 11gR2 (11.1.2) using Oracle E-Business Suite AccessGate [ID 1484024.1]

I cannot say that there weren’t any problems at all but the vast majority of SR I opened were on OAM. They were mostly related to misconfigurations in oam-config.xml file. Either because of upgrade or because of old errors. All this is our local particularity and not so interesting.

All works great in a test environment but the first day after upgrade production system was quite strained:

1. It appears that index IDX_JPS_PARENTDN in schema OID_OPSS was replaced by composite index. The old non-unique index was extended by unique key entity_id. This led to that the index grew from 300 to 600 MB. Explain plans also changed. All still works fine for a single client session. But when clients commenced connect at the same time execution time of search queries increased significantly (because of heavy disk IO) and method “ldapbind” was timing out. I recreated the old index and the majority of problems has gone.

2. Some queries still used the new one (composite) index. After some investigation we noticed too high cost of their explain plans. It was new 12c query re-optimization functionality named “SQL Plan Directives”. Igor Usoltsev wrote about it. We just disabled those SQL Directives.

3. Http GET response significantly increased. For some type of browsers (IE11 for example) it became more than 8K. We terminated SSL/TLS at a proxy side. And proxy (we use nginx) drops large packets by default. Large packets could get from both client and server sites. For nginx we increased buffer sizes (there are several settings, one for client side and other for server).

P.S. It would be useful also read this community thread. We haven’t faced with Coherence issues yet, our 3.7.1.1 versions works fine but folks reports that 3.7.1.19 is more stable.

Upgrade Identity Management