The necessity of right tool for checking Coherence cluster configuration

I faced with a quite interesting problem after upgrading Coherence in my Oracle Access Manager cluster.

The issue appeared right after upgrading, all log files of OAM cluster were full of messages like:


<2015-08-28 12:37:09.878/1513.280 Oracle Coherence GE 3.7.1.13 (thread=Cluster, member=n/a): Delaying formation of a new cluster; IpMonitor failed to verify the reachability of senior Member(Id=1, Timestamp=2015-08-26 13:38:36.265, Address=xxx.xxx.xxx.xxx:9095, MachineId=3353, Location=site:,machine:oam1,process:12886, Role=WeblogicServer); if this persists it is likely the result of a local or remote firewall rule blocking either ICMP pings, or connections to TCP port 7>

At a first glance, it’s well known issue and relates in a some kind of network problems as described in Doc ID 1530288.1 IpMontor Failed To Verify The Reachability Of Senior Member. The only problem that I absolutely sure there is no firewall between coherence instances and all hosts in a cluster are reachable. Of course I’ve made all recommended tests: java ping test from Doc ID 1936105.1, multicast test from Doc ID 1936452.1, datagram test from 1936575.1.

I’ve even written a special program for checking whether method InetAddress.isReachable() works or not (as it’s described in What Is The Purpose Of IpMonitor In A Coherence Cluster? ( Doc ID 1526745.1 ))

It’s quite simple:

import java.net.*;

public class alive{
public static void main(String args[]){
try{
InetAddress ia = InetAddress.getByName("oam2");;
boolean b = ia.isReachable(10000);
if(b){
System.out.println("Reachable");
}
else{
System.out.println("Unreachable");
}

}catch(Exception e){
System.out.println("Exception: " + e.getMessage());
}
}
}

What I tried is to check availability of host oam2 from host oam1. And it worked.


[oracle@oam2 ~]$ /opt/oracle/jrockit/bin/java alive
Reachable

Also I colud see icmp packets in tcpdump:


12:32:40.656815 IP (tos 0x0, ttl 64, id 16265, offset 0, flags [DF], proto ICMP (1), length 72)
oam1 > oam2: ICMP echo request, id 25757, seq 1, length 52
12:32:40.656942 IP (tos 0x60, ttl 64, id 49076, offset 0, flags [none], proto ICMP (1), length 72)
oam2 > oam1: ICMP echo reply, id 25757, seq 1, length 52

The particularly strange was I couldn’t see any network packets (nor icmp nor tcp port 7) from working oam1 server whereas out file was full of error messages “IpMonitor failed to verify the reachability of senior Member”.

So OAM_Server seemed not to have sent network packets (I mean IpMonitor, there was a lot of traffic between servers on the other ports) at all.

After some investigation I found out a couple of interesting strings in a strace output:


27348 socket(PF_INET, SOCK_RAW, IPPROTO_ICMP) = -1 EPERM (Operation not permitted)
27348 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 665
27348 bind(665, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.128.20.100")}, 16) = 0
27348 connect(665, {sa_family=AF_INET, sin_port=htons(7), sin_addr=inet_addr("xxx.xxx.xxx.xxx")}, 16

Here you can see that program tried to send network packets to public address from a private address. In that moment it became clear the matter is in our specific network configuration.

But at first, let’s reproduce the issue. My code was wrong:


InetAddress ia = InetAddress.getByName("oam2");;
boolean b = ia.isReachable(10000);

And wrong was all coherence tests (like java ping test)
It is not how ipMonitor calls isReachable method. In Coherence 3.7.1.19 it selects network interface for sending a ping packet:


InetAddress ia = InetAddress.getByName("oam2");;
NetworkInterface netIntfc = NetworkInterface.getByInetAddress(InetAddress.getByName("oam1")) ;
boolean b = ia.isReachable(netIntfc, 0 , 10000);

Let’s see in jdk source code, I mean InetAddress source code and isReachable method implementation.
Here if we launch isReachable and pass the parameter NetworkInterface (as it does Coherence) it’ll take the first IPv4address found on interface and uses it as a source address.


cat ./java/net/Inet4AddressImpl.java
class Inet4AddressImpl implements InetAddressImpl {
...
private native boolean isReachable0(byte[] addr, int timeout, byte[] ifaddr, int ttl) throws IOException;
...
public boolean isReachable(InetAddress addr, int timeout, NetworkInterface netif, int ttl) throws IOException {
byte[] ifaddr = null;
if (netif != null) {
/*
* Let's make sure we use an address of the proper family
*/
java.util.Enumeration it = netif.getInetAddresses();
InetAddress inetaddr = null;
while (!(inetaddr instanceof Inet4Address) &&
it.hasMoreElements())
inetaddr = it.nextElement();
if (inetaddr instanceof Inet4Address)
ifaddr = inetaddr.getAddress();
}
return isReachable0(addr.getAddress(), timeout, ifaddr, ttl);
}

It works on windows (where interface could have only addresses from one network), but Linux kernel has source routing. We have 2 addresses in one interface: public and private. Jdk took first address it found (I checked, the same behavior is in last jdk 8u60), and in our case it was private address. And tried to send packets in a public network. Such types of network packets normally drop by kernel (see Reverse Path Filtering http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html) because they are meaningless. Programs should either do not use source address or use the right public address for send packets.

As a workaround I added rule for rewriting source addresses of those packets.

I’m not sure whether it’s bug or feature. And also I think our case is quite unique. But it’s once again reminded me the simple truth: if you test something you should do it that way as you do it in a production.

P.S. During investigation I realized that OAM Coherence is not synonymous with the Oracle Coherence (see Doc ID 1579874.1 Questions about OAM 11g Coherence Configuration) 🙂 It has their own settings in oam-config.xml, they difference from Oracle Coherence settings.

P.P.S. Java could send icmp requests instead of tcp packets. You could read about it at stackoverflow or at stackexchange. We don’t use it.

The necessity of right tool for checking Coherence cluster configuration