The necessity of right tool for checking Coherence cluster configuration

I faced with a quite interesting problem after upgrading Coherence in my Oracle Access Manager cluster.

The issue appeared right after upgrading, all log files of OAM cluster were full of messages like:


<2015-08-28 12:37:09.878/1513.280 Oracle Coherence GE 3.7.1.13 (thread=Cluster, member=n/a): Delaying formation of a new cluster; IpMonitor failed to verify the reachability of senior Member(Id=1, Timestamp=2015-08-26 13:38:36.265, Address=xxx.xxx.xxx.xxx:9095, MachineId=3353, Location=site:,machine:oam1,process:12886, Role=WeblogicServer); if this persists it is likely the result of a local or remote firewall rule blocking either ICMP pings, or connections to TCP port 7>

At a first glance, it’s well known issue and relates in a some kind of network problems as described in Doc ID 1530288.1 IpMontor Failed To Verify The Reachability Of Senior Member. The only problem that I absolutely sure there is no firewall between coherence instances and all hosts in a cluster are reachable. Of course I’ve made all recommended tests: java ping test from Doc ID 1936105.1, multicast test from Doc ID 1936452.1, datagram test from 1936575.1.

I’ve even written a special program for checking whether method InetAddress.isReachable() works or not (as it’s described in What Is The Purpose Of IpMonitor In A Coherence Cluster? ( Doc ID 1526745.1 ))

It’s quite simple:

import java.net.*;

public class alive{
public static void main(String args[]){
try{
InetAddress ia = InetAddress.getByName("oam2");;
boolean b = ia.isReachable(10000);
if(b){
System.out.println("Reachable");
}
else{
System.out.println("Unreachable");
}

}catch(Exception e){
System.out.println("Exception: " + e.getMessage());
}
}
}

What I tried is to check availability of host oam2 from host oam1. And it worked.


[oracle@oam2 ~]$ /opt/oracle/jrockit/bin/java alive
Reachable

Also I colud see icmp packets in tcpdump:


12:32:40.656815 IP (tos 0x0, ttl 64, id 16265, offset 0, flags [DF], proto ICMP (1), length 72)
oam1 > oam2: ICMP echo request, id 25757, seq 1, length 52
12:32:40.656942 IP (tos 0x60, ttl 64, id 49076, offset 0, flags [none], proto ICMP (1), length 72)
oam2 > oam1: ICMP echo reply, id 25757, seq 1, length 52

The particularly strange was I couldn’t see any network packets (nor icmp nor tcp port 7) from working oam1 server whereas out file was full of error messages “IpMonitor failed to verify the reachability of senior Member”.

So OAM_Server seemed not to have sent network packets (I mean IpMonitor, there was a lot of traffic between servers on the other ports) at all.

After some investigation I found out a couple of interesting strings in a strace output:


27348 socket(PF_INET, SOCK_RAW, IPPROTO_ICMP) = -1 EPERM (Operation not permitted)
27348 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 665
27348 bind(665, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.128.20.100")}, 16) = 0
27348 connect(665, {sa_family=AF_INET, sin_port=htons(7), sin_addr=inet_addr("xxx.xxx.xxx.xxx")}, 16

Here you can see that program tried to send network packets to public address from a private address. In that moment it became clear the matter is in our specific network configuration.

But at first, let’s reproduce the issue. My code was wrong:


InetAddress ia = InetAddress.getByName("oam2");;
boolean b = ia.isReachable(10000);

And wrong was all coherence tests (like java ping test)
It is not how ipMonitor calls isReachable method. In Coherence 3.7.1.19 it selects network interface for sending a ping packet:


InetAddress ia = InetAddress.getByName("oam2");;
NetworkInterface netIntfc = NetworkInterface.getByInetAddress(InetAddress.getByName("oam1")) ;
boolean b = ia.isReachable(netIntfc, 0 , 10000);

Let’s see in jdk source code, I mean InetAddress source code and isReachable method implementation.
Here if we launch isReachable and pass the parameter NetworkInterface (as it does Coherence) it’ll take the first IPv4address found on interface and uses it as a source address.


cat ./java/net/Inet4AddressImpl.java
class Inet4AddressImpl implements InetAddressImpl {
...
private native boolean isReachable0(byte[] addr, int timeout, byte[] ifaddr, int ttl) throws IOException;
...
public boolean isReachable(InetAddress addr, int timeout, NetworkInterface netif, int ttl) throws IOException {
byte[] ifaddr = null;
if (netif != null) {
/*
* Let's make sure we use an address of the proper family
*/
java.util.Enumeration it = netif.getInetAddresses();
InetAddress inetaddr = null;
while (!(inetaddr instanceof Inet4Address) &&
it.hasMoreElements())
inetaddr = it.nextElement();
if (inetaddr instanceof Inet4Address)
ifaddr = inetaddr.getAddress();
}
return isReachable0(addr.getAddress(), timeout, ifaddr, ttl);
}

It works on windows (where interface could have only addresses from one network), but Linux kernel has source routing. We have 2 addresses in one interface: public and private. Jdk took first address it found (I checked, the same behavior is in last jdk 8u60), and in our case it was private address. And tried to send packets in a public network. Such types of network packets normally drop by kernel (see Reverse Path Filtering http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html) because they are meaningless. Programs should either do not use source address or use the right public address for send packets.

As a workaround I added rule for rewriting source addresses of those packets.

I’m not sure whether it’s bug or feature. And also I think our case is quite unique. But it’s once again reminded me the simple truth: if you test something you should do it that way as you do it in a production.

P.S. During investigation I realized that OAM Coherence is not synonymous with the Oracle Coherence (see Doc ID 1579874.1 Questions about OAM 11g Coherence Configuration) 🙂 It has their own settings in oam-config.xml, they difference from Oracle Coherence settings.

P.P.S. Java could send icmp requests instead of tcp packets. You could read about it at stackoverflow or at stackexchange. We don’t use it.

Advertisements
The necessity of right tool for checking Coherence cluster configuration

Upgrade Identity Management

It was the spring when we upgraded our IDM system for EBS authentication. Now a few months have passed and all seems to have been working fine so I’d like to share a brief notes about issues we met and a final configuration.

We have a classical for identity management 3-Nodes configuration and now we’re running on:
OEL6 with UEK R3 kernel
Oracle Database 12.1.0.2 + PSU3
Java 1.7_75
Weblogic 10.3.6.11
OAM 11.1.2.2 + PSU5
Webgate 11.1.2.2
OID/OVD 11.1.1.7 + PSU3
AccessGate 1.2.3.4

The most difficult (for me) question was where could I find documentation for upgrading? Here what I collected:
Database upgrade guide
Oracle® Fusion Middleware Upgrade Guide for Oracle Identity and Access Management 11g Release 2 (11.1.2.2.0)
Patch Management of an Oracle Identity Management Deployment
OAM Bundle Patch Release History [ID 736372.1]
How to Upgrade OID/ OVD 11.1.1.5 To 11.1.1.7 (IDM PatchSet 6) (Doc ID 1962045.1)
Considerations When Applying Patch Sets to FMW 11g Release 1 Identity Management (Doc ID 1298815.1)
Master Note on Fusion Middleware Proactive Patching – Patch Set Updates (PSUs) and Bundle Patches (BPs) (Doc ID 1494151.1)
Oracle Internet Directory (OID) Version 11.1.1.7 Bundle Patches For Non-Fusion Applications Customers (Doc ID 1614114.1)
Integrating Oracle E-Business Suite Release 12 with Oracle Access Manager 11gR2 (11.1.2) using Oracle E-Business Suite AccessGate [ID 1484024.1]

I cannot say that there weren’t any problems at all but the vast majority of SR I opened were on OAM. They were mostly related to misconfigurations in oam-config.xml file. Either because of upgrade or because of old errors. All this is our local particularity and not so interesting.

All works great in a test environment but the first day after upgrade production system was quite strained:

1. It appears that index IDX_JPS_PARENTDN in schema OID_OPSS was replaced by composite index. The old non-unique index was extended by unique key entity_id. This led to that the index grew from 300 to 600 MB. Explain plans also changed. All still works fine for a single client session. But when clients commenced connect at the same time execution time of search queries increased significantly (because of heavy disk IO) and method “ldapbind” was timing out. I recreated the old index and the majority of problems has gone.

2. Some queries still used the new one (composite) index. After some investigation we noticed too high cost of their explain plans. It was new 12c query re-optimization functionality named “SQL Plan Directives”. Igor Usoltsev wrote about it. We just disabled those SQL Directives.

3. Http GET response significantly increased. For some type of browsers (IE11 for example) it became more than 8K. We terminated SSL/TLS at a proxy side. And proxy (we use nginx) drops large packets by default. Large packets could get from both client and server sites. For nginx we increased buffer sizes (there are several settings, one for client side and other for server).

P.S. It would be useful also read this community thread. We haven’t faced with Coherence issues yet, our 3.7.1.1 versions works fine but folks reports that 3.7.1.19 is more stable.

Upgrade Identity Management

5-й КИТ

Закончился 5-й по счёту курс информационных технологий (КИТ)

Я там читал несколько лекций по основам БД, ничего сильно специфичного, несколько монологов о технологиях БД:

1. Базы данных: какие они бывают, что такое реляционная алгебра, SQL, нормальная форма и зачем нужна система управления БД

2. Базы данных: атомарность транзакций, способы ведения журналов транзакций и принципы построения транзакционных систем высокой доступности (на примере СУБД Oracle)

Кажется, что курс вышел достаточно интересным 🙂

Полный список лекций можно найти на странице проекта

5-й КИТ

Одна задачка на SQL

В феврале 2013 года мне посчастливилось участвовать в жюри региональных финалов олимпиады Oracle ИТ-Планета

Все региональные этапы проходят заочно, в том числе и для жюри. В моем распоряжении были номера участников и их решения. Необходимо было оценить корректность решений и лично удостовериться, что решения, признанные неверными автоматической проверкой, действительно неверны. Впечатлений масса, но я хотел бы рассказать об одной классической задаче и ее решениях. Вернее, об ошибках, которые можно допустить при решении, казалось бы, избитой задачи.

Итак, одно из заданий было сформулировано так:

Одной командой SELECT вывести список сотрудников, которым установлен оклад больший, чем средний оклад по подразделению компании, к которому они приписаны.

Сведения о сотрудниках, для которых неизвестно к какому подразделению они приписаны, выводить не нужно.

В результат вывести 5 (пять) столбцов:
1. Идентификатор сотрудника
2. Фамилию сотрудника
3. Имя сотрудника
4. Оклад, установленный сотруднику
5. Идентификатор подразделения, к которому приписан сотрудник

Результат отсортировать:
1. По окладу, установленный сотруднику (по убыванию)
2. По фамилии сотрудника (по возрастанию)
3. По имени сотрудника (по возрастанию)
4. По идентификатору сотрудника (по возрастанию)

Решением задачи является запрос:


SELECT employee_id, last_name, first_name, salary, department_id
FROM employees E
WHERE salary > (SELECT AVG(salary) FROM employees X
WHERE E.department_id = x.department_id)
ORDER BY salary DESC, last_name, first_name, employee_id;

Вообще говоря, запрос можно сформулировать множеством способов, например, применив преобразование устранение вложенности подзапросов или используя аналитические функции. Однако интересно, сколько и какие ошибки можно допустить в решении этой набившей оскомину задачи? Я, конечно, предполагал, как можно решить эту задачу неправильно, но такого количества различных неверных решений не ожидал. 🙂

Continue reading “Одна задачка на SQL”

Одна задачка на SQL