I’ve been very busy with clients over the past 2 weeks, troubleshooting Clustering problems, Exchange issues, and planning a new trust relationship, on top of normal maintenance and design. As I solve each issue, I’ll be posting what I can about them. This week we were able to solve the odd clustering problem…

We’ve seen some issues over the past approximately 2 months, particularly with MS SQL 2000 clusters (1 Exchange 2003 cluster), where the cluster group fails on one node, and the other node (or nodes) fails to pick up the group, leaving the complete cluster group offline. In each of the cases (on both HP and Dell hardware) the first striking piece of evidence in the logs is that all nodes that fail to bring up the cluster report that the Cluster IP Address resource couldn’t be brought online, because of an IP address conflict on the network

Making this issue particularly fun is that most of the information we used to solve the problem, is a lack of information.  In particular, there is absolutely nothing interesting at all in any nodes’ cluster.log file.You see the disks negotiate from node to node, but nothing that makes the failover look any different than if you had right-clicked the group and chosen “Move Group” from Cluster Administrator.

What starts the problem off is Event ID 1228 from source “ClusNet”, which says that the “ClusNet driver couldn’t communicate with the ClusSvc for 60 seconds, the Cluster service is being terminated.” Most of the time, you might even miss that this event is there, because it causes so many Event Source Tcpip, ID 4199; Source ftdisk, ID 57; and Source ntfs event ID 50 events, that it’s easy to look over 1 little error. Especially when monitoring systems like Microsoft Operations Manager (MOM), or Idera SQLDiagnostics Manager (SQLDiag) or HP Systems Insight Manager (SIM) all report the cluster as having issues 30-60 seconds after the CluNet 1228 event is written (timing which corresponds exactly to the Tcpip 4199 events (IP address conflict) or the ftdisk 57 events (failed to flush transaction data). So, here’s what happens, based on conversations with Microsoft, training with Microsoft and HP, and a LOT of reading. (more…)