I’ve been very busy with clients over the past 2 weeks, troubleshooting Clustering problems, Exchange issues, and planning a new trust relationship, on top of normal maintenance and design. As I solve each issue, I’ll be posting what I can about them. This week we were able to solve the odd clustering problem…

We’ve seen some issues over the past approximately 2 months, particularly with MS SQL 2000 clusters (1 Exchange 2003 cluster), where the cluster group fails on one node, and the other node (or nodes) fails to pick up the group, leaving the complete cluster group offline. In each of the cases (on both HP and Dell hardware) the first striking piece of evidence in the logs is that all nodes that fail to bring up the cluster report that the Cluster IP Address resource couldn’t be brought online, because of an IP address conflict on the network

Making this issue particularly fun is that most of the information we used to solve the problem, is a lack of information.  In particular, there is absolutely nothing interesting at all in any nodes’ cluster.log file.You see the disks negotiate from node to node, but nothing that makes the failover look any different than if you had right-clicked the group and chosen “Move Group” from Cluster Administrator.

What starts the problem off is Event ID 1228 from source “ClusNet”, which says that the “ClusNet driver couldn’t communicate with the ClusSvc for 60 seconds, the Cluster service is being terminated.” Most of the time, you might even miss that this event is there, because it causes so many Event Source Tcpip, ID 4199; Source ftdisk, ID 57; and Source ntfs event ID 50 events, that it’s easy to look over 1 little error. Especially when monitoring systems like Microsoft Operations Manager (MOM), or Idera SQLDiagnostics Manager (SQLDiag) or HP Systems Insight Manager (SIM) all report the cluster as having issues 30-60 seconds after the CluNet 1228 event is written (timing which corresponds exactly to the Tcpip 4199 events (IP address conflict) or the ftdisk 57 events (failed to flush transaction data). So, here’s what happens, based on conversations with Microsoft, training with Microsoft and HP, and a LOT of reading.

First, we have to discuss the layout of Microsoft Clustering.  There is the well-known “ClusSvc” – the “Cluster” service that you find in the “Services” MMC console. This is what’s known as a user-mode process (as opposed to kernel-mode), so it runs in the same memory and processor spaces as Exchange and SQL.  There are 2 other “services” – the Cluster Network driver (ClusNet) and the Cluster Disk driver (ClusDisk).  If you were ever an NT4 MCSE, you might remember the old Control panel “Drivers” extension.  These services would show up in there, if it still existed, along with a lot of other driver/services – they haven’t gone away, just the GUI for editing them.  Anyways, because ClusSvc runs in the same processor context as SQL or Exchange, and because it shares memory with SQL or Exchange, it can be starved of resources by SQL or Exchange, or any other user-mode process running on your server.  To keep track of whether this has happened or not, Microsoft wrote the ClusNet driver to not only keep track of the private and public networks, and the group IP address resources currently owned by the node, they wrote it to also check a heartbeat (default of 60 seconds) back with ClusSvc.  The best explanation I’ve found is here on Microsoft’s site.

So, the ClusNet event 1228 means that the ClusSvc failed to heartbeat within the 60 second timeout window, so ClusNet killed the Cluster.exe process (ClusSvc).  Generally the OS is set to auto-restart the Cluster service if it terminates, so the node will recover… but what about the cluster as a whole?  Some interesting things happen when cluster.exe is forcibly terminated.

  1. Because it’s terminated, cluster.exe can’t tell ClusNet to drop registration of the group IP address resources owned by the node, so ClusNet keeps the IP addresses advertised (arp’d) on the network.
  2. Cluster.exe also can’t tell ClusDisk to flush ntfs logs to disk in preparation for the disks being grabbed by another node.
  3. Because the Cluster service isn’t running, the node can’t heartbeat to the other nodes in the cluster.
  4. Therefore, the other node(s) reach a quorum, decide that the first node is unavailable, and begin the process of a failover.
  5. First, disks are negotiated on the SCSI/FibreChannel bus, eventually forcibly taken from the “offline” node (causing the ftdisk 57 and ntfs 50 errors).
  6. Simultaneously (unless you have IP address resources having a requirement on your disks), the preferred failover node (or next in the round-robin process) attempts to advertise that its MAC address is now the IP address of the cluster group.
  7. And things get ugly.

Because the original node still hasn’t restarted its Cluster Service (services are normally restarted after 60 seconds / 1 minute, unless you changed your cluster), the ClusNet driver on the original node is still arping the group IP address.  This causes the tcpip 4199 errors on the node that’s attempting to bring up the group.  These errors cause the group to fail to come up on the 2nd node, so the cluster group moves to the next node in the round-robin (or preferred owners) list, untill all nodes have failed to bring up the group.  In my experience, 4 nodes all fail to bring up the group in under 60 seconds, so the cluster group stays offline failed, even after the cluster service restarts on the original node.  Also, the disks have been moved to every node in the cluster, so you can probably gather where the disk errors come from.

So how can you fix or avoid this problem?  I am taking the advice written between the lines of the previously mentioned Microsoft article. The default action of a failure of the heartbeat is to restart the cluster service, which causes the issues mentioned above.  However, there are 2 other options available:  Do Nothing (and log or not), and Bugcheck.

If you set the mentioned registry key to “3”, the heartbeat failure will cause a bugcheck of the server, dumping out a kernel memory dump (or minidump or full dump, depending on your settings, but the default is a kernel dump) of exactly what was going on when the heartbeat failed.  Then, because the failed node is now Blue Screened while it writes out the memory dump, it will no longer fight the disk arbitration OR advertise the cluster group IP addresses that it had owned.  The next node in the list will pick up the group, restart the resources, and your cluster is once-again highly available.

And you now have more troubleshooting information than you would if you just let the service restart.

Now, we had some discussions with clients about this setting, and they were concerned that bugchecking a server could cause data corruption, especially on a busy SQL server.  This is true. However, restarting the cluster service causes a forced unload of running cluster group processes anyways, so bugcheck or cluster service restart causes the exact same data corruption issues, but a bugcheck gives the technician looking at the problem more data, to know how to avoid the problem in the future.  Also, if no action is taken, you could have the cluster in a resource starvation situation, in which case the technician has to manually kill processes, if he can even get into the server to kill processes.  The resource starvation may even force someone to physically reset power to the server node.  And while that’s happening, the cluster may not fail over to the working nodes, causing an actual business outage.  And isn’t that why you’re building clusters in the first place?