Technology


We’ve been having some server uptime/stability issues, and aren’t getting alerts from HP Systems Insight Manager (HP SIM) that the services are down (cause they’re not, they’re just not answering on HTTP).  So I took a copy of “responder.pl” and put it into something I wrote for totalnetsolutions.net.  What came out is actually pretty nice, easily configurable, and so far this week, very stable.

We haev this running ever 3 minutes from 3 systems: 1 Windows 2003, 1 Fedora Core 8, and 1 Kubuntu Gutsy Gibbon.  Requires Net::SMTP, Config::INIFiles, LWP::UserAgent, and HTTP::Request.  The only one that I’ve needed to download and install is Config::INIFiles on any of those 3 systems.  But I do have LWP::Simple on all systems, so I’m not sure if you’ll need the last 2.  This is my first published code other than 3 line bash scripts, so be kind in comments.

Feel free to take and use / improve / update this – I’d just appreciate if you’d let me know so I can update this version here.  The parseIni() function checks that all “URL”s are in http://www.google.com format or http://64.233.167.99 format (it checks for http:// followed by text followed by what appears to be a valid TLD format, or it checks for http:// followed by an IP address).  I have yet to add in the regex to look for a valid full URI, because I didn’t need that yet.

This is upgraded over responser.pl in that:

  1. It will send to any number of SMTP recipients (comma-separated)
  2. It will silence its alerting if *all* checked addresses are down.  If the monitoring system gets unplugged from the network, it won’t attempt to send hundreds of alerts upon regaining access.  Or if you’re testing from a DSL line, you won’t get alerts because the DSL line went down, but the actual target was up.  The next version will have this as an option in the INI file.
  3. It uses standard INI file formatting, rather than a parsed text file.
  4. it runs out of the box (so to speak) on Windows (ActivePerl) or Linux (Fedora and Ubuntu both tested).
  5. It has better inline documentation.

The major problem is that a minimum of 2 URLs are needed in the INI file for the full logic to work.  You can get around this for small networks by adding in the DNS domain for one, and the IP address for the other. 

Thanks, and please share any concerns or problems.

chk-web.pl

As I mentioned in my last clustering post, there are some Exchange problems we’ve been working on over the past few weeks.  One of the simpler problems has a complex answer, so I thought I’d explain a bit.

As any good Exchange administrator knows, Exchange stores its data (for a store) in 2 files, the EDB file, and the STM file.  However, there’s not a really great explanation of the differences between the two files – the best I’ve found so far is at MessagingTalk.org, but they only explain that the STM is MIME formatted, and the EDB is MAPI content. Why, though, and how does it affect the end users? This is what we’ll explore. (more…)

If you are setting up a cross-forest trust with selective authentication (which requires a Windows Server 2003 Native mode level forest and domain), don’t forget to grant the “Allowed to Authenticate” right to the users from the trusted domain to the servers they’ll need access to in your domain. The error messages you’ll get back (replicated here in my test VM domains) don’t really say much helpful.

System Error 317 has occurred. The system cannot find message text for message number 0x*** in the message file for ***.

System Error 317

Further information about adding the “Allowed to Authenticate” right to the trusted users is available at Microsoft TechNet. If you have the opportunity to raise your forest and domain functional levels to take advantage of this, I highly recommend it. But I recommend also (even more strongly) documenting precisely what you set.

I’ve been very busy with clients over the past 2 weeks, troubleshooting Clustering problems, Exchange issues, and planning a new trust relationship, on top of normal maintenance and design. As I solve each issue, I’ll be posting what I can about them. This week we were able to solve the odd clustering problem…

We’ve seen some issues over the past approximately 2 months, particularly with MS SQL 2000 clusters (1 Exchange 2003 cluster), where the cluster group fails on one node, and the other node (or nodes) fails to pick up the group, leaving the complete cluster group offline. In each of the cases (on both HP and Dell hardware) the first striking piece of evidence in the logs is that all nodes that fail to bring up the cluster report that the Cluster IP Address resource couldn’t be brought online, because of an IP address conflict on the network

Making this issue particularly fun is that most of the information we used to solve the problem, is a lack of information.  In particular, there is absolutely nothing interesting at all in any nodes’ cluster.log file.You see the disks negotiate from node to node, but nothing that makes the failover look any different than if you had right-clicked the group and chosen “Move Group” from Cluster Administrator.

What starts the problem off is Event ID 1228 from source “ClusNet”, which says that the “ClusNet driver couldn’t communicate with the ClusSvc for 60 seconds, the Cluster service is being terminated.” Most of the time, you might even miss that this event is there, because it causes so many Event Source Tcpip, ID 4199; Source ftdisk, ID 57; and Source ntfs event ID 50 events, that it’s easy to look over 1 little error. Especially when monitoring systems like Microsoft Operations Manager (MOM), or Idera SQLDiagnostics Manager (SQLDiag) or HP Systems Insight Manager (SIM) all report the cluster as having issues 30-60 seconds after the CluNet 1228 event is written (timing which corresponds exactly to the Tcpip 4199 events (IP address conflict) or the ftdisk 57 events (failed to flush transaction data). So, here’s what happens, based on conversations with Microsoft, training with Microsoft and HP, and a LOT of reading. (more…)

Now that I have the system back online, I thought I’d post a quick “where we are” update for any regular readers:

  1. We have restored from most recent backup, but are missing a single post, “PHP, mail(), Apache, and SELinux (FC7)”, which even google.com’s cache didn’t catch in full. I apologize to the readers who were using the instructions in that post whom we met through their comments.
  2. We haven’t yet restored the “comments” table. I haven’t yet decided if we will.
  3. I have fixed the problem of storing backups for the company in 3 different locations, based on system type. Now we only have 2 – onsite and offsite.
  4. The extremely popular How to Change a DC IP address article was restored first. (That page drives over half of our traffic.)

We did a standard forensics review of what happened, and it appears as though a perfect storm of issues hit us – a weekend outage, a hardware failure, and failure to keep publicly exposed software fully up-to-date. The saying often goes, “The cobbler’s kids are the ones without shoes” or something similar to that, and here we failed to follow our own advice, preferring to keep our customers’ systems running smoothly. I know I’ll be spending a few extra hours a week the rest of this year reviewing our internal systems for best practices.

In any case, things are fixed and running great again.

« Previous PageNext Page »