October 2007


I have been working with a client and Microsoft on a very difficult issue with their Exchange 2003 system.  A few months ago, a particular store started exhibiting Event ID 623 errors from source ESE – the Extensible (or Exchange) Storage Engine.  Since this error was coming up on a server that was in the process of being decommissioned, the suggestion to “move the users to a new store” was extremely feasible.

But the problem came back 22 days later on one of the 2 stores that the users were moved to, so we knew something else must be up.  I’ll cut to the chase and explain that Microsoft now is very positive of what is happening, just not who is causing it or why it’s happening.

What’s frustrating about this is that all the tools that can be used to look deeper into this problem aren’t available to me as a technician outside of Microsoft.  All I’ve been able to do for my client is set up triggers to cause “Exchange store.exe dumps” which are essentially process freezes followed by private memory dumps to disk.  The good thing is that the end users don’t notice, nor does the Windows 2003 Cluster service.  Also, our Microsoft support team has been great at sharing information with us.

But the problem still remains, that there is nothing at all that I can do to fix this problem.  I can’t run the debug programs (I can run a debug against the process, but not to the same level of detail, due to a lack of published information) that Microsoft has available, despite a very deep understanding of how the ESE runs the EDB, STM, and LOG files (for an outside consultant who just reads voraciously).  This inability to better service my customers frustrates me to no end, whether Microsoft’s technicians are fantastic or not (there have been other times…).

So, while I wait for them to get back to me on yet another dump that has been generated, looking for a very elusive fSearch() operation against one of my client’s many Exchange 2003 stores, I sit on my hands in anticipation, wishing to be able to do more.

Previously I mentioned some issues I had been having on Kubuntu Feisty Fawn with disk utilization seemingly caused by unflushed disk buffers. I alluded to believing that my “laptop-mode.conf” parameters were at fault.

With my recent upgrade of that same laptop to Kubuntu Gutsy Gibbon, I kept the laptop-mode.conf file a bit closer to the maintaner’s version. There are some changes to the “dirty-writeback-centiseconds” and the “dirty-background-ratio” values from what I posted, and my issue seems to have gone away. I’ve been able to go back to running my Windows 2003 SBS server with a Centrify DirectControl lab environment and a RHEL 4 Oracle 10g server attached at the same time.

The configuration files that work MUCH better are attached here:

laptop-mode.conf

cpufreqd.conf

I upgraded my Dell D620 from Feisty to Gutsy this weekend, which included an upgrade to kernel 2.6.22. Every time there’s a kernel upgrade, VMWare Workstation needs to be reconfigured with “vmware-config.pl”. This isn’t an issue normally, but today it was. Thanks to Chris Hope with Electric Toolbox I was able to fix the problem quick and easy.

For completeness the error I was getting was the same:
/tmp/vmware-config1/vmnet-only/userif.c:630: error: ‘const struct sk_buff’ has no member named ‘h’
when trying to build the VMNet module – VMMon built and inserted perfectly. Downloaded 6.0.1 and installed it, and I’m back and in the game.

I saw this post from Jeff Jones over at Microsoft today. He mentions that Red Hat Enterprise Linux 4 recently patched their 1000th vulnerability, and provides a quote from Truth Happens(direct link to post), which is a Red Hat blog. I suggest you at least read Jeff’s post, since he quotes the relevant point of the Truth article.

I read both of these blogs, and I’m frankly disgusted by the way both sides are treating the data. I understand that statistics are often more useful for what they hide, than what they show. In this case, the 2 competing ideas seem to be: “We fix more bugs, which means we’re working harder to protect you”, vs. “we fix fewer bugs because we have fewer bugs, so we’re working harder to protect you”. I think both of these arguments are invalid, so I hope both sides see this and pay attention.

  1. Jeff Jones: Jeff does a very interesting quarterly (or so) patch report – what OS’s have had the most patches applied in “xx” time frame (past quarter, past year, etc.). I get a lot of out this report, and he does very good trending. Find them on his blog and read them.To that end, he does a very good job selling Microsoft as a security company. By purely counting “number of patches submitted”, Microsoft will automatically look better, simply because “Windows (XP and 2003 combined)” has fewer features than “Red Hat Enterprise Linux” or “SUSE Enterprise Linux” or “Ubuntu Desktop Edition”.Jeff makes a point that Microsoft has only released patches for 649 security vulnerabilities across all Microsoft products in 7 years, but…What Windows does have that the GNU/Linux variants don’t have: .NET Framework, which is a HUGE project, but when it’s updated, you get a single update, so it counts as “1″ in Jeff’s analysis. Also, Microsoft doesn’t have conflicting software product lines – they have the Office team which has swallowed the “Works” team, but there are at least 3 “Office” suites in any GNU/Linux distro (OOo, koffice for KDE, and the suite including ABIWord for gnome).

    Then we can discuss kernels – when there is a driver update for a 3rd party product (Intel i810/845/945 motherboard, for example), it’s a module in the kernel, which requires an updated kernel package from the GNU/Linux distributors, but when there’s a driver update for a 3rd party application, Microsoft doesn’t even have to count it, since it’s “3rd party.” And on the subject of kernels, I don’t recall ever seeing an actual “kernel” update for Windows that wasn’t included in a service pack, or a box on a shelf.

  2. Truth Happens writers: Selling “look how many bugs we fix” to a corporation is a pretty crappy way of doing business, in my opinion. That I can put an appointment in my calendar for 3pm the 2nd Tuesday of each month to review patches, test them that afternoon, and start rolling them out to QA the next morning, is a fantastic way to work. When Red Hat comes out with an update, it’s at a random time, and I have to review each one individually against what I may have installed on my systems.Now, this isn’t a dig against any GNU/Linux distribution out there – free (Ubuntu) or enterprise (Novell / Red Hat) – they are forced into this disclosure/fix model by the fact that these packages are not maintained solely by the companies that are pushing the fixes. In fact, in these cases, the patches have to be done on a “per-report” basis because of how most open-source software vulnerabilities are reported.This is a great time to ask: why is OOo included in a server distro? There *has* to be some GPL or package management reason behind it, but I’d be really interested to know.

So here we see 2 points of view: MS’s (Jeff Jones’) “we’re great because we don’t have a lot of patches, which means we’re more secure;” and RH’s (Truth Happens’) “we’re great because we’ve patched all of the bugs that have been found, no matter how small.” In truth, I think the real point should be that they are 2 completely different companies with huge differences in their offerings in the “Operating System” category. To have both representatives of both companies post what amount to “nyah nyah, we’re better than you are” blogs, keeps the entire discourse of security at a childish level that helps nobody.

So, to both Jeff and the writers of “Truth Happens”: please, out of respect for your readers, look deeper into the numbers and provide some insight, don’t just knock your competition.

First, reference back to my first post on Domain Controller IP/Subnet changes. The nice thing about changing IP addresses on DCs in a larger environment, is that it’s actually easier. I have to keep this one quick for now, but will expand based on comments, which you all seem pretty good at leaving (and thank you!). Please, PLEASE refer back to the first post – this one is only an expansion on that one.

  1. Same as before: why are you changing IPs? In larger environments, I do this because of a physical move of just one site. If the networking team doesn’t have the new subnet up and routing, don’t start!
  2. Make sure the new site (if required) is set up in AD. If I’m moving DCs from one physical location to another, I will build a new site, rather than re-using the old one, because the new site often has better connectivity, so the site link costs are changing.
  3. Add the new IP to the DC you’re moving (DC01 for this). Same as before: don’t remove the old one, just add the new.
  4. On DC01, do the following to verify registration worked:
    ipconfig /registerdns
    Wait a few minutes.
    nslookup
    server DC01
    set type=A
    DC01.foobar.local
    foobar.local
    server DC02
    DC01.foobar.local
    foobar.local

    The answers from DC01 and DC02 should be the same, with possibly different orders. The important thing is that the new IP address and the old IP address show up for both queries on both servers.
  5. Shut down DC01, pack it up and move it. (Or just plug it into the new network.)
  6. Boot up, verify that DC01 has network connectivity, and that other systems can see that it has the new IP.
  7. If you haven’t, make the new IP primary (change order in Network settings), make sure the DNS and WINS servers are correct and reachable (Remember that Windows 2003 DNS should point to itself).
  8. Once verifying that AD is replicating across sites properly (up to 15 minutes in my experience), remove the old IP, ipconfig /registerdns, and reboot.
  9. When it comes back up re-verify that AD is still replicating, and you should be set.

I would point out that when doing a change this big to your environment, reviewing your AD replication, DNS forwarding, and WINS topology is a good idea.

Next Page »