I upgraded the TNS lab this past week from Windows 2008 to Windows 2008 R2, including replacing the 4 Domain Controllers (rather than upgrading). It gave me a chance to review the procedure for moving a Certificate Server to a new system, which I hadn’t done since 2005. For those who haven’t tried, the procedure for moving a Certificate Server is reasonably well documented at the Microsoft Support site here: http://support.microsoft.com/kb/555012. The part of this that’s especially tricky, especially in our lab, is the renaming of the DC.

In our lab we have an empty forest root, as per the old (Windows 2000-era) Microsoft recommendations, to match several large customer environments. Because it’s a lab, and no clients connect to it, we only have a single DC. I snapshotted it as a backup, and went through the procedure to rename a domain controller, also well documented by Microsoft, this time at TechNet.

For review, the procedure we planned to run was:
netdom computername dc04 /add:dc01.lwtest.corp
netdom computername dc04 /makeprimary:dc01.lwtest.corp
shutdown -r -t 0
netdom computername dc01 /enum
netdom computername dc01 /verify
netdom computername dc01 /rem:dc04.lwtest.corp

I’m still not sure what caused it, but in this case, this command failed:
netdom computername dc04 /makeprimary:dc01.tns.lab
At this point, I couldn’t make the old name primary again (I would get an “Access Denied” error), so I rebooted to see which name had taken. And that’s where things went bad.

When the DC came up, we were getting this error: Netlogon EventID 5602. Source: NETLOGON, EventID: 5602, Data: “An internal error occurred while accessing the computer’s local or network security database.”

Because the DC rename hadn’t completed successfully, the computer couldn’t actually log into itself to load AD. Very bad for the root of the forest. I wasn’t able to find anything helpful in my searches, so thought I’d let you know the fix:

Name it back to the old name and try again:
Reboot into Safe Mode.
netdom computername localhost /makeprimary:dc04.lwtest.corp
shutdown -r -t 0

Boot normally
netdom computername localhost /makeprimary:dc04.lwtest.corp
netdom computername dc01 /enum
netdom computername dc01 /verify
shutdown -r -t 0

After *that* reboot, make sure, with the verify command, that the old name took, and that you can log in, and just try the rename again.

I couldn’t get the “rename back” to take untill after the attempt in safe mode. Strange, but it’s working great now! Hopefully this will help someone.

I had a Bourne Shell (sh) script I needed to capture the exit status of, but it was being run through “tee” to capture a log file, so “$?” always returned the exit status of “tee”, not the script. In a nutshell, it went something like this:
#!/bin/sh
DO_LOG=$1
LOGNAME="`hostname`.out"
if [ "$DO_LOG" -eq "1" ]; then
# Logging is turned on, so relaunch ourself with logging disabled, and tee the output to the logfile
sh $0 0 | tee $LOGNAME
exit $?
fi
#... Do lots of things in the script
exit $ERRORCODE

Now, the important thing here is that the script sets very specific error codes (we have 16 defined) based on different error states, so that a tool like HP Opsware can give us different reports based on the exit status. When run with “0″ for no logging, this works great, but it requires the controlling tool to capture logs, and not all do (especially cheap “for” loops in a shell script.)

But when run with logging enabled, all of the fancy error code handling (45 lines of subroutines’ worth) gets lost, because “$!” is equal to the status code of the “tee” command. Bash scripters out there will say “but what about $PIPESTATUS ?” If we could use bash, the code would be:
#!/bin/sh
DO_LOG=$1
LOGNAME="`hostname`.out"
if [ "$DO_LOG" -eq "1" ]; then
# Logging is turned on, so relaunch ourself with logging disabled, and tee the output to the logfile
sh $0 0 | tee $LOGNAME
exit ${PIPESTATUS[0]}
fi
#... Do lots of things in the script
exit $ERRORCODE

(Note the single line change in the conditional exit.)

But, I don’t have the luxury of bash (thanks AIX and FreeBSD and Solaris 8), so we needed to get fancy…
#!/bin/sh
DO_LOG=$1
LOGNAME="`hostname`.out"
if [ "$DO_LOG" -eq "1" ]; then
# Logging is turned on, so relaunch ourself with logging disabled, and tee the output to the logfile
cp /dev/null $LOGNAME
tail -f $LOGNAME &
TAILPID=$!
sh $0 0 >> $LOGNAME 2>&1
RETURNCODE=$?
kill TAILPID
exit $RETURNCODE
fi
#... Do lots of things in the script
exit $ERRORCODE

In this last example, we’re creating the empty logfile by copying /dev/null to the logname, then starting a backgrounded “tail” command on the empty file. Because we haven’t disconnected STDOUT in the backgrounding, we will still get the screen output we desire from “tail”. The script now only writes *its* output, with redirected STDOUT and STDERR, to the log file, which is already being tailed to the actual screen. At the end of the script, we capture the true exit code, clean up the tail ugliness, and exit with the desired status code.

This does have a serious downside that if the script encounters and error and exits, the “tail” is left running indefinitely on Linux and Solaris, since the kernel there will simply scavenge the process to be owned by init. So, if you take this method, be very careful to capture all errors you may possibly encounter. Or, just use a better scripting tool. :)

I have recently pushed the main ESX host for TNS to 70% overcommit on RAM, since upgrading to 4.1. Interestingly (expectedly), the performance now is the same as it was on 3.5 with 2 fewer VMs and only 50% overcommit. But, it’s still pretty poor in the “Lab” performance pool, even after changing that pool from “low” to “normal” shares. So we finally ordered new memory, doubling the server to 16gb. It goes in Sunday night, so we’ll see how things perform next week when Rob’s on site with customers.

I recently upgraded the totalnetsolutions.net internal network from ESX 3.5 to ESXi 4.1. The ESX Host upgrade itself is simple, and not worth mentioning. When complete, however, you have an option to upgrade the Guest OS Virtual Hardware from v4 to v7. Support for USB devices, thin-provisioned disks, and supposed speed improvements come with the upgrade.

The process should always be:

1. Upgrade VMware Tools to the latest available version. This pre-stages the drivers for the newest hardware, even though it’s not “installed” yet.
2. Reboot the guest and make sure it boots and runs properly after all upgrades (host and guest) have been completed.
3. Back up the entire guest VM, including the VMX and VMDK files.
4. Upgrade the virtual hardware through vSphere
5. Boot the VM and verify all settings are working properly.

I started the upgrades in the Unix lab. The Red Hat Enterprise Linux (4 and 5) and Ubuntu (10) systems went without a hitch. VMware Tools automatic upgrade went properly, systems rebooted fine, and after upgrading the virtual hardware, I didn’t have to change a thing in the guests. The Solaris 10 x86 guest, had some issues, however. I believe a rescan was all that was required to fix it, but we were planning on rebuilding the box anyways, so used the issues as the final “nail in the coffin” to the old hardware.

On the Windows side, we have 2 pools in our ESX environment: one for test machines, and one running our production environment. We have Domain Controllers (and separate forests) in both environments, but all file and Exchange operations only live in production.

The Windows 2003 DC / Exchange 2003 server came up fine, although it lost its network configuration (adapter MAC changed), so that had to be reset, but is a simple fix.

All Windows 2008 DCs in the test lab, including the RODC, came up fine, but with the same “lost network configuration” hiccup. These systems all have the NTDS data and logs on the C: drive.

The Windows 2008 Server Core DC / File server, however, was a different story. Upon reboot, the server kept giving a BSOD and rebooting, so I couldn’t read the error. As this system is the primary (200GB) file server, primary DNS server (including conditional forwarding to the test lab), and the DC that handles the most load (DNS weight on the Windows 2003 is slightly lower), fixing the Blue Screen was of major importance. This is how it’s been fixed:

1. Safe Mode and “Last known Config” didn’t work, so hit F8 on the boot process to choose “Do not restart on system failure”. This allows you to read the BSOD message. In our case, it was simply “File Not Found”. Which means, no minidump, and you might be sunk.
2. On a whim, since it is a DC, I tried to boot into Directory Services Restore Mode, hoping the “not found” file was AD related… and was right.
3. This leads us down the path of this support article.
4. Immediately upon booting, I ran: ntdsutil files integrity which gave this error:
Could not initialize the Jet engine: Jet Error -566.
Failed to open DIT for AD DS/LDS instance NTDS. Error -2147418113
5. Searching shows there’s not much useful here, but we know it’s a failure to read the DIT. This could be security, or horrid corruption.
6. I quit ntdsutil to try to check the files on the E: drive, where they lived, only to find there was no E: drive. With no MMC, it’s diskpart to the rescue.
7. diskpart
DISKPART> list disk
Disk ### Status Size Free Dyn Gpt
-------- ---------- ------- ------- --- ---
Disk 0 Online 24 GB 0 B
Disk 1 Offline 100 GB 0 B
Disk 2 Offline 100 GB 0 B

8. I ran:
select disk 1
online
select disk 2
online
exit

9. Now I can read the E: drive, so try ntdsutil files integrity again… and get the same error message. Checking the disk, everything looked fine. In Linux, I’d check permissions with a quick “touch filename”, but notepad needed to be used here, only to discover the entire disk was marked read-only. Back to diskpart!
diskpart
select disk 1
attributes disk clear readonly
select disk 2
attributes disk clear readonly

10. Now ntdsutil runs properly, reboot into normal mode, and the system is fixed!

I haven’t seen posts of other people having disks get marked offline and unreadable on their VMs after an upgrade, but this only happened on the Windows 2008 system, and it’s non-system disks.

I’ve been fighting K9Mail for weeks now, trying to get it to sync with MailStreet who hosts “exchange.ms”) hosted Exchange. If you’ve already followed the instructions at the K9Mail Wiki with no success, read on.

Thanks to the k9mail wiki on debugging connection issues and the fact that I already had the Android SDK installed, I was able to solve the 2 related errors I was getting. I would either get an “HTTP 404 not found” or an “HTTP 501 Not Implemented” depending on the settings I chose. With no additional settings other than suggested in the Wiki, I’d get a “501 not implemented”. If I tried to set a mailbox path, or a WebDAV path, I’d get the HTTP 404 Not Found.

In the debugging log, I saw that the system was calling “http://mail.$domain.exchange.ms/”$webDAVpath/Inbox – if I set it to a full URL, the full URL was getting appended. When I attempted to hit those same paths in a full browser, I’d always get an HTTP 404. So, digging in my history in Firefox, I found the following (cleaned) path:

http://mail.$domain.exchange.ms/exchange/$emailaddress/

In this case $emailaddress was my Exchange mail address with the “@” stripped out. Appending “Inbox” to the end of this path resulted in a valid load of my OWA inbox.

Plugging then: /exchange/$emailaddress/ into the WebDAV box in K9Mail, and my email immediately loaded up.

Now I have Android syncing my calendars and contacts, and k9mail is handling my massive inbox!

« Previous PageNext Page »