VMWare


I have recently pushed the main ESX host for TNS to 70% overcommit on RAM, since upgrading to 4.1. Interestingly (expectedly), the performance now is the same as it was on 3.5 with 2 fewer VMs and only 50% overcommit. But, it’s still pretty poor in the “Lab” performance pool, even after changing that pool from “low” to “normal” shares. So we finally ordered new memory, doubling the server to 16gb. It goes in Sunday night, so we’ll see how things perform next week when Rob’s on site with customers.

I recently upgraded the totalnetsolutions.net internal network from ESX 3.5 to ESXi 4.1. The ESX Host upgrade itself is simple, and not worth mentioning. When complete, however, you have an option to upgrade the Guest OS Virtual Hardware from v4 to v7. Support for USB devices, thin-provisioned disks, and supposed speed improvements come with the upgrade.

The process should always be:

1. Upgrade VMware Tools to the latest available version. This pre-stages the drivers for the newest hardware, even though it’s not “installed” yet.
2. Reboot the guest and make sure it boots and runs properly after all upgrades (host and guest) have been completed.
3. Back up the entire guest VM, including the VMX and VMDK files.
4. Upgrade the virtual hardware through vSphere
5. Boot the VM and verify all settings are working properly.

I started the upgrades in the Unix lab. The Red Hat Enterprise Linux (4 and 5) and Ubuntu (10) systems went without a hitch. VMware Tools automatic upgrade went properly, systems rebooted fine, and after upgrading the virtual hardware, I didn’t have to change a thing in the guests. The Solaris 10 x86 guest, had some issues, however. I believe a rescan was all that was required to fix it, but we were planning on rebuilding the box anyways, so used the issues as the final “nail in the coffin” to the old hardware.

On the Windows side, we have 2 pools in our ESX environment: one for test machines, and one running our production environment. We have Domain Controllers (and separate forests) in both environments, but all file and Exchange operations only live in production.

The Windows 2003 DC / Exchange 2003 server came up fine, although it lost its network configuration (adapter MAC changed), so that had to be reset, but is a simple fix.

All Windows 2008 DCs in the test lab, including the RODC, came up fine, but with the same “lost network configuration” hiccup. These systems all have the NTDS data and logs on the C: drive.

The Windows 2008 Server Core DC / File server, however, was a different story. Upon reboot, the server kept giving a BSOD and rebooting, so I couldn’t read the error. As this system is the primary (200GB) file server, primary DNS server (including conditional forwarding to the test lab), and the DC that handles the most load (DNS weight on the Windows 2003 is slightly lower), fixing the Blue Screen was of major importance. This is how it’s been fixed:

1. Safe Mode and “Last known Config” didn’t work, so hit F8 on the boot process to choose “Do not restart on system failure”. This allows you to read the BSOD message. In our case, it was simply “File Not Found”. Which means, no minidump, and you might be sunk.
2. On a whim, since it is a DC, I tried to boot into Directory Services Restore Mode, hoping the “not found” file was AD related… and was right.
3. This leads us down the path of this support article.
4. Immediately upon booting, I ran: ntdsutil files integrity which gave this error:
Could not initialize the Jet engine: Jet Error -566.
Failed to open DIT for AD DS/LDS instance NTDS. Error -2147418113
5. Searching shows there’s not much useful here, but we know it’s a failure to read the DIT. This could be security, or horrid corruption.
6. I quit ntdsutil to try to check the files on the E: drive, where they lived, only to find there was no E: drive. With no MMC, it’s diskpart to the rescue.
7. diskpart
DISKPART> list disk
Disk ### Status Size Free Dyn Gpt
-------- ---------- ------- ------- --- ---
Disk 0 Online 24 GB 0 B
Disk 1 Offline 100 GB 0 B
Disk 2 Offline 100 GB 0 B

8. I ran:
select disk 1
online
select disk 2
online
exit

9. Now I can read the E: drive, so try ntdsutil files integrity again… and get the same error message. Checking the disk, everything looked fine. In Linux, I’d check permissions with a quick “touch filename”, but notepad needed to be used here, only to discover the entire disk was marked read-only. Back to diskpart!
diskpart
select disk 1
attributes disk clear readonly
select disk 2
attributes disk clear readonly

10. Now ntdsutil runs properly, reboot into normal mode, and the system is fixed!

I haven’t seen posts of other people having disks get marked offline and unreadable on their VMs after an upgrade, but this only happened on the Windows 2008 system, and it’s non-system disks.

Because not enough information exists in easy-to-find searches: as a simple reminder – SCSI device IDs can and will change.

A few months ago I hot-added a new disk to an ssh bastion host (a VM on ESX). As these things tend to go, I eventually took a maintenance window and updated firmware/BIOS/OS on the ESX host. When the bastion VM came back online, however, I was presented with an odd error:

[root@bastion ~]: fsck /dev/sdc1
e2fsck 1.39 (29-May-2006)
fsck.ext3: Device or resource busy while trying to open /dev/sdc1
Filesystem mounted or opened exclusively by another program?
[root@oracle1 ~]# cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw,data=ordered 0 0
/dev /dev tmpfs rw 0 0
/proc /proc proc rw 0 0
none /selinux selinuxfs rw 0 0
devpts /dev/pts devpts rw 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
sunrpc /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
[root@oracle1 ~]# cat /etc/fstab
/dev/main/root / ext3 defaults 1 1
/dev/sdc1 /home ext3 defaults 1 2
/dev/main/var /var ext3 defaults 1 2
/dev/main/tmp /tmp ext3 defaults 1 2
LABEL=/boot /boot ext3 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
/dev/main/swap swap swap defaults 0 0
# Beginning of the block added by the VMware software
.host:/ /mnt/hgfs vmhgfs defaults,ttl=5 0 0
# End of the block added by the VMware software

So everything in the fstab is how I left it – /dev/sdc1 is the new disk I added that is giving errors mounting. So I thought to check for corruption on the disk, and found the problem:

[root@oracle1 ~]# fdisk -l
Disk /dev/sda: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 104391 83 Linux
/dev/sda2 14 5221 41833260 8e Linux LVM
Disk /dev/sdb: 42.9 GB, 42949672960 bytes
255 heads, 63 sectors/track, 5221 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 1 5221 41937651 83 Linux
Disk /dev/sdc: 32.2 GB, 32212254720 bytes
255 heads, 63 sectors/track, 3916 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdc1 * 1 3917 31457279+ 8e Linux LVM

So, a simple fix – change “/dev/sdc1″ to “/dev/sdb1″ in /etc/fstab (or to VOLUME=home), and boot back up.

It’s not something that’ll probably happen on this server again, but it is something to be aware of, on both VM guests and on physical servers. This is why so many newer Linux OSes are using UUID= or VOLUME= instead of device path for SCSI disks.

I just finished my upgrade from Kubuntu 8.04 to 8.10 this past week (since I had downtime from work, I could afford to break things for a few days).  The upgrade went great, and I’ll write about it shortly, once I get used to the newness.

Anyways; Workstation 6.5 has been giving me problems.  Because of the newness of KDE4, I initially thought it was a KDE problem, but it turns out it’s something between Workstation 6.5 and Ubuntu 8.10.  I just ran the “adapt –dist-upgrade-devel” command from the Ubuntu wiki to upgrade, and upon reboot, I couldn’t “ctrl-alt-ins” or “ctrl-alt-del” to log into my Windows VM, my “Windows/Start” key on the keyboard wouldn’t respond, and my arrow keys wouldn’t work.  Incredibly, when I’d hit the “down” arrow, I’d get the Windows Start menu pop up!!

Fix is easy, edit /etc/vmware/config and add the line below like:

sudo vim /etc/vmware/config
:$
A (that's vi-command for "go to the end of the file, and start writing a new line")
xkeymap.nokeycodeMap = true

Have to restart your VMs for this change to take effect. Thanks to Duncan Epping for this fix (he posted it in the forums, where I found it).

Just as a quick note – Windows Server 2008 RC0 seems to have the same setup issue as Windows Vista, or at least the x64 RC0 does – I spent most of the evening last night editing settings, rebooting, plugging in the product key, and reading “This computer’s hardware may not support booting to this disk. Ensure that the disk’s controller is enabled in the BIOS.” The problem is detailed at http://support.microsoft.com/kb/925481 for Vista. Windows Server 2008 RC1 fixes this issue.

I was having the problem on a Virtual system in VMWare ESX 3.5, so it was easy to disconnect a disk to get past the error, but downloading and installing the updated RC seemed like the better fix for the first DC in a new test lab.

What apparently is going on is if you have 2 hard drives that have never been partitioned or initialized, then the Setup.exe program gets confused. You can remove one of the disks temporarily, format them with another boot medium (BartPE, anyone?), or just not use Win2k8 RC0. According to the support note, the only fix for Vista is to format the drives.  I bet you can remove one of them and Vista will work, too, but haven’t tested.

Next Page »