Sunday, May 22, 2011

Pitfalls to Virtualizing Exchange 2010

Exchange performs best when it can interact with the physical components of a server directly. If you disagree with this statement that’s usually a symptom exhibited right after a vmware conference - hopefully it will go away.

Before I go into the pitfalls, let me say I love virtualization - most of my Exchange 2010 deployments are virtualized. Virtualization is fantastic for small deployments of Exchange 2010 as it provides many advantages including increasing energy efficiency and requiring less hardware with server consolidation. What us Exchange people are arguing is there are times you do not want to virtualize an Exchange server.

One statement VMware makes is "Virtualize enterprise apps, including Oracle, Exchange, SQL Server, Sharepoint and SAP, and deliver the highest SLAs and top performance.", please see: http://www.vmware.com/virtualization/. How is virtualizing your infrastructure going to achieve top performance? Top performance can only be achieved when software can interact directly with physical hardware bypassing hypervisors completely.

VMware have published publicly that Exchange 2010 DAG replication is supported - please see:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1037959

This is officially NOT supported by Microsoft. Just because VMware had it working in their lab doesn't with "simulated load generators" doesn't mean it will behave correctly in a production environment - I am talking here from experience. The Microsoft Exchange product team has clearly stated that both HyperV live migration and VMotion are not supported. You can see the link: http://technet.microsoft.com/en-us/library/aa996719.aspx. The link states that you're not supported if you use VMotion/SRM (etc) at the same time as the DAG. That's the support statement. Period.

I have seen problems directly related to virtualizing clustered Exchange 2010 mailbox servers. I have seen virtualized Exchange servers where the windows cluster had kicked the server out of being a cluster member. When you try and readd the node back into the cluster, the cluster node would join and then be evicted again. This can be resolved by evicting the node via ConfigurationOnly and removing it from Windows Clustering. The problem? The Witness! It appears the Witness was locked and the new (old) node could not access it. As it turns out the client vmotioned the server to another node and that's when all the problems began.

I have just got back to Australia from MSIDC (Microsoft Indian Development Centre). During my time at MSIDC I had an interesting debate internally with a Microsoft virtualization MVP named Susantha Silva on this subject. Virtualization MVP's ususally think everything should be virtulized, I'm sure they would virtualize the operating system on their mobile phone if they could! Present during this debate was Sheesh Dubey, Hyper V Program Manager for Microsoft. This debate was closed by Sheesh Dubey saying "If a product team states their application should not be virtualized in any particular circumstance, you should always follow the advise of the product team as they have reasons behind any statement they make."

A pitfall when Virtualizing mailbox servers that are a member of a cluster is taking snapshots in regards to transaction log replication. Go on take a snapshot of a server performing log replication. Have fun fixing it!

Another pitfall to Virtualization is low-balling the RAM because "it's a VM and we can add it whenever". Exchange 2010 was architectured to use slow disk, as a result it needs more memory to provide ample disk cache. Make sure you use the Exchange Storage calculator and calculate your RAM correctly.

Next problem we have is with a concept called Dynamic Memory. This is when a virtual machine is able to automatically extend the amount of memory it has allocated and return it to the hypervisor when its not required through a concept called Ballooning. SQL Server now supports Dynamic Memory as per http://support.microsoft.com/kb/956893, but Exchange 2010 mailbox role however does not (all other Exchange 2010 roles do). This is because the Mailbox Role in Exchange Server does not change its memory allocations on the fly. I encourage you to do a google/bing search for "dynamic memory exchange".

"Not doing the I/O numbers because it’s that magic virtual storage behind VMware". If you want the absolute best disk I/O you need to pass the physical storage directly through to the virtual machine. This provides the best possible I/O performance. If you insist on using a VHD or VMDK file make sure its fixed size! Another problem with shared storage is its very easy to overcommit a RAID Array of disks on a LUN to to many servers. I have seen so many virtual exchange servers where the disk queue lengths exceed 100 - very bad! Especially if your using the recommended Exchange 2010 TIER2 storage model, please make sure nothing else hits the disks!

Tossing a whole DAG on one virtual host. Yes I have seen it done unfortunately. We have mentioned above the reasons why we cannot add live migrate or vmotion a servers that are a member of a DAG cluster. So what is the point? Having two mailbox servers in a DAG cluster on a single host performs worse then if the server was simply a single mailbox server.

People, one virtual CPU does not represent one physical CPU. A virtual CPU represents 1 core of a physical CPU.

RDM/LUN limits per VMWare cluster (FC path limit of 1024 / 4 paths = 256 LUNs per cluster)
RDM LUN limits per Exchange VM (60 LUNs max – 4x SCSI controllers, 15 LUNs per controller)

When Exchange 2010 is virtualized I ususally see the DB/Logs on the same Disk Spindles that are used by the VMs for the operating system. Remember back to the days when all servers were physical. Did the DB and LOGs ever be on the same disks as the operating system? No they didn't.

When virtualizing your client access servers Microsoft NLB does not support Unicast Mode if Notify Switches is turned on. This means you need to configure NLB to use multicast mode and send cluster communication over the servers primary network interface card.

Virtual Sprawl. I see admins just add more and more virtual machines until the virtualization platform almost collapses and dies. They seem to have such faith in it so they don’t even monitor it, they also think that in some mystic way it will compensate for not designing and RAM, disk and CPU correct. I think this is a knowledge and maturity thing of admin.

I have a personal rule I follow, if more then 2000 users require access to a mailbox server and the users are heavy power users I ensure the Exchange server is setup physical. If the users are under 2000, require light weight access to their email and do not require DAG clustering with live migration/vmotion I set the environment up Virtual. This is pending that the disks and servers are not already over committed. On majority of networks Exchange followed by SQL requires more resources then other server. Because of this it does not always make sense to perform server consolidation through virtualization as your Exchange server may already be utilizing majority of resources on the physical host. Please note: CPU's are getting faster, this blog post is getting older. Pay close attention to the CINT Rates value of the CPU and use the sizing spreadsheet created by Ross Smith and size your CPU's appropriately.

Take advantage of cheap local storage, the whole idea of Exchange 2010 is to have multiple copies of your data in different geographical locations on extremely cheap disk/hardware. It is designed to scale out so that you CAN have servers fail. A properly designed enterprise exchange 2010 environment should be able to cope with 49% of the servers failing without taking down the environment. If you have a deep understanding on how Exchange 2010 clustering and native data protection technology you will understand where us Exchange guys are coming from in terms of virtualization.

If your a SME business with up to 1000 users I highly recommend looking at the HP e5000. It's a DAG in a box solution with 2 physical blade servers and locally attached storage. It comes bundled with the Exchange licensing setup with a plug in and go approach. It offers a fast performing lean mean highly available email system for companies that require high levels of email up time. There is loads of information about the e5000 on the internet.

4 comments:

  1. Hi there! Was reading this post with interest and have a quick question!

    In your observations, have you experienced similar happenings on an Exchange 2007 environment?

    My Exchange is in a VMware environment and experience similar events on occasion, (especially weird as v-motioning was being run against the cluster’s passive node).

    Hope to hear back!

    ReplyDelete
  2. Hi mate,

    Exchange 2007 and 2010 are fully supported in a virtual environment. What is not supported is performing live migration or vmotion when using log shipping in the event you have mailbox servers clustered. I assume you are running CCR in a virtual environment?

    No I have not seen problems in 2007 as I have never vmotioned a live cluster. The only supported way of moving a mailbox server partaking in log shipping is to power the server down, move it to another host, then power it up. Disable your clustered mailbox servers from vmotion.
    I assume you are referring to performing vmotion in an emviro

    ReplyDelete
  3. Hi Clint,

    What about this announcement from Microsoft:

    http://blogs.technet.com/b/exchange/archive/2011/05/16/announcing-enhanced-hardware-virtualization-support-for-exchange-2010.aspx

    It seems to indicate that MS does now, at least vaguely, support HA/DRS/vMotion with Exchange 2010 SP1 and later DAGs.

    What are your thoughts on this?

    Thanks!

    ReplyDelete
  4. Hi,

    What would your advise for large deployment?
    Says 15,000 to 20,000 users?

    Have you seen any organization that size virtualize all of their exchange servers? Including the mailbox servers.

    ReplyDelete