Recently, I wrapped up testing for a reference architecture scaling out a XenDesktop 7.5 environment here in the labs at X-IO’s headquarters. The project took much longer than I expected, but I learned quite a bit along the way, and we had quite a few head scratching moments as well. All in all, it was a great experience and I look forward to more testing.
One thing that I ran testing on was simulated boot storms of 500 and 1000 VMs. These were not slow, controlled boot up scenarios, but instead red-alert type situations where you have to get all your VMs up and running.
I know you are probably thinking, “I’ll never have to deal with a situation like that!” However, I’ve always believed that a good server administrator has to have a decent level of paranoia. My level of paranoia, I’m sad to say, is probably slightly higher than the normal administrator. Somewhere in the back of my mind, I seem to have a threaded process that is always working through worst case scenarios. On planes, when the flight attendant tells you to look around for your nearest exit, I don’t because I figured it out when I sat down. The back of my car is filled with supplies to live in my car should I get stranded, despite the fact that I have roadside assistance and a button to call for emergency help. In the off chance that zombies do happen, let’s just say you’ll want to find my house if you want to live.
The IO Blender
So, why worry about boot storms? Boot storms are probably the extreme example of a stress situation for your hosts, and more importantly, your storage, in a VDI solution. Windows desktops are programmed to handle I/O as though they were installed on a single hardware device, with sequential access to the disk through the processor. In a VDI solution, you have multiple devices in contention for that I/O access, sharing processors, and this gives you what we here at X-IO call the ‘blender effect.’ Think about blending up a mix in a blender. You drop in your ice cream, your milk, some cookie bits and it fills the carafe about ¼ full. (Ok, 1/3 full if you are fixing milk shakes my way.) When you press the button to blend, the action of the blades sends the mix flying up the sides of the carafe, giving the impression of there being so much more content. Once done, it settles down the sides and eventually to the bottom.
This ‘blender effect’ is very much like boot storms. As I stated, the contention over the sudden need for hundreds or thousands of VMS demanding access to storage causes a similar burst in IOP activity. In my testing, I found that in certain cases, boot storms created a demand for storage transactions that was 20 times higher than normal demand during typical user activity.
“I don’t need to size for boot storms… wait, what just happened?!?!”
As a good, somewhat paranoid, administrator, you should always been planning for the impact of worst case scenarios, and when it comes to routine operations, a boot storm is definitely one of which you should prepare. Ideally, you shouldn’t encounter a boot storm. As a paranoid administrator, you probably also plan, design, and implement around the fact that you do NOT want to get paged at 2 a.m. on a regularly basis, so you design your environment to NOT have boot storms occur. Whether you are using Desktop Groups with precisely defined boot and shut down windows, or you’ve added Citrix’ Power and Capacity Management tool to give you a savvy level of control to your environment, the boot storm is not part of your daily routine.
It is, however, part of your extreme emergency case routine for which you need to be prepared. Examples of when a boot storm will become necessary:
- Power outage to your data center, either planned or unplanned
- Hypervisor host outage
- Storage outage (not that that should happen with ISE, but I have to include it to be fair!)
- Your backup admin while you were on vacation
- Another admin with access to the environment, and just enough knowledge to be dangerous
In the first three cases, you may say, ‘Why wouldn’t I allow normal power cycle procedures to occur?’ Easy – normal power cycling procedures, particularly if you are allowing your Desktop Controllers power cycle your VMs, are throttled to allow a gracefully power up of the environment. In stress situations, such as a power outage or unexpected host/cluster outage, your business is at risk. For some companies, this means actual cash on the line, either in the form of lost business revenue, or in the form of SLA penalties, particularly if they are hosting applications. All your VP knows is that you have the ability to have your VDIs online – RIGHT NOW – but you refuse to boot them up all at once because of a blender.
You’ve set yourself up for a Resume Generating Event. (RGE). Paranoid admins should plan to avoid RGE’s at all costs.
This is why you should include boot storms and their impact into planning your VDI environment. Ideally, they will never occur. Failing to plan for them will pretty much guarantee they will occur and make your life, and career miserable. I’d like to help you avoid that.
XenDesktop Boot Storms with PVS or MCS
While you are planning your VDI environment, plan for the boot storm and how that will impact your IOPS. Your storage needs to be able to handle the boot storm effectively for it to be adequate for your VDI needs. Let’s look at some details from my testing.
I tested for three different scenarios:
- Machine Creation Servers managed VMs (MCS)
- Provisioning Services managed VMs, using cache to hard drive (PVS CHD)
- Provisioning Services managed VMs, using cache to RAM, overwrite to hard drive (PVS C2RAM/HD)
Because MCS uses a methodology wherein master images are copied to the datastores in use, then all VMs in a common catalog sharing a common datastore will use the same image for their OS. This creates a scenario where excessive read IOPS activity occurs, much more than the total IOPS activity for either PVS scenario.
PVS, however, doesn’t get a pass. The need to cache information to the attached hard disk in both situation, causes an increase in write IOPS. Both had peak bursts that greatly exceeded normal operational activity, but they were ¼ the peak size seen in MCS.
Let’s take a look at the results of highlighting all the VMs in the console and hitting the power button:
The bad news first – average IOPS activity consistently remained in the 70K range during a large part of the boot operation. Peaks, as I mentioned, got well above 80K and close to 90K on one peak. It’s clear the IOPS demand on storage was unrelenting for the duration of the boot cycle. The good news is that all 1000 VMs were up and registered in the console within 10 minutes of hitting the power switch. The ISE 740 hybrid storage array that was used for this test was able to handle the demand without a problem. Quite impressive!
Now, for PVS C2RAM/HD:
Since PVS streams it image from the PVS store, the focus for IOPS is on write, as opposed to the heavy focus on read for MCS. There are peak bursts up to 20K, but there isn’t the sustained high level of IOPS activity like there is in MCS. All VMs were powered up and registered within 6 minutes. PVS CHD showed similar activity to the PVS C2RAM/HD.
There are some things you can do to help performance during your stress situations, such as a boot storm. One is to use storage that has hybrid technology, utilizing both HDD and Flash technology. X-IO’s ISE 700 Series Hybrid Storage Arrays certainly fit that bill. Combine them with our patented, intelligent Continuous Adaptive Data Placement (CADP) engine, you can be assured that the stress points for your boot storms, such as that master image in MCS, get moved to flash during the high demand for read IOPS activity to provide the best performance.
For PVS, we have a better option to help with boot storms. Our latest ISE Manager software adds media affinity. Create a LUN that will be the host for your PVS servers and their associated drives that host the image stores. Pin the LUN to flash storage, and you can be assured of high performance from them while they stream the stored images to their target devices. Because PVS is constantly being used, pinning them to flash makes sense.* Combine this on your infrastructure with CADP watching your VMs and managing their promotion to flash as needed, when needed, and you can rest assured that you’ve built the best, adaptive environment for your VDI needs.
Have you got other concerns about scaling your Citrix environment and if you’ve got the right storage for your needs? Reach out to use at X-IO, and we’ll be happy to talk to you more about this and other stress points for setting up a Citrix XenDesktop VDI.
*Media affinity only applies at the datastore level. At this time, a single file, such as the MCS master image, cannot be pinned and remain on same datastore with unpinned data, such as your VMs. Media affinity is best for PVS image stores.