With our focus on Virtual Desktop Solutions, I have been talking with lots of our customers and partners about their experiences, and I’m shocked at the number of failures caused by storage. I have been in solution architectures for over 15 years, and these designs should always start with sizing. Why is it so many fail to deliver?
From the problems that I have heard about around storage, I can generally break them up into 3 areas:
- Bad sizing – Deploy, Recompose, Login/Logout, Boot operations should be taken into account, in addition to Steady State requirements. The challenge is how to find out what all of these different operations require. You simply have to test them all to know. In this post I will talk about the workloads that we characterized from the recent testing we have completed. While every environment is different, I hope that you can use my results to get a better starting point when designing your solutions.
- Bad storage – Storage can cause problems in lots of different ways. Performance drops as the capacity is consumed, reliability/recovery operations can severely degrade performance (to the point of Desktop Failure), use of NL-SAS or SATA technologies, sizing the solution based on “Marketing Numbers” from synthetic workloads, and “black magic” tuning techniques make storage a frustrating (and expensive) part of the solution for the vAdministrator. Storage should just blend into the background of the hypervisor like the CPU and RAM in the blade servers.
- Bad Pilot –Pilots are a great place to test a lot of different things that are going to be encountered when the solution goes into production. When planning tests, make sure to not only do things common in your environment, but stress testing beyond the normal levels can yield valuable insights as to where the bottlenecks are. Doing tests at the 1,500 desktop level allowed me to get a good look at the system requirements of the deploy/recompose/refresh operations.
Flash is very important in the context of VDI, but don’t think for a second that using SSDs means that you don’t need to size.
The only way to observe these different workloads was to build the total solution in the lab, and perform these operations until we reached a bottleneck we could not scale around. To really test the solution we had to build the whole thing in the lab, and perform the operations that Virtual Desktop environments do every day. Simply getting a single server, and configuring a synthetic load generator (Iometer, iostat, fio, SQLIO, etc) can give you a quick idea of what a system will do with a single static workload. Again, the problem with Virtual Desktop, is that a common VDI workload looks something like this:
I wouldn’t even begin to guess at the scripting that would be required to make Iometer generate a workload with this high variation in read/write mix and a huge swing in IOPS (e.g. 5000 – 15000 IOPS) like the above boot storm.
So we’ve established that the only real way to test a virtual desktop solution is to do the actual things that the environment will do. At X-IO, we made the decision to test the entire virtual desktop solution, not just stress the storage with a synthetic tool. In creating this configuration, the vast majority of the work was spent on getting all of the pieces working together.
- Configure the Cisco UCS blades, Boot from SAN, 10Gb Ethernet, Cisco 9148 (we just completed certification for ISE direct attach into the Fabric Interconnect (6248), eliminating the Fibre Channel networking entirely now)
- Install ESXi on all of the blades, which booted from the ISE
- Create all of the infrastructure services (Active Directory/DNS/DHCP, vSphere, View Composer, View Connection Server, vCOPS, Login VSI Server, Performance Collection, etc)
- Create the base images for the targets (custom install scripts for Office 2013)
- Create LUNs from the ISE for the desktop pools (ISE Manager Suite, 10x LUNs > 10X Datastores in one interface, minutes)
- Create desktop pools (Floating Pools, View composer linked clones)
- Configure Login VSI to run the actual tests (Script creation of 2,000 AD accounts, Group Policies for network access, Security permissions for Remote Desktop, 80X Launcher VMs, RDP and PCoIP, etc)
Now we have to make the desktops do actual work (and be able to quantify their experience). Using Login VSI for login and steady-state testing, and View Administrator for deploy ops and boots, proved to be invaluable tools for performing the solution validation on various platforms by us. Several of our enterprise customers use Login VSI in their labs, and that’s how we got introduced to it. Testing tools that can measure the actual desktop experience (while the desktops are doing real work) are an extremely valuable investment. In the end, the quality of the desktop will be the ONLY thing that really matters.
Key Takeaways from the Reference Architecture Testing
By creating an actual environment (not a single server Iometer config), we were able to observe the behavior of the whole system when doing lots of different things. Some of the main takeaways from the testing were:
- During a Deploy/Recompose operation, the vast majority of the time and resources used were spent on the Sysprep process. This stressed the CPU and RAM of the UCS blade servers tremendously, and was extremely write heavy (<80%). I recommend that you time how long this takes with no activity (in addition to host load), as this should help in determining how big you want your desktop pools to be. This is a great way to stress the server CPU and RAM, in addition to storage.
- Normal operations were the least storage performance demanding, but the most sensitive to any latency throughout the system. In our testing, the main limit was the blades of the Cisco UCS (CPU & RAM). Just to add another thought here, I had 2x HUGE Cisco UCS Chassis filled with blade servers to get to 1,500 users (CPU and RAM in the blades were the limit). I know what those things cost, and a single ISE 740 would be a fraction of the total.
- When we booted all of the desktops (1,500), there were bursts of Reads and bursts of Writes. It wasn’t a constant mix of either. I did this between every test run, so I got a good look at what this process entailed. This was the most stressful operation for the entire environment (CPU/RAM/Storage), and is where I saw +50,000 IOPS from the ISE 740. This test can also help in determining how big to make your desktop pools.
- Storage performance required for the login phase of the test was about 2x what the steady-state workload required. LoginVSI enabled us to adjust the time that all of the users took to login, so we were able to observe login storms of varying intensity.
- After I decided on an ISE LUN configuration, I never really touched it again. We were using View Composer Linked Clones to create the desktop images. So once I presented the LUNs to the clusters, there wasn’t anything else to do on the ISE. We changed desktop pool configurations no less than 10x times, but never had to change the storage LUN configuration (after initial testing).
The only way I would have found all of the above info was to configure the entire environment, just as our customers do in their environments. Using an “entire system test” methodology allowed for us to get information that is immensely more valuable than any synthetic tool that only focused on storage ever could. Investing in LoginVSI was well worth the money, and I encourage everyone to look at tools like this when evaluating an environment.
Making SAN Storage Easy to Manage
The amount of time I spent actually working on the “storage” was a small fraction of the total time it took to setup the test environment, and this gave us a wealth of information about management. The ISE management integration with VMware made setting this up, and making changes, simple. The Datastore/LUN configuration was changed several times during the initial testing phase of various sizes/raid types. ISE Manager let me perform all of my storage functions, across all clusters and ISEs, from a single interface.
I can create/delete/modify storage (from LUN → Datastore) all in one place, and this ensure that best practices are followed for configuration (saves a TON of time). I have done this manually before in previous testing, and it is something so painful I swore I’d never do it again. ISE manager let me treat the storage configuration like any other variable I was testing. I encourage everyone to make integration with your hypervisor one of the “Tier 1” criteria for any new solution.
The Cisco UCS made setting this up very easy by comparison of most other configurations I’ve built. We didn’t have to run miles of cables in the lab. I didn’t have to stand under an air vent that blows 55 degree air at 20 Mph (at least it feels like it), and feed DVDs to servers. I even did the initial ESXi installs from 35,000 ft on the way to Denver (and this was Boot from SAN). The Cisco UCS pretty much took away all of the reasons to even be in the lab. That’s a good thing, as I ran all of the testing in Colorado Springs from where I live in Florida. Combining the UCS with the ISE integration for VMware, made testing lots of different LUN and Desktop Pool configurations possible in a timely manner.
We created the X-Pod architecture with the idea of simplifying the vAdministrators life, and the goal of this testing was to prove it. Download the X-Pod for VDI Reference Architecture now.