Entries in VMware (6)

Monday
Sep192011

vSphere Research Project - Security Update

Several readers (ok, more than several) pointed out some security concerns about releasing esxtop data with hostname and VM name in it. Click through for a simple perl script that will strip out all the names from your data. They are replaced with server1 ... serverN. It is not pretty, but it gets the job done. Run the script with the name of the csv file from esxtop. This was a giant oversight on my part, and I should have put this up with the original request. 

This post refers to a paid vSphere research survey here

Click to read more ...

Wednesday
Aug312011

vSphere Research Project

I did not realize how long it has been since I last posted. Time has been flying by. I took a position with a startup about 6 months ago. My new company is still in stealth mode, so there is not much I can say about the details yet. We are solving some hard IT problems and I think the end result is going to be very exciting.

One of the projects I am working on requires me to understand what "typical" VMware environments looks like. After lots of digging, it turns out there is not much published information about the systems running in these environments. So, I am looking to all of you to help me gather data. I have posted a survey online and anyone who submits esxtop output from a vSphere 4.1 (or later) system will get a $10 Starbucks card.

 Click here to take the survey and feel free to forward it along.

I am looking forward to coming out of stealth and getting to share all of the exciting things we have been working on. Watch here or follow me @JesseStLaurent for updates as we get closer to launch.

Friday
Mar262010

Block alignment is critical

Block alignment is an important topic that is often overlooked in storage. I read a blog entry by Robin Harris a couple months back about the importance of block alignment with the new 4KB  drives. I was curious to test the theory on one of the new 4KB drives, but I did not have one on hand. That got me thinking about Solid State Disk (SSD) devices. If filesystem misalignment hurts traditional spinning disk performance, how would it impact SSD performance. In short, it is ugly. Here is a chart showing the difference between aligned and misaligned random read operations to a Sun F20 card. I guess it is officially an Oracle F20 card. Oracle F20 - Aligned vs. Misaligned Oracle F20 - Aligned vs. Misaligned With only a couple threads, the flash module can deliver about 50% more random 4KB read operations. As the thread count increases, the module is able to deliver over 9x the number of operations if properly aligned. It is worth noting that the card is delivering those aligned reads at less that 1ms while the misaligned operations average over 7ms of latency. 9x the operations at 85% less latency makes this an issue worth paying attention to. (My test was done on Solaris and here is an article about how to solve the block alignment issue for Solaris x64 volumes.) I have seen a significant increase in block alignment issues with clients recently. Some arrays and some operating systems make is easier to align filesystems than others, but a new variable has crept in over the last few years. VMware on block devices means that VMFS adds another layer of abstraction to the process. Now it is important to be sure the virtual machine filesystems are aligned in addition to the root operating system/hypervisor filesystem. Server virtualization has been the catalyst for many IT organizations to centralize more of their storage. Unfortunately, centralized storage does not come at the same $/GB as the mirrored drives in the server. It is much more expensive. Block misalignment can make the new storage even more expensive by making it less efficient. If the filesystems are misaligned, then it makes the array cache far less efficient. When that misaligned data is read from or written to disk, the drives are forced to do additional operations that would not be required for an aligned operation. It can quickly turn a fast storage array into a very average system. Most of the storage manufacturers can provide you with a best practices doc to help you avoid these issues. Ask them for a whitepaper about block alignment issues with virtual machines.

Click to read more ...

Monday
Dec282009

VMware boot storm on NetApp - Part 2

I have received a few questions relating to my previous post about NetApp VMware bootstorm results and want to answer them here.  I have also had a chance to look through the performance data gathered during the tests and have a few interesting data points to share. I also wanted to mention that I now have a pair of second generation Performance Accelerator Modules (PAM 2) in hand and will be publishing updated VMware boot storm results with the larger capacity cards. What type of disk were the virtual machines stored on?

  • The virtual machines were stored on a SATA RAID-DP aggregate.
What was the rate of data reduction through deduplication?
  • The VMDK files were all fully provisioned at the time of creation. Each operating system type was placed on a different NFS datastore. This resulted in 50 virtual machines on each of 4 shares. The deduplication reduced the physical footprint of the data by 97%
A few interesting stats gathered during the testing. These numbers are not exact and due to the somewhat imprecise nature of starting and stopping statit in synchronization with the start and end of each test.
  • The CPU utilization moved inversely with the boot time. The shorter the boot time, the higher the CPU utilization. This is not surprising as during the faster boots, the CPUs were not waiting around for disk drives to respond. More data was served from cache the the CPU could stay more utilized.
  • The total NFS operations required for each test was 2.8 million.
  • The total GB read by the VMware physical servers from the NetApp was roughly 49GB.
  • The total GB read from disk trended down between cold and warm cache boots. This is what I expected and would be somewhat concerned if it was not true.
  • The total GB read from disk trended down with the addition of each PAM. Again, I would be somewhat concerned if this was not the case.
  • The total GB read from disk took a significant drop when the data was deduplicated. This helps to prove out the theory that NetApp is no longer going to disk for every read of a different logical block that points to the same physical block.
How much disk load was eliminated by the combination of dedup and PAM?
  • The cold boots with no dedup and no PAM read about 67GB of data from disk. The cold boot with dedup and no PAM dropped that down to around 16GB. Adding 2 PAM (or 32GB of extended dedup aware cache) dropped the amount of data read from disk to less that 4GB.

Click to read more ...

Sunday
Nov012009

VMware boot storm on NetApp

UPDATE: I have posted an update to this article here: More boot storm details Measuring the benefit of cache deduplication with a real world workload can be very difficult unless you try it in production. I have written about the theory in the past and I did a lab test here with highly duplicate synthetic data. The results were revealing about how the NetApp deduplication technology impacts both read cache and disk. Based on our findings, we decided to run another test. This time the plan was to test NetApp deduplication with a VMware guest boot storm. We also added the NetApp Performance Accelerator Module (PAM) to the testing. The test infrastructure consists of 4 dual socket Intel Nehalem servers with 48GB of RAM each. Each server is connected to a 10GbE switch. A FAS3170 is connected to the same 10GbE switch. There are 200 virtual machines: 50 Microsoft Windows 2003, 50 Microsoft Vista, 50 Microsoft Windows 2008, and 50 linux. Each operating system type is installed in a separate NetApp FlexVol for a total of 4 volumes. This was not done to maximize the deduplication results. Instead we did it to allow the VMware systems to use 4 different NFS datastores. Each physical server mounts all 4 NFS datastores and the guests were split evenly across the 4 physical servers. The test consisted of booting all 200 guests simultaneously. This test was run multiple times with the FAS 3170 cache warm and cold, with deduplication and without, and with PAM and without. Here is a table summarizing the boot timing results. This is the amount of time between starting the boot and the 200th system acquiring an IP address. Here are the results:

Cold Cache (MM:SS) Warm Cache (MM:SS) % Improvement
Pre-Deduplication
0 PAM 15:09 13:42 9.6%
1 PAM 14:29 12:34 12.2%
2 PAM 14:05 8:43 38.1%
Post-Deduplication
0 PAM 8:37 7:58 7.5%
1 PAM 7:19 5:12 29.0%
2 PAM 7:02 4:27 37.0%
Let's take a look at the Pre-Deduplicaion results first. The warm 0 PAM boot performance improved by roughly 9.6% over the cold cache test. I suspect the small improvement is because the cache has been blown out by the time the cold cache boot completes. This is the behavior I would expect when the working set is substantially larger than the cache size. The 1 PAM warm boot results are 13.2% faster than the cold boot suggesting that the working set is still larger than the cache footprint. With 2 PAM cards, the warm boot is 38.1% faster than the cold boot. With 2 PAM cards it appears that a significant portion of the working set is now fitting into cache enabling a significantly faster warm cache boot. The Post-Deduplication results show a significant improvement in cold boot time over the Pre-Deduplication results. This is no surprise since once the data is deduplicated, the NetApp will fulfill a read request for any duplicate data block already in the cache by a copy in DRAM and save a disk access. (This article contains a full explanation of how the cache copy mechanism works.) As I have written previously, reducing the physical footprint of data is only one benefit of a good deduplication implementation. Clearly, it can provide a significant performance improvement as well. As one would expect, the Post-Deduplication warm boots also show a significant performance improvement over the the cold boots. The deduplicated working set appears to be larger than the 16GB PAM card as adding a second 16GB card further improved the warm boot performance. It is certainly possible the additional PAM capacity would further improve the results. It is worth noting that NetApp has released a larger 512GB PAM II card since we started doing this testing. The PAM I used in these tests is a 16GB DRAM based card and the PAM II is a 512GB flash based card. In theory, a DRAM based card should have lower latency for access. Since the cards are not directly accessed by a host protocol, it is not clear if the performance will be measurable at the host. Even if the card is theoretically slower, I can only assume the 32x size increase will more than make up for that with an improved hit rate. Thanks to Rick Ross and Joe Gries in the Corporate Technologies Infrastructure Services Group who did all the hard work in the lab to put these results together.

Click to read more ...

Thursday
Jun112009

Deduplication - Sometimes it's about performance

In a previous post I discussed the topic of deduplication for capacity optimization. Removing redundant data blocks on disk is the first, and most obvious, phase of deduplication in the marketplace. It helps to drive down the most obvious cost - the cost per GB of disk capacity. This market has grown quickly over the last few years. Both startups and established storage vendors have products that compete in the space. They are most commonly marketed as virtual tape library (VTL) or disk-to-disk backup solutions. Does that mean that deduplication is a point solution for highly sequential workloads? No. There is another somewhat less obvious benefit of deduplication. What storage administrator does not ask for more cache in the storage array? If I can afford 8GB, I want 16GB. If the system supports 16GB, I want 32GB. Whether it is for financial or technical reasons, cache is always limited. What about deduplicating the data in cache? When the workload is streaming sequential backup data from disk, this may not be very helpful. However, in a primary storage system with a more varied workload, this becomes very interesting. The cost per GB of cache (DRAM) is several orders of magnitude higher than the cost of hard drives. If the goal is to reduce storage capital expenses by making the storage array more efficient, then let’s focus on the most expensive component. The cache is the most expensive component. If the physical data footprint on disk is reduced, then it is logical that the array cache should also benefit from that space savings. If the same deduplicated physical block is accessed multiple times through different logical blocks, then it should result in one read from disk and many reads from cache. Storing VMware, Hyper-V, or Xen virtual machine images creates a tremendous amount of duplicate data. It is not uncommon to see storage arrays that are storing tens or even hundreds of these virtual images. If 10 or 20 or all of those virtual servers need to boot at the same time, then it places an extreme workload on the storage array. If all of these requests have to hit disk, the disk drives will be overwhelmed. If each duplicate logical block requires its own space in cache, then the cache will be blown out. If each duplicate block is read off disk once and stored in cache once, then the clients will boot quickly and the cache will be maintained. Preserving the current contents of the cache will ensure the performance of the rest of the applications in the environment is not impacted. This virtual machine example does make a couple of assumptions. The most significant is that the storage controllers can keep up with the workload. Deduplication on disk allows for fewer disk drives. Deduplication in cache expands the logical cache size and makes the disk drives more efficient. Neither of these does anything to reduce the performance demand on the CPU, fibre channel, network, or bus infrastructure on the controllers. In fact, they likely place more demand on these controller resources. If the controller is spending less time waiting for disk drives, it needs the horsepower to deliver higher IOPS from cache with less latency. This same deduplication theory applies to flash technology as well. Whether the flash devices are being used to expand cache or provide another storage tier, they should be deduplicated. Flash devices are more expensive than disk drives, so let's get the capacity utilization up as high as possible. As the mainstream storage vendors start to bring deduplication to market for primary storage, it will be interesting to see how they deal with these challenges. Deduplication for primary storage presents an entirely different a set of requirements from backup to disk.

Click to read more ...