VMworld 2013

I am writing this as I fly back to Boston from VMworld in San Francisco and let me start by saying that writing a cheerleader post about a trade show is not my style. With that background, I would like to say… “VMworld 2013 was incredible.” The SimpliVity team on the ground in San Francisco was fantastic. There were early mornings, late nights, and I think I saw a few Advil consumed along the way to help with the inevitable foot pain that comes working on the show floor. Thanks to the SimpliVity marketing team for the extra carpet padding this year and for the extra 100hp+ in this years Audi R8.

Click to read more ...


vSphere Research Project - Security Update

Several readers (ok, more than several) pointed out some security concerns about releasing esxtop data with hostname and VM name in it. Click through for a simple perl script that will strip out all the names from your data. They are replaced with server1 ... serverN. It is not pretty, but it gets the job done. Run the script with the name of the csv file from esxtop. This was a giant oversight on my part, and I should have put this up with the original request. 

This post refers to a paid vSphere research survey here

Click to read more ...


vSphere Research Project

I did not realize how long it has been since I last posted. Time has been flying by. I took a position with a startup about 6 months ago. My new company is still in stealth mode, so there is not much I can say about the details yet. We are solving some hard IT problems and I think the end result is going to be very exciting.

One of the projects I am working on requires me to understand what "typical" VMware environments looks like. After lots of digging, it turns out there is not much published information about the systems running in these environments. So, I am looking to all of you to help me gather data. I have posted a survey online and anyone who submits esxtop output from a vSphere 4.1 (or later) system will get a $10 Starbucks card.

 Click here to take the survey and feel free to forward it along.

I am looking forward to coming out of stealth and getting to share all of the exciting things we have been working on. Watch here or follow me @JesseStLaurent for updates as we get closer to launch.


Commercial Storage at "cloud scale"

The major storage manufacturers are all chasing the cloud storage market. The private cloud storage market makes a lot of sense to me. Clients adopting private cloud methodologies have additional, often more advanced, storage requirements. This will frequently require a storage rearchitecture and may dictate changing storage platforms to meet the new requirements. The public cloud storage market outlook is much less clear to me.

If public cloud services are as successful as the analysts, media, and vendors are suggesting they will be, then cloud providers will become massive storage buyers at a scale that dwarfs today's corporate consumers. Whether the public cloud storage is part of an overall architecture that includes compute and capacity or a pure storage solution, the issue is the same. This is not about 1 or 2PB. The large cloud providers could easily be orders of magnitude larger than that.

Huge storage consumers are exactly what the storage manufacturers are looking for, right? Let me suggest something that may sound counterintuitive. Enormous success of cloud providers will be terrible news for today's mainstream storage manufacturers.

Click to read more ...


Jumbo Frames for NFS & iSCSI VMWare Datastores

We have been working on a comparison between VMware datastores running on NFS, iSCSI, and FC. (Stay tuned. We will publish those results shortly.) Along the way we were reminded of the performance boost that jumbo frames can provide. These tests were run using the same 'boot storm' test harness on the server side we have used before (details can be found at the end of this post). The question is, "How much faster will ESX be with jumbo frames enabled?" Let's jump right to the answer...

No Jumbo Frames (M:SS) Jumbo Frames Enabled (M:SS) % Improvement
NFS 5:10 4:25 14.5%
iSCSI 4:12 3:48 9.5%
In this test, enabling jumbo frames improves iSCSI performance by nearly 10% and NFS performance by almost 15%. Improving performance by 10-15% is a significant win when there is no cost required to do it. The simple change of enabling jumbo frames on the ESX servers, the network switch, and the storage array made the existing infrastructure faster. Much like block alignment, there is no downside. Infrastructure details: The test infrastructure consists of 4 dual socket Intel Nehalem-EP servers with 48GB of RAM each. Each server is connected to a 10GbE switch. A FAS3170 is connected to the same 10GbE switch with a single 10GbE link. There are 200 virtual machines: 50 Microsoft Windows 2003, 50 Microsoft Vista, 50 Microsoft Windows 2008, and 50 linux. Each operating system type is installed in a separate NetApp FlexVol for a total of 4 volumes. The guests were separated into multiple datastores to allow the VMware ESX systems to use 4 different NFS mount points on each ESX system. Each physical server mounts all 4 NFS datastores and the guests were split evenly across the 4 physical servers. The timing listed in the table is the time from the start of the 200 systems booting until the time the last system acquired an IP address.

Click to read more ...


Oracle/Sun F20 Flash Card - How fast is it?

I received several questions about the performance of the Oracle/Sun F20 flash card I used in my previous post about block alignment, so I put together a quick overview of the card's performance capabilities. The following results are from testing the card in a dual socket 2.93Ghz Nehalem (x5570) system running Solaris x64. This is similar to the server platform Oracle uses in the ExaData 2 platform. The F20 card is a SAS controller with 4 x 24GB flash modules attached to it. You can find more info on the flash modules on Adam Leventhal's blog and the official Oracle product page has the F20 details. All of my tests used 100% random 4KB blocks. I focused on random operations, because in most cases it is not cost effective to use SSD for sequential operations. These tests were run with a variety of different thread counts to give an idea of how the card scales with multiple threads. The first test compared the performance of a single 24GB flash module to the performance of all 4 modules.

4KB Random Operations 4KB Random Operations At lower thread counts the 4 module test is roughly 4x the operations per second of the single module test. As the thread count rises, the single module test tops out at 35,411 ops and 4 modules can deliver 97,850 ops, or 2.76x the single module test. It would be great if the card was able to drive the 4 modules at full speed, but 97K+ ops is not too shabby. What is more impressive to me is that those 97K+ ops are delivered at roughly 1ms of latency. The next round of testing included three different workloads. The three phases were 100% read, 80% read, and 100% write and they were run against all 4 flash modules. Again, all tests used 4KB random operations. Here are the operations per second results.


And a throughput version in MB/s for anyone that is interested.


Flash and solid state disk (SSD) technologies are developing at an incredibly fast pace. They are a great answer, but I think we are still figuring out what the question is. At some point down the line, they may replace spinning disk drives, but I do not think that is going to happen in the short term. There are some applications that can leverage this low latency capacity inside the servers today, but this is not the majority of applications.

Where flash and SSD make more sense to me is as a large cache. NetApp and Sun are using flash this way today in their storage array product lines. DRAM is very expensive, but flash can provide a very large and very low latency cache. I expect we will see more vendors adopting this "flash for cache" approach moving forward. The economics just make sense. Disks are too slow and DRAM is too expensive.

It would also be great to see operating systems that were intelligent enough to use technologies like the F20 card and the Fusion IO card as extended filesystem read cache. Solaris can do it for zfs filesystems using the L2ARC. As far as I know, there are no filesystems that have this feature in the other major operating systems. What about using as a client side NFS cache? At one point, Solaris offered CacheFS for NFS caching, but I do not believe it is still being actively developed. While CacheFS had its challenges, I believe the idea was a very good one. It costs a lot more to buy a storage array capable of delivering 97K ops than it does to put more cache into the server.

Click to read more ...


Block alignment is critical

Block alignment is an important topic that is often overlooked in storage. I read a blog entry by Robin Harris a couple months back about the importance of block alignment with the new 4KB  drives. I was curious to test the theory on one of the new 4KB drives, but I did not have one on hand. That got me thinking about Solid State Disk (SSD) devices. If filesystem misalignment hurts traditional spinning disk performance, how would it impact SSD performance. In short, it is ugly. Here is a chart showing the difference between aligned and misaligned random read operations to a Sun F20 card. I guess it is officially an Oracle F20 card. Oracle F20 - Aligned vs. Misaligned Oracle F20 - Aligned vs. Misaligned With only a couple threads, the flash module can deliver about 50% more random 4KB read operations. As the thread count increases, the module is able to deliver over 9x the number of operations if properly aligned. It is worth noting that the card is delivering those aligned reads at less that 1ms while the misaligned operations average over 7ms of latency. 9x the operations at 85% less latency makes this an issue worth paying attention to. (My test was done on Solaris and here is an article about how to solve the block alignment issue for Solaris x64 volumes.) I have seen a significant increase in block alignment issues with clients recently. Some arrays and some operating systems make is easier to align filesystems than others, but a new variable has crept in over the last few years. VMware on block devices means that VMFS adds another layer of abstraction to the process. Now it is important to be sure the virtual machine filesystems are aligned in addition to the root operating system/hypervisor filesystem. Server virtualization has been the catalyst for many IT organizations to centralize more of their storage. Unfortunately, centralized storage does not come at the same $/GB as the mirrored drives in the server. It is much more expensive. Block misalignment can make the new storage even more expensive by making it less efficient. If the filesystems are misaligned, then it makes the array cache far less efficient. When that misaligned data is read from or written to disk, the drives are forced to do additional operations that would not be required for an aligned operation. It can quickly turn a fast storage array into a very average system. Most of the storage manufacturers can provide you with a best practices doc to help you avoid these issues. Ask them for a whitepaper about block alignment issues with virtual machines.

Click to read more ...


VMware boot storm on NetApp - Part 2

I have received a few questions relating to my previous post about NetApp VMware bootstorm results and want to answer them here.  I have also had a chance to look through the performance data gathered during the tests and have a few interesting data points to share. I also wanted to mention that I now have a pair of second generation Performance Accelerator Modules (PAM 2) in hand and will be publishing updated VMware boot storm results with the larger capacity cards. What type of disk were the virtual machines stored on?

  • The virtual machines were stored on a SATA RAID-DP aggregate.
What was the rate of data reduction through deduplication?
  • The VMDK files were all fully provisioned at the time of creation. Each operating system type was placed on a different NFS datastore. This resulted in 50 virtual machines on each of 4 shares. The deduplication reduced the physical footprint of the data by 97%
A few interesting stats gathered during the testing. These numbers are not exact and due to the somewhat imprecise nature of starting and stopping statit in synchronization with the start and end of each test.
  • The CPU utilization moved inversely with the boot time. The shorter the boot time, the higher the CPU utilization. This is not surprising as during the faster boots, the CPUs were not waiting around for disk drives to respond. More data was served from cache the the CPU could stay more utilized.
  • The total NFS operations required for each test was 2.8 million.
  • The total GB read by the VMware physical servers from the NetApp was roughly 49GB.
  • The total GB read from disk trended down between cold and warm cache boots. This is what I expected and would be somewhat concerned if it was not true.
  • The total GB read from disk trended down with the addition of each PAM. Again, I would be somewhat concerned if this was not the case.
  • The total GB read from disk took a significant drop when the data was deduplicated. This helps to prove out the theory that NetApp is no longer going to disk for every read of a different logical block that points to the same physical block.
How much disk load was eliminated by the combination of dedup and PAM?
  • The cold boots with no dedup and no PAM read about 67GB of data from disk. The cold boot with dedup and no PAM dropped that down to around 16GB. Adding 2 PAM (or 32GB of extended dedup aware cache) dropped the amount of data read from disk to less that 4GB.

Click to read more ...


VMware boot storm on NetApp

UPDATE: I have posted an update to this article here: More boot storm details Measuring the benefit of cache deduplication with a real world workload can be very difficult unless you try it in production. I have written about the theory in the past and I did a lab test here with highly duplicate synthetic data. The results were revealing about how the NetApp deduplication technology impacts both read cache and disk. Based on our findings, we decided to run another test. This time the plan was to test NetApp deduplication with a VMware guest boot storm. We also added the NetApp Performance Accelerator Module (PAM) to the testing. The test infrastructure consists of 4 dual socket Intel Nehalem servers with 48GB of RAM each. Each server is connected to a 10GbE switch. A FAS3170 is connected to the same 10GbE switch. There are 200 virtual machines: 50 Microsoft Windows 2003, 50 Microsoft Vista, 50 Microsoft Windows 2008, and 50 linux. Each operating system type is installed in a separate NetApp FlexVol for a total of 4 volumes. This was not done to maximize the deduplication results. Instead we did it to allow the VMware systems to use 4 different NFS datastores. Each physical server mounts all 4 NFS datastores and the guests were split evenly across the 4 physical servers. The test consisted of booting all 200 guests simultaneously. This test was run multiple times with the FAS 3170 cache warm and cold, with deduplication and without, and with PAM and without. Here is a table summarizing the boot timing results. This is the amount of time between starting the boot and the 200th system acquiring an IP address. Here are the results:

Cold Cache (MM:SS) Warm Cache (MM:SS) % Improvement
0 PAM 15:09 13:42 9.6%
1 PAM 14:29 12:34 12.2%
2 PAM 14:05 8:43 38.1%
0 PAM 8:37 7:58 7.5%
1 PAM 7:19 5:12 29.0%
2 PAM 7:02 4:27 37.0%
Let's take a look at the Pre-Deduplicaion results first. The warm 0 PAM boot performance improved by roughly 9.6% over the cold cache test. I suspect the small improvement is because the cache has been blown out by the time the cold cache boot completes. This is the behavior I would expect when the working set is substantially larger than the cache size. The 1 PAM warm boot results are 13.2% faster than the cold boot suggesting that the working set is still larger than the cache footprint. With 2 PAM cards, the warm boot is 38.1% faster than the cold boot. With 2 PAM cards it appears that a significant portion of the working set is now fitting into cache enabling a significantly faster warm cache boot. The Post-Deduplication results show a significant improvement in cold boot time over the Pre-Deduplication results. This is no surprise since once the data is deduplicated, the NetApp will fulfill a read request for any duplicate data block already in the cache by a copy in DRAM and save a disk access. (This article contains a full explanation of how the cache copy mechanism works.) As I have written previously, reducing the physical footprint of data is only one benefit of a good deduplication implementation. Clearly, it can provide a significant performance improvement as well. As one would expect, the Post-Deduplication warm boots also show a significant performance improvement over the the cold boots. The deduplicated working set appears to be larger than the 16GB PAM card as adding a second 16GB card further improved the warm boot performance. It is certainly possible the additional PAM capacity would further improve the results. It is worth noting that NetApp has released a larger 512GB PAM II card since we started doing this testing. The PAM I used in these tests is a 16GB DRAM based card and the PAM II is a 512GB flash based card. In theory, a DRAM based card should have lower latency for access. Since the cards are not directly accessed by a host protocol, it is not clear if the performance will be measurable at the host. Even if the card is theoretically slower, I can only assume the 32x size increase will more than make up for that with an improved hit rate. Thanks to Rick Ross and Joe Gries in the Corporate Technologies Infrastructure Services Group who did all the hard work in the lab to put these results together.

Click to read more ...


ZFS Capacity Usage - Optimizing Compression and Record Size Settings

I have migrated some data to ZFS filesystems recently and the capacity consumed has surprised me a couple times. In general, it has appeared that the data uses more capacity when stored on the ZFS filesystem. This prompted me to do a little investigating. Is ZFS using more capacity? Is it simply a reporting anomaly? Where is that space going? Does ZFS record size have a major impact? Does enabling compression have a significant impact? In part, the extra space use is a result of ZFS reporting space utilization differently than other filesystems. When a ZFS filesystem is formatted, almost no capacity is used. A df command will show nearly the entire raw capacity. Many other filesystems take a portion of the raw capacity off the top and reserve it for metadata. This reserve will not show up in df. As data is added to the ZFS filesystem, blocks are allocated for both data and metadata. Both the data and metadata blocks will show up as used capacity. In many other filesystems, at least some of the metadata blocks will be taken from the reserve and only the data blocks will show as consumed capacity. For example, in Solaris, the du command will return the capacity used by the data blocks in a file. In ZFS, that du command returns the total space consumed by the file including metadata and compression. So the question at hand is, when storing a given set of files, does ZFS use more total space than other file systems? That one is difficult to test, given all the variables. But we can test various ZFS configuration options to determine the best settings for minimizing block use. All of our testing was done on RAID-Z2. In a RAID-Z2 filesystem, each data block will require at least 2 512-byte sectors of parity information. With a larger record size, this is not noticeable, but with a small record size it can really add up. Imagine the impact if the filesystem is using a 1KB record size. The parity data could double the capacity consumed! So, is the solution to use the largest possible block size? Unfortunately, it is never that simple. The last block of any file will be on average 50% utilized. With a 128KB block size, each file is going to have an average of 64KB of wasted space. Enabling compression will zero out the unused portion of the block and that portion of the block will compress extremely well. To test this out, I created filesystems with block sizes from 1KB to 128KB and using lzjb, gzip-2, gzip-6, and gzip-9 compression. Then I copied a data set to each of these filesystems. The test data set consisted of 179,559 PDF files totaling approximately 111GB uncompressed. Nearly all of the files are larger than the largest 128KB block size. The results would be very different if the data set consisted of thousands of very small files. The intent of this test is to simulate the file sizes that might exist in a home directory environment. The goal is to examine the "wasted" capacity, not the impact of compression on the overall data set. So with all of that said, let's take a look at the data: ZFS Block Size & Compression Comparison You can click on the chart to open a larger version in a new window. The additional capacity consumed by the metadata is most obvious in the 1KB block size results. The PDF files are not very compressible, so a large portion of the data reduction between no compression and lzjb is likely due to saving and average of 50% of the last block of each file. For 128KB blocks, there will be 179,559 files that waste on average 64KB of space each. If it averages exactly 50% (unlikely) and that capacity compressed down to take no space (not quite true) it would save roughly 10.95GB of capacity. Interestingly, that is in the region of what is saved between 128KB with no compression and lzjb. UPDATE: I analyzed the file sizes for this specific data set and there is ~12.3GB of space wasted in the last blocks of the 128KB no compression test. That means we are not averaging 50% utilization in the last block, but it is reasonably close. These results would vary dramatically if the data was highly compressible or if there were many small files. Also, performance was completely ignored in these tests. The better the compression the rate on the chart above, the more CPU that was required. The goal here was to talk a bit about ZFS and the effect of compression on space utilization. Watch for a detailed discussion on selecting the correct block size for your application in a future post. You can find more information about ZFS in the ZFS FAQ. Here are a couple more charts that show the makeup of the data set that was used for these tests. ZFS - PDF File Distribution by Capacity You can click on the chart to open a larger version in a new window. ZFS - PDF File Distribution by Count You can click on the chart to open a larger version in a new window.

Click to read more ...