Entries in Data Reduction (5)

Monday
Dec282009

VMware boot storm on NetApp - Part 2

I have received a few questions relating to my previous post about NetApp VMware bootstorm results and want to answer them here.  I have also had a chance to look through the performance data gathered during the tests and have a few interesting data points to share. I also wanted to mention that I now have a pair of second generation Performance Accelerator Modules (PAM 2) in hand and will be publishing updated VMware boot storm results with the larger capacity cards. What type of disk were the virtual machines stored on?

  • The virtual machines were stored on a SATA RAID-DP aggregate.
What was the rate of data reduction through deduplication?
  • The VMDK files were all fully provisioned at the time of creation. Each operating system type was placed on a different NFS datastore. This resulted in 50 virtual machines on each of 4 shares. The deduplication reduced the physical footprint of the data by 97%
A few interesting stats gathered during the testing. These numbers are not exact and due to the somewhat imprecise nature of starting and stopping statit in synchronization with the start and end of each test.
  • The CPU utilization moved inversely with the boot time. The shorter the boot time, the higher the CPU utilization. This is not surprising as during the faster boots, the CPUs were not waiting around for disk drives to respond. More data was served from cache the the CPU could stay more utilized.
  • The total NFS operations required for each test was 2.8 million.
  • The total GB read by the VMware physical servers from the NetApp was roughly 49GB.
  • The total GB read from disk trended down between cold and warm cache boots. This is what I expected and would be somewhat concerned if it was not true.
  • The total GB read from disk trended down with the addition of each PAM. Again, I would be somewhat concerned if this was not the case.
  • The total GB read from disk took a significant drop when the data was deduplicated. This helps to prove out the theory that NetApp is no longer going to disk for every read of a different logical block that points to the same physical block.
How much disk load was eliminated by the combination of dedup and PAM?
  • The cold boots with no dedup and no PAM read about 67GB of data from disk. The cold boot with dedup and no PAM dropped that down to around 16GB. Adding 2 PAM (or 32GB of extended dedup aware cache) dropped the amount of data read from disk to less that 4GB.

Click to read more ...

Friday
Oct022009

ZFS Capacity Usage - Optimizing Compression and Record Size Settings

I have migrated some data to ZFS filesystems recently and the capacity consumed has surprised me a couple times. In general, it has appeared that the data uses more capacity when stored on the ZFS filesystem. This prompted me to do a little investigating. Is ZFS using more capacity? Is it simply a reporting anomaly? Where is that space going? Does ZFS record size have a major impact? Does enabling compression have a significant impact? In part, the extra space use is a result of ZFS reporting space utilization differently than other filesystems. When a ZFS filesystem is formatted, almost no capacity is used. A df command will show nearly the entire raw capacity. Many other filesystems take a portion of the raw capacity off the top and reserve it for metadata. This reserve will not show up in df. As data is added to the ZFS filesystem, blocks are allocated for both data and metadata. Both the data and metadata blocks will show up as used capacity. In many other filesystems, at least some of the metadata blocks will be taken from the reserve and only the data blocks will show as consumed capacity. For example, in Solaris, the du command will return the capacity used by the data blocks in a file. In ZFS, that du command returns the total space consumed by the file including metadata and compression. So the question at hand is, when storing a given set of files, does ZFS use more total space than other file systems? That one is difficult to test, given all the variables. But we can test various ZFS configuration options to determine the best settings for minimizing block use. All of our testing was done on RAID-Z2. In a RAID-Z2 filesystem, each data block will require at least 2 512-byte sectors of parity information. With a larger record size, this is not noticeable, but with a small record size it can really add up. Imagine the impact if the filesystem is using a 1KB record size. The parity data could double the capacity consumed! So, is the solution to use the largest possible block size? Unfortunately, it is never that simple. The last block of any file will be on average 50% utilized. With a 128KB block size, each file is going to have an average of 64KB of wasted space. Enabling compression will zero out the unused portion of the block and that portion of the block will compress extremely well. To test this out, I created filesystems with block sizes from 1KB to 128KB and using lzjb, gzip-2, gzip-6, and gzip-9 compression. Then I copied a data set to each of these filesystems. The test data set consisted of 179,559 PDF files totaling approximately 111GB uncompressed. Nearly all of the files are larger than the largest 128KB block size. The results would be very different if the data set consisted of thousands of very small files. The intent of this test is to simulate the file sizes that might exist in a home directory environment. The goal is to examine the "wasted" capacity, not the impact of compression on the overall data set. So with all of that said, let's take a look at the data: ZFS Block Size & Compression Comparison You can click on the chart to open a larger version in a new window. The additional capacity consumed by the metadata is most obvious in the 1KB block size results. The PDF files are not very compressible, so a large portion of the data reduction between no compression and lzjb is likely due to saving and average of 50% of the last block of each file. For 128KB blocks, there will be 179,559 files that waste on average 64KB of space each. If it averages exactly 50% (unlikely) and that capacity compressed down to take no space (not quite true) it would save roughly 10.95GB of capacity. Interestingly, that is in the region of what is saved between 128KB with no compression and lzjb. UPDATE: I analyzed the file sizes for this specific data set and there is ~12.3GB of space wasted in the last blocks of the 128KB no compression test. That means we are not averaging 50% utilization in the last block, but it is reasonably close. These results would vary dramatically if the data was highly compressible or if there were many small files. Also, performance was completely ignored in these tests. The better the compression the rate on the chart above, the more CPU that was required. The goal here was to talk a bit about ZFS and the effect of compression on space utilization. Watch for a detailed discussion on selecting the correct block size for your application in a future post. You can find more information about ZFS in the ZFS FAQ. Here are a couple more charts that show the makeup of the data set that was used for these tests. ZFS - PDF File Distribution by Capacity You can click on the chart to open a larger version in a new window. ZFS - PDF File Distribution by Count You can click on the chart to open a larger version in a new window.

Click to read more ...

Monday
Jul202009

Deduplication - The NetApp Approach

After writing a couple of articles (here and here) about deduplication and how I think it should be implemented, I figured I would try it on a NetApp system I have in the lab. The goal of the testing here is to compare storage performance of a data set before and after deduplication. Sometimes capacity is the only factor, but sometimes performance matters. The test is random 4KB reads against a 100GB file. The 100GB file represents significantly more data than the test system can fit into its' 16GB read cache. I am using 4KB because that is the natural block size for NetApp. To maximize the observability of the results in this deduplication test, the 100GB file is completely full of duplicate data. For those who are interested, the data was created by doing a dd from /dev/zero. It does not get any more redundant than that. I am not suggesting this is representative of a real world deduplication scenario. It is simply the easiest way to observe the effect deduplication has on other aspects of the system. This is the output from sysstat -x during the first test. The data is being transferred over NFS and the client system has caching disabled, so all reads are going to the storage device. (The command output below is truncated to the right, but the important data is all there.)

Random 4KB reads from a 100GB file – pre-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 19%  6572     0     0    6579  1423 27901  23104     11     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6542     0     0    6549  1367 27812  23265    726     0     0     7   17%   5%  T  100%      0     7     0     0     0     0
 19%  6550     0     0    6559  1305 27839  23146     11     0     0     7   15%   0%  -  100%      0     9     0     0     0     0
 19%  6569     0     0    6576  1362 27856  23247    442     0     0     7   16%   4%  T  100%      0     7     0     0     0     0
 19%  6484     0     0    6491  1357 27527  22870      6     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6500     0     0    6509  1300 27635  23102    442     0     0     7   17%   9%  T  100%      0     9     0     0     0     0
The system is delivering an average of 6536 NFS operations per second. The cache hit rate hovers around 16-17%. As you can see, the working set does not fit in primary cache. This makes sense. The 3170 has 16GB of primary cache and we are randomly reading from a 100GB file. Ideally, we would like to get a 16% cache hit rate (16GB cache / 100GB working set) and we are very close. The disks are running at 100% utilization and are clearly the bottleneck in this scenario. The spindles are delivering as many operations as the are capable of. So what happens if we deduplication this data? First, we need to activate deduplication, a_sis in NetApp vocabulary, on the test volume and deduplicate the test data. (Before deduplication became the official buzz word, NetApp referred to their technology as Advanced Single Instance Storage.)
fas3170-a> sis on /vol/test_vol
SIS for "/vol/test_vol" is enabled.
Already existing data could be processed by running "sis start -s /vol/test_vol".
fas3170-a> sis start -s /vol/test_vol
The file system will be scanned to process existing data in /vol/test_vol.
This operation may initialize related existing metafiles.
Are you sure you want to proceed (y/n)? y
The SIS operation for "/vol/test_vol" is started.
fas3170-a> sis status
Path                           State      Status     Progress
/vol/test_vol                  Enabled    Initializing Initializing for 00:00:04
fas3170-a> df -s
Filesystem                used      saved       %saved
/vol/test_vol/         2277560  279778352          99%
fas3170-a>
There are a few other files on the test volume that contain random data, but the physical volume size as been reduced by over 99%. This means our 100GB file is now less that 1GB in size on disk. So, let’s do some reads from the same file and see what has changed.

Random 4KB reads from a 100GB file – post-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 93% 96766     0     0   96773 17674 409570    466     11     0     0    35s  53%   0%  -    6%      0     7     0     0     0     0
 93% 97949     0     0   97958 17821 413990    578    764     0     0    35s  53%   8%  T    7%      0     9     0     0     0     0
 93% 99199     0     0   99206 18071 419544    280      6     0     0    34s  53%   0%  -    4%      0     7     0     0     0     0
 93% 98587     0     0   98594 17941 416948    565    445     0     0    36s  53%   6%  T    6%      0     7     0     0     0     0
 93% 98063     0     0   98072 17924 414712    398     11     0     0    35s  53%   0%  -    5%      0     9     0     0     0     0
 93% 96568     0     0   96575 17590 408539    755    502     0     0    35s  53%   8%  T    7%      0     7     0     0     0     0
There has been a noticeable increase in NFS operations. The system has gone from delivering 6536 NFS ops to delivering 96,850 NFS ops. That is nearly fifteen-fold increase in delivered operations. The CPU utilization has gone up roughly 4.9x. The disk reads have dropped to almost 0 and the system is serving out over 400MB/s. This is a clear indication that the operations are being serviced from cache instead of from disk. It is also worth noting that the average latency, as measured from the host, has dropped by over 80%. The improvement in latency is not surprising given that the requests are no longer being serviced from disk. The cache age has dropped down to 35 seconds. Cache age is the average age of the blocks that are being evicted from cache to make space for new blocks. The test had been running for over an hour when this data was captured, so this is not due to the load ramping. This suggests that even though we are accessing a small number of disk blocks, the system is evicting blocks from cache. I suspect this is because the system is not truly deduplicating cache. Instead, it appears that each logical file block is taking up space in cache even though they refer to the same physical disk block. One potential explanation for this is that NetApp is eliminating the disk read by reading the duplicate block from cache instead of disk. I am not sure how to validate this through the available system stats, but I believe it explains the behavior. It explains why the NFS ops have gone up, the disk ops have gone down, and the cache age has gone down to 35 seconds. While it would be preferable to store only a single copy of the logical block in cache, this is better than reading all of the blocks from disk. The cache hit percentage is a bit of a puzzle here. It is stable at 53% and I am not sure how to explain that. The system is delivering more than 53% of the read operations from cache. The very small number of disk reads shows that. Maybe someone from NetApp will chime in and give us some details on how that number is derived. This testing was done on Data ONTAP 7.3.1 (or more specifically 7.3.1.1L1P1). I tried to replicate the results on versions of Data ONTAP prior to 7.3.1 without success. In older versions, the performance of the deduplicated volume is very similar to the original volume. It appears that reads for logically different blocks that point to the same physical block go disk prior to 7.3.1. Check back shortly as I am currently working on a deduplication performance test for VMware guests. It is a simple test to show the storage performance impact of booting many guests simultaneously. The plan is to test use a handful of fast servers to boot a couple hundred guests. The boot time will be compared across a volume that is not deduplicated and one that has been deduplicated. I am also working on one additional test that may provide a bit of performance acceleration. These results are from a NetApp FAS 3170. Please be careful trying to map these results to your filer as I am using very old disk drives and that skews the numbers a bit. The slow drives make the performance of the full data set (non-deduplicated?) slower than it would be with new disk drives.

Click to read more ...

Thursday
Jun112009

Deduplication - Sometimes it's about performance

In a previous post I discussed the topic of deduplication for capacity optimization. Removing redundant data blocks on disk is the first, and most obvious, phase of deduplication in the marketplace. It helps to drive down the most obvious cost - the cost per GB of disk capacity. This market has grown quickly over the last few years. Both startups and established storage vendors have products that compete in the space. They are most commonly marketed as virtual tape library (VTL) or disk-to-disk backup solutions. Does that mean that deduplication is a point solution for highly sequential workloads? No. There is another somewhat less obvious benefit of deduplication. What storage administrator does not ask for more cache in the storage array? If I can afford 8GB, I want 16GB. If the system supports 16GB, I want 32GB. Whether it is for financial or technical reasons, cache is always limited. What about deduplicating the data in cache? When the workload is streaming sequential backup data from disk, this may not be very helpful. However, in a primary storage system with a more varied workload, this becomes very interesting. The cost per GB of cache (DRAM) is several orders of magnitude higher than the cost of hard drives. If the goal is to reduce storage capital expenses by making the storage array more efficient, then let’s focus on the most expensive component. The cache is the most expensive component. If the physical data footprint on disk is reduced, then it is logical that the array cache should also benefit from that space savings. If the same deduplicated physical block is accessed multiple times through different logical blocks, then it should result in one read from disk and many reads from cache. Storing VMware, Hyper-V, or Xen virtual machine images creates a tremendous amount of duplicate data. It is not uncommon to see storage arrays that are storing tens or even hundreds of these virtual images. If 10 or 20 or all of those virtual servers need to boot at the same time, then it places an extreme workload on the storage array. If all of these requests have to hit disk, the disk drives will be overwhelmed. If each duplicate logical block requires its own space in cache, then the cache will be blown out. If each duplicate block is read off disk once and stored in cache once, then the clients will boot quickly and the cache will be maintained. Preserving the current contents of the cache will ensure the performance of the rest of the applications in the environment is not impacted. This virtual machine example does make a couple of assumptions. The most significant is that the storage controllers can keep up with the workload. Deduplication on disk allows for fewer disk drives. Deduplication in cache expands the logical cache size and makes the disk drives more efficient. Neither of these does anything to reduce the performance demand on the CPU, fibre channel, network, or bus infrastructure on the controllers. In fact, they likely place more demand on these controller resources. If the controller is spending less time waiting for disk drives, it needs the horsepower to deliver higher IOPS from cache with less latency. This same deduplication theory applies to flash technology as well. Whether the flash devices are being used to expand cache or provide another storage tier, they should be deduplicated. Flash devices are more expensive than disk drives, so let's get the capacity utilization up as high as possible. As the mainstream storage vendors start to bring deduplication to market for primary storage, it will be interesting to see how they deal with these challenges. Deduplication for primary storage presents an entirely different a set of requirements from backup to disk.

Click to read more ...

Friday
Apr102009

Deduplication - It's not just about capacity

There is no debating that duplication is one of the hottest topics in IT. The question is if the hype has started to become bigger than the technology. Today, there are two primary use cases driving deduplication in the marketplace. The first is backup to disk and the second is virtual guest operating systems (VMware, Hyper-V, and Xen guests). (I will talk a bit about the disk to disk scenario in this article and the virtual guest topic in the next one.) These are both logical markets to adopt deduplication because they suffer from a common challenge. They both create a tremendous amount of redundant data on the disk array. The goal in both cases is to pack more data onto a disk drive and reduce the cost per GB. This is the first and most obvious use case for deduplication. Disk drive capacity is growing exponentially, but disk performance is increasing at a much slower rate. In many cases, when helping customers size for their workload, performance drives the spindle count and not capacity. It is easy to meet the capacity needs with large drives, but will they meet the performance requirement? That is the problem. Often performance is what dictates the spindle count. It is no longer sufficient to size a storage device based solely on capacity requirements. This is a general challenge that must be taken into account when sizing a storage array. So how is the growing disparity between size and performance effected by deduplication? Deduplication can make the performance issue worse by reducing the number of spindles even further. If the bottleneck in the storage device is the spindles, then using deduplication to pack more data onto those spindles is only going to exacerbate the situation. Let's take a closer look at sizing storage for a backup to disk workload. Delivering on the highly sequential read and write requirements of disk to disk backups is much easier than serving a more random workload. Disk drives do a great job with sequential reads and writes. This makes backup to disk all about sizing for capacity. When deduplication is added into the mix, the disk drives should still meet the performance requirement as long as the deduplication technology being used does not turn sequential IO into random IO. This is why it is important to understand how a specific deduplication implementation works. The reality is that nearly every other IT workload is more random than backup to disk. If deduplication was used to pack more data onto the same number of spindles for a highly random workload the spindles would likely not meet the performance requirements. Does that mean deduplication is a point solution for highly sequential workloads? I do not believe so. I am working on an entry covering the potential performance benefits of deduplication in a more random environment.

Click to read more ...