Entries in Performance (7)

Friday
Mar262010

Block alignment is critical

Block alignment is an important topic that is often overlooked in storage. I read a blog entry by Robin Harris a couple months back about the importance of block alignment with the new 4KB  drives. I was curious to test the theory on one of the new 4KB drives, but I did not have one on hand. That got me thinking about Solid State Disk (SSD) devices. If filesystem misalignment hurts traditional spinning disk performance, how would it impact SSD performance. In short, it is ugly. Here is a chart showing the difference between aligned and misaligned random read operations to a Sun F20 card. I guess it is officially an Oracle F20 card. Oracle F20 - Aligned vs. Misaligned Oracle F20 - Aligned vs. Misaligned With only a couple threads, the flash module can deliver about 50% more random 4KB read operations. As the thread count increases, the module is able to deliver over 9x the number of operations if properly aligned. It is worth noting that the card is delivering those aligned reads at less that 1ms while the misaligned operations average over 7ms of latency. 9x the operations at 85% less latency makes this an issue worth paying attention to. (My test was done on Solaris and here is an article about how to solve the block alignment issue for Solaris x64 volumes.) I have seen a significant increase in block alignment issues with clients recently. Some arrays and some operating systems make is easier to align filesystems than others, but a new variable has crept in over the last few years. VMware on block devices means that VMFS adds another layer of abstraction to the process. Now it is important to be sure the virtual machine filesystems are aligned in addition to the root operating system/hypervisor filesystem. Server virtualization has been the catalyst for many IT organizations to centralize more of their storage. Unfortunately, centralized storage does not come at the same $/GB as the mirrored drives in the server. It is much more expensive. Block misalignment can make the new storage even more expensive by making it less efficient. If the filesystems are misaligned, then it makes the array cache far less efficient. When that misaligned data is read from or written to disk, the drives are forced to do additional operations that would not be required for an aligned operation. It can quickly turn a fast storage array into a very average system. Most of the storage manufacturers can provide you with a best practices doc to help you avoid these issues. Ask them for a whitepaper about block alignment issues with virtual machines.

Click to read more ...

Sunday
Nov012009

VMware boot storm on NetApp

UPDATE: I have posted an update to this article here: More boot storm details Measuring the benefit of cache deduplication with a real world workload can be very difficult unless you try it in production. I have written about the theory in the past and I did a lab test here with highly duplicate synthetic data. The results were revealing about how the NetApp deduplication technology impacts both read cache and disk. Based on our findings, we decided to run another test. This time the plan was to test NetApp deduplication with a VMware guest boot storm. We also added the NetApp Performance Accelerator Module (PAM) to the testing. The test infrastructure consists of 4 dual socket Intel Nehalem servers with 48GB of RAM each. Each server is connected to a 10GbE switch. A FAS3170 is connected to the same 10GbE switch. There are 200 virtual machines: 50 Microsoft Windows 2003, 50 Microsoft Vista, 50 Microsoft Windows 2008, and 50 linux. Each operating system type is installed in a separate NetApp FlexVol for a total of 4 volumes. This was not done to maximize the deduplication results. Instead we did it to allow the VMware systems to use 4 different NFS datastores. Each physical server mounts all 4 NFS datastores and the guests were split evenly across the 4 physical servers. The test consisted of booting all 200 guests simultaneously. This test was run multiple times with the FAS 3170 cache warm and cold, with deduplication and without, and with PAM and without. Here is a table summarizing the boot timing results. This is the amount of time between starting the boot and the 200th system acquiring an IP address. Here are the results:

Cold Cache (MM:SS) Warm Cache (MM:SS) % Improvement
Pre-Deduplication
0 PAM 15:09 13:42 9.6%
1 PAM 14:29 12:34 12.2%
2 PAM 14:05 8:43 38.1%
Post-Deduplication
0 PAM 8:37 7:58 7.5%
1 PAM 7:19 5:12 29.0%
2 PAM 7:02 4:27 37.0%
Let's take a look at the Pre-Deduplicaion results first. The warm 0 PAM boot performance improved by roughly 9.6% over the cold cache test. I suspect the small improvement is because the cache has been blown out by the time the cold cache boot completes. This is the behavior I would expect when the working set is substantially larger than the cache size. The 1 PAM warm boot results are 13.2% faster than the cold boot suggesting that the working set is still larger than the cache footprint. With 2 PAM cards, the warm boot is 38.1% faster than the cold boot. With 2 PAM cards it appears that a significant portion of the working set is now fitting into cache enabling a significantly faster warm cache boot. The Post-Deduplication results show a significant improvement in cold boot time over the Pre-Deduplication results. This is no surprise since once the data is deduplicated, the NetApp will fulfill a read request for any duplicate data block already in the cache by a copy in DRAM and save a disk access. (This article contains a full explanation of how the cache copy mechanism works.) As I have written previously, reducing the physical footprint of data is only one benefit of a good deduplication implementation. Clearly, it can provide a significant performance improvement as well. As one would expect, the Post-Deduplication warm boots also show a significant performance improvement over the the cold boots. The deduplicated working set appears to be larger than the 16GB PAM card as adding a second 16GB card further improved the warm boot performance. It is certainly possible the additional PAM capacity would further improve the results. It is worth noting that NetApp has released a larger 512GB PAM II card since we started doing this testing. The PAM I used in these tests is a 16GB DRAM based card and the PAM II is a 512GB flash based card. In theory, a DRAM based card should have lower latency for access. Since the cards are not directly accessed by a host protocol, it is not clear if the performance will be measurable at the host. Even if the card is theoretically slower, I can only assume the 32x size increase will more than make up for that with an improved hit rate. Thanks to Rick Ross and Joe Gries in the Corporate Technologies Infrastructure Services Group who did all the hard work in the lab to put these results together.

Click to read more ...

Monday
Jul202009

Deduplication - The NetApp Approach

After writing a couple of articles (here and here) about deduplication and how I think it should be implemented, I figured I would try it on a NetApp system I have in the lab. The goal of the testing here is to compare storage performance of a data set before and after deduplication. Sometimes capacity is the only factor, but sometimes performance matters. The test is random 4KB reads against a 100GB file. The 100GB file represents significantly more data than the test system can fit into its' 16GB read cache. I am using 4KB because that is the natural block size for NetApp. To maximize the observability of the results in this deduplication test, the 100GB file is completely full of duplicate data. For those who are interested, the data was created by doing a dd from /dev/zero. It does not get any more redundant than that. I am not suggesting this is representative of a real world deduplication scenario. It is simply the easiest way to observe the effect deduplication has on other aspects of the system. This is the output from sysstat -x during the first test. The data is being transferred over NFS and the client system has caching disabled, so all reads are going to the storage device. (The command output below is truncated to the right, but the important data is all there.)

Random 4KB reads from a 100GB file – pre-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 19%  6572     0     0    6579  1423 27901  23104     11     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6542     0     0    6549  1367 27812  23265    726     0     0     7   17%   5%  T  100%      0     7     0     0     0     0
 19%  6550     0     0    6559  1305 27839  23146     11     0     0     7   15%   0%  -  100%      0     9     0     0     0     0
 19%  6569     0     0    6576  1362 27856  23247    442     0     0     7   16%   4%  T  100%      0     7     0     0     0     0
 19%  6484     0     0    6491  1357 27527  22870      6     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6500     0     0    6509  1300 27635  23102    442     0     0     7   17%   9%  T  100%      0     9     0     0     0     0
The system is delivering an average of 6536 NFS operations per second. The cache hit rate hovers around 16-17%. As you can see, the working set does not fit in primary cache. This makes sense. The 3170 has 16GB of primary cache and we are randomly reading from a 100GB file. Ideally, we would like to get a 16% cache hit rate (16GB cache / 100GB working set) and we are very close. The disks are running at 100% utilization and are clearly the bottleneck in this scenario. The spindles are delivering as many operations as the are capable of. So what happens if we deduplication this data? First, we need to activate deduplication, a_sis in NetApp vocabulary, on the test volume and deduplicate the test data. (Before deduplication became the official buzz word, NetApp referred to their technology as Advanced Single Instance Storage.)
fas3170-a> sis on /vol/test_vol
SIS for "/vol/test_vol" is enabled.
Already existing data could be processed by running "sis start -s /vol/test_vol".
fas3170-a> sis start -s /vol/test_vol
The file system will be scanned to process existing data in /vol/test_vol.
This operation may initialize related existing metafiles.
Are you sure you want to proceed (y/n)? y
The SIS operation for "/vol/test_vol" is started.
fas3170-a> sis status
Path                           State      Status     Progress
/vol/test_vol                  Enabled    Initializing Initializing for 00:00:04
fas3170-a> df -s
Filesystem                used      saved       %saved
/vol/test_vol/         2277560  279778352          99%
fas3170-a>
There are a few other files on the test volume that contain random data, but the physical volume size as been reduced by over 99%. This means our 100GB file is now less that 1GB in size on disk. So, let’s do some reads from the same file and see what has changed.

Random 4KB reads from a 100GB file – post-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 93% 96766     0     0   96773 17674 409570    466     11     0     0    35s  53%   0%  -    6%      0     7     0     0     0     0
 93% 97949     0     0   97958 17821 413990    578    764     0     0    35s  53%   8%  T    7%      0     9     0     0     0     0
 93% 99199     0     0   99206 18071 419544    280      6     0     0    34s  53%   0%  -    4%      0     7     0     0     0     0
 93% 98587     0     0   98594 17941 416948    565    445     0     0    36s  53%   6%  T    6%      0     7     0     0     0     0
 93% 98063     0     0   98072 17924 414712    398     11     0     0    35s  53%   0%  -    5%      0     9     0     0     0     0
 93% 96568     0     0   96575 17590 408539    755    502     0     0    35s  53%   8%  T    7%      0     7     0     0     0     0
There has been a noticeable increase in NFS operations. The system has gone from delivering 6536 NFS ops to delivering 96,850 NFS ops. That is nearly fifteen-fold increase in delivered operations. The CPU utilization has gone up roughly 4.9x. The disk reads have dropped to almost 0 and the system is serving out over 400MB/s. This is a clear indication that the operations are being serviced from cache instead of from disk. It is also worth noting that the average latency, as measured from the host, has dropped by over 80%. The improvement in latency is not surprising given that the requests are no longer being serviced from disk. The cache age has dropped down to 35 seconds. Cache age is the average age of the blocks that are being evicted from cache to make space for new blocks. The test had been running for over an hour when this data was captured, so this is not due to the load ramping. This suggests that even though we are accessing a small number of disk blocks, the system is evicting blocks from cache. I suspect this is because the system is not truly deduplicating cache. Instead, it appears that each logical file block is taking up space in cache even though they refer to the same physical disk block. One potential explanation for this is that NetApp is eliminating the disk read by reading the duplicate block from cache instead of disk. I am not sure how to validate this through the available system stats, but I believe it explains the behavior. It explains why the NFS ops have gone up, the disk ops have gone down, and the cache age has gone down to 35 seconds. While it would be preferable to store only a single copy of the logical block in cache, this is better than reading all of the blocks from disk. The cache hit percentage is a bit of a puzzle here. It is stable at 53% and I am not sure how to explain that. The system is delivering more than 53% of the read operations from cache. The very small number of disk reads shows that. Maybe someone from NetApp will chime in and give us some details on how that number is derived. This testing was done on Data ONTAP 7.3.1 (or more specifically 7.3.1.1L1P1). I tried to replicate the results on versions of Data ONTAP prior to 7.3.1 without success. In older versions, the performance of the deduplicated volume is very similar to the original volume. It appears that reads for logically different blocks that point to the same physical block go disk prior to 7.3.1. Check back shortly as I am currently working on a deduplication performance test for VMware guests. It is a simple test to show the storage performance impact of booting many guests simultaneously. The plan is to test use a handful of fast servers to boot a couple hundred guests. The boot time will be compared across a volume that is not deduplicated and one that has been deduplicated. I am also working on one additional test that may provide a bit of performance acceleration. These results are from a NetApp FAS 3170. Please be careful trying to map these results to your filer as I am using very old disk drives and that skews the numbers a bit. The slow drives make the performance of the full data set (non-deduplicated?) slower than it would be with new disk drives.

Click to read more ...

Friday
Apr102009

Deduplication - It's not just about capacity

There is no debating that duplication is one of the hottest topics in IT. The question is if the hype has started to become bigger than the technology. Today, there are two primary use cases driving deduplication in the marketplace. The first is backup to disk and the second is virtual guest operating systems (VMware, Hyper-V, and Xen guests). (I will talk a bit about the disk to disk scenario in this article and the virtual guest topic in the next one.) These are both logical markets to adopt deduplication because they suffer from a common challenge. They both create a tremendous amount of redundant data on the disk array. The goal in both cases is to pack more data onto a disk drive and reduce the cost per GB. This is the first and most obvious use case for deduplication. Disk drive capacity is growing exponentially, but disk performance is increasing at a much slower rate. In many cases, when helping customers size for their workload, performance drives the spindle count and not capacity. It is easy to meet the capacity needs with large drives, but will they meet the performance requirement? That is the problem. Often performance is what dictates the spindle count. It is no longer sufficient to size a storage device based solely on capacity requirements. This is a general challenge that must be taken into account when sizing a storage array. So how is the growing disparity between size and performance effected by deduplication? Deduplication can make the performance issue worse by reducing the number of spindles even further. If the bottleneck in the storage device is the spindles, then using deduplication to pack more data onto those spindles is only going to exacerbate the situation. Let's take a closer look at sizing storage for a backup to disk workload. Delivering on the highly sequential read and write requirements of disk to disk backups is much easier than serving a more random workload. Disk drives do a great job with sequential reads and writes. This makes backup to disk all about sizing for capacity. When deduplication is added into the mix, the disk drives should still meet the performance requirement as long as the deduplication technology being used does not turn sequential IO into random IO. This is why it is important to understand how a specific deduplication implementation works. The reality is that nearly every other IT workload is more random than backup to disk. If deduplication was used to pack more data onto the same number of spindles for a highly random workload the spindles would likely not meet the performance requirements. Does that mean deduplication is a point solution for highly sequential workloads? I do not believe so. I am working on an entry covering the potential performance benefits of deduplication in a more random environment.

Click to read more ...

Friday
Feb272009

Do I need more cache in my NetApp?

How many times have you wondered whether you could improve the performance of your storage array by adding additional cache? Will more cache improve the performance of my storage array? This is what the vendors so often tell us, but they have no objective information to explain why it is going to help. Depending on the workload, increasing the cache may have little or no effect on performance. There are two ways to know whether your environment will benefit from additional cache. The first is to understand every nuance of your application. Most storage managers I speak with classify this as impractical at best and impossible at worst. Even if you have an application with a very well understood workload, most storage devices are not hosting a single application. Instead, they are the hosting many different applications. It is even more complex to understand how this combined workload will be effected by adding cache. The second way to measure cache benefit is to put the cache in and see what happens. This is the most common approach I see in the field. When performance becomes unacceptable, the options of adding additional disk and/or cache are weighed and a purchase is made. (I will save the topic of adding spindles to increase performance for a future post.) Both of these options force a purchase to be made with no guarantee it will solve the problem. NetApp has introduced a tool to provide a 3rd option: Predictive Cache Statistics. It provides the objective data needed to rationalize a hardware purchase. Predictive Cache Statistics (PCS) is available in systems running 7.3+ and having at least 2GB of memory. When it is enabled, PCS reports what the cache hit ratio would be if the system had 2x (ec0), 4x (ec1), and 8x (ec2) the current cache footprint. (ec0, ec1, and ec2 are the names of the extended caches when the stats are presented by the NetApp system.) Now, let's drill down into exactly how predictive cache statistics work... In most conditions there is no significant impact to system performance. I monitored the change in latency on my test system with PCS enabled and disabled and there was not a measurable difference. The storage controller was running at about 25% CPU utilization at the time with a 40% cache hit rate. NetApp warns in their docs that performance can be effected when the storage controller is at 80% CPU utilization or higher. It is understandable given the amount of information the array has to track in order to provide the cache statistics. This simply means some thought needs to be put into when it is enabled and how long it is run for in production. Here are the steps required to gather the information: 1) Enable Predictive Cache Statistics (PCS)

options flexscale.enable pcs
2) It is important to allow the workload to run until the virtual caches have time to warm up. In a system with a large amount of cache, this can be hours or even days.  Monitor array performance while the storage workload runs. If latency increases to unacceptable levels, you can disable PCS.
options flexscale.enable off
3) The NetApp perfstat tool can be used to capture and analyze the data that is gathered. I prefer instant gratification, so for this example, I will use real time stats command.
stats show –p flexscale-pcs
The way the results are reported can be a little confusing the first time you look at it. The ec0, ec1, and ec2 'virtual caches' are relative to the base cache in the system being tested (2x, 4x, and 8x). If the test system has 16GB of primary cache, ec0 will represent 32GB of 'virtual cache' (2x 16GB). ec1 brings the 'virtual cache' to a total of 4x base cache or an additional 32GB beyond ec0. ec2 brings the total to 8x base cache or an additional 64GB beyond ec0 + ec1. The statistics on each line represent the values for that specific cache segment. Hopefully that explanation clears up more confusion than it introduces. Here are a couple examples. This testing was completed on a NetApp FAS3170. The 3170 platform has 16GB of cache standard. So, in these examples, ec0 is 32GB, ec1 is 32GB, and ec2 is 64GB.

Example 1: 8GB working set, 4KB IO, and 100% random reads

fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net  kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in    out   read  write  read write   age   hit time  ty util                 in   out    in   out
 39% 39137     0     0   39137  7102 165539    206    370     0     0   >60  100%   3%  T    2%      0     0     0     0     0     0
 39% 39882     0     0   39882  7236 168677    136      6     0     0   >60  100%   0%  -    1%      0     0     0     0     0     0
 39% 39098     0     0   39098  7094 165338    186    285     0     0   >60  100%   3%  T    2%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
---
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
---
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
The sysstat shows a cache hit rate of 100%. This is exactly what we would expect for an 8GB dataset on a system with 16GB of cache. The stats command shows that PCS is currently reporting no activity. Again, this is exactly what we should expect with a working set that fits completely in main cache.

Example 2: 30GB working set, 4KB IO, and 100% random reads

fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 27% 11607     0     0   11607  2173 49352  27850      6     0     0     3   41%   0%  -   99%      0     0     0     0     0     0
 27% 11642     0     0   11642  2180 49518  28097    279     0     0     3   41%  21%  T   99%      0     0     0     0     0     0
 26% 11413     0     0   11413  2138 48511  27773     11     0     0     3   41%   0%  -   99%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608     1    38  8560   0     0          0  14811
     ec1   8388608     0     0  8560   0     0          0      0
     ec2  16777216     0     0  8560   0     0          0      0
---
     ec0   8388608     1    65  6985   0     0          0      0
     ec1   8388608     0     0  6985   0     0          0      0
     ec2  16777216     0     0  6985   0     0          0      0
---
     ec0   8388608     1   100  6922   1     0          0  11899
     ec1   8388608     0     0  6922   0     0          0      0
     ec2  16777216     0     0  6922   0     0          0      0
This data was gathered after the 30GB workload had been running for a few minutes, but just after I enabled predictive cache statistics. The PCS data shows that there are very few hits, but there are a significant number of inserts. This is what we should expect when PCS is first enabled. The sysstat output shows a cache hit rate of 41%.
fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 27% 11238     0     0   11238  2105 47784  27862    286     0     0     4   40%  18%  T   99%      0     0     0     0     0     0
 26% 11371     0     0   11371  2130 48349  27934     11     0     0     4   40%   0%  -   99%      0     0     0     0     0     0
 27% 11184     0     0   11184  2096 47554  27938    275     0     0     4   40%  33%  T   99%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608    87  6536   456  93   933          0    934
     ec1   8388608     6   453     3  99     0        934    933
     ec2  16777216     0     0     3   0     0          0      0
---
     ec0   8388608    87  6512   435  93     0          0      0
     ec1   8388608     6   435     0 100     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
---
     ec0   8388608    87  6472   450  93   963          0    964
     ec1   8388608     6   445     5  98     0        964    963
     ec2  16777216     0     0     5   0     0          0      0
Now that the ec0 virtual cache has warmed up, the potential value of additional cache becomes more apparent. The hit rate has gone up to 93% and it is servicing over 6500 operations per second. With 32GB of additional cache, 6500+ disk reads would be alleviated and the latency would be dramatically reduced. These cache hits are virtual, so currently those 'hits' are still causing disk reads. Clearly, the additional cache will provide a major performance boost, but unfortunately, it is impossible to determine exactly how it will effect overall system performance. The current bottleneck, reads from disk, would be alleviated, but that simply means we will find the next one. Additional cache can be added to most NetApp systems in the form of a Performance Accelerator Module (PAM). The PAM is a  PCI Express card with 16GB of DRAM on it. It plugs directly into one of the PCI Express slots in the filer. I suspect there a slight increase in latency when accessing data in the PAM over the main system cache. Although, this increase is likely so small that it will not be noticed on the client side as it is a very small portion of the total transaction time from the client perspective. Unfortunately, I do not have first hand performance data that I can share as I have not been able to get access to a PAM for complete lab testing. It is important to note that a system with 16GB of primary cache and 32GB of PAM cache is not the same as a system with 48GB of primary cache. The PAM cache is populated as items are evicted from primary cache. If there is a hit in the PAM, that block is copied back into primary cache. This type of cache is commonly referred to as a victim cache or an L2 cache. If the goal is to serve a working set without ever going to disk, then that working set needs to fit into extended cache, not the the primary cache plus extended cache. Predictive cache statistics are a great feature. It gives us the power to answer a question we could only guess at in the past. However, like most end users, I always want more. There are a couple things that I would love to see in the future. First, the PAM cards are 16GB in size. It would be great if the extended cache segments reported by PCS could be in 16GB increments. That would make it even easier to determine the value of each card I add. It would also remove all the confusion around how big ec0, ec1, and ec2 are. The ability to reset the PCS counters back to zero would also be helpful. When testing different workloads, this would allow the stats to be associated with each individual workload. It is worth noting that this was not a performance test and the data above should be treated as such. Nothing was done to either the client or the filer to optimize NFS performance. In an attempt to prevent these numbers from being used to judge system performance, I am intentionally omiting the details of how the disk was configured.

Click to read more ...

Tuesday
Jan272009

WAN optimization for array replication

As the need for disaster recovery continues to move downmarket from the enterprise to medium and small businesses, the number of IT shops replicating their data to an offsite location is increasing. Array based replication was once a feature reserved for the big budgets of the Fortune 1000. Today, array based replication is a feature that is available on most midrange storage devices (and even some of the entry level products). This increase in replication deployments has created a new challenge for IT. The most common replication solutions move the data over the IP network. That data puts a significant load on the IP network infrastructure. The LAN infrastructure is almost always up to the task, but the WAN is often not able to handle this new burden. While the prices of network infrastructure have come down over the years, big pipes are still an expensive monthly outlay. So, how do we get that data offsite without driving up those WAN costs? WAN optimization technology provides a potential solution. Not every workload or protocol can benefit from today's WAN optimization technology, but replication is one that usually gets a big boost. I gathered some data from a client who is using NetApp SnapMirror to replicate to a remote datacenter and deployed  WAN optimization to prevent a major WAN upgrade. The NetApp filer is serving iSCSI, Fibre Channel, and CIFS. The clients are primarily Windows and they run Exchange and MS SQL along with some home grown applications. All of their data is stored on the NetApp storage. The chart below shows the impact the WAN optimization device had. For the purposes of this discussion, think of the device as having one unoptimized LAN port and an optimized WAN port. The LAN traffic is represented by the red and the WAN by the blue. With no optimization, the traffic would be the same on both sides. The chart shows a dramatic reduction on the amount of data being pushed over the WAN. Network Throughput Network Throughput This data was gathered over a 2 week period. The total data reduction over the WAN was 83% over the data in the chart and there was a peak of 93% for one window. Again, this is not what every environment will see, so test before you deploy. In this case, the system paid for itself in less than 12 months with the savings in WAN costs. That is the kind of ROI that works for almost anyone. I am intentionally not addressing what WAN optimization technology was used in this solution. Last time we tested these devices in our lab, we brought in a half dozen and they all had their pros and cons. That is another topic for another post.

Click to read more ...

Monday
Dec012008

HSM without the headaches

Hierarchical Storage Managementement (HSM), Information Lifecycle Management (ILM), and Data Lifecycle Management (DLM). Everyone wants to manage their data intelligently to reduce their spending on storage infrastructure. The storage vendors and the trade rags would like to convince us that there are magic tools to solve this challenge. The truth is there is no magic tool to manage unstructured data. (I am not talking about the archiving tools that integrate with application here, I am only talking about unstructured data.) I have tried many tools over the years and they are simply not cost effective. Don't panic though, in most cases, the solution is far simpler and far less expensive than HSM. File services is a huge consumer of storage capacity. For the purposes of this conversation, let's consider file services as NFS or CIFS storage whether they be integrated appliances or a servers leveraging back end storage devices. In most environments I visit, the file serving infrastructure is using tier 1 disk drives (fibre channel, SCSI, or SAS). These disk drives are populated with data that is mostly idle and the storage managers want to get that idle data onto a less expensive disk tier. The most common request is to transparently move the idle data to a SATA based devices. Let's walk through this the scenarios for an environment with 20TB of unstructured data. To make the example a little simpler, I am going to ignore both RAID capacity overhead and drive right-sizing. I am going to use 300GB FC drives for tier 1 and 1TB SATA for tier 2. I am going to assume that 10% of the data gets 90% of the IO. (While every environment is different, this is is line with what I see in file sharing environments.) Check out this interesting paper about a recent file server analysis. "Measurement And Analysis Of Large-Scale Network File System Workloads" by Andrew W. Leung and Ethan L. Miller from UC Santa Cruz and Shankar Pasupathy and Garth Goodson of NetApp. Storing 20TB of data on tier 1 drives takes 69 drives (20TB * 1024GB/TB  /  300GB/drive = 68.26 drives). If 10% of that data is considered active, then a tiered environment would require 7 300GB drives for the tier 1 data and 18 1TB drives for the tier 2 data. There are two apparent solutions to this problem. The first is 100% tier 1 disk (option A) and the second is 10% tier 1 and 90% tier 2 disk (option B). Using all tier 1 disk will deliver the required performance and capacity. The downside is that the disk is expensive and takes a tremendous amount of power and cooling. This is the expensive solution storage managers are trying to escape from. The second option is to use a mix of tier 1 and tier 2 disk. This has the potential to make the disks significantly less expensive. The challenge here is the requirement for a magic HSM tool. These tools are so expensive that they often cost more than is saved by using tier 2 disks. Additionally, they are very complex to deploy and manage. There is a third option that is often not considered. Use 100% tier 2 disk. Is it practical to use 100% tier 2 disk? Yes, in most environments the unstructured data will perform just fine on tier 2 disks. Let's go back to the 10% tier 1 example for a minute. In this example the small number of tier 1 disks are being asked to shoulder 90% of the IO workload. The tier 2 drives in that example are nearly idle. When we use 100% tier 2 disk, we are able to put all of the spindles to work. Why pay the high price for tier 1 disk to centralize the workload and leave 70%+ of the drives underutilized? Put those tier 2 disks to work. Disk IOPS are the most common performance limiter I see, so I am always looking for ways to spread out the workload. Modern disk drives not only run fine with a mix of active and idle data, they actually need to host some idle data. If a drive were filled to capacity with active data, it would most likely be unable to handle the workload. Disclaimer: This is not true for every environment. Some environments drive too much IO to leverage tier 2 disk effectively. For those environment, I suggest using 100% tier 1 disk. Yes, I will admit that is some extreme cases, the HSM solutions make sense. Far more often the cost effective approach is to stick with either 100% tier 1 or 100% tier 2 disk.

Click to read more ...