Entries in NetApp (7)

Thursday
Apr152010

Oracle/Sun F20 Flash Card - How fast is it?

I received several questions about the performance of the Oracle/Sun F20 flash card I used in my previous post about block alignment, so I put together a quick overview of the card's performance capabilities. The following results are from testing the card in a dual socket 2.93Ghz Nehalem (x5570) system running Solaris x64. This is similar to the server platform Oracle uses in the ExaData 2 platform. The F20 card is a SAS controller with 4 x 24GB flash modules attached to it. You can find more info on the flash modules on Adam Leventhal's blog and the official Oracle product page has the F20 details. All of my tests used 100% random 4KB blocks. I focused on random operations, because in most cases it is not cost effective to use SSD for sequential operations. These tests were run with a variety of different thread counts to give an idea of how the card scales with multiple threads. The first test compared the performance of a single 24GB flash module to the performance of all 4 modules.

4KB Random Operations 4KB Random Operations At lower thread counts the 4 module test is roughly 4x the operations per second of the single module test. As the thread count rises, the single module test tops out at 35,411 ops and 4 modules can deliver 97,850 ops, or 2.76x the single module test. It would be great if the card was able to drive the 4 modules at full speed, but 97K+ ops is not too shabby. What is more impressive to me is that those 97K+ ops are delivered at roughly 1ms of latency. The next round of testing included three different workloads. The three phases were 100% read, 80% read, and 100% write and they were run against all 4 flash modules. Again, all tests used 4KB random operations. Here are the operations per second results.

2010.04.07.4KB.Rand.Ops

And a throughput version in MB/s for anyone that is interested.

2010.04.07.4KB.Rand.Througput

Flash and solid state disk (SSD) technologies are developing at an incredibly fast pace. They are a great answer, but I think we are still figuring out what the question is. At some point down the line, they may replace spinning disk drives, but I do not think that is going to happen in the short term. There are some applications that can leverage this low latency capacity inside the servers today, but this is not the majority of applications.

Where flash and SSD make more sense to me is as a large cache. NetApp and Sun are using flash this way today in their storage array product lines. DRAM is very expensive, but flash can provide a very large and very low latency cache. I expect we will see more vendors adopting this "flash for cache" approach moving forward. The economics just make sense. Disks are too slow and DRAM is too expensive.

It would also be great to see operating systems that were intelligent enough to use technologies like the F20 card and the Fusion IO card as extended filesystem read cache. Solaris can do it for zfs filesystems using the L2ARC. As far as I know, there are no filesystems that have this feature in the other major operating systems. What about using as a client side NFS cache? At one point, Solaris offered CacheFS for NFS caching, but I do not believe it is still being actively developed. While CacheFS had its challenges, I believe the idea was a very good one. It costs a lot more to buy a storage array capable of delivering 97K ops than it does to put more cache into the server.

Click to read more ...

Monday
Dec282009

VMware boot storm on NetApp - Part 2

I have received a few questions relating to my previous post about NetApp VMware bootstorm results and want to answer them here.  I have also had a chance to look through the performance data gathered during the tests and have a few interesting data points to share. I also wanted to mention that I now have a pair of second generation Performance Accelerator Modules (PAM 2) in hand and will be publishing updated VMware boot storm results with the larger capacity cards. What type of disk were the virtual machines stored on?

  • The virtual machines were stored on a SATA RAID-DP aggregate.
What was the rate of data reduction through deduplication?
  • The VMDK files were all fully provisioned at the time of creation. Each operating system type was placed on a different NFS datastore. This resulted in 50 virtual machines on each of 4 shares. The deduplication reduced the physical footprint of the data by 97%
A few interesting stats gathered during the testing. These numbers are not exact and due to the somewhat imprecise nature of starting and stopping statit in synchronization with the start and end of each test.
  • The CPU utilization moved inversely with the boot time. The shorter the boot time, the higher the CPU utilization. This is not surprising as during the faster boots, the CPUs were not waiting around for disk drives to respond. More data was served from cache the the CPU could stay more utilized.
  • The total NFS operations required for each test was 2.8 million.
  • The total GB read by the VMware physical servers from the NetApp was roughly 49GB.
  • The total GB read from disk trended down between cold and warm cache boots. This is what I expected and would be somewhat concerned if it was not true.
  • The total GB read from disk trended down with the addition of each PAM. Again, I would be somewhat concerned if this was not the case.
  • The total GB read from disk took a significant drop when the data was deduplicated. This helps to prove out the theory that NetApp is no longer going to disk for every read of a different logical block that points to the same physical block.
How much disk load was eliminated by the combination of dedup and PAM?
  • The cold boots with no dedup and no PAM read about 67GB of data from disk. The cold boot with dedup and no PAM dropped that down to around 16GB. Adding 2 PAM (or 32GB of extended dedup aware cache) dropped the amount of data read from disk to less that 4GB.

Click to read more ...

Sunday
Nov012009

VMware boot storm on NetApp

UPDATE: I have posted an update to this article here: More boot storm details Measuring the benefit of cache deduplication with a real world workload can be very difficult unless you try it in production. I have written about the theory in the past and I did a lab test here with highly duplicate synthetic data. The results were revealing about how the NetApp deduplication technology impacts both read cache and disk. Based on our findings, we decided to run another test. This time the plan was to test NetApp deduplication with a VMware guest boot storm. We also added the NetApp Performance Accelerator Module (PAM) to the testing. The test infrastructure consists of 4 dual socket Intel Nehalem servers with 48GB of RAM each. Each server is connected to a 10GbE switch. A FAS3170 is connected to the same 10GbE switch. There are 200 virtual machines: 50 Microsoft Windows 2003, 50 Microsoft Vista, 50 Microsoft Windows 2008, and 50 linux. Each operating system type is installed in a separate NetApp FlexVol for a total of 4 volumes. This was not done to maximize the deduplication results. Instead we did it to allow the VMware systems to use 4 different NFS datastores. Each physical server mounts all 4 NFS datastores and the guests were split evenly across the 4 physical servers. The test consisted of booting all 200 guests simultaneously. This test was run multiple times with the FAS 3170 cache warm and cold, with deduplication and without, and with PAM and without. Here is a table summarizing the boot timing results. This is the amount of time between starting the boot and the 200th system acquiring an IP address. Here are the results:

Cold Cache (MM:SS) Warm Cache (MM:SS) % Improvement
Pre-Deduplication
0 PAM 15:09 13:42 9.6%
1 PAM 14:29 12:34 12.2%
2 PAM 14:05 8:43 38.1%
Post-Deduplication
0 PAM 8:37 7:58 7.5%
1 PAM 7:19 5:12 29.0%
2 PAM 7:02 4:27 37.0%
Let's take a look at the Pre-Deduplicaion results first. The warm 0 PAM boot performance improved by roughly 9.6% over the cold cache test. I suspect the small improvement is because the cache has been blown out by the time the cold cache boot completes. This is the behavior I would expect when the working set is substantially larger than the cache size. The 1 PAM warm boot results are 13.2% faster than the cold boot suggesting that the working set is still larger than the cache footprint. With 2 PAM cards, the warm boot is 38.1% faster than the cold boot. With 2 PAM cards it appears that a significant portion of the working set is now fitting into cache enabling a significantly faster warm cache boot. The Post-Deduplication results show a significant improvement in cold boot time over the Pre-Deduplication results. This is no surprise since once the data is deduplicated, the NetApp will fulfill a read request for any duplicate data block already in the cache by a copy in DRAM and save a disk access. (This article contains a full explanation of how the cache copy mechanism works.) As I have written previously, reducing the physical footprint of data is only one benefit of a good deduplication implementation. Clearly, it can provide a significant performance improvement as well. As one would expect, the Post-Deduplication warm boots also show a significant performance improvement over the the cold boots. The deduplicated working set appears to be larger than the 16GB PAM card as adding a second 16GB card further improved the warm boot performance. It is certainly possible the additional PAM capacity would further improve the results. It is worth noting that NetApp has released a larger 512GB PAM II card since we started doing this testing. The PAM I used in these tests is a 16GB DRAM based card and the PAM II is a 512GB flash based card. In theory, a DRAM based card should have lower latency for access. Since the cards are not directly accessed by a host protocol, it is not clear if the performance will be measurable at the host. Even if the card is theoretically slower, I can only assume the 32x size increase will more than make up for that with an improved hit rate. Thanks to Rick Ross and Joe Gries in the Corporate Technologies Infrastructure Services Group who did all the hard work in the lab to put these results together.

Click to read more ...

Monday
Jul202009

Deduplication - The NetApp Approach

After writing a couple of articles (here and here) about deduplication and how I think it should be implemented, I figured I would try it on a NetApp system I have in the lab. The goal of the testing here is to compare storage performance of a data set before and after deduplication. Sometimes capacity is the only factor, but sometimes performance matters. The test is random 4KB reads against a 100GB file. The 100GB file represents significantly more data than the test system can fit into its' 16GB read cache. I am using 4KB because that is the natural block size for NetApp. To maximize the observability of the results in this deduplication test, the 100GB file is completely full of duplicate data. For those who are interested, the data was created by doing a dd from /dev/zero. It does not get any more redundant than that. I am not suggesting this is representative of a real world deduplication scenario. It is simply the easiest way to observe the effect deduplication has on other aspects of the system. This is the output from sysstat -x during the first test. The data is being transferred over NFS and the client system has caching disabled, so all reads are going to the storage device. (The command output below is truncated to the right, but the important data is all there.)

Random 4KB reads from a 100GB file – pre-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 19%  6572     0     0    6579  1423 27901  23104     11     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6542     0     0    6549  1367 27812  23265    726     0     0     7   17%   5%  T  100%      0     7     0     0     0     0
 19%  6550     0     0    6559  1305 27839  23146     11     0     0     7   15%   0%  -  100%      0     9     0     0     0     0
 19%  6569     0     0    6576  1362 27856  23247    442     0     0     7   16%   4%  T  100%      0     7     0     0     0     0
 19%  6484     0     0    6491  1357 27527  22870      6     0     0     7   16%   0%  -  100%      0     7     0     0     0     0
 19%  6500     0     0    6509  1300 27635  23102    442     0     0     7   17%   9%  T  100%      0     9     0     0     0     0
The system is delivering an average of 6536 NFS operations per second. The cache hit rate hovers around 16-17%. As you can see, the working set does not fit in primary cache. This makes sense. The 3170 has 16GB of primary cache and we are randomly reading from a 100GB file. Ideally, we would like to get a 16% cache hit rate (16GB cache / 100GB working set) and we are very close. The disks are running at 100% utilization and are clearly the bottleneck in this scenario. The spindles are delivering as many operations as the are capable of. So what happens if we deduplication this data? First, we need to activate deduplication, a_sis in NetApp vocabulary, on the test volume and deduplicate the test data. (Before deduplication became the official buzz word, NetApp referred to their technology as Advanced Single Instance Storage.)
fas3170-a> sis on /vol/test_vol
SIS for "/vol/test_vol" is enabled.
Already existing data could be processed by running "sis start -s /vol/test_vol".
fas3170-a> sis start -s /vol/test_vol
The file system will be scanned to process existing data in /vol/test_vol.
This operation may initialize related existing metafiles.
Are you sure you want to proceed (y/n)? y
The SIS operation for "/vol/test_vol" is started.
fas3170-a> sis status
Path                           State      Status     Progress
/vol/test_vol                  Enabled    Initializing Initializing for 00:00:04
fas3170-a> df -s
Filesystem                used      saved       %saved
/vol/test_vol/         2277560  279778352          99%
fas3170-a>
There are a few other files on the test volume that contain random data, but the physical volume size as been reduced by over 99%. This means our 100GB file is now less that 1GB in size on disk. So, let’s do some reads from the same file and see what has changed.

Random 4KB reads from a 100GB file – post-deduplication:

 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 93% 96766     0     0   96773 17674 409570    466     11     0     0    35s  53%   0%  -    6%      0     7     0     0     0     0
 93% 97949     0     0   97958 17821 413990    578    764     0     0    35s  53%   8%  T    7%      0     9     0     0     0     0
 93% 99199     0     0   99206 18071 419544    280      6     0     0    34s  53%   0%  -    4%      0     7     0     0     0     0
 93% 98587     0     0   98594 17941 416948    565    445     0     0    36s  53%   6%  T    6%      0     7     0     0     0     0
 93% 98063     0     0   98072 17924 414712    398     11     0     0    35s  53%   0%  -    5%      0     9     0     0     0     0
 93% 96568     0     0   96575 17590 408539    755    502     0     0    35s  53%   8%  T    7%      0     7     0     0     0     0
There has been a noticeable increase in NFS operations. The system has gone from delivering 6536 NFS ops to delivering 96,850 NFS ops. That is nearly fifteen-fold increase in delivered operations. The CPU utilization has gone up roughly 4.9x. The disk reads have dropped to almost 0 and the system is serving out over 400MB/s. This is a clear indication that the operations are being serviced from cache instead of from disk. It is also worth noting that the average latency, as measured from the host, has dropped by over 80%. The improvement in latency is not surprising given that the requests are no longer being serviced from disk. The cache age has dropped down to 35 seconds. Cache age is the average age of the blocks that are being evicted from cache to make space for new blocks. The test had been running for over an hour when this data was captured, so this is not due to the load ramping. This suggests that even though we are accessing a small number of disk blocks, the system is evicting blocks from cache. I suspect this is because the system is not truly deduplicating cache. Instead, it appears that each logical file block is taking up space in cache even though they refer to the same physical disk block. One potential explanation for this is that NetApp is eliminating the disk read by reading the duplicate block from cache instead of disk. I am not sure how to validate this through the available system stats, but I believe it explains the behavior. It explains why the NFS ops have gone up, the disk ops have gone down, and the cache age has gone down to 35 seconds. While it would be preferable to store only a single copy of the logical block in cache, this is better than reading all of the blocks from disk. The cache hit percentage is a bit of a puzzle here. It is stable at 53% and I am not sure how to explain that. The system is delivering more than 53% of the read operations from cache. The very small number of disk reads shows that. Maybe someone from NetApp will chime in and give us some details on how that number is derived. This testing was done on Data ONTAP 7.3.1 (or more specifically 7.3.1.1L1P1). I tried to replicate the results on versions of Data ONTAP prior to 7.3.1 without success. In older versions, the performance of the deduplicated volume is very similar to the original volume. It appears that reads for logically different blocks that point to the same physical block go disk prior to 7.3.1. Check back shortly as I am currently working on a deduplication performance test for VMware guests. It is a simple test to show the storage performance impact of booting many guests simultaneously. The plan is to test use a handful of fast servers to boot a couple hundred guests. The boot time will be compared across a volume that is not deduplicated and one that has been deduplicated. I am also working on one additional test that may provide a bit of performance acceleration. These results are from a NetApp FAS 3170. Please be careful trying to map these results to your filer as I am using very old disk drives and that skews the numbers a bit. The slow drives make the performance of the full data set (non-deduplicated?) slower than it would be with new disk drives.

Click to read more ...

Friday
Feb272009

Do I need more cache in my NetApp?

How many times have you wondered whether you could improve the performance of your storage array by adding additional cache? Will more cache improve the performance of my storage array? This is what the vendors so often tell us, but they have no objective information to explain why it is going to help. Depending on the workload, increasing the cache may have little or no effect on performance. There are two ways to know whether your environment will benefit from additional cache. The first is to understand every nuance of your application. Most storage managers I speak with classify this as impractical at best and impossible at worst. Even if you have an application with a very well understood workload, most storage devices are not hosting a single application. Instead, they are the hosting many different applications. It is even more complex to understand how this combined workload will be effected by adding cache. The second way to measure cache benefit is to put the cache in and see what happens. This is the most common approach I see in the field. When performance becomes unacceptable, the options of adding additional disk and/or cache are weighed and a purchase is made. (I will save the topic of adding spindles to increase performance for a future post.) Both of these options force a purchase to be made with no guarantee it will solve the problem. NetApp has introduced a tool to provide a 3rd option: Predictive Cache Statistics. It provides the objective data needed to rationalize a hardware purchase. Predictive Cache Statistics (PCS) is available in systems running 7.3+ and having at least 2GB of memory. When it is enabled, PCS reports what the cache hit ratio would be if the system had 2x (ec0), 4x (ec1), and 8x (ec2) the current cache footprint. (ec0, ec1, and ec2 are the names of the extended caches when the stats are presented by the NetApp system.) Now, let's drill down into exactly how predictive cache statistics work... In most conditions there is no significant impact to system performance. I monitored the change in latency on my test system with PCS enabled and disabled and there was not a measurable difference. The storage controller was running at about 25% CPU utilization at the time with a 40% cache hit rate. NetApp warns in their docs that performance can be effected when the storage controller is at 80% CPU utilization or higher. It is understandable given the amount of information the array has to track in order to provide the cache statistics. This simply means some thought needs to be put into when it is enabled and how long it is run for in production. Here are the steps required to gather the information: 1) Enable Predictive Cache Statistics (PCS)

options flexscale.enable pcs
2) It is important to allow the workload to run until the virtual caches have time to warm up. In a system with a large amount of cache, this can be hours or even days.  Monitor array performance while the storage workload runs. If latency increases to unacceptable levels, you can disable PCS.
options flexscale.enable off
3) The NetApp perfstat tool can be used to capture and analyze the data that is gathered. I prefer instant gratification, so for this example, I will use real time stats command.
stats show –p flexscale-pcs
The way the results are reported can be a little confusing the first time you look at it. The ec0, ec1, and ec2 'virtual caches' are relative to the base cache in the system being tested (2x, 4x, and 8x). If the test system has 16GB of primary cache, ec0 will represent 32GB of 'virtual cache' (2x 16GB). ec1 brings the 'virtual cache' to a total of 4x base cache or an additional 32GB beyond ec0. ec2 brings the total to 8x base cache or an additional 64GB beyond ec0 + ec1. The statistics on each line represent the values for that specific cache segment. Hopefully that explanation clears up more confusion than it introduces. Here are a couple examples. This testing was completed on a NetApp FAS3170. The 3170 platform has 16GB of cache standard. So, in these examples, ec0 is 32GB, ec1 is 32GB, and ec2 is 64GB.

Example 1: 8GB working set, 4KB IO, and 100% random reads

fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net  kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in    out   read  write  read write   age   hit time  ty util                 in   out    in   out
 39% 39137     0     0   39137  7102 165539    206    370     0     0   >60  100%   3%  T    2%      0     0     0     0     0     0
 39% 39882     0     0   39882  7236 168677    136      6     0     0   >60  100%   0%  -    1%      0     0     0     0     0     0
 39% 39098     0     0   39098  7094 165338    186    285     0     0   >60  100%   3%  T    2%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
---
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
---
     ec0   8388608     0     0     0   0     0          0      0
     ec1   8388608     0     0     0   0     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
The sysstat shows a cache hit rate of 100%. This is exactly what we would expect for an 8GB dataset on a system with 16GB of cache. The stats command shows that PCS is currently reporting no activity. Again, this is exactly what we should expect with a working set that fits completely in main cache.

Example 2: 30GB working set, 4KB IO, and 100% random reads

fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 27% 11607     0     0   11607  2173 49352  27850      6     0     0     3   41%   0%  -   99%      0     0     0     0     0     0
 27% 11642     0     0   11642  2180 49518  28097    279     0     0     3   41%  21%  T   99%      0     0     0     0     0     0
 26% 11413     0     0   11413  2138 48511  27773     11     0     0     3   41%   0%  -   99%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608     1    38  8560   0     0          0  14811
     ec1   8388608     0     0  8560   0     0          0      0
     ec2  16777216     0     0  8560   0     0          0      0
---
     ec0   8388608     1    65  6985   0     0          0      0
     ec1   8388608     0     0  6985   0     0          0      0
     ec2  16777216     0     0  6985   0     0          0      0
---
     ec0   8388608     1   100  6922   1     0          0  11899
     ec1   8388608     0     0  6922   0     0          0      0
     ec2  16777216     0     0  6922   0     0          0      0
This data was gathered after the 30GB workload had been running for a few minutes, but just after I enabled predictive cache statistics. The PCS data shows that there are very few hits, but there are a significant number of inserts. This is what we should expect when PCS is first enabled. The sysstat output shows a cache hit rate of 41%.
fas3170-a> sysstat -x 5
 CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
 27% 11238     0     0   11238  2105 47784  27862    286     0     0     4   40%  18%  T   99%      0     0     0     0     0     0
 26% 11371     0     0   11371  2130 48349  27934     11     0     0     4   40%   0%  -   99%      0     0     0     0     0     0
 27% 11184     0     0   11184  2096 47554  27938    275     0     0     4   40%  33%  T   99%      0     0     0     0     0     0

fas3170-a> stats show -p flexscale-pcs
Instance    Blocks Usage   Hit  Miss Hit Evict Invalidate Insert
                       %    /s    /s   %    /s         /s     /s
     ec0   8388608    87  6536   456  93   933          0    934
     ec1   8388608     6   453     3  99     0        934    933
     ec2  16777216     0     0     3   0     0          0      0
---
     ec0   8388608    87  6512   435  93     0          0      0
     ec1   8388608     6   435     0 100     0          0      0
     ec2  16777216     0     0     0   0     0          0      0
---
     ec0   8388608    87  6472   450  93   963          0    964
     ec1   8388608     6   445     5  98     0        964    963
     ec2  16777216     0     0     5   0     0          0      0
Now that the ec0 virtual cache has warmed up, the potential value of additional cache becomes more apparent. The hit rate has gone up to 93% and it is servicing over 6500 operations per second. With 32GB of additional cache, 6500+ disk reads would be alleviated and the latency would be dramatically reduced. These cache hits are virtual, so currently those 'hits' are still causing disk reads. Clearly, the additional cache will provide a major performance boost, but unfortunately, it is impossible to determine exactly how it will effect overall system performance. The current bottleneck, reads from disk, would be alleviated, but that simply means we will find the next one. Additional cache can be added to most NetApp systems in the form of a Performance Accelerator Module (PAM). The PAM is a  PCI Express card with 16GB of DRAM on it. It plugs directly into one of the PCI Express slots in the filer. I suspect there a slight increase in latency when accessing data in the PAM over the main system cache. Although, this increase is likely so small that it will not be noticed on the client side as it is a very small portion of the total transaction time from the client perspective. Unfortunately, I do not have first hand performance data that I can share as I have not been able to get access to a PAM for complete lab testing. It is important to note that a system with 16GB of primary cache and 32GB of PAM cache is not the same as a system with 48GB of primary cache. The PAM cache is populated as items are evicted from primary cache. If there is a hit in the PAM, that block is copied back into primary cache. This type of cache is commonly referred to as a victim cache or an L2 cache. If the goal is to serve a working set without ever going to disk, then that working set needs to fit into extended cache, not the the primary cache plus extended cache. Predictive cache statistics are a great feature. It gives us the power to answer a question we could only guess at in the past. However, like most end users, I always want more. There are a couple things that I would love to see in the future. First, the PAM cards are 16GB in size. It would be great if the extended cache segments reported by PCS could be in 16GB increments. That would make it even easier to determine the value of each card I add. It would also remove all the confusion around how big ec0, ec1, and ec2 are. The ability to reset the PCS counters back to zero would also be helpful. When testing different workloads, this would allow the stats to be associated with each individual workload. It is worth noting that this was not a performance test and the data above should be treated as such. Nothing was done to either the client or the filer to optimize NFS performance. In an attempt to prevent these numbers from being used to judge system performance, I am intentionally omiting the details of how the disk was configured.

Click to read more ...

Tuesday
Jan272009

WAN optimization for array replication

As the need for disaster recovery continues to move downmarket from the enterprise to medium and small businesses, the number of IT shops replicating their data to an offsite location is increasing. Array based replication was once a feature reserved for the big budgets of the Fortune 1000. Today, array based replication is a feature that is available on most midrange storage devices (and even some of the entry level products). This increase in replication deployments has created a new challenge for IT. The most common replication solutions move the data over the IP network. That data puts a significant load on the IP network infrastructure. The LAN infrastructure is almost always up to the task, but the WAN is often not able to handle this new burden. While the prices of network infrastructure have come down over the years, big pipes are still an expensive monthly outlay. So, how do we get that data offsite without driving up those WAN costs? WAN optimization technology provides a potential solution. Not every workload or protocol can benefit from today's WAN optimization technology, but replication is one that usually gets a big boost. I gathered some data from a client who is using NetApp SnapMirror to replicate to a remote datacenter and deployed  WAN optimization to prevent a major WAN upgrade. The NetApp filer is serving iSCSI, Fibre Channel, and CIFS. The clients are primarily Windows and they run Exchange and MS SQL along with some home grown applications. All of their data is stored on the NetApp storage. The chart below shows the impact the WAN optimization device had. For the purposes of this discussion, think of the device as having one unoptimized LAN port and an optimized WAN port. The LAN traffic is represented by the red and the WAN by the blue. With no optimization, the traffic would be the same on both sides. The chart shows a dramatic reduction on the amount of data being pushed over the WAN. Network Throughput Network Throughput This data was gathered over a 2 week period. The total data reduction over the WAN was 83% over the data in the chart and there was a peak of 93% for one window. Again, this is not what every environment will see, so test before you deploy. In this case, the system paid for itself in less than 12 months with the savings in WAN costs. That is the kind of ROI that works for almost anyone. I am intentionally not addressing what WAN optimization technology was used in this solution. Last time we tested these devices in our lab, we brought in a half dozen and they all had their pros and cons. That is another topic for another post.

Click to read more ...

Monday
Jan052009

Benchmarking and 'real FC'

Sometimes I think the only people who read technology blogs are people who write other technology blogs. I have no way to figure out if this is true or not, but it is an interesting topic to ponder. Do IT end users actually read technology blogs? If they are reading, they do not seem to comment very frequently. Much more often comments come from other bloggers or competing vendors. That said, I am going to talk about an issue that some of the storage bloggers seem to be caught up in at the moment. The issue of 'emulated FC' vs 'real FC.' Let me start off by sharing a few recent posts from other blogs: Chuck Hollis at EMC writes about the EMC/Dell relationship and takes the opportunity to compare EMC to NetApp. In this case, he is comparing the EMC NX4 to the NetApp FAS2020. The comment in the post that certainly aggravated NetApp is that EMC does "real deal FC that isn't emulated." The obvious implications being that EMC FC is not emulated, NetApp FC is emulated, and FC emulation is bad. (This is not a new debate between EMC and NetApp. Look back through the blogs at both companies and you will find plenty of back and forth on the topic. Kostadis Russos at NetApp has a post explaining why he, not surprisingly, completely disagrees with Chuck. Stephen Foskett, a storage consultant, posts what I think is an excellent overview of the issues. He cuts through the marketing spin and asks the right questions. His coverage of the topic is so complete, I almost decided not to write about the topic. I will try not to retrace all the issues he covered. I will hit a couple of his high level points in case you have not had a chance to read his post (I highly recommend it though, it is very good.) In summary:

  • All enterprise storage arrays “emulate” Fibre Channel drives to one extent or another
  • NetApp is emulating Fibre Channel drives
  • All modern storage arrays emulate SCSI drives
  • Using the wrong tool for the job will always lead to trouble
  • Which is more important to you, integration, performance, or features?
So, why am I writing about it? I am writing about it because Chuck posted a very good blog entry about benchmarking a few days later that, to me, contradicts the importance he gave to 'real FC' on 12/9. I have never meet Chuck or Stephen, but they both seem to be very technically adept from their postings. Without trying to put words in his mouth (text on his blog?), the overall theme of Chuck's post is to make sure you use meaningful tests if you want meaningful results from a storage product benchmark. He is absolutely correct. I could not agree more. How many times have we seen benchmarks performed that were completely irrelevant to the workload the array would see in production? My question is, if the end result of performance testing with real world applications produces acceptable results, then who cares what is 'real' and what is 'emulated'? The average driver does not worry about how the computer in her car is controlling the variable valve timing. She worries about whether it reliably gets her to work on time. VMware is selling plenty of virtualization technology that presents devices that are not 'real.' I know it is not storage, but why is that any different? Less and less is 'real' in storage these days. It is impossible to continue to drive innovation in storage array technology if we are bound by the old ideas of how we configure and manage our storage. With the introduction of technologies that leverage thin provisioning, dependent pointer based copies, compression, and deduplication we need to rethink concepts as fundamental as RAID groups, block placement, and LUN configuration. Or in my opinion, we need to stop thinking about those things. Controlling the location of the bits is not what matters. Features and performance are what matter. Results in the real world matter. Look at the systems available and decide what blend of features fits your organization and workload best. Full disclosure: My company provides storage consulting on all of the platforms discussed above. We sell NetApp, Sun, and HDS products. We do not sell EMC products.

Click to read more ...