Entries in Solid State Disk (3)

Thursday
Apr152010

Oracle/Sun F20 Flash Card - How fast is it?

I received several questions about the performance of the Oracle/Sun F20 flash card I used in my previous post about block alignment, so I put together a quick overview of the card's performance capabilities. The following results are from testing the card in a dual socket 2.93Ghz Nehalem (x5570) system running Solaris x64. This is similar to the server platform Oracle uses in the ExaData 2 platform. The F20 card is a SAS controller with 4 x 24GB flash modules attached to it. You can find more info on the flash modules on Adam Leventhal's blog and the official Oracle product page has the F20 details. All of my tests used 100% random 4KB blocks. I focused on random operations, because in most cases it is not cost effective to use SSD for sequential operations. These tests were run with a variety of different thread counts to give an idea of how the card scales with multiple threads. The first test compared the performance of a single 24GB flash module to the performance of all 4 modules.

4KB Random Operations 4KB Random Operations At lower thread counts the 4 module test is roughly 4x the operations per second of the single module test. As the thread count rises, the single module test tops out at 35,411 ops and 4 modules can deliver 97,850 ops, or 2.76x the single module test. It would be great if the card was able to drive the 4 modules at full speed, but 97K+ ops is not too shabby. What is more impressive to me is that those 97K+ ops are delivered at roughly 1ms of latency. The next round of testing included three different workloads. The three phases were 100% read, 80% read, and 100% write and they were run against all 4 flash modules. Again, all tests used 4KB random operations. Here are the operations per second results.

2010.04.07.4KB.Rand.Ops

And a throughput version in MB/s for anyone that is interested.

2010.04.07.4KB.Rand.Througput

Flash and solid state disk (SSD) technologies are developing at an incredibly fast pace. They are a great answer, but I think we are still figuring out what the question is. At some point down the line, they may replace spinning disk drives, but I do not think that is going to happen in the short term. There are some applications that can leverage this low latency capacity inside the servers today, but this is not the majority of applications.

Where flash and SSD make more sense to me is as a large cache. NetApp and Sun are using flash this way today in their storage array product lines. DRAM is very expensive, but flash can provide a very large and very low latency cache. I expect we will see more vendors adopting this "flash for cache" approach moving forward. The economics just make sense. Disks are too slow and DRAM is too expensive.

It would also be great to see operating systems that were intelligent enough to use technologies like the F20 card and the Fusion IO card as extended filesystem read cache. Solaris can do it for zfs filesystems using the L2ARC. As far as I know, there are no filesystems that have this feature in the other major operating systems. What about using as a client side NFS cache? At one point, Solaris offered CacheFS for NFS caching, but I do not believe it is still being actively developed. While CacheFS had its challenges, I believe the idea was a very good one. It costs a lot more to buy a storage array capable of delivering 97K ops than it does to put more cache into the server.

Click to read more ...

Friday
Mar262010

Block alignment is critical

Block alignment is an important topic that is often overlooked in storage. I read a blog entry by Robin Harris a couple months back about the importance of block alignment with the new 4KB  drives. I was curious to test the theory on one of the new 4KB drives, but I did not have one on hand. That got me thinking about Solid State Disk (SSD) devices. If filesystem misalignment hurts traditional spinning disk performance, how would it impact SSD performance. In short, it is ugly. Here is a chart showing the difference between aligned and misaligned random read operations to a Sun F20 card. I guess it is officially an Oracle F20 card. Oracle F20 - Aligned vs. Misaligned Oracle F20 - Aligned vs. Misaligned With only a couple threads, the flash module can deliver about 50% more random 4KB read operations. As the thread count increases, the module is able to deliver over 9x the number of operations if properly aligned. It is worth noting that the card is delivering those aligned reads at less that 1ms while the misaligned operations average over 7ms of latency. 9x the operations at 85% less latency makes this an issue worth paying attention to. (My test was done on Solaris and here is an article about how to solve the block alignment issue for Solaris x64 volumes.) I have seen a significant increase in block alignment issues with clients recently. Some arrays and some operating systems make is easier to align filesystems than others, but a new variable has crept in over the last few years. VMware on block devices means that VMFS adds another layer of abstraction to the process. Now it is important to be sure the virtual machine filesystems are aligned in addition to the root operating system/hypervisor filesystem. Server virtualization has been the catalyst for many IT organizations to centralize more of their storage. Unfortunately, centralized storage does not come at the same $/GB as the mirrored drives in the server. It is much more expensive. Block misalignment can make the new storage even more expensive by making it less efficient. If the filesystems are misaligned, then it makes the array cache far less efficient. When that misaligned data is read from or written to disk, the drives are forced to do additional operations that would not be required for an aligned operation. It can quickly turn a fast storage array into a very average system. Most of the storage manufacturers can provide you with a best practices doc to help you avoid these issues. Ask them for a whitepaper about block alignment issues with virtual machines.

Click to read more ...

Thursday
Jun112009

Deduplication - Sometimes it's about performance

In a previous post I discussed the topic of deduplication for capacity optimization. Removing redundant data blocks on disk is the first, and most obvious, phase of deduplication in the marketplace. It helps to drive down the most obvious cost - the cost per GB of disk capacity. This market has grown quickly over the last few years. Both startups and established storage vendors have products that compete in the space. They are most commonly marketed as virtual tape library (VTL) or disk-to-disk backup solutions. Does that mean that deduplication is a point solution for highly sequential workloads? No. There is another somewhat less obvious benefit of deduplication. What storage administrator does not ask for more cache in the storage array? If I can afford 8GB, I want 16GB. If the system supports 16GB, I want 32GB. Whether it is for financial or technical reasons, cache is always limited. What about deduplicating the data in cache? When the workload is streaming sequential backup data from disk, this may not be very helpful. However, in a primary storage system with a more varied workload, this becomes very interesting. The cost per GB of cache (DRAM) is several orders of magnitude higher than the cost of hard drives. If the goal is to reduce storage capital expenses by making the storage array more efficient, then let’s focus on the most expensive component. The cache is the most expensive component. If the physical data footprint on disk is reduced, then it is logical that the array cache should also benefit from that space savings. If the same deduplicated physical block is accessed multiple times through different logical blocks, then it should result in one read from disk and many reads from cache. Storing VMware, Hyper-V, or Xen virtual machine images creates a tremendous amount of duplicate data. It is not uncommon to see storage arrays that are storing tens or even hundreds of these virtual images. If 10 or 20 or all of those virtual servers need to boot at the same time, then it places an extreme workload on the storage array. If all of these requests have to hit disk, the disk drives will be overwhelmed. If each duplicate logical block requires its own space in cache, then the cache will be blown out. If each duplicate block is read off disk once and stored in cache once, then the clients will boot quickly and the cache will be maintained. Preserving the current contents of the cache will ensure the performance of the rest of the applications in the environment is not impacted. This virtual machine example does make a couple of assumptions. The most significant is that the storage controllers can keep up with the workload. Deduplication on disk allows for fewer disk drives. Deduplication in cache expands the logical cache size and makes the disk drives more efficient. Neither of these does anything to reduce the performance demand on the CPU, fibre channel, network, or bus infrastructure on the controllers. In fact, they likely place more demand on these controller resources. If the controller is spending less time waiting for disk drives, it needs the horsepower to deliver higher IOPS from cache with less latency. This same deduplication theory applies to flash technology as well. Whether the flash devices are being used to expand cache or provide another storage tier, they should be deduplicated. Flash devices are more expensive than disk drives, so let's get the capacity utilization up as high as possible. As the mainstream storage vendors start to bring deduplication to market for primary storage, it will be interesting to see how they deal with these challenges. Deduplication for primary storage presents an entirely different a set of requirements from backup to disk.

Click to read more ...