Date: Tue, 2 Apr 96 12:21:04 EST To: sun-managers@ra.mcs.anl.gov From: Margarita Suarez Subject: SUMMARY: SPARCstorage Array Survey I received 23 responses to my survey, and a couple other opinions (even though they didn't answer the survey questions). I'm not a social scientist, so my survey wasn't exactly designed for easy summarizing. I'll give you an idea of the overall feeling for the SSA's here (which was what I sought when I sent out the survey). Also, there are a few important problems (the NVRAM bug was already posted here) which SSA managers should know about. Please visit http://www.columbia.edu/~marg/misc/ssa/ if you want to see the original texts of the responses, as well as my summary, and other important information and more detailed discusison regarding all of the following issues. Hardware Many disks have died but those losing the most disks still feel the deaths are within the MTBF. Some people had problems with flaky controllers similar to ours, but it seems as though once you get a good batch of hardware things work pretty well. NVRAM If you are running RAID5, TURN FAST_WRITES OFF until you get new firmware from Sun on April 15. Many, many respondents were running with NVRAM enabled apparently without problem but the only way to ensure that data corruption will not happen is to turn it off. RAID5 There is a general consensus that the VxVM RAID5 implementation may have a bug causing data corruption. Those running VxVM RAID0+1 had not lost data. Hot Spares In short, the failure of VxVM to automatically employ hot spares is now a well-known bug. It will be fixed in the next release of VxVM, due out in the summer. Performance Half of the people were satisfied with the performance (speed) of the SSA. A quarter were very satisfied with the performance. Another quarter couldn't really say (hadn't tested, or SSA's were too new). Overall Satisfaction More than half said they were satisfied, would keep their arrays, and might buy more. Nine said they weren't satisfied and probably wouldn't buy more. My Personal Opinion We are continuing to evaluate the most current technologies for NFS file service and Web service, as well as for grander projects like our Digital Library Project (hierarchical storage management, archiving, etc.). We were doing these evaluations even before we had trouble with our arrays, but we're been researching more vigorously lately :-) In the meantime, we will have to keep these arrays and make them work the best we can. One result of conducting this survey is that I've convinced myself that striping and mirroring should not present the data corruption problems that we saw when we were running RAID5 (I hope I'm right!) They really are nifty machines (physically), and using vxmake and configuration makefiles, they're pretty easy to configure and maintain. This is all I'll summarize here. Again, please visit my web page or drop me a line if you want more details. I'll be updating it periodically as I learn more about what was going wrong with our arrays. It occurred to me that a mailing list for SSA managers might be useful for the discussion of not only reliability issues but also configuration, performance tuning, etc. e.g., we have a "clever" idea of how to perform nearly online backups using dual-ported hosts and arrays, extra "slush" disks, and a third backup host. Please let me know if you are interested in being on (or running) such a list. By the way, my survey "caused quite a stir" within Sun (according to my sales rep). Apparently Scott McNealy has read it! :-) Too bad for us they weren't so interested six months and six corrupted filesystems ago. At any rate, they're intersted now. I hope they'll be able to fix these bugs because the SSA really is a cool product. Thanks to everyone who responded! Margarita Suarez Columbia University Academic Information Systems UNIX Systems Group marg@columbia.edu