To our dismay, we lost some data again in June (only two months after we had converted to RAID 0+1). In other words, the filesystem corruption problem was not limited to RAID5.
We were able to reproduce the problem on a test array that we had. Sun had been analyzing crash dumps and spelunking around in the corrupt filesystem for a couple weeks, when we needed to use the array for a production machine. Quite by accident -- and just before putting 14 gig of production data on the array -- I realized that two disk media records (logical disks) were referencing the same disk access record (physical disk), and volume manager was getting confused.
I was certain that I had not configured it this way (and it should be impossible to do so). I found out that two physical disks had the same disk id. We don't know how the labels got to be the same (volume manager's fault or our fault). But since volume manager never told us about the duplicate diskids, this dangerous condition lingered for who knows how long before we detected it. The failure of volume manger to report the dup diskids has been logged as Sun RFE/BugID 1262894 (SunSolve customers only).
Though we're not positive that the duplicate diskid's cause all of our corruption problems, it might explain how we experienced similar corruption uner both RAID5 and RAID0+1. We're crossing our fingers.
Last updated: Thu Aug 22 18:23:05 1996