Hello, I am experiencing SCSI failures and panics. I could not figure out the real source of the problem. So I would appreciate any help, thank you... Background: The server is a PIII 800 512M machine with Intel L440GX+ mainboard. It has "Adaptec(2)AIC-7896/7 Ultra2" SCSI controller (ahc) and Intel EtherExpress PRO/100 ethernet adapter (fxp) onboard. It contains 5 x 18GB Quantum AtlasV SCSI drives used in RAID5 configuration (raidframe) The server is used as the file (samba) web and e-mail server. It has OpenBSD 2.7 installed upgraded from cvs to late 2.7-current (early 2.8) just before October 28. [Yes, I know there are updates to ahc after that date (and they are about timeouts) I was planning to wait until the semester break to upgrade but ... please read on :-) ] The History: Being an idealist, I installed the server using raidframe and softupdates (at around september). There were no problems during initial testing. However, once I took the server on line, we started experiencing lock-ups. The system would freeze when large amounts of files are transferred. Disabling softupdates lowered the frequency (and performance) but we did have an occasional lock-up. I suspected UVM and I read somewhere (cant remember where) that using two SCSI disks for swap causes UVM swapping code to lock-up (why!). 512M RAM takes a long time to fill before swapping I guess. I removed one swap partition and I re-arranged the kernel configuration (increased some buffers, enabled memory profiling etc.) and encountered no lock-ups since (softupdates still disabled). This was early december. But now I am receiving panics! 8-| The Problem: It started with SCSI timeout errors from the kernel: and the system would freeze for a short time before the error. (The complete kernel logs are attached ) > Timedout SCB handled by another timeout I remember seeing those _very rarely_ earlier but one day, there was something like: > Timedout SCB handled by another timeout > Timedout SCB handled by another timeout > sda CCB 0x11 timedout in Command phase and perhaps a few other messages? and PANIC! I am ashamed to say that I panicked too and restarted the system before writing down more debug information. I do remember that the panic was in a raidframe function :-(. All drives came back online to my relief nothing was lost (I started taking backups :-) ) A few days later, I received another panic. This time I managed to write down the kernel messages and stack trace: > Timedout SCB handled by another timeout > Timedout SCB handled by another timeout > sd0 (ahc1:0:0): SCB 0x11 - timedout in Command phase, SEQADDR = 0x165 > sd0 (ahc1:0:0): BDR message in message buffer > ahc1:A:0: unknown scsi bus phase b6. Attempting to continue > ahc1:A:0: Missed busfree, Curphase = 0xb6 > sd0(ahc1:0:0): SCB 0x11 - timedout while idle, SEQADDR = 0xC > panic Disconnected List Inconsistency. SCB index = 255, yet numscbs=30. > Stopped at - Debugger+0x4: movl %ebp, %esp > _Debugger( ...) at _Debugger+0x4 > _panic(e010cdbc,ff,1e,e019e8a3,ff) at _panic+0x81 > _ahc_init(e0dd8e00,ffffffff,41,ffffffff,ff) at _ahc_init+0x30ea > _ahc_init(e0dd8e00,ffffffff,41,ffffffff,ff) at _ahc_init+0x2e60 > _ahc_init(e0dd8e00,41,1,e0231154,7fffffff) at _ahc_init+0x251a > _softclock(e0dd6500,e0f8aa00,e0f8aa00,ea1187ba,e0dd54e0) at _softclock+0x8e > _hardclock(f9e1ad70,40000000,0,1,e0f8aa0d) at _hardclock+0x41c > _clockintr(f1ead70) at _clockintr+0x12 > _Xrecurse0() ar _XRecurse+0x69 > --- interrupt --- > _idle(0,0,0,e01a32f7,e0f8aa00) at _idle+0x20 > bpend+sleep(e0397f5c,118,e01a3045,65,f9e1af80) at bpend+sleep > _sys_select(e0f8a400,f9e1af88,f9e1af80,50a7,0) at _sys_select+0x357 > _syscall(number 93) > 0x41b3e2b: The system restarted without problems, and I have managed to update the parity on raid. Here is my a question related to raidframe: What if I _really_ lose a disk AND receive a panic just after that without a chance to shutdown? I think I would lose all the data. The raid set would be in a dirty state and missing a drive! Any comments? Anyway, I installed an IDE drive (wd0), and added it to the raid set as a spare. Than, I manually failed sd0 to wd0 (since that drive seemed to have problems) and replaced the drives, with sd0 as the spare and wd0 part of the raid set. Note that I did NOT remove the sd0 drive (my root partition is still there) Just as I was preparing to contact the dealer that sold me the SCSI drives, and ask for a replacement I got this: > Timedout SCB handled by another timeout > Timedout SCB handled by another timeout > Timedout SCB handled by another timeout > Timedout SCB handled by another timeout > sd1(ahc1:2:0): SCB 0x1b - timed out in Command phase, SEQADDR == 0x165 > sd1(ahc1:2:0): BDR message in message buffer > sd1(ahc1:2:0): no longer in timeout, status = 0 > sd1(ahc1:2:0): Unexpected busfree in Message-out phase SEQADDR == 0x165 > sd1(ahc1:2:0): parity error detected in Data-in phase. SEQADDR(0x164) SCSIRATE(0x93) > 17 x sd1(ahc1:2:0): parity error detected in Data-in phase. SEQADDR(0x166) SCSIRATE(0x93) > ahc1:A:2: unknown scsi bus phase 0. Attempting to continue > ahc1:A:2: Target did not send an IDENTIFY message. LASTPHASE = 0x0, SAVED_TCL == 0x20 > sd1(ahc1:2:0): Sense Error Code 0 > raid0: IO Error. Marking /dev/sd1d as failed. > raid0: node (Rop) returned fail, rolling backward > ahc1: Issued Channel A Bus Reset. 6 SCBs aborted > ahc1: target 2 using 16bit transfers > ahc1: target 2 synchronous at 40.0MHz, offset = 0x1f > ahc1: target 2 using 16bit transfers > ahc1: target 2 synchronous at 40.0MHz, offset = 0x1f > sd1(ahc1:2:0): Check Condition on opcode a > SENSE KEY: Aborted Command > ASC/ASCQ: Overlapped Commands Attempted > ahc1: target 12 using 16bit transfers > ahc1: target 12 synchronous at 40.0MHz, offset = 0x1f > ahc1: target 4 using 16bit transfers > ahc1: target 4 synchronous at 40.0MHz, offset = 0x1f > ahc1: target 8 using 16bit transfers > ahc1: target 8 synchronous at 40.0MHz, offset = 0x1f > ahc1: target 8 using 16bit transfers > ahc1: target 8 synchronous at 40.0MHz, offset = 0x1f > raid0: DAG failure: w addr 0x2b87d80 (45645184) nblk 0x20 (32) buf 0xf43e1000 > ahc1: target 0 using 16bit transfers > ahc1: target 0 synchronous at 40.0MHz, offset = 0x1f > ahc1: target 0 using 16bit transfers > ahc1: target 0 synchronous at 40.0MHz, offset = 0x1f Apparently ahc failed, the raidframe did what it is supposed to do, and failed the disk. So I now have a raid set running on parity! I shutdown the system, and ran the Adaptec BIOS Utility to scan sd0 and sd1 (media test). The utility did not report any problems on any of the disks. (I do not know if it is supposed to) but I had no choice (having ran out of spare disks) but to restart the system, and reconstruct the failed sd1 back onto itself. (it was Jan 5th) I did one more thing and set the Adaptec host adapter SCSI ID to 15 (it was 7). a few hours ago, I encountered another parity error. But it also panicked the system: > sd1(ahc1:2:0): SCB 0x10 - timed out in Command phase, SEQADDR == 0x165 > sd1(ahc1:2:0): BDR message in message buffer > sd1(ahc1:2:0): no longer in timeout, status = 0 > sd1(ahc1:2:0): Unexpected busfree in Message-out phase SEQADDR == 0x165 > 18x sd1(ahc1:2:0): parity error detected in Data-in phase. SEQADDR(0x165) SCSIRATE(0x93) > ahc1:A:2: unknown scsi bus phase 0. Attempting to continue > ahc1:A:2: Target did not send an IDENTIFY message. LASTPHASE = 0x0, SAVED_TCL == 0x20 > raid0: IO Error. Marking /dev/sd1d as failed. > raid0: node (Rop) returned fail, rolling backward > ahc1: Issued Channel A Bus Reset. 12 SCBs aborted > ahc1: target 4 using 16bit transfers > ahc1: target 4 synchronous at 40.0MHz, offset = 0x1f > Interrupted for status of 0??? > raid0: DAG failure: w addr 0x2b87d80 (45645184) nblk 0x20 (32) buf 0xf4631000 > ahc1: target 2 using 16bit transfers > ahc1: target 2 synchronous at 40.0MHz, offset = 0x1f > Interrupted for status of 0??? > ahc1: target 8 using 16bit transfers > ahc1: target 8 synchronous at 40.0MHz, offset = 0x1f > Interrupted for status of 0??? > ahc1: target 12 using 16bit transfers > ahc1: target 12 synchronous at 40.0MHz, offset = 0x1f > Interrupted for status of 0??? > Interrupted for status of 0??? > Interrupted for status of 0??? > ahc1:A:4: ahc_intr - referenced scb not valid during seqint 0x71 scb (22) > ahc1: WARNING no command for scb 22 (cmdcmplt) QOUTPOS=6 > panic: biodone already > stopped at _Debugger+0x4: movl %ebp, %esp > _Debugger(...) at _Debugger+0x4 > _panic(e01bb16e,0,f9fdbe30,e02a58bb,e132bb00) at _panic+0x81 > _biodone(e132bb00,1,e0f03000,0,0) at _biodone+0x1f > _scsi_done(e0f03000,1,f9fdbea0,e0109b91,e0dd8e00,e0f03000,0,e019ecb2) at _scsi_done+0xf0 > _ahc_intr(e0dd8e00,e0dda200,64,e01068e5) at _ahc_intr+0x37f6 > _ahc_init(e0dd8e00,2,41,0,ff) at _ahc_init+0x2baa > _ahc_init(e0dd8e00,e0dd4a00,0,e10ef600,71) at _ahc_init+0x1bed > _ahc_intr(e0dd8e00,71,aa10,f9fdbfa8) at _ahc_intr+0xb64 > _ahc_intr(e0dd8e00) at _ahc_intr+0x127 > _Xrecurse11() at _Xrecurse11+0x6d > --- interrupt --- > 0x400c8f4a: [phew! hope typed it right] The activity light of sd1 was (and remained) on during the debugging session. I again restarted the system, and initialized the parity. sd1 is still being used, and I am expecting another incident anytime now (so I might be able to provide additional traces). unfortunately, I cannot get any dumps, since the disks are not working (if there is any other way to obtain additional information I can help anyway I can). Conclusion: Well, if you have read thus far, thank you for your patience :-) As I said I have no idea what is wrong. I think, due to raid, the disks are overworked and starting to go bad? well, they are _sold_ as server disks but who knows. I have to fight my dealer over this. The fact that the disks 'seem' to work stand alone would not help me in this respect. But I think, the ahc driver should handle the failing of disks more gracefully. If a disk fails, I would expect raidframe to handle this. And if it is something unrecoverable ... I know such cases are very hard to test and debug (well, here is a chance) and I am willing to offer any help that I can. As I said before, I have not upgraded to 2.8 yet. And my kernel configuration is far from GENERIC (due to my previous problems with lock-ups). However, I am planning an upgrade in a few days and I can try any configurations/patches you may offer, provided that the server is not taken off-line for very long. (as it is actively used as the department server) Can --=< Can Erkin Acar (canacar_(_at_)_bigfoot_(_dot_)_com) >=--
The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any another MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.
---- File information -----------
File: dmesg.log
Date: 9 Jan 2001, 19:17
Size: 21748 bytes.
Type: Unknown
Attachment:
dmesg.log
Description: Binary data