[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

ahc SCSI (and raidframe) Problems



Hello,

I am experiencing SCSI failures and panics. I could not
figure out the real source of the problem. So I would
appreciate any help,
thank you...


Background:

The server is a PIII 800 512M machine with
Intel L440GX+ mainboard. It has "Adaptec(2)AIC-7896/7
Ultra2" SCSI controller (ahc) and Intel EtherExpress
PRO/100 ethernet adapter (fxp) onboard.

It contains 5 x 18GB Quantum AtlasV SCSI drives used
in RAID5 configuration (raidframe)

The server is used as the file (samba) web and e-mail server.

It has OpenBSD 2.7 installed upgraded from cvs to
late 2.7-current (early 2.8) just before October 28.

[Yes, I know there are updates to ahc after that date
 (and they are about timeouts) I was planning to wait
 until the semester break to upgrade but ...
 please read on :-) ]


The History:

Being an idealist, I installed the server using raidframe
and softupdates (at around september). There were no
problems during initial testing. However, once I took the
server on line, we started experiencing lock-ups. The system
would freeze when large amounts of files are transferred.
Disabling softupdates lowered the frequency (and performance)
but we did have an occasional lock-up.

I suspected UVM and I read somewhere (cant remember where)
that using two SCSI disks for swap causes UVM swapping code
to lock-up (why!). 512M RAM takes a long time to fill before
swapping I guess.

I removed one swap partition and I re-arranged the kernel
configuration (increased some buffers, enabled memory
profiling etc.) and encountered no lock-ups since
(softupdates still disabled). This was early december.
But now I am receiving panics! 8-| 


The Problem:

It started with SCSI timeout errors from the kernel:
and the system would freeze for a short time before the error.
(The complete kernel logs are attached )

> Timedout SCB handled by another timeout

I remember seeing those _very rarely_ earlier but 
one day, there was something like:

> Timedout SCB handled by another timeout
> Timedout SCB handled by another timeout
> sda CCB 0x11 timedout in Command phase

and perhaps a few other messages?
and PANIC!

I am ashamed to say that I panicked too and restarted the system
before writing down more debug information. I do remember that
the panic was in a raidframe function :-(.

All drives came back online to my relief nothing was lost
(I started taking backups :-) )

A few days later, I received another panic. This time I managed to
write down the kernel messages and stack trace:

> Timedout SCB handled by another timeout
> Timedout SCB handled by another timeout
> sd0 (ahc1:0:0): SCB 0x11 - timedout in Command phase, SEQADDR = 0x165
> sd0 (ahc1:0:0): BDR message in message buffer
> ahc1:A:0: unknown scsi bus phase b6. Attempting to continue
> ahc1:A:0: Missed busfree, Curphase = 0xb6
> sd0(ahc1:0:0): SCB 0x11 - timedout while idle, SEQADDR = 0xC
> panic Disconnected List Inconsistency. SCB index = 255, yet numscbs=30.
> Stopped at - Debugger+0x4: movl %ebp, %esp
> _Debugger( ...) at _Debugger+0x4
> _panic(e010cdbc,ff,1e,e019e8a3,ff) at _panic+0x81
> _ahc_init(e0dd8e00,ffffffff,41,ffffffff,ff) at _ahc_init+0x30ea
> _ahc_init(e0dd8e00,ffffffff,41,ffffffff,ff) at _ahc_init+0x2e60
> _ahc_init(e0dd8e00,41,1,e0231154,7fffffff) at _ahc_init+0x251a
>  _softclock(e0dd6500,e0f8aa00,e0f8aa00,ea1187ba,e0dd54e0) at _softclock+0x8e
> _hardclock(f9e1ad70,40000000,0,1,e0f8aa0d) at _hardclock+0x41c
> _clockintr(f1ead70) at _clockintr+0x12
> _Xrecurse0() ar _XRecurse+0x69
> --- interrupt ---
> _idle(0,0,0,e01a32f7,e0f8aa00) at _idle+0x20
> bpend+sleep(e0397f5c,118,e01a3045,65,f9e1af80) at bpend+sleep
> _sys_select(e0f8a400,f9e1af88,f9e1af80,50a7,0) at _sys_select+0x357
> _syscall(number 93)
> 0x41b3e2b:

The system restarted without problems, and I have managed to
update the parity on raid.

Here is my a question related to raidframe:

 What if I _really_ lose a disk AND receive a panic
 just after that without a chance to shutdown?
 I think I would lose all the data. The raid set would
 be in a dirty state and missing a drive!

 Any comments?

Anyway, I installed an IDE drive (wd0), and added it to
the raid set as a spare. Than, I manually failed sd0 to
wd0 (since that drive seemed to have problems) and
replaced the drives, with sd0 as the spare and wd0
part of the raid set.

Note that I did NOT remove the sd0 drive
(my root partition is still there)

Just as I was preparing to contact the dealer that sold me
the SCSI drives, and ask for a replacement I got this:

> Timedout SCB handled by another timeout
> Timedout SCB handled by another timeout
> Timedout SCB handled by another timeout
> Timedout SCB handled by another timeout
> sd1(ahc1:2:0): SCB 0x1b - timed out in Command phase, SEQADDR == 0x165
> sd1(ahc1:2:0): BDR message in message buffer
> sd1(ahc1:2:0): no longer in timeout, status = 0
> sd1(ahc1:2:0): Unexpected busfree in Message-out phase SEQADDR == 0x165
> sd1(ahc1:2:0): parity error detected in Data-in phase. SEQADDR(0x164) SCSIRATE(0x93)
> 17 x sd1(ahc1:2:0): parity error detected in Data-in phase. SEQADDR(0x166) SCSIRATE(0x93)
> ahc1:A:2: unknown scsi bus phase 0.  Attempting to continue
> ahc1:A:2: Target did not send an IDENTIFY message. LASTPHASE = 0x0, SAVED_TCL == 0x20
> sd1(ahc1:2:0): Sense Error Code 0
> raid0: IO Error.  Marking /dev/sd1d as failed.
> raid0: node (Rop) returned fail, rolling backward
> ahc1: Issued Channel A Bus Reset. 6 SCBs aborted
> ahc1: target 2 using 16bit transfers
> ahc1: target 2 synchronous at 40.0MHz, offset = 0x1f
> ahc1: target 2 using 16bit transfers
> ahc1: target 2 synchronous at 40.0MHz, offset = 0x1f
> sd1(ahc1:2:0): Check Condition on opcode a
>     SENSE KEY: Aborted Command
>      ASC/ASCQ: Overlapped Commands Attempted
> ahc1: target 12 using 16bit transfers
> ahc1: target 12 synchronous at 40.0MHz, offset = 0x1f
> ahc1: target 4 using 16bit transfers
> ahc1: target 4 synchronous at 40.0MHz, offset = 0x1f
> ahc1: target 8 using 16bit transfers
> ahc1: target 8 synchronous at 40.0MHz, offset = 0x1f
> ahc1: target 8 using 16bit transfers
> ahc1: target 8 synchronous at 40.0MHz, offset = 0x1f
> raid0: DAG failure: w addr 0x2b87d80 (45645184) nblk 0x20 (32) buf 0xf43e1000
> ahc1: target 0 using 16bit transfers
> ahc1: target 0 synchronous at 40.0MHz, offset = 0x1f
> ahc1: target 0 using 16bit transfers
> ahc1: target 0 synchronous at 40.0MHz, offset = 0x1f

Apparently ahc failed, the raidframe did what it is
supposed to do, and failed the disk. So I now have
a raid set running on parity!

I shutdown the system, and ran the Adaptec BIOS Utility
to scan sd0 and sd1 (media test). The utility did not
report any problems on any of the disks. (I do not know
if it is supposed to)

but I had no choice (having ran out of spare disks)
but to restart the system, and reconstruct the failed
sd1 back onto itself. (it was Jan 5th)

I did one more thing and set the
Adaptec host adapter SCSI ID to 15 (it was 7).

a few hours ago, I encountered another parity
error. But it also panicked the system:

> sd1(ahc1:2:0): SCB 0x10 - timed out in Command phase, SEQADDR == 0x165
> sd1(ahc1:2:0): BDR message in message buffer
> sd1(ahc1:2:0): no longer in timeout, status = 0
> sd1(ahc1:2:0): Unexpected busfree in Message-out phase SEQADDR == 0x165
> 18x sd1(ahc1:2:0): parity error detected in Data-in phase. SEQADDR(0x165) SCSIRATE(0x93)
> ahc1:A:2: unknown scsi bus phase 0.  Attempting to continue
> ahc1:A:2: Target did not send an IDENTIFY message. LASTPHASE = 0x0, SAVED_TCL == 0x20
> raid0: IO Error.  Marking /dev/sd1d as failed.
> raid0: node (Rop) returned fail, rolling backward
> ahc1: Issued Channel A Bus Reset. 12 SCBs aborted
> ahc1: target 4 using 16bit transfers
> ahc1: target 4 synchronous at 40.0MHz, offset = 0x1f
> Interrupted for status of 0???
> raid0: DAG failure: w addr 0x2b87d80 (45645184) nblk 0x20 (32) buf 0xf4631000
> ahc1: target 2 using 16bit transfers
> ahc1: target 2 synchronous at 40.0MHz, offset = 0x1f
> Interrupted for status of 0???
> ahc1: target 8 using 16bit transfers
> ahc1: target 8 synchronous at 40.0MHz, offset = 0x1f
> Interrupted for status of 0???
> ahc1: target 12 using 16bit transfers
> ahc1: target 12 synchronous at 40.0MHz, offset = 0x1f
> Interrupted for status of 0???
> Interrupted for status of 0???
> Interrupted for status of 0???
> ahc1:A:4: ahc_intr - referenced scb not valid during seqint 0x71 scb (22)
> ahc1: WARNING no command for scb 22 (cmdcmplt) QOUTPOS=6
> panic: biodone already
> stopped at _Debugger+0x4: movl %ebp, %esp
> _Debugger(...) at _Debugger+0x4
> _panic(e01bb16e,0,f9fdbe30,e02a58bb,e132bb00) at _panic+0x81
> _biodone(e132bb00,1,e0f03000,0,0) at _biodone+0x1f
> _scsi_done(e0f03000,1,f9fdbea0,e0109b91,e0dd8e00,e0f03000,0,e019ecb2) at _scsi_done+0xf0
> _ahc_intr(e0dd8e00,e0dda200,64,e01068e5) at _ahc_intr+0x37f6
> _ahc_init(e0dd8e00,2,41,0,ff) at _ahc_init+0x2baa
> _ahc_init(e0dd8e00,e0dd4a00,0,e10ef600,71) at _ahc_init+0x1bed
> _ahc_intr(e0dd8e00,71,aa10,f9fdbfa8) at _ahc_intr+0xb64
> _ahc_intr(e0dd8e00) at _ahc_intr+0x127
> _Xrecurse11() at _Xrecurse11+0x6d
> --- interrupt ---
> 0x400c8f4a:

[phew! hope typed it right]
The activity light of sd1 was (and remained) on
during the debugging session.

I again restarted the system, and initialized the parity.
sd1 is still being used, and I am expecting another incident
anytime now (so I might be able to provide additional traces).
unfortunately, I cannot get any dumps, since the disks are not
working (if there is any other way to obtain additional
information I can help anyway I can).


Conclusion:

Well, if you have read thus far, thank you for your patience :-)
As I said I have no idea what is wrong.

I think, due to raid, the disks are overworked and starting
to go bad? well, they are _sold_ as server disks but who knows.
I have to fight my dealer over this. The fact that the disks 
'seem' to work stand alone would not help me in this respect.

But I think, the ahc driver should handle the failing of disks
more gracefully. If a disk fails, I would expect raidframe to
handle this. And if it is something unrecoverable ...
I know such cases are very hard to test and debug
(well, here is a chance) and I am willing to offer any help
that I can.

As I said before, I have not upgraded to 2.8 yet. And my kernel
configuration is far from GENERIC (due to my previous problems
with lock-ups). However, I am planning an upgrade in a few days
and I can try any configurations/patches you may offer, provided
that the server is not taken off-line for very long.
(as it is actively used as the department server)


Can














--=< Can Erkin Acar (canacar_(_at_)_bigfoot_(_dot_)_com) >=--
The following section of this message contains a file attachment
prepared for transmission using the Internet MIME message format.
If you are using Pegasus Mail, or any another MIME-compliant system,
you should be able to save it or view it from within your mailer.
If you cannot, please ask your system administrator for assistance.

   ---- File information -----------
     File:  dmesg.log
     Date:  9 Jan 2001, 19:17
     Size:  21748 bytes.
     Type:  Unknown

Attachment: dmesg.log
Description: Binary data


Visit your host, monkey.org