[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
"le0: dropping chained buffer", OpenBSD 3.1 on Sparc 5 (sun4m)
- To: tech@openbsd.org
- Subject: "le0: dropping chained buffer", OpenBSD 3.1 on Sparc 5 (sun4m)
- From: Rich Kulawiec <rsk@gsp.org>
- Date: Sat, 16 Aug 2003 15:46:57 -0400
- Content-Disposition: inline
- User-Agent: Mutt/1.4i
I have two 170 MHz, 256M Sparc 5's running 3.1 on a small network
connected via T-1 to the Internet. They have identical configurations.
(They were built on the same day from the same checklist.) They run
a handful of services: SSH, DNS, SMTP and POP. They've been running
smoothly without a reboot for about 4 months. The load average seldom
exceeds .5, as they really aren't asked to do very much.
Yesterday afternoon around 13:00, both of them simultaneously became
unresponsive to all network traffic -- TCP, UDP, ARP, ICMP, everything.
Some minutes later (I estimate 10, but could easily be off) the problem
went away; but it soon came back. Each time, there was at least one
entry of the form "le0: dropping chained buffer" and a number of
syslog entries indicating that the message was repeated.
I rebooted one machine at around 14:50; while that temporarily cleared
the problem, it did not appear to have any lasting effect, as that
system showed the same symptom again within 20 minutes.
However, neither system has shown any sign of this problem since about
16:00 yesterday. I have no explanation for that.
Sniffing network traffic with another system didn't show much beyond normal
traffic and various systems around the 'net infected with the M$ RPC DCOM
worm and looking for more.
This error is apparently rare: a search of the last several years' worth
of archives of the OpenBSD, NetBSD, and FreeBSD mailing lists turned up
only a few mentions of it, and all of them were of the form "What the
hell is this?". A Google search was similarly fruitless.
So I went to the source. I have traced this to the following bit of
code in am7990.c (in sys/dev/ic):
} else if ((rmd.rmd1_bits & (LE_R1_STP | LE_R1_ENP)) !=
(LE_R1_STP | LE_R1_ENP)) {
printf("%s: dropping chained buffer\n",
sc->sc_dev.dv_xname);
ifp->if_ierrors++;
This is inside am7990_rint(), which handles data receive interrupts.
It's after code which looks for framing errors and crc errors, so I
think those can be ruled out.
I managed to find the manufacturer's spec sheet for the LANCE chip (which
is what this is about) on AMD's web site, thanks to a URL given in am7990reg.h.
(It's 17881.pdf, if you want to find it on AMD's site.)
Between reading the spec sheet and the driver source (wow, it's been a
LONG time since I've done this) my impression is that the reason the
interface shut down was that ifp->if_ierrors became large enough to merit
taking it offline for a bit, then resetting it and trying again. In other
words, I think that was a symptom, not the cause.
Trying to find the cause brings me back to that bit of code above and
what conditions can trigger it. In am7990reg.h, we find:
#define LE_R1_STP 0x02 /* start of packet */
#define LE_R1_ENP 0x01 /* end of packet */
so I believe the test above is checking to see if the corresponding
bits (0x03) in rmd.rmd1_bits are both set.
This seems to match up with 28 of the 7990 ("LANCE") data sheet, which says:
STP START OF PACKET indicates that this is the first buffer
used by the C-LANCE for this packet. It is used for
data chaining buffers.
ENP END OF PACKET indicates that this is the last buffer used
by the C-LANCE for this packet. It is used for data chaining
buffers. If both STP and ENP are set, the packet fits in one
buffer and there is no data chaining.
It's that last sentence that has me confused. I think the piece of code above
is testing for exactly that condition, so I would expect that condition to
be true if data was not being chained (across multiple buffers).
(Aside: the LANCE data sheet goes on about this for a while, and diagrams
can be found on page 32.)
So the best I can come up with at the moment, is that something has wrong
while receiving a packet, and it's gone wrong at a pretty low level, i.e.
this doesn't seem to have anything to do with higher network layers. And
it seems to have something to do with the driver's method of storing
the packet -- i.e., it doesn't look like a malformed packet on the wire.
I think I'll stop here, because I think one explanation for my confusion
is that I've misread something or made another kind of mistake. If I have,
I'd appreciate it if someone could point it out. But whether I have or
haven't, any guidance on what might be causing this (and of course, how
I can fix it) would be most welcome.
Thanks,
---Rsk