[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge R610, Issues
cc'ing Guru... > -----Original Message----- > From: Joshua West [mailto:jwest@xxxxxxxxxxxx] > Sent: Friday, March 18, 2011 1:25 PM > To: xen-devel@xxxxxxxxxxxxxxxxxxx > Subject: Re: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge > R610, Issues > > Hi Guru, > > Awesome, thanks for the tip. > > I'll test out disabling cstates in the BIOS as I don't believe Xen > 3.4.x > lets you set max_cstate as an argument to xen.gz in grub.conf. > > The patch in the changeset you mention applies to Xen 3.4.3 code. Do > you have an experience with that patch functioning/helping/working with > Xen 3.4.x? And if so, do you think it will end up as part of Xen 3.4.4 > (if that ever gets tagged/released)? Assuming disabling cstates in the > BIOS alleviates my problem, I'll probably give that patch a whirl with > cstates enabled and see if the issue comes back. Just wondering if > anybody else has used that patch with Xen 3.4.3 and found success. > > Thanks. > > On 03/18/11 11:53, Guru Anbalagane wrote: > > This is likely related xen losing interrupts while certain cpus goes > > to c6 state. > > The below patch addresses an issue around this. > > http://xenbits.xen.org/hg/xen-unstable.hg/rev/1087f9a03ab6 > > Easy workaround would be to turn off cstates in BIOS or limit cstate > > in xen. > > > > Hope this helps. > > Thanks > > Guru > >> Message: 5 > >> Date: Fri, 18 Mar 2011 11:39:07 -0400 > >> From: Joshua West<jwest@xxxxxxxxxxxx> > >> Subject: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge R610 > >> Issues > >> To: xen-devel@xxxxxxxxxxxxxxxxxxx > >> Message-ID:<4D837C9B.6030107@xxxxxxxxxxxx> > >> Content-Type: text/plain; charset="iso-8859-1" > >> > >> Hey folks, > >> > >> Unfortunately, ever since we went live with Xen on Dell PowerEdge > >> R610's, we've been having some odd and aggravating issues. The > NIC's > >> tend to drop out when under heavy traffic after 1-7 days of uptime > >> (random, difficult to reproduce). But before I get into the issue's > >> specifics, here's some information about our setup: > >> > >> * Dell PowerEdge R610's w/ 4 Onboard Broadcom BCM5709 1-GbE > NIC's. > >> * RHEL 5.6. > >> * Xen 3.4.3 (from xen.org; our own compile) > >> * Kernel 2.6.18.18 > >> (http://xenbits.xensource.com/linux-2.6.18-xen.hg) > >> checkout 1073. > >> * bnx2 driver 2.0.18c from Broadcom's netxtreme2-6.0.53 package. > >> * bnx2 that ships with 2.6.18.8 doesn't support BCM5709's. > >> * Had to use driver package from broadcom.com in order to get > >> networking. > >> * NIC bonding in pairs (eth0 + eth1, etc), with options "mode=4 > >> lacp_rate=fast miimon=100 use_carrier=1". > >> > >> What occurs is suddenly one of the NIC's in the bond stops > responding. > >> Gets stuck on transmitting from what I understand. Kernel logs show > the > >> following, which includes extra debug information as the developers > from > >> Broadcom (Michael Chan and Benjamin Li) were assisting in > >> troubleshooting and gave me a version of bnx2 2.0.18c to run, that > >> prints out extra debug information upon NIC crash: > >> > >> Mar 18 01:40:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth0: transmit > >> timed out > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth0 > >> ---> > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_PFTQ_CTL > 10000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_TFTQ_CTL > 20000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_MFTQ_CTL > 4000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TBDR_FTQ_CTL > 4002 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TDMA_FTQ_CTL > 10002 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TXP_FTQ_CTL > 10002 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TPAT_FTQ_CTL > 10000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_CFTQ_CTL > 8000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_FTQ_CTL > 100000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: > BNX2_COM_COMXQ_FTQ_CTL > >> 10000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: > BNX2_COM_COMTQ_FTQ_CTL > >> 20000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: > BNX2_COM_COMQ_FTQ_CTL > >> 10000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_CP_CPQ_FTQ_CTL > 4000 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TXP mode b84c state > >> 80001000 evt_mask 500 pc 8001284 pc 8001284 instr 1440fffc > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TPAT mode b84c state > >> 80001000 evt_mask 500 pc 8000a50 pc 8000a4c instr 38420001 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: RXP mode b84c state > >> 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: COM mode b8cc state > >> 80008000 evt_mask 500 pc 8000a98 pc 8000a8c instr 8821 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: CP mode b8cc state > >> 80000000 evt_mask 500 pc 8000c7c pc 8000928 instr 8ce800e8 > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth0 - > --> > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0] > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0] > >> PCI_CMD[00100406] > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: > PCI_PM[19002008] > >> PCI_MISC_CFG[92000088] > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: > >> EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000] > >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 > >> RPM_MGMT_PKT_CTRL[40000088] > >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG: > >> MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e] > >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG: > >> HC_STATS_INTERRUPT_STATUS[01fe0001] > >> Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state > 12 > >> Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 7 > >> Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status > idx > >> 307c irq jiffies 100759890 > >> Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c > >> Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669 > >> Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c > >> Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa > tx > >> 1008f41e2 poll 100759890 > >> Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx > start > >> jiffies 0 > >> Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event > c68c37 > >> Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state > 12 > >> Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 77 > >> Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status > idx > >> 307c irq jiffies 100759890 > >> Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c > >> Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669 > >> Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c > >> Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa > tx > >> 1008f41e2 poll 100759890 > >> Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx > start > >> jiffies 0 > >> Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event > c68c37 > >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 NIC Copper Link is > Down > >> Mar 18 01:40:27 xen-san-gb1 kernel: bonding: bond0: link status > >> definitely down for interface eth0, disabling it > >> > >> This was then followed rather quickly by a failure with the second > NIC > >> (eth1) in the bond: > >> > >> Mar 18 01:42:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth1: transmit > >> timed out > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth1 > >> ---> > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_PFTQ_CTL > 10000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_TFTQ_CTL > 20000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_MFTQ_CTL > 4000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TBDR_FTQ_CTL > 4002 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TDMA_FTQ_CTL > 10000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TXP_FTQ_CTL > 10002 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TPAT_FTQ_CTL > 10000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_CFTQ_CTL > 8000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_FTQ_CTL > 100000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: > BNX2_COM_COMXQ_FTQ_CTL > >> 10000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: > BNX2_COM_COMTQ_FTQ_CTL > >> 20000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: > BNX2_COM_COMQ_FTQ_CTL > >> 10000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_CP_CPQ_FTQ_CTL > 4000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TXP mode b84c state > >> 80005000 evt_mask 500 pc 8001294 pc 8001284 instr 38640001 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TPAT mode b84c state > >> 80001000 evt_mask 500 pc 8000a58 pc 8000a5c instr 8f820014 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: RXP mode b84c state > >> 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: COM mode b8cc state > >> 80000000 evt_mask 500 pc 8000a9c pc 8000a94 instr 3c028000 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: CP mode b8cc state > >> 80008000 evt_mask 500 pc 8000c58 pc 8000c6c instr 27bdffe8 > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth1 - > --> > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0] > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0] > >> PCI_CMD[00100406] > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: > PCI_PM[19002008] > >> PCI_MISC_CFG[92000088] > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: > >> EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000] > >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 > >> RPM_MGMT_PKT_CTRL[40000088] > >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG: > >> MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e] > >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG: > >> HC_STATS_INTERRUPT_STATUS[01fe0001] > >> Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state > 12 > >> Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 7 > >> Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status > idx > >> 29c4 irq jiffies 100759898 > >> Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce > >> Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421 > >> Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce > >> Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc71 HZ fa > tx > >> 1008fb744 poll 100759898 > >> Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx > start > >> jiffies 100239dfd > >> Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event > ab2e14 > >> Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state > 12 > >> Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 77 > >> Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status > idx > >> 29c4 irq jiffies 100759898 > >> Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce > >> Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421 > >> Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce > >> Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc72 HZ fa > tx > >> 1008fb744 poll 100759898 > >> Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx > start > >> jiffies 100239dfd > >> Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event > ab2e14 > >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 NIC Copper Link is > Down > >> Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: link status > >> definitely down for interface eth1, disabling it > >> Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: Warning: No > 802.3ad > >> response from the link partner for any adapters in the bond > >> > >> Onto more technical details... > >> > >> The kernel we were running (2.6.18.8 from xenbits) was compiled > without > >> support for MSI/MSI-X originally. So, we were experiencing these > >> problems with plain standard IRQ's. Michael Chan @ Broadcom, the > author > >> of bnx2 if you modinfo, has told me via email: > >> > >> * "The logs show that we haven't had an interrupt for a very > long > >> time. It's not clear how that interrupt was lost." > >> * "So far the logs don't show any inconsistent state in the > hardware > >> or software. It is possible that the Xen kernel is missing an > interrupt > >> and not delivering to the driver. Normally, in INTA mode, the IRQ is > >> level triggered and should remain asserted until it is seen by the > >> driver and de-asserted by the driver." > >> > >> But, just in case, I compiled 2.6.18.8 with support for MSI/MSI-X > and > >> was able to confirm (via dmesg and lspci -vv) that the NIC's began > to > >> use MSI for interrupts. Unfortunately, the NIC crash happened > anyways > >> (the above kernel logs is actually from when running with MSI). > >> > >> Here's whats really bugging me. We have a Dell PowerEdge R610, > running > >> Xen along with the bnx2 drivers from Broadcom, thats been online for > >> ~220 days. Without a failure. The only difference is the system is > not > >> making use of bonding. It has just one NIC connected to the network > >> with no VLAN's trunked down etc. > >> > >> It looks like I'm not alone out there, as there's a Red Hat bugzilla > >> report for this issue: > >> > >> https://bugzilla.redhat.com/show_bug.cgi?id=520888 > >> > >> ^^ The above has an indication of *Status > >> <https://bugzilla.redhat.com/page.cgi?id=fields.html#status>*: > CLOSED > >> DUPLICATE of bug 511368 > >> <https://bugzilla.redhat.com/show_bug.cgi?id=511368> , but looks > like I > >> don't have access to view 511368. Grrr. > >> > >> Anyways... > >> > >> 1) Has anybody else experienced this issue? > >> 2) Any developers care to comment on possible causes of this > problem? > >> 3) Anybody know of a solution? > >> 4) What can I do to troubleshoot further, and get developers > necessary > >> information? > >> > >> Lastly... > >> > >> 5) Is anybody running Intel NIC's within Dell PowerEdge R610's, > using > >> bonding + Xen 3.4.3 + 2.6.18.8, and can safely report success? I > may > >> switch to Intel... > >> > >> Thanks! > >> > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@xxxxxxxxxxxxxxxxxxx > > http://lists.xensource.com/xen-devel > > > -- > Joshua West > Senior Systems Engineer > Brandeis University > http://www.brandeis.edu > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |