[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Possible bug? DOM-U network stopped working after fatal error reported in DOM0
On Mon, Jan 10, 2022 at 10:54 PM Roger Pau Monné <roger.pau@xxxxxxxxxx> wrote: > > On Sat, Jan 08, 2022 at 01:14:26AM +0800, G.R. wrote: > > On Wed, Jan 5, 2022 at 10:33 PM Roger Pau Monné <roger.pau@xxxxxxxxxx> > > wrote: > > > > > > On Wed, Jan 05, 2022 at 12:05:39AM +0800, G.R. wrote: > > > > > > > > But seems like this patch is not stable enough yet and has its > > > > > > > > own > > > > > > > > issue -- memory is not properly released? > > > > > > > > > > > > > > I know. I've been working on improving it this morning and I'm > > > > > > > attaching an updated version below. > > > > > > > > > > > > > Good news. > > > > > > With this new patch, the NAS domU can serve iSCSI disk without OOM > > > > > > panic, at least for a little while. > > > > > > I'm going to keep it up and running for a while to see if it's > > > > > > stable over time. > > > > > > > > > > Thanks again for all the testing. Do you see any difference > > > > > performance wise? > > > > I'm still on a *debug* kernel build to capture any potential panic -- > > > > none so far -- no performance testing yet. > > > > Since I'm a home user with a relatively lightweight workload, so far I > > > > didn't observe any difference in daily usage. > > > > > > > > I did some quick iperf3 testing just now. > > > > > > Thanks for doing this. > > > > > > > 1. between nas domU <=> Linux dom0 running on an old i7-3770 based box. > > > > The peak is roughly 12 Gbits/s when domU is the server. > > > > But I do see regression down to ~8.5 Gbits/s when I repeat the test in > > > > a short burst. > > > > The regression can recover when I leave the system idle for a while. > > > > > > > > When dom0 is the iperf3 server, the transfer rate is much lower, down > > > > all the way to 1.x Gbits/s. > > > > Sometimes, I can see the following kernel log repeats during the > > > > testing, likely contributing to the slowdown. > > > > interrupt storm detected on "irq2328:"; throttling > > > > interrupt source > > > > > > I assume the message is in the domU, not the dom0? > > Yes, in the TrueNAS domU. > > BTW, I rebooted back to the stock kernel and the message is no longer > > observed. > > > > With the stock kernel, the transfer rate from dom0 to nas domU can be > > as high as 30Gbps. > > The variation is still observed, sometimes down to ~19Gbps. There is > > no retransmission in this direction. > > > > For the reverse direction, the observed low transfer rate still exists. > > It's still within the range of 1.x Gbps, but should still be better > > than the previous test. > > The huge number of re-transmission is still observed. > > The same behavior can be observed on a stock FreeBSD 12.2 image, so > > this is not specific to TrueNAS. > > So that's domU sending the data, and dom0 receiving it. Correct. > > > > > According to the packet capture, the re-transmission appears to be > > caused by packet reorder. > > Here is one example incident: > > 1. dom0 sees a sequence jump in the incoming stream and begins to send out > > SACKs > > 2. When SACK shows up at domU, it begins to re-transmit lost frames > > (the re-transmit looks weird since it show up as a mixed stream of > > 1448 bytes and 12 bytes packets, instead of always 1448 bytes) > > 3. Suddenly the packets that are believed to have lost show up, dom0 > > accept them as if they are re-transmission > > Hm, so there seems to be some kind of issue with ordering I would say. Agree. > > > 4. The actual re-transmission finally shows up in dom0... > > Should we expect packet reorder on a direct virtual link? Sounds fishy to > > me. > > Any chance we can get this re-transmission fixed? > > Does this still happen with all the extra features disabled? (-rxcsum > -txcsum -lro -tso) No obvious impact I would say. After disabling all extra features: xn0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 ether 00:18:3c:51:6e:4c inet 192.168.1.9 netmask 0xffffff00 broadcast 192.168.1.255 media: Ethernet manual status: active nd6 options=9<PERFORMNUD,IFDISABLED> The iperf3 result: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 2.04 GBytes 1.75 Gbits/sec 12674 sender [ 5] 0.00-10.14 sec 2.04 GBytes 1.73 Gbits/sec receiver BTW, those extra features have huge impact on the dom0 => domU direction. It goes all the way down from ~30 / 18 Gbps to 3.5 / 1.8 Gbps (variation range) without those. But there is no retransmission at all in both configs for this direction. I wonder why such a huge difference since the nic is purely virtual without any HW acceleration? Any further suggestions on this retransmission issue? > > > So looks like at least the imbalance between two directions are not > > related to your patch. > > Likely the debug build is a bigger contributor to the perf difference > > in both directions. > > > > I also tried your patch on a release build, and didn't observe any > > major difference in iperf3 numbers. > > Roughly match the 30Gbps and 1.xGbps number on the stock release kernel. > > Thanks a lot, will try to get this upstream then. > > Roger.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |