WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

RE: [Xen-devel] arp during live migration

To: <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: RE: [Xen-devel] arp during live migration
From: "Graham, Simon" <Simon.Graham@xxxxxxxxxxx>
Date: Tue, 6 Mar 2007 17:59:31 -0500
Delivery-date: Tue, 06 Mar 2007 14:58:40 -0800
Envelope-to: www-data@xxxxxxxxxxxxxxxxxx
In-reply-to: 342BAC0A5467384983B586A6B0B3767104DC3FC3@xxxxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
References: 45E88EEA.4020707@xxxxxxxxxxxxx<342BAC0A5467384983B586A6B0B3767104DC3DAF@xxxxxxxxxxxxxxxxxxxxx><1172938895.14470.25.ca mel@xxxxxxxxxxxxxxxxxxxxx> 45EC39E2.3020100@xxxxxxxxxxxxx 342BAC0A5467384983B586A6B0B3767104DC3FC3@xxxxxxxxxxxxxxxxxxxxx
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
Thread-index: AcdfPJTYp3d71aDxRxK0gccT+4xaGgAQl6OgADB089A=
Thread-topic: [Xen-devel] arp during live migration
> >  > In my case, I NEVER see the gratuitous ARP being sent (confirmed
> > using
> >  > tcpdump on peth0 in Dom0) and the return value from
dev_queue_xmit
> > is
> >  > sometimes 0 and sometimes 2 (that's PLUS 2 -- congestion
> > notification
> >  > [NET_XMIT_CN]).
> >
> > I am seeing the same error, indeed it looks like it is NET_XMIT_CN.
I
> > also see 100% percent loss, the ARP never makes it to the wire in
any
> > of
> > my tests.
> >
> 

I guess no one else is seeing this problem? 

Anyway -- after a fair amount of stumbling around I think I know what
the problem is (but I don't have a solution) -- for a while, I thought
it was an SMP bug in the netfront/netback interaction but, although
there is some dodgy code there, it does seem that it always sends the
gratuitous ARP and the backend always picks it up.

The real problem seems to be in the bridge in Dom0; it seems that the
VIF port to the bridge is always in the disabled state when the ARP is
sent, so it simply gets dropped. 

Why is this? Well, the bridge doesn't enable the port until the VIF is
both up AND has link (netif_carrier_on() has been called) -- this latter
call is not made until netfront connects to netback.

What's more, this change is not passed to the bridge code until the next
time the netwatch worker runs, which could be up to 1s after the
netif_carrier_on() is called... at least, that's how it looks to me...

All of this leads to a ~1s delay setting up the network path plus the
gratuitous ARP is dropped so there can be a MUCH larger network
blackout. If you are trying to get sub-second blackout on migration this
is a big problem!

It seems to me that the right thing to do here is to have the link up on
the VIF in advance of the domain resuming on the target but I'm guessing
that this would cause netback to have conniptions...

All suggestions welcome...
Simon

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel