WARNING - OLD ARCHIVES

This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
   
 
 
Xen 
 
Home Products Support Community News
 
   
 

xen-devel

Re: [Xen-devel] Error restoring DomU when using GPLPV

To: Keir Fraser <keir.fraser@xxxxxxxxxxxxx>
Subject: Re: [Xen-devel] Error restoring DomU when using GPLPV
From: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx>
Date: Mon, 14 Sep 2009 19:25:04 -0700
Cc: Joshua West <jwest@xxxxxxxxxxxx>, Dan Magenheimer <dan.magenheimer@xxxxxxxxxx>, James Harper <james.harper@xxxxxxxxxxxxxxxx>, "Kurt C. Hackel" <kurt.hackel@xxxxxxxxxx>, "annie.li@xxxxxxxxxx" <annie.li@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, "wayne.gong@xxxxxxxxxx" <wayne.gong@xxxxxxxxxx>
Delivery-date: Mon, 14 Sep 2009 19:26:07 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
In-reply-to: <C6C7C941.13DA6%keir.fraser@xxxxxxxxxxxxx>
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Organization: Oracle Corp
References: <C6C7C941.13DA6%keir.fraser@xxxxxxxxxxxxx>
Reply-to: mukesh.rathor@xxxxxxxxxx
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Thunderbird 2.0.0.21 (X11/20090320)

Ok, I've been looking at this and figured what's going on. Annie's problem
lies in not remapping the grant frames post migration. Hence the leak,
tot_pages goes up every time until migration fails. On linux, remapping
is where the frames created by restore (for heap pfn's), get freed back to
the dom heap, is what I found.  So that's a fix to be made on win
pv driver side.

Now back to orig problem. As you already know, because libxc is not
skipping heap pages, tot_pages in struct domain{} temporarily goes up
by (shared-info-frame + gnt-frames) until guest remaps these pages.
Hence, migration fails if
      (max_pages - tot_pages) < (shared-info-frame + gnt-frames).

Occassionally, I see tot_pages nearly same as max_pages, and I don't
know of all ways that may happen or what causes that to happen
(by default, i see tot_pages short by 21).

Anyways, of two solutions:

1. Always balloon down, shinfo+gnttab frames: This needs to be done just
   once during load, right? I'm not sure how it would work tho if mem gets
   ballooned up subsequently. I suppose the driver will have to intercept
   every increase in reservation and balloon down everytime?

   Also, balloon down during suspend call would prob be too late, right?

2. libxc fix: I wonder how much work this will be. Good thing here is,
   it'll take care of both linux and PV HVM guests avoiding driver
   updates in many versions, and hence appealing to us. Can we somehow
   mark the frames special to be skipped? Looking at biiig xc_domain_save
   function, not sure in case of HVM, how pfn_type gets set. May be before the
   outer loop, it could ask hyp for all xen heap page list, but then what if a
   new page gets added to the list in between.....


Also, unfortunately, the failure case is not handled properly sometimes.
If migration fails after suspend, then no way to get the guest
back. I even noticed, the guest disappeared totally from both source and
target when failed, couple times of several dozen migrations I did.


thanks,
Mukesh



Keir Fraser wrote:
Not all those pages are special. Frames fc0xx will be ACPI tables, resident
in ordinary guest memory pages, for example. Only the Xen-heap pages are
special and need to be (1) skipped; or (2) unmapped by the HVMPV drivers on
suspend; or (3) accounted for by HVMPV drivers by unmapping and freeing an
equal number of domain-heap pages. (1) is 'nicest' but actually a bit of a
pain to implement; (2) won't work well for live migration, where the pages
wouldn't get unmapped by the drivers until the last round of page copying;
and (3) was apparently tried by Annie but didn't work? I'm curious why (3)
didn't work - I can't explain that.

 -- Keir

On 05/09/2009 00:02, "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx> wrote:

On further debugging, it appears that the
p2m_size may be OK, but there's something about
those 24 "magic" gpfns that isn't quite right.

-----Original Message-----
From: Dan Magenheimer
Sent: Friday, September 04, 2009 3:29 PM
To: Wayne Gong; Annie Li; Keir Fraser
Cc: Joshua West; James Harper; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: RE: [Xen-devel] Error restoring DomU when using GPLPV


I think I've tracked down the cause of this problem
in the hypervisor, but am unsure how to best fix it.

In tools/libxc/xc_domain_save.c, the static variable p2m_size
is said to be "number of pfns this guest has (i.e. number of
entries in the P2M)".  But apparently p2m_size is getting
set to a very large number (0x100000) regardless of the
maximum psuedophysical memory for the hvm guest.  As a result,
some "magic" pages in the 0xf0000-0xfefff range are getting
placed in the save file.  But since they are not "real"
pages, the restore process runs beyond the maximum number
of physical pages allowed for the domain and fails.
(The gpfn of the last 24 pages saved are f2020, fc000-fc012,
feffb, feffc, feffd, feffe.)

p2m_size is set in "save" with a call to a memory_op hypercall
(XENMEM_maximum_gpfn) which for an hvm domain returns
d->arch.p2m->max_mapped_pfn.  I suspect that the meaning
of max_mapped_pfn changed at some point to more match
its name, but this changed the semantics of the hypercall
as used by xc_domain_restore, resulting in this curious
problem.

Any thoughts on how to fix this?

-----Original Message-----
From: Annie Li Sent: Tuesday, September 01, 2009 10:27 PM
To: Keir Fraser
Cc: Joshua West; James Harper; xen-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Xen-devel] Error restoring DomU when using GPLPV



It seems this problem is connected with gnttab, not shareinfo.
I changed some code about grant table in winpv driver (not using
balloon down shinfo+gnttab method),
save/restore/migration can work
properly on Xen3.4 now.

What i changed is winpv driver use hypercall
XENMEM_add_to_physmap to
map corresponding grant tables which devices require, instead of
mapping all 32 pages grant table during initialization.  It seems
those extra grant table mapping cause this problem.
Wondering whether those extra grant table mapping is the root
cause of the migration problem? or by luck as linux PVHVM too?

Thanks
Annie.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel