feature: use libvirt directly for spawning instances
ClosedPublic

Authored by roshi on Oct 8 2015, 3:32 PM.

Diff Detail

Repository
rTCLOUD testcloud
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
roshi retitled this revision from to feature: use libvirt directly for spawning instances.
roshi updated this object.
roshi edited the test plan for this revision. (Show Details)
roshi added a reviewer: tflink.
roshi updated this revision to Diff 1610.Oct 9 2015, 9:18 PM
  • fix: remove virsh calls and replace with libvirt calls

overall, I like the approach. I noted a concern about MAC addresses and I still think that generating the xml from a real template instead of adding xml elements would suit this code better.

I'm excited to see the virsh and virt-install calls go away :)

testcloud/config.py
94

Shouldn't this file be part of the review?

testcloud/instance.py
290–307

This seems like a silly way to generate the xml files needed to define a vm. I'd think that using jinja (or some other form of templating if you object to another dep) to generate the xml would be a) less fragile and b) more flexible. With the current template and code, I have no say over how much memory the VM has, which network its adaptor is attached to etc.

testcloud/util.py
67

shouldn't you be checking to see if the pseudo-random mac address already exists in the libvirt network before using it?

roshi updated this revision to Diff 1618.Oct 14 2015, 6:29 PM
  • fix: use jinja instead of xml for setting up the domain template
roshi added inline comments.Oct 14 2015, 8:06 PM
testcloud/config.py
94

Included now.

testcloud/instance.py
290–308

Fixed in recent commit.

testcloud/util.py
67

To be honest, I don't think there's much chance of collision. Not only would we have to hit the collision, but the domains would also have to be on and connected to the same network. We could check, but then for each instance we'd have to check every other instance libvirt knows about: dump all xml for all domains, search through it and then look for matches. I just don't think it's worth the effort to be that safe - if someone hits it during general use, I'd be surprised.

Note, I still need to better generalize the template a bit - make sure everything there is workable for any libvirt host.

Looks good to me after a quick look through. haven't tested it, though

testcloud/instance.py
299

is this comment still valid?

testcloud/util.py
67

These kinda sound like famous last words to me but that's something which can be added later. do we know what the failure mode is if there's a duplicate mac? my fear is that it would be a silent "network isn't working" kinda error that would be terrible to debug

lbrabec accepted this revision.Oct 20 2015, 8:40 AM

Is this log info from cli.py still relevant?

INFO:Don't worry about these 'QEMU Driver' errors. libvirt is whiny and has no method to shut it up...

Otherwise, py.test passed, creating and running instances from command line worked, disposable client worked, LGTM.

conf/domain-template.jinja
12 ↗(On Diff #1618)

Specific CPU is not probably good idea, I've got error:

libvirt.libvirtError: unsupported configuration: guest and host CPU are not compatible: Host CPU does not provide required features: tbm, fma4, xop, misalignsse, sse4a, svm

When I changed it to kvm64, it worked well.

This revision is now accepted and ready to land.Oct 20 2015, 8:40 AM
roshi added a comment.Oct 20 2015, 3:54 PM

Awesome, I'll make the changes to docs and the template default and push it out.

conf/domain-template.jinja
12 ↗(On Diff #1618)

Yeah, Tim thought that might be an issue. I'd just copied the xml from a domain testcloud had created in the past - I'll test kvm64 here and if it works then I'll make it the default.

testcloud/instance.py
299

Probably not. I'll take it out/re word it.

testcloud/util.py
67

From running the numbers through some Birthday paradox math, it looks like we have ~10% chance of collision if there are 1800 guests running on a single host connected to a single network. I'd be really really surprised if this ever happens.

roshi updated this revision to Diff 1649.Nov 2 2015, 8:23 PM

feature: use jinja instead of munging xml directly

roshi added a comment.Nov 2 2015, 8:26 PM

So, it seems I had already updated to this revision and phab doesn't check for duplicates? I obviously did something wrong - not sure what it is though. Sorry for the spam.

roshi updated this revision to Diff 1660.Nov 3 2015, 7:22 PM
  • fix: use generic CPU in template and cleaned up comments
tflink accepted this revision.Nov 3 2015, 10:17 PM

The docs for this are a bit rough and there are still some weird cases in the code but it does work. The only potential showstopper I see is the machine type that I noted above

conf/domain-template.jinja
8 ↗(On Diff #1660)

the machine argument changes with different versions - my f21 box doesn't have pc-i440fx-2.3 as a machine type, just pc-i440fx-2.1 and older. However, the newest "version" of that machine type appears to be aliased to pc, so changing that part to machine='pc' should be more flexible and not require tweaking for every supported fedora version

testcloud/instance.py
416

The problem with this is that if there's a failure before the domain was actually created, this will TB and the only way to attempt re-creation is to remove /var/lib/testcloud/instances/<name> before trying again.

This error handling could be improved

roshi updated this revision to Diff 1666.Nov 4 2015, 9:25 PM
  • fix: make domain xml more generic
  • fix: no longer crash when destroying unregistered domain
tflink added a comment.Nov 4 2015, 9:27 PM

looks good to me

This revision was automatically updated to reflect the committed changes.