Improve VM cleanup handling on ctrl+c
ClosedPublic

Authored by kparal on Jun 29 2016, 6:43 AM.

Details

Summary

libtaskotron fails to clean the space if errors happens when it tries to stop an instance

Test Plan

Just run common --libvirt runtask and interrupt the process immediately or several seconds later,then check if there is zombie image left

Diff Detail

Repository
rLTRN libtaskotron
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
lnie retitled this revision from to Modify the code to improve libtaskotron's ability of handling issues.Jun 29 2016, 6:43 AM
lnie updated this object.
lnie edited the test plan for this revision. (Show Details)
lnie added reviewers: jskladan, lbrabec, kparal, mkrizek.
lnie added a subscriber: tflink.
lnie updated this object.Jun 29 2016, 6:54 AM
jskladan requested changes to this revision.Jun 29 2016, 7:45 AM
jskladan added inline comments.
libtaskotron/ext/disposable/vm.py
201–203

Wouldn't it be just better to make the testcloud behave properly, instead of making workarounds down the usage stream?

This revision now requires changes to proceed.Jun 29 2016, 7:45 AM
lnie added inline comments.Jun 30 2016, 4:47 AM
libtaskotron/ext/disposable/vm.py
201–203

ah..I didn't see this comment until just now.Would you like give me some hints,I bet you genius have one: - ) Actually,I have thought about making some useful change directly in testcloud before I submitted this diff, but failed to find a better solution.

lbrabec added inline comments.Jun 30 2016, 12:20 PM
libtaskotron/ext/disposable/vm.py
201–203

The method self._stop_instance() itself have the testcloud code in try-except, it catches the TestcloudInstanceError and raises TaskotronRemoteError. Change in this diff is not needed at all.

lnie added inline comments.Jul 1 2016, 7:19 AM
libtaskotron/ext/disposable/vm.py
201–203

yeah,it does has try-except,which only can catch TestcloudInstanceError,and is unable to do anything when other errors ,such as libvirtError and KeyboardInterrupt ,happen.I do have thought about changing TestcloudInstanceError to Exception,but I have chosen this diff for two reasons: maybe other contributors will add some special methods to TestcloudInstanceError and TaskotronRemoteError;this diff will make sure _destroy_instance_ be called even exceptions happen when the program is handling the exception raised in tc_instance.stop (a little Rap-py)

So, I and @lbrabec have looked into this, and I can reproduce this issue (of having zombie VMs) only and only if I press Ctrl+c repeatedly. If I press Ctrl+c only one, the whole execution is stopped and the VM is cleaned up. This is how it looks (notice the ^C mark) when waiting for VM bootup:

[testcloud.instance:instance.py:392] 2016-07-12 11:11:16 DEBUG   Polling domain for active network interface
^C[libtaskotron:minion.py:231] 2016-07-12 11:11:20 WARNING Task execution was interrupted by an error, doing emergency cleanup.
[libtaskotron:minion.py:239] 2016-07-12 11:11:20 INFO    Shutting down the minion...
[libtaskotron:vm.py:173] 2016-07-12 11:11:20 DEBUG   Stopping instance taskotron-20160712_111113_985688
[testcloud.instance:instance.py:417] 2016-07-12 11:11:20 DEBUG   stopping instance taskotron-20160712_111113_985688.
[libtaskotron:vm.py:185] 2016-07-12 11:11:22 DEBUG   destroying instance taskotron-20160712_111113_985688
[testcloud.instance:instance.py:429] 2016-07-12 11:11:22 DEBUG   removing instance taskotron-20160712_111113_985688 from libvirt.
[testcloud.instance:instance.py:446] 2016-07-12 11:11:22 DEBUG   Unregistering domain from libvirt.
[testcloud.instance:instance.py:448] 2016-07-12 11:11:22 DEBUG   removing instance /var/lib/testcloud/instances/taskotron-20160712_111113_985688 from disk
[libtaskotron:logger.py:88] 2016-07-12 11:11:22 CRITICAL Traceback (most recent call last):
  File "/home/kparal/devel/taskotron/env_taskotron/bin/runtask", line 9, in <module>
    load_entry_point('libtaskotron==0.4.4', 'console_scripts', 'runtask')()
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/main.py", line 163, in main
    overlord.start()
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/overlord.py", line 92, in start
    runner.execute()
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/minion.py", line 201, in execute
    task_vm.prepare(**env)
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/ext/disposable/vm.py", line 140, in prepare
    self._prepare_instance(tc_image)
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/ext/disposable/vm.py", line 103, in _prepare_instance
    tc_instance.start()
  File "/home/kparal/devel/testcloud/testcloud/instance.py", line 408, in start
    time.sleep(poll_tick)
KeyboardInterrupt

And this is how it looks during task execution:

[libtaskotron:remote_exec.py:116] 2016-07-12 11:13:12 DEBUG   Running command on remote host: dnf --cacheonly repolist || dnf makecache
$ dnf --cacheonly repolist || dnf makecache
^C[libtaskotron:minion.py:231] 2016-07-12 11:13:13 WARNING Task execution was interrupted by an error, doing emergency cleanup.
[libtaskotron:minion.py:239] 2016-07-12 11:13:13 INFO    Shutting down the minion...
[libtaskotron:vm.py:173] 2016-07-12 11:13:13 DEBUG   Stopping instance taskotron-20160712_111257_430451
[testcloud.instance:instance.py:417] 2016-07-12 11:13:13 DEBUG   stopping instance taskotron-20160712_111257_430451.
[libtaskotron:vm.py:185] 2016-07-12 11:13:15 DEBUG   destroying instance taskotron-20160712_111257_430451
[testcloud.instance:instance.py:429] 2016-07-12 11:13:15 DEBUG   removing instance taskotron-20160712_111257_430451 from libvirt.
[testcloud.instance:instance.py:446] 2016-07-12 11:13:15 DEBUG   Unregistering domain from libvirt.
[testcloud.instance:instance.py:448] 2016-07-12 11:13:15 DEBUG   removing instance /var/lib/testcloud/instances/taskotron-20160712_111257_430451 from disk
[libtaskotron:logger.py:88] 2016-07-12 11:13:15 CRITICAL Traceback (most recent call last):
  File "/home/kparal/devel/taskotron/env_taskotron/bin/runtask", line 9, in <module>
    load_entry_point('libtaskotron==0.4.4', 'console_scripts', 'runtask')()
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/main.py", line 163, in main
    overlord.start()
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/overlord.py", line 92, in start
    runner.execute()
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/minion.py", line 214, in execute
    self._prepare_task()
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/minion.py", line 66, in _prepare_task
    self.ssh.cmd('dnf --cacheonly repolist || dnf makecache')
  File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/remote_exec.py", line 140, in cmd
    time.sleep(0.1)
KeyboardInterrupt

Notice that this line gets printed out:

Task execution was interrupted by an error, doing emergency cleanup.

However, if press Ctrl+c repeatedly, not only I interrupt the task execution, I also interrupt the cleanup process, and the VM is left running. I'm not sure we can do anything about it, because the user might want to interrupt the cleanup phase (if it takes too long, for example).

What we can do is to make the cleanup faster. There is now hardcoded a 2 second delay between stopping and undefining the VM, and according to my testing, it's completely unnecessary (even when trying the libvirt API, the methods can be called immediately and there's no problem with it). Maybe there used to be some problem with libvirt, but it's not there anymore. Reducing the cleanup phase by 2 seconds will help a lot to prevent the user pressing Ctrl+c multiple times impatiently.

I'll take this revision and update it with a diff implementing this.

@lnie, is there something I missed? Can you reproduce any problems with zombie VMs when pressing Ctrl+c just once?

kparal commandeered this revision.Jul 12 2016, 12:24 PM
kparal edited reviewers, added: lnie; removed: kparal.
kparal updated this revision to Diff 2369.Jul 12 2016, 12:25 PM
vm: don't wait 2 seconds when destroying VMs

We tested libvirt API and found no issues in very fast sequential
execution of destroy/undefine calls. There might have been an issue in
the past, but currently there seems to be no reason to wait during VM
cleanup. By reducing wait time, the user experience is better when
trying to interrupt a task using ctrl+c.

Also fix a few crashes when logging messages (incorrect syntax).

To expand on this even further, the libvirt API not only seems to have no problems with fast calls of destroy; undefine, but it even allows you to undefine a VM before destroying it (the VM is still visible, and once it is stopped, it disappears from the system). So there should definitely be no race condition.

There could be some problems introduced by testcloud, because it seems to impose an unnecessary constraint that a VM must be stopped before removing it (which libvirt API does not require, as described above). So if there are any problems with removing the sleep delay, I'd look into testcloud first.

@tflink, you wrote the sleep code in 82d0e32a. Do you know about any reason to keep it?

kparal retitled this revision from Modify the code to improve libtaskotron's ability of handling issues to Improve VM cleanup handling on ctrl+c.Jul 12 2016, 12:39 PM
In D908#17486, @kparal wrote:

To expand on this even further, the libvirt API not only seems to have no problems with fast calls of destroy; undefine, but it even allows you to undefine a VM before destroying it (the VM is still visible, and once it is stopped, it disappears from the system). So there should definitely be no race condition.

There could be some problems introduced by testcloud, because it seems to impose an unnecessary constraint that a VM must be stopped before removing it (which libvirt API does not require, as described above). So if there are any problems with removing the sleep delay, I'd look into testcloud first.

I don't think that it's unreasonable to have the VM stopped before it's removed. How many use cases are there where it's advantageous to allow deletion before stopping?

That being said, there should be a ticket for testcloud to allow "force delete" of instances (stop if not already stopped, then delete) but that work has yet to be done.

@tflink, you wrote the sleep code in 82d0e32a. Do you know about any reason to keep it?

As I recall, that was added when testcloud was still doing subprocess calls to virsh. With the current testcloud code, I don't think it's needed anymore.

lnie added a comment.Jul 13 2016, 4:31 AM

@lnie, is there something I missed? Can you reproduce any problems with zombie VMs when pressing Ctrl+c just once?

Bingo, I'm afraid there is: ) "Polling domain for active network interface" shows that the program is in testcloud's start step( "Running command on remote host: dnf --cacheonly repolist || dnf makecache" shows the created VM already starts successfully), the bug is 100% reproducible if you '^c' in time ,or raise an exception intentionally in the test code before the program come to testcloud's spawn_vm step, which means exception will occur before libvirt defines a VM.
This diff is not only wanting to take care of '^c',but also all the exceptions happen before a VM is defined ,like exceptions raised in _create_local_disk,_generate_seed_image,or even in spawn_vm, write_domain_xml: )

lnie added a comment.Jul 13 2016, 4:43 AM
In D908#17486, @kparal wrote:

To expand on this even further, the libvirt API not only seems to have no problems with fast calls of destroy; undefine, but it even allows you to undefine a VM before destroying it (the VM is still visible, and once it is stopped, it disappears from the system). So there should definitely be no race condition.
There could be some problems introduced by testcloud, because it seems to impose an unnecessary constraint that a VM must be stopped before removing it (which libvirt API does not require, as described above). So if there are any problems with removing the sleep delay, I'd look into testcloud first.

As for the destroy-before-stop thing,perhaps because of my OCD, I always feels like that it will do harm to my honey host machine,but if you are not in favor of the constraint ,I can remove it in testcloud: )

lnie added a comment.Jul 13 2016, 6:47 AM

However, if press Ctrl+c repeatedly, not only I interrupt the task execution, I also interrupt the cleanup process, and the VM is left running. I'm not sure we >can do anything about it, because the user might want to interrupt the cleanup phase (if it takes too long, for example).

Considering that,we can add something like below into teardown( ) to help users get all they want :)
custom_no_destroy = False

tc_instance = self._check_existing_instance(should_exist=True)
try:
    self._stop_instance(tc_instance)
except KeyboardInterrupt:
    custom_no_destroy = True
except Exception,e:
    raise exc.TaskotronRemoteError(e)
finally:
    if custom_no_destroy:
        pass
    else:
        self._destroy_instance(tc_instance)
In D908#17511, @tflink wrote:
In D908#17486, @kparal wrote:

To expand on this even further, the libvirt API not only seems to have no problems with fast calls of destroy; undefine, but it even allows you to undefine a VM before destroying it (the VM is still visible, and once it is stopped, it disappears from the system). So there should definitely be no race condition.

There could be some problems introduced by testcloud, because it seems to impose an unnecessary constraint that a VM must be stopped before removing it (which libvirt API does not require, as described above). So if there are any problems with removing the sleep delay, I'd look into testcloud first.

I don't think that it's unreasonable to have the VM stopped before it's removed. How many use cases are there where it's advantageous to allow deletion before stopping?

That being said, there should be a ticket for testcloud to allow "force delete" of instances (stop if not already stopped, then delete) but that work has yet to be done.

Do you mean this https://github.com/Rorosha/testcloud/commit/1ac7df6cd897421262309879d965dfb0190979d7 ?

So, talking to Lili about this, turns out that when she was talking about "zombie VM", she meant leftover files in /var/lib/testcloud/instances. Yes, there are some files generated and not deleted, if you hit ctrl+c at the right time. However, that should be solved inside testcloud, not libtaskotron. I recommended Lili to file a ticket about it. Also, there's a very narrow window during which you can still hit ctrl+c, have the instance registered in libvirt but not cleaned up. I know where the problem is and will come up with a new diff for that.

In this diff, we managed to get rid of the 2 seconds wait during cleanup, which is a good thing. As @tflink said, it was present since the time testcloud was calling libvirt using cmdline and probably is not needed anymore. Let's push this, and deal with other issues in different tickets.

mkrizek accepted this revision.Jul 14 2016, 8:53 AM

LGTM

lbrabec accepted this revision.Jul 14 2016, 9:00 AM

LGTM

This revision was automatically updated to reflect the committed changes.