libtaskotron fails to clean the space if errors happens when it tries to stop an instance
Details
Diff Detail
- Repository
- rLTRN libtaskotron
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
| libtaskotron/ext/disposable/vm.py | ||
|---|---|---|
| 201–203 | Wouldn't it be just better to make the testcloud behave properly, instead of making workarounds down the usage stream? | |
| libtaskotron/ext/disposable/vm.py | ||
|---|---|---|
| 201–203 | ah..I didn't see this comment until just now.Would you like give me some hints,I bet you genius have one: - ) Actually,I have thought about making some useful change directly in testcloud before I submitted this diff, but failed to find a better solution. | |
| libtaskotron/ext/disposable/vm.py | ||
|---|---|---|
| 201–203 | The method self._stop_instance() itself have the testcloud code in try-except, it catches the TestcloudInstanceError and raises TaskotronRemoteError. Change in this diff is not needed at all. | |
| libtaskotron/ext/disposable/vm.py | ||
|---|---|---|
| 201–203 | yeah,it does has try-except,which only can catch TestcloudInstanceError,and is unable to do anything when other errors ,such as libvirtError and KeyboardInterrupt ,happen.I do have thought about changing TestcloudInstanceError to Exception,but I have chosen this diff for two reasons: maybe other contributors will add some special methods to TestcloudInstanceError and TaskotronRemoteError;this diff will make sure _destroy_instance_ be called even exceptions happen when the program is handling the exception raised in tc_instance.stop (a little Rap-py) | |
So, I and @lbrabec have looked into this, and I can reproduce this issue (of having zombie VMs) only and only if I press Ctrl+c repeatedly. If I press Ctrl+c only one, the whole execution is stopped and the VM is cleaned up. This is how it looks (notice the ^C mark) when waiting for VM bootup:
[testcloud.instance:instance.py:392] 2016-07-12 11:11:16 DEBUG Polling domain for active network interface
^C[libtaskotron:minion.py:231] 2016-07-12 11:11:20 WARNING Task execution was interrupted by an error, doing emergency cleanup.
[libtaskotron:minion.py:239] 2016-07-12 11:11:20 INFO Shutting down the minion...
[libtaskotron:vm.py:173] 2016-07-12 11:11:20 DEBUG Stopping instance taskotron-20160712_111113_985688
[testcloud.instance:instance.py:417] 2016-07-12 11:11:20 DEBUG stopping instance taskotron-20160712_111113_985688.
[libtaskotron:vm.py:185] 2016-07-12 11:11:22 DEBUG destroying instance taskotron-20160712_111113_985688
[testcloud.instance:instance.py:429] 2016-07-12 11:11:22 DEBUG removing instance taskotron-20160712_111113_985688 from libvirt.
[testcloud.instance:instance.py:446] 2016-07-12 11:11:22 DEBUG Unregistering domain from libvirt.
[testcloud.instance:instance.py:448] 2016-07-12 11:11:22 DEBUG removing instance /var/lib/testcloud/instances/taskotron-20160712_111113_985688 from disk
[libtaskotron:logger.py:88] 2016-07-12 11:11:22 CRITICAL Traceback (most recent call last):
File "/home/kparal/devel/taskotron/env_taskotron/bin/runtask", line 9, in <module>
load_entry_point('libtaskotron==0.4.4', 'console_scripts', 'runtask')()
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/main.py", line 163, in main
overlord.start()
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/overlord.py", line 92, in start
runner.execute()
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/minion.py", line 201, in execute
task_vm.prepare(**env)
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/ext/disposable/vm.py", line 140, in prepare
self._prepare_instance(tc_image)
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/ext/disposable/vm.py", line 103, in _prepare_instance
tc_instance.start()
File "/home/kparal/devel/testcloud/testcloud/instance.py", line 408, in start
time.sleep(poll_tick)
KeyboardInterruptAnd this is how it looks during task execution:
[libtaskotron:remote_exec.py:116] 2016-07-12 11:13:12 DEBUG Running command on remote host: dnf --cacheonly repolist || dnf makecache
$ dnf --cacheonly repolist || dnf makecache
^C[libtaskotron:minion.py:231] 2016-07-12 11:13:13 WARNING Task execution was interrupted by an error, doing emergency cleanup.
[libtaskotron:minion.py:239] 2016-07-12 11:13:13 INFO Shutting down the minion...
[libtaskotron:vm.py:173] 2016-07-12 11:13:13 DEBUG Stopping instance taskotron-20160712_111257_430451
[testcloud.instance:instance.py:417] 2016-07-12 11:13:13 DEBUG stopping instance taskotron-20160712_111257_430451.
[libtaskotron:vm.py:185] 2016-07-12 11:13:15 DEBUG destroying instance taskotron-20160712_111257_430451
[testcloud.instance:instance.py:429] 2016-07-12 11:13:15 DEBUG removing instance taskotron-20160712_111257_430451 from libvirt.
[testcloud.instance:instance.py:446] 2016-07-12 11:13:15 DEBUG Unregistering domain from libvirt.
[testcloud.instance:instance.py:448] 2016-07-12 11:13:15 DEBUG removing instance /var/lib/testcloud/instances/taskotron-20160712_111257_430451 from disk
[libtaskotron:logger.py:88] 2016-07-12 11:13:15 CRITICAL Traceback (most recent call last):
File "/home/kparal/devel/taskotron/env_taskotron/bin/runtask", line 9, in <module>
load_entry_point('libtaskotron==0.4.4', 'console_scripts', 'runtask')()
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/main.py", line 163, in main
overlord.start()
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/overlord.py", line 92, in start
runner.execute()
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/minion.py", line 214, in execute
self._prepare_task()
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/minion.py", line 66, in _prepare_task
self.ssh.cmd('dnf --cacheonly repolist || dnf makecache')
File "/home/kparal/devel/taskotron/libtaskotron/libtaskotron/remote_exec.py", line 140, in cmd
time.sleep(0.1)
KeyboardInterruptNotice that this line gets printed out:
Task execution was interrupted by an error, doing emergency cleanup.
However, if press Ctrl+c repeatedly, not only I interrupt the task execution, I also interrupt the cleanup process, and the VM is left running. I'm not sure we can do anything about it, because the user might want to interrupt the cleanup phase (if it takes too long, for example).
What we can do is to make the cleanup faster. There is now hardcoded a 2 second delay between stopping and undefining the VM, and according to my testing, it's completely unnecessary (even when trying the libvirt API, the methods can be called immediately and there's no problem with it). Maybe there used to be some problem with libvirt, but it's not there anymore. Reducing the cleanup phase by 2 seconds will help a lot to prevent the user pressing Ctrl+c multiple times impatiently.
I'll take this revision and update it with a diff implementing this.
@lnie, is there something I missed? Can you reproduce any problems with zombie VMs when pressing Ctrl+c just once?
vm: don't wait 2 seconds when destroying VMs We tested libvirt API and found no issues in very fast sequential execution of destroy/undefine calls. There might have been an issue in the past, but currently there seems to be no reason to wait during VM cleanup. By reducing wait time, the user experience is better when trying to interrupt a task using ctrl+c. Also fix a few crashes when logging messages (incorrect syntax).
To expand on this even further, the libvirt API not only seems to have no problems with fast calls of destroy; undefine, but it even allows you to undefine a VM before destroying it (the VM is still visible, and once it is stopped, it disappears from the system). So there should definitely be no race condition.
There could be some problems introduced by testcloud, because it seems to impose an unnecessary constraint that a VM must be stopped before removing it (which libvirt API does not require, as described above). So if there are any problems with removing the sleep delay, I'd look into testcloud first.
@tflink, you wrote the sleep code in 82d0e32a. Do you know about any reason to keep it?
I don't think that it's unreasonable to have the VM stopped before it's removed. How many use cases are there where it's advantageous to allow deletion before stopping?
That being said, there should be a ticket for testcloud to allow "force delete" of instances (stop if not already stopped, then delete) but that work has yet to be done.
@tflink, you wrote the sleep code in 82d0e32a. Do you know about any reason to keep it?
As I recall, that was added when testcloud was still doing subprocess calls to virsh. With the current testcloud code, I don't think it's needed anymore.
@lnie, is there something I missed? Can you reproduce any problems with zombie VMs when pressing Ctrl+c just once?
Bingo, I'm afraid there is: ) "Polling domain for active network interface" shows that the program is in testcloud's start step( "Running command on remote host: dnf --cacheonly repolist || dnf makecache" shows the created VM already starts successfully), the bug is 100% reproducible if you '^c' in time ,or raise an exception intentionally in the test code before the program come to testcloud's spawn_vm step, which means exception will occur before libvirt defines a VM.
This diff is not only wanting to take care of '^c',but also all the exceptions happen before a VM is defined ,like exceptions raised in _create_local_disk,_generate_seed_image,or even in spawn_vm, write_domain_xml: )
As for the destroy-before-stop thing,perhaps because of my OCD, I always feels like that it will do harm to my honey host machine,but if you are not in favor of the constraint ,I can remove it in testcloud: )
However, if press Ctrl+c repeatedly, not only I interrupt the task execution, I also interrupt the cleanup process, and the VM is left running. I'm not sure we >can do anything about it, because the user might want to interrupt the cleanup phase (if it takes too long, for example).
Considering that,we can add something like below into teardown( ) to help users get all they want :)
custom_no_destroy = False
tc_instance = self._check_existing_instance(should_exist=True)
try:
self._stop_instance(tc_instance)
except KeyboardInterrupt:
custom_no_destroy = True
except Exception,e:
raise exc.TaskotronRemoteError(e)
finally:
if custom_no_destroy:
pass
else:
self._destroy_instance(tc_instance)So, talking to Lili about this, turns out that when she was talking about "zombie VM", she meant leftover files in /var/lib/testcloud/instances. Yes, there are some files generated and not deleted, if you hit ctrl+c at the right time. However, that should be solved inside testcloud, not libtaskotron. I recommended Lili to file a ticket about it. Also, there's a very narrow window during which you can still hit ctrl+c, have the instance registered in libvirt but not cleaned up. I know where the problem is and will come up with a new diff for that.
In this diff, we managed to get rid of the 2 seconds wait during cleanup, which is a good thing. As @tflink said, it was present since the time testcloud was calling libvirt using cmdline and probably is not needed anymore. Let's push this, and deal with other issues in different tickets.
Wouldn't it be just better to make the testcloud behave properly, instead of making workarounds down the usage stream?