clean up the space for users when testcloud is interrupted by an error
AbandonedPublic

Authored by lnie on Jun 29 2016, 7:48 AM.

Details

Reviewers
kparal
jskladan
Group Reviewers
testcloud
Summary

clean up the space for users when testcloud is interrupted by an error

Test Plan

Feed an error to testcloud ,like stop the process using "Ctrl-C",and check if the space is cleaned up

Diff Detail

Repository
rTCLOUD testcloud
Branch
angel
Lint
Lint OK
Unit
Unit Tests OK
Build Status
Buildable 689
Build 689: arc lint + arc unit
lnie retitled this revision from to clean up the space for users when testcloud is interrupted by an error.Jun 29 2016, 7:48 AM
lnie updated this object.
lnie edited the test plan for this revision. (Show Details)
lnie added a reviewer: testcloud.
jskladan requested changes to this revision.Jun 29 2016, 7:50 AM
jskladan added a reviewer: jskladan.
jskladan added a subscriber: jskladan.
jskladan added inline comments.
testcloud/image.py
184–211 ↗(On Diff #2318)

These changes do not seem to be a part of the diff as described. Please make sure only the relevant changes are in each Differential request.

This revision now requires changes to proceed.Jun 29 2016, 7:50 AM

Based on how "messy" your diffs are, I suggest you take some time with Git, and use a different branch for each diff. If you need a "diff on top of a diff", then make a new branch on top of what's supposed to be the diff's base, and arc diff against that.

lnie updated this revision to Diff 2319.Jun 29 2016, 7:59 AM

do the clean up work

lnie added a comment.Jun 29 2016, 8:10 AM

Based on how "messy" your diffs are, I suggest you take some time with Git, and use a different branch for each diff. If you need a "diff on top of a diff", then make a new branch on top of what's supposed to be the diff's base, and arc diff against that.

Shocking commenting speed! : ) I saw the problem after I submitted the code,and then I ran to do the remedy.I thought I am quick enough when I did the update,but still slower than you.sigh...
yeah, dragging in image.py again,I'm going to eat more git skills now.

jskladan added inline comments.Jul 7 2016, 1:48 PM
testcloud/cli.py
311

Is this wise? I don't know whether it's intended or not, but this means that on ^C, the machine will be auto-destroyed, as you noted in the DR - that might not be suitable - for example when trying to debug a problem, there might be a need to keep the machine in that state. Other than that the code looks good, I'd like to see some rationale.

lnie added inline comments.Jul 11 2016, 4:44 AM
testcloud/cli.py
311

In my narrow vision,I only see users do '^c' when they intend to stop the process,or test the reaction of the code to '^c'. Or,you mean we should add an 'no-destroy' option for testcloud,so that the users can do some search on the surviving VM,if they would like,when they see an error? I vote for adding a 'no-destroy' option:)

kparal requested changes to this revision.Aug 9 2016, 3:19 PM
kparal added a reviewer: kparal.
kparal added a subscriber: kparal.

Overall, I'm having troubles identifying the use case when we want to do this auto-destroy. I think it only makes sense for testcloud instance create. But even then, if the error occurred before the instance is created, there's no instance to tear down. If the error occurred after the instance was created, well - which kinds of error are we talking about here? Please give me some real-world example. I can only imagine Ctrl+C, and for that particular use case, I'd rather keep the instance running (and print a helpful message explaining that). When you create an instance, it takes 10-15 seconds before the VM boots up and testcloud prints out its IP. If you don't want to wait, because you don't need that IP (you have other ways of accessing the machine), you can simply hit Ctrl+C instead of waiting. That doesn't mean you want the instance to be destroyed. I don't know if this was supposed to cover Ctrl+C signal or some other errors.

testcloud/cli.py
303

This should be a comment, not a string.

305–307

I don't understand this.

312

It should not be wrapped as TestcloudCliError, because that is defined like this:

class TestcloudCliError(TestcloudException):
    """Exception for errors having to do with Testcloud CLI processing"""

and this might not be related to CLI. It could be wrapped in TestcloudException, but only in case it's not a subclass of TestcloudException already (in that case don't wrap it again, just raise it directly).

By the way, it seems that DomainNotFoundError is a subclass of BaseException instead of TestcloudException by mistake, so that should be fixed as well (either as part of this diff or separately).

314–321

Currently, all of this is applied even for testcloud image subcommand, not just testcloud instance subcommand. Which of course doesn't make sense, if I want to list images, there's no instance to stop or destroy.

Furthermore, if any error occurs, why would we want to destroy the instance? Let's say you call instance stop and it fails with some error. This code would remove that instance, which is something you didn't want. All your data gone.

I can imagine this could be useful for taskotron instance create subcommand, but I don't see the use case for anything else.

319

The same concern as above.

This revision now requires changes to proceed.Aug 9 2016, 3:19 PM
lnie added a comment.Aug 10 2016, 7:56 AM

Overall, I'm having troubles identifying the use case when we want to do this auto-destroy. I think it only makes sense for testcloud instance create. But even >then, if the error occurred before the instance is created, there's no instance to tear down.

It's obviously that I'm kind of copying the thought of libtaskotron when it sees an error.Of course,the difference is that we will see much more and all kinds of errors in libtaskotron.This diff mainly concerns instance create subcommand, the exact or final purpose of this diff is to do some cleanup work,if errors happens before a instance is created,then it just cleanup the leftover files(maybe we can add a no-destroy option for testcloud if the leftover files might be useful),if after,then it will stop the instance and then remove the instance.Besides that,it will also enhance the instance remove performance:if an error happens when users try to remove an instance,this diff will handle that and make sure the instance will be removed successfully.

If the error occurred after the instance was created, well - which kinds of error are we talking about here? Please give me some real-world example.

All kinds of errors,it's hard to come up with an error right away,how about when they try to test a newly testcloud image but turns out that image doesn't work well,ie,failed to start in 30 seconds.If that doesn't count then how about we consider testcloud contributors as users.We are likely to run into some errors when we try to modify some code.We can't(or at least me) make sure that our draft code will work well,and the log printed to host terminal is enough for us to figure out the problem after we look into the specific code.

I can only imagine Ctrl+C, and for that particular use case, I'd rather keep the instance running (and print a helpful message explaining that). When you create an >instance, it takes 10-15 seconds before the VM boots up and testcloud prints out its IP. If you don't want to wait, because you don't need that IP (you have >other ways of accessing the machine), you can simply hit Ctrl+C instead of waiting. That doesn't mean you want the instance to be destroyed.

For me the waiting time is not so long,seemly less than 10 seconds.I can hardly figure out any method except vi the ip file manually in another terminal,and when we are back with the ip ,perhaps testcloud prints the ip for us already:)

testcloud/cli.py
305–307

This is supposed to handle the problem when users try to use an already existing VM name(for example,"test"),which won't happen in libtaskotron,as libtaskotron uses timestamp wisely.When that happens this diff will remove the existing instance by mistake if without this.

314–321

Currently, all of this is applied even for testcloud image subcommand, not just testcloud instance subcommand. Which of >course doesn't make sense, if I want to list images, there's no instance to stop or destroy.Furthermore, if any error occurs, >why would we want to destroy the instance? Let's say you call instance stop and it fails with some error. This code would >remove that instance, which is something you didn't want. All your data gone.

my fault,I haven't thought enough.I have ignore image subcommand,probably because I have never used it,but it's useful, we can modify the main()code to handle this:

def main():

    parser = get_argparser()
    args = parser.parse_args()
    print args
    _configure_logging()
    # "connection" will definitely be in the args when we run instance subcommand
    if 'connection' in args:
        execute_instance_func(args)
    else:
        args.func(args)

As for the stop situation,considering that we already have D933,maybe there is very little chance for we to see an error in stop step:)

I can imagine this could be useful for taskotron instance create subcommand, but I don't see the use case for anything else.

It seems that we are likely see testcloud errors happen in instance create subcommand,so that's this diff 's main concern.Besides,this diff will also enhance the performance of instance remove .

This diff mainly concerns instance create subcommand, the exact or final purpose of this diff is to do some cleanup work,if errors happens before a instance is created,

I see. In that case, the whole try except code should have been inside _create_instance(), not inside main. It is a very bad idea to look at args.name, args.connection or any other argument and try to guess which subcommand we're running. That might work today, but will not work in the future once we add new subcommands. The cleanup code, if anywhere, must be inside (or wrap) _create_instance() only.

If the error occurred after the instance was created, well - which kinds of error are we talking about here? Please give me some real-world example.

All kinds of errors,it's hard to come up with an error right away,

This is not the right approach. You don't know what you're trying to fix and I don't know what you're trying to fix.

how about when they try to test a newly testcloud image but turns out that image doesn't work well,ie,failed to start in 30 seconds.

So if the instance doesn't boot in 30 seconds, we destroy it? That doesn't sound like a good idea. There can be multiple reasons for a slow boot and the user might not care that testcloud timed out. That doesn't mean he or she wants the instance to be killed and removed.

I can hardly figure out any method except vi the ip file manually in another terminal

Using serial console is one option. Opening the VM in virt-manager and interacting with it graphically is a second option. Also, what about VMs that have network disabled on boot? You can't know what the user is creating, all of those are valid use cases.


Overall, I'm not convinced all of this is a good idea, sorry. There could be some use cases where automatic VM removal might be a good idea, but it would have to be in some very specific circumstances, not "on any error". For example let's say we created the disk image, but libvirt returns an error when we try to define the instance. In that case it might make sense to clean up and remove that big disk image. (That's what I was talking about when I said you need to have a real-world example of the possible error). But I wouldn't do that just because the VM is slow to boot, or the user pressed Ctrl+C while waiting for the IP address. However, covering just these cases would probably require a lot of changes to the current code, they wouldn't be easy (definitely not a good fit for beginner programmers), and probably is not worth the effort at this moment.

lnie added a comment.Aug 11 2016, 3:35 AM
In D909#18231, @kparal wrote:

I see. In that case, the whole try except code should have been inside _create_instance(), not inside main. It is a very bad idea to look at args.name, >args.connection or any other argument and try to guess which subcommand we're running. That might work today, but will not work in the future once we >add new subcommands. The cleanup code, if anywhere, must be inside (or wrap) _create_instance() only.

Really make sense for me,the code will look much more gentle.Though we won't able to enhance the instance remove performance then,but that's trivial,we can ignore that:)

how about when they try to test a newly testcloud image but turns out that image doesn't work well,ie,failed to start in 30 seconds.

So if the instance doesn't boot in 30 seconds, we destroy it? That doesn't sound like a good idea. There can be multiple reasons for a slow boot and the user >might not care that testcloud timed out. That doesn't mean he or she wants the instance to be killed and removed.

when I said" failed to boot in 30 seconds",I mean the image has some problem itself,not slow to boot,but unable to boot.I thought failed to boot in 30 seconds exactly means that the image is unusable,ie,won't boot no matter how long we wait.For example ,when we try to create an instance using a raw image without D898 .Now I know I was wrong,thanks for letting me know^^,though I have never ran into a situation where a usable image failed to boot in 30 seconds.

I can hardly figure out any method except vi the ip file manually in another terminal

Using serial console is one option. Opening the VM in virt-manager and interacting with it graphically is a second option. Also, what about VMs that have >network disabled on boot? You can't know what the user is creating, all of those are valid use cases.

Just wants to gain more knowledge.How can we use serial console for testcloud instance? Doesn't console has some display problems ,so we'd better use ssh?
How can we interact with it graphically?Isn't the screen black or blank?Will we be able to have network disabled on boot with the current testcloud code?

Overall, I'm not convinced all of this is a good idea, sorry.

Try to think that this will do a favor for contributors^^.As you said we probably will add new subcommands in the future,and even if not we still need to modify
the testcloud code from time to time.

There could be some use cases where automatic VM removal might be a good idea, but it would have to be in some very specific circumstances, not "on any >error".For example let's say we created the disk image, but libvirt returns an error when we try to define the instance. In that case it might make sense to clean >up and remove that big disk image.

Right,there is a chance that we will see that libvirt error,so there might be a need for us to add the cleanup function:)

(That's what I was talking about when I said you need to have a real-world example of the possible error).

When you said real-world example I thought you want me to create a real error right away,and better with error log attached,now I know what you mean:)

In D909#18276, @lnie wrote:

when I said" failed to boot in 30 seconds",I mean the image has some problem itself,not slow to boot,but unable to boot.I thought failed to boot in 30 seconds exactly means that the image is unusable,ie,won't boot no matter how long we wait.

I admit it's not highly probable, but we can't know what the user is running. The important part is that it can happen.

Just wants to gain more knowledge.How can we use serial console for testcloud instance?

Just create a VM with serial console enabled by default. Then start it with testcloud and you can connect to it even before network is up.

Doesn't console has some display problems ,so we'd better use ssh?

ssh is usually better yes. Serial console has different use cases (e.g. doesn't need networking).

How can we interact with it graphically?Isn't the screen black or blank?

That depends on your image. If you create a Workstation VM and set it properly, it will boot into graphics mode, of course.

Will we be able to have network disabled on boot with the current testcloud code?

Current testcloud code, no. It will time out while waiting for the IP. But that doesn't mean the VM is unusable, if you want to access it in a different manner (e.g. the serial console). In the future, we'll probably need to address this somehow. Requiring the network to be up and working after boot is something that might not be true for all images.

For that very reason, I don't think we should destroy and remove an instance just because we didn't detect the IP. We should not assume what the user wants. If the VM is really broken, the user can always stop it/remove it using the command line. It should be his/her choice.

lnie added a comment.Aug 12 2016, 3:12 AM

when I said" failed to boot in 30 seconds",I mean the image has some problem itself,not slow to boot,but unable to boot.I thought failed to boot in 30 seconds >>exactly means that the image is unusable,ie,won't boot no matter how long we wait.

I admit it's not highly probable, but we can't know what the user is running. The important part is that it can happen.

I'd like suggest you pay more attention to the conclusion sentence here ,which is "Now I know I was wrong,thanks for letting me know^^,though I have never ran into a situation where a usable image failed to boot in 30 seconds."I thought 30-seconds-thing exactly means that the newly image is unusable,that's the reason why I *took* it as a use case:)

Doesn't console has some display problems ,so we'd better use ssh?

ssh is usually better yes. Serial console has different use cases (e.g. doesn't need networking).

I know,just wanted to say that I think we'd better use ssh when we have choices.

For that very reason, I don't think we should destroy and remove an instance just because we didn't detect the IP.

Agree with you.

We should not assume what the user wants. If the VM is really broken, the user can always stop it/remove it using the command line. It should be his/her >choice.

This diff just intends to do a favor for users when they see errors ,if remove the VM is their choices.If they want to keep their VMs,we can add --no-destroy option.Again,I'm just copying the thought of libtaskotron when it sees errors:)

In D909#18338, @lnie wrote:

This diff just intends to do a favor for users when they see errors ,if remove the VM is their choices.If they want to keep their VMs,we can add --no-destroy option.Again,I'm just copying the thought of libtaskotron when it sees errors:)

For testcloud, I prefer the opposite logic - don't destroy anything by default, and let the user stop and remove the instance manually if it didn't work for them. Please close this revision, thank you.

lnie abandoned this revision.Aug 12 2016, 8:41 AM
In D909#18344, @kparal wrote:
In D909#18338, @lnie wrote:

This diff just intends to do a favor for users when they see errors ,if remove the VM is their choices.If they want to keep their VMs,we can add --no-destroy >option.Again,I'm just copying the thought of libtaskotron when it sees errors:)

For testcloud, I prefer the opposite logic - don't destroy anything by default, and let the user stop and remove the instance manually if it didn't work for them. >Please close this revision, thank you.

Sure,gonna to do