Support for Ansible Tasks
ClosedPublic

Authored by mkrizek on May 15 2017, 4:49 PM.

Details

Summary

I'd like to create a new side branch for work on ansible tasks.

This patch removes libtaskotron formula format for writing tasks and
introduces support for running tasks in form of an ansible playbook.

The following parts are removed:

  • task formula handling code
  • remote execution (Paramiko Wrapper)
  • concept of overlord and minion
  • the exitcode directive (hasn't been really used)

When running locally, user is asked for sudo password since tasks are ran
as root.

https://fedoraproject.org/wiki/Changes/InvokingTestsAnsible

Test Plan

$ git clone https://pagure.io/task-rpmlint-ansible
$ runtask -d -i gzip -t koji_build -a x86_64 task-rpmlint-ansible/run_tests.yml
$ runtask -d -i gzip -t koji_build -a x86_64 --ssh $IP_ADDRESS --ssh-privkey conf/id_taskotron task-rpmlint-ansible/run_tests.yml
$ runtask -d -i gzip -t koji_build -a x86_64 --libvirt --ssh-privkey conf/id_taskotron task-rpmlint-ansible/run_tests.yml

Diff Detail

Repository
rLTRN libtaskotron
Branch
feature/ansiblize (branched from develop)
Lint
Lint OK
Unit
Unit Tests OK
Build Status
Buildable 1156
Build 1156: arc lint + arc unit
mkrizek created this revision.May 15 2017, 4:49 PM

Looks good as for a WIP to me. My only real concern here is the broad deletion of all the minion-related code - this might be because I'm not familiar with the design details, but WRT the disposable "minion" - what is the plan for choosing the "right" image for the task (aka using F24 to test F24 packages) that we had in the "previoius" code? Are we dropping the feature alltogether, or is there another process being planned to do that? If so, what is it?

libtaskotron/executor.py
59

Is dropping the environment "discovery" based on input data (aka item + type) intentional? Are we just going to be using the config defaults from now on, thus dropping the "test F24 package on F24" functionality that we had? I know the patch notes say "you need to have the right image in config", but this seems more like a temp devel solution, than the expected outcome to me. Or do you have an other method to do this in mind, which was just not implemented yet?

Looks good as for a WIP to me. My only real concern here is the broad deletion of all the minion-related code - this might be because I'm not familiar with the design details, but WRT the disposable "minion" - what is the plan for choosing the "right" image for the task (aka using F24 to test F24 packages) that we had in the "previoius" code? Are we dropping the feature alltogether, or is there another process being planned to do that? If so, what is it?

I forgot to mention this, sorry. The plan, at least short-term, is to somewhat re-use the existing code (although you're right that it's deleted in the patch). Speaking of images, according to @tflink, releng are supposed to be building the images for us \o/

Thanks for pointing that out.

kparal added a subscriber: kparal.May 16 2017, 11:32 AM

So, first things first - I think there should be almost no deletions in this code. Our old system (formula and minion-based) will have to co-exist with the new system (ansible based) for some time, until we convert all our existing tasks to the new one, and iron out all issues. We need to keep compatibility for some time, this is probably a too big jump to switch everything over at the same time. I'd say keep all the existing code in place, just add new code to make ansible-based workflow work. Once both systems are deployed and working and once we covert all tasks to the new system, then we can drop the old code and refactor all the modules. This approach will also help in having the ansible feature branch much simpler, easier to compare and easier to rebase against develop.

One exception, the exitcode directive is truly not being used, please post a separate patch and we can drop it immediately also in develop.

Regarding ansible workflow - I'm a bit confused, so please bear with me :-) The spec says only local execution is supported, so our runner has to be running on the very same machine as the test itself (i.e. on the minion, in our terms). The code below implements our usual 3 modes - local, remote, and disposable. That's fine, it's no problem to go above the spec, to help development, etc. But since the local mode is the mandatory one (and the one being used in production), I'm not clear why we're removing so much of our existing "remote exec" and "minion creation" code. Because if we want to execute the task locally, we still need to:
a) create the minion VM. That's true until we have openstack or some other magical solution that provisions VMs for us.
b) SSH into the machine (paramiko). We can't run the task (i.e. execute ansible) from the VM host, that's against the spec. (Of course we can suggest that to be changed).
c) checkout the task (from distgit, or other location for generic tasks)
d) copy the results back to VM host (again paramiko and remote_exec methods), or upload them directly to buildbot/jenkins/etc

So, my idea was that we simply execute ansible instead of parsing formula and running directives directly, but everything else stays the same (minion spawning, files being copied in and out, etc). And once we have openstack, infra-provided images, jenkins instead of buildbot, etc, we replace those parts - but those parts are not tightly related to the ansible-based execution.

Am I talking nonsense here? Please correct me. Thanks.

conf/taskotron.yaml.example
112–113 ↗(On Diff #3032)

So where do we check out the distgit tests/ folder (or generic tasks) now?

libtaskotron/executor.py
142–148

IIUIC, you're running the task on a remote machine, just naming it localhost, right? In that case I don't think it adheres to the proposal anyway. The proposal doesn't allow any other execution than local.

So what we need to have is the ability to execute ansible on a local machine. But that doesn't mean taskotron can't have extra functionality, e.g. for easy development. In production, we'll execute everything locally, but in development, there's nothing stopping us from also supporting remote ansible execution.

As a consequence, there's no need to put remote machines until a fake "localhost" name into inventory. We don't need to hide the fact. It just must not be the default approach in production.

Am I making sense? Am I confused instead?

173

This should not work (does it?). It should be

'-u', 'root',
kparal added inline comments.May 16 2017, 1:26 PM
libtaskotron/executor.py
182

Please use long options in scripts, they make it more readable without having to search in a man page.

In D1195#22176, @kparal wrote:

So, first things first - I think there should be almost no deletions in this code. Our old system (formula and minion-based) will have to co-exist with the new system (ansible based) for some time, until we convert all our existing tasks to the new one, and iron out all issues. We need to keep compatibility for some time, this is probably a too big jump to switch everything over at the same time. I'd say keep all the existing code in place, just add new code to make ansible-based workflow work. Once both systems are deployed and working and once we covert all tasks to the new system, then we can drop the old code and refactor all the modules. This approach will also help in having the ansible feature branch much simpler, easier to compare and easier to rebase against develop.

I don't think it's worth the effort to keep both formats supported. The remote related code needs to die in a fire. There really are not that many tasks using the current formula that we don't maintain. Let's not keep the code more complex than it has to be. Also, the remote related code needs to die in a fire. My idea was, that until we port the tasks to the new format, we keep running the current production and have the new runner in dev.

Did I mention that the remote related code needs to die in a fire?

One exception, the exitcode directive is truly not being used, please post a separate patch and we can drop it immediately also in develop.

Regarding ansible workflow - I'm a bit confused, so please bear with me :-) The spec says only local execution is supported, so our runner has to be running on the very same machine as the test itself (i.e. on the minion, in our terms). The code below implements our usual 3 modes - local, remote, and disposable. That's fine, it's no problem to go above the spec, to help development, etc. But since the local mode is the mandatory one (and the one being used in production), I'm not clear why we're removing so much of our existing "remote exec" and "minion creation" code. Because if we want to execute the task locally, we still need to:
a) create the minion VM. That's true until we have openstack or some other magical solution that provisions VMs for us.
b) SSH into the machine (paramiko). We can't run the task (i.e. execute ansible) from the VM host, that's against the spec. (Of course we can suggest that to be changed).
c) checkout the task (from distgit, or other location for generic tasks)
d) copy the results back to VM host (again paramiko and remote_exec methods), or upload them directly to buildbot/jenkins/etc

So, my idea was that we simply execute ansible instead of parsing formula and running directives directly, but everything else stays the same (minion spawning, files being copied in and out, etc). And once we have openstack, infra-provided images, jenkins instead of buildbot, etc, we replace those parts - but those parts are not tightly related to the ansible-based execution.

Am I talking nonsense here? Please correct me. Thanks.

That's correct. Though I am still hoping the 'local' requirement will go away and we don't have to keep the "inner runner". So currently, yes it does not respect the proposal in this aspect. I am hoping this will get figured out soon.

conf/taskotron.yaml.example
112–113 ↗(On Diff #3032)

Ansible handles that.

libtaskotron/executor.py
173

Thanks.

182

Yep, will do, thanks.

mkrizek updated this revision to Diff 3034.May 16 2017, 4:00 PM

Address issues in the review. Fix tests and lint errors.

In D1195#22176, @kparal wrote:

So, first things first - I think there should be almost no deletions in this code. Our old system (formula and minion-based) will have to co-exist with the new system (ansible based) for some time, until we convert all our existing tasks to the new one, and iron out all issues. We need to keep compatibility for some time, this is probably a too big jump to switch everything over at the same time. I'd say keep all the existing code in place, just add new code to make ansible-based workflow work. Once both systems are deployed and working and once we covert all tasks to the new system, then we can drop the old code and refactor all the modules. This approach will also help in having the ansible feature branch much simpler, easier to compare and easier to rebase against develop.

I don't think it's worth the effort to keep both formats supported. The remote related code needs to die in a fire. There really are not that many tasks using the current formula that we don't maintain. Let's not keep the code more complex than it has to be. Also, the remote related code needs to die in a fire. My idea was, that until we port the tasks to the new format, we keep running the current production and have the new runner in dev.

I'm of the same mind - we don't have that many tasks using the "old" formulas in the wild that we don't already maintain. I think it'll be far less effort to just do the porting work that it'd be to support both ansible and formulas at the same time.

Did I mention that the remote related code needs to die in a fire?

I'm not sad to see it go but I've seen far fuglier code :)

tflink added a subscriber: merlinm.May 17 2017, 5:07 AM
In D1195#22176, @kparal wrote:

Regarding ansible workflow - I'm a bit confused, so please bear with me :-)

I think that confusion abounds in this area, to be honest :)

The spec says only local execution is supported, so our runner has to be running on the very same machine as the test itself (i.e. on the minion, in our terms). The code below implements our usual 3 modes - local, remote, and disposable. That's fine, it's no problem to go above the spec, to help development, etc. But since the local mode is the mandatory one (and the one being used in production), I'm not clear why we're removing so much of our existing "remote exec" and "minion creation" code. Because if we want to execute the task locally, we still need to:
a) create the minion VM. That's true until we have openstack or some other magical solution that provisions VMs for us.
b) SSH into the machine (paramiko). We can't run the task (i.e. execute ansible) from the VM host, that's against the spec. (Of course we can suggest that to be changed).
c) checkout the task (from distgit, or other location for generic tasks)
d) copy the results back to VM host (again paramiko and remote_exec methods), or upload them directly to buildbot/jenkins/etc

So, my idea was that we simply execute ansible instead of parsing formula and running directives directly, but everything else stays the same (minion spawning, files being copied in and out, etc). And once we have openstack, infra-provided images, jenkins instead of buildbot, etc, we replace those parts - but those parts are not tightly related to the ansible-based execution.

Am I talking nonsense here? Please correct me. Thanks.

This makes sense to me and tracks what I've understood. I still don't understand why the spec was written that way but in its current form, the idea is to spawn a VM and let the test do any spawning work that it needs to do. I'm not sure this is the best way to do multi-host testing and I know that there's a usecase @merlinm is working on that doesn't cleanly fit into what's currently in the wiki. Are there other usecases that won't fit into that system well?

Are there other cases that

In D1195#22199, @tflink wrote:

I don't think it's worth the effort to keep both formats supported. The remote related code needs to die in a fire. There really are not that many tasks using the current formula that we don't maintain. Let's not keep the code more complex than it has to be. Also, the remote related code needs to die in a fire. My idea was, that until we port the tasks to the new format, we keep running the current production and have the new runner in dev.

I'm of the same mind - we don't have that many tasks using the "old" formulas in the wild that we don't already maintain. I think it'll be far less effort to just do the porting work that it'd be to support both ansible and formulas at the same time.

So, we'll aim for deploying taskotron-ansible onto dev server and also push all tasks converted to ansible format to dev branch, at the exactly same time, correct? If so, how long do we suppose to be in this state, until we update production? Because if we suppose a week or two, that's probably fine. It if should be a month, or two, or three, it's going to be a problem for task developers who use dev branch to test new code (as they rightly should) before pushing to master. (Well, I know of one such person active at the moment, python-versions' maintainer). Of course we can tell them that they need to push directly to production during this phase.

kparal added inline comments.May 17 2017, 11:05 AM
libtaskotron.spec
36

This can be removed.

In D1195#22202, @kparal wrote:
In D1195#22199, @tflink wrote:

I don't think it's worth the effort to keep both formats supported. The remote related code needs to die in a fire. There really are not that many tasks using the current formula that we don't maintain. Let's not keep the code more complex than it has to be. Also, the remote related code needs to die in a fire. My idea was, that until we port the tasks to the new format, we keep running the current production and have the new runner in dev.

I'm of the same mind - we don't have that many tasks using the "old" formulas in the wild that we don't already maintain. I think it'll be far less effort to just do the porting work that it'd be to support both ansible and formulas at the same time.

So, we'll aim for deploying taskotron-ansible onto dev server and also push all tasks converted to ansible format to dev branch, at the exactly same time, correct? If so, how long do we suppose to be in this state, until we update production? Because if we suppose a week or two, that's probably fine. It if should be a month, or two, or three, it's going to be a problem for task developers who use dev branch to test new code (as they rightly should) before pushing to master. (Well, I know of one such person active at the moment, python-versions' maintainer). Of course we can tell them that they need to push directly to production during this phase.

That sounds about right.

Given how fast everything is moving right now, I don't think we can afford to wait a month while porting everything. That will need to be done as quickly as we can manage it.

This is probably going to be a good time to move our tasks from bitbucket to pagure as well.

mkrizek updated this revision to Diff 3050.May 22 2017, 10:58 AM
  • inner runner :(
tflink accepted this revision.May 23 2017, 4:44 AM

I'm not a huge fan of keeping a playbook in the root of the repo like that but I'm not coming up with any brilliant alternatives at the moment.

Otherwise, it looks good to me as a WIP.

libtaskotron.spec
172

Please add a ref to this Diff

This revision is now accepted and ready to land.May 23 2017, 4:44 AM
jskladan accepted this revision.May 23 2017, 8:56 AM
  • inner runner :(

Yo dawg, we heard you like playbooks. So we put a playbook, in yo playbook, so you can ansible while ansibling already!

Looks good to me. I expect we'll hit plenty of things we did not think of, along the way, but this is IMO perfectly good starting point.

mkrizek updated this revision to Diff 3053.May 30 2017, 12:27 PM
  • polishing
mkrizek edited the summary of this revision. (Show Details)May 30 2017, 1:01 PM
mkrizek edited the test plan for this revision. (Show Details)

@mkrizek, instead of creating a new repo for each converted task (https://pagure.io/task-rpmlint-ansible), would it be possible to instead create a new branch (like feature/ansible or ansible) in the existing repo? I feel like we have too many repos already :-)

In D1195#22336, @kparal wrote:

@mkrizek, instead of creating a new repo for each converted task (https://pagure.io/task-rpmlint-ansible), would it be possible to instead create a new branch (like feature/ansible or ansible) in the existing repo? I feel like we have too many repos already :-)

Yes, that's the plan. This was just quick and adhoc solution for testing. ;) That's why I didn't put it into any of our namespaces. Thanks for pointing it out.

tflink added a comment.Jun 7 2017, 2:45 AM

Other than the missing ref to this diff in the specfile changelog, it looks good enough to me to move forward with putting it into a branch.

kparal added a comment.EditedJun 13 2017, 11:05 AM

There's no point further waiting here. Please merge with latest develop and push to a feature branch, thanks. It will also help @roshi push D1196.

kparal added a subscriber: roshi.Jun 13 2017, 11:10 AM