turn 'all' into 'nightlies' and wait for the composes
AbandonedPublic

Authored by adamwill on Aug 19 2015, 11:45 PM.

Details

Summary

This changes 'all' into 'nightlies': it doesn't run on the
'current' compose any more. In practice we run 'current' every
hour and 'all' (now 'nightlies') every day, so there's really
not much point doing 'current' in 'all', and this cleans it up
a bit.

Then the fun starts! Using some very new fedfind functionality,
'nightlies' can now wait for composes to complete before firing
jobs. By default it will wait 8 hours; the -w parameter lets
you specify a different wait time (in minutes).

It does the waiting and job scheduling in parallel, using
multiprocessing.Pool (I actually did it with threading first,
but this is a lot simpler and just as fast). So as soon as
either compose appears it will download the ISOs and schedule
the jobs; it doesn't wait for one then wait for the other.

Upshot, we can schedule this to run every night an hour or two
before the earliest time the compose is likely to complete and
it should fire off the jobs as the composes complete. If we
want to get aggressive we can probably in theory tweak more so
we run the boot.iso tests as soon as it appears and run the
lives later, but it'd require some refactoring of jobs_from_
fedfind.

Test Plan

Run the new 'nightlies' and check it behaves as
advertised. Bask in its glory! Then check I didn't bust
anything else, especially run_current.

Diff Detail

Repository
rOPENQA fedora_openqa
Branch
nightlies-wait
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 115
Build 115: arc lint + arc unit
adamwill retitled this revision from to turn 'all' into 'nightlies' and wait for the composes.Aug 19 2015, 11:45 PM
adamwill updated this object.
adamwill edited the test plan for this revision. (Show Details)
adamwill added reviewers: jskladan, garretraziel.
adamwill updated this revision to Diff 1360.Aug 20 2015, 7:19 AM

Use %s instead of .format when logging.

garretraziel requested changes to this revision.EditedAug 20 2015, 11:43 AM

It seems to work on current event, but when I run it on yesterday's nightly, it ended with:

INFO:root:running nightlies for 20150819
Traceback (most recent call last):
  File "openqa_trigger.py", line 456, in <module>
    args.func(args, client, wiki)
  File "openqa_trigger.py", line 346, in run_nightlies
    jobs = pool.map_async(_run_jobs, workers).get(99999)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
AssertionError: daemonic processes are not allowed to have children

but other that this and inline comments, LGTM.

tools/openqa_trigger/openqa_trigger.py
53–57

Can't you just use lambda? Or by "inline function" you mean lambda?

283–318

This looks good to me, but since you have only two threads and two tasks, couldn't you use multiprocessing.Process like here? I think that it would be a little bit more transparent.

OTOH, if there is any problem with using Process, i'm OK with Pool.

307

Couldn't you just use .get() without argument?

This revision now requires changes to proceed.Aug 20 2015, 11:43 AM

I have some "weird feeling" about using Processes in the code, but overall the idea is sound, and the code looks good. @garretraziel will comment in more detail.

tools/openqa_trigger/openqa_trigger.py
36

Feels like the message attribute should be a mandatory one.

Thanks for the comments. I will look at all the issues.

There is of course another choice, which I thought of after learning multiprocessing on the fly: we leverage the inherent parallel processing capabilities of the underlying infrastructure.

...that is to say, we just run two trigger commands from cron, one for Branched and one for Rawhide. Then we don't need the cleverness, I guess. Let me know if you see any problems with that (to get the compose report done I guess we'd want to use a little bash script which kicked off both trigger commands, waited for them both to finish, then ran the compose report).

adamwill added inline comments.Aug 20 2015, 3:12 PM
tools/openqa_trigger/openqa_trigger.py
36

Sure, "Waiting for something" isn't great user experience ;)

53–57

No, by 'inline' I meant the thing where you just stuff a function inside another function (because yo dawg, I heard you like to...func?) That keeps it in the 'logical' place in the code and makes it super-clear what's going on, but map() throws a fit if you try it.

I didn't think of using a lambda, for some reason. Would probably work, I'll try it.

283–318

Two things: 1) at first I was gonna include the 'current' run so we'd have had three jobs, and also 2) I didn't see that example :P I started out using threading, which worked but seemed kinda over-engineered, then found some SO post or something which recommended using Pool for simple cases, so I went straight to Pool. I find the Pool design pretty simple personally, but I'll try it with Process and see if it's better. I suspect that (like both the other cases) it'll look a bit less simple after I put in exception handling, but I'll see.

307

IIRC if you do that, it stops handling KeyboardInterrupts. I'll check it again, though.

adamwill added a comment.EditedAug 20 2015, 3:32 PM

Oh, fun...I suspect the traceback might happen because I put multiprocessing into fedfind too, so when jobs_from_fedfind()runs its fedfind query, it tries to open another Pool, and apparently you can't have a pool in your pool (aww). I'm assuming you tested with a very new fedfind?

I could hack around this by having fedfind not go parallel if it only has one Koji search to do, but then if we start testing ARM or Cloud images the problem will just show up again.

If this is the problem I'm not sure there's an awesome solution, but I'll poke at it. My initial threading attempt started out using non-daemonic threads, but the problem then is that they stick around after the main thread dies so it becomes rather difficult to handle keyboard interrupts (you have to write the code so the main process/thread can signal the children to shut down somehow, and it wound up being a mess).

edit: that definitely looks like it is the problem. I hacked up a simplified version of the flow in a single script to make it more visible and pokeable. I'll try and come up with some kind of solution. Right now I'm seeing if Celery does it better.

adamwill added inline comments.Aug 20 2015, 7:46 PM
tools/openqa_trigger/openqa_trigger.py
53–57

Turns out using a lambda is also a NOPE, multiprocessing throws its toys out of the pram if you try. Can't pickle it.

283–318

OK, so I looked at this briefly, and unless I'm misunderstanding something the problem with just using Process directly is it's not trivial to get the result of a Process. start()ing a process does not return whatever its target callable returns, it just returns a boolean. Try modifying the simplest example to make f(name) return something, then try and get the return value.

https://stackoverflow.com/questions/8329974/can-i-get-a-return-value-from-multiprocessing-process says you can handle this with a Queue. By the time I add in code for two Processes and change _run_jobs to stick its results in the queue, it feels like I'm just reinventing the Pool stuff...

Let me know if I missed something, though.

307

Yup, verified, you need the timeout. I think with the timeout, multiprocessing has to keep the main process active to check the timeout; without it, the main process just goes to sleep until the workers are done. Or, you know, something like that.

adamwill updated this revision to Diff 1364.Aug 20 2015, 8:12 PM

Use multiprocess.dummy instead of multiprocess

I love magic! Just as I was gently weeping into a pile of burning
parallelism modules, I stumbled across a stray mention of
multiprocess.dummy. This basically does exactly the same thing as
multiprocess but uses threads instead of processes.

Somehow - and don't ask me how - the same 'nested pools' case that
causes multiprocess to lose its lunch entirely works just fine
with multiprocess.dummy. I tested multiple times, with the real
openqa_trigger.py and with my dummy script: with multiprocess it
crashes every time, with dummy it works just fine. Magic!

I'm also looking at switching fedfind to use the xmlrpcclient
'multicall' interface to batch Koji requests instead of using
multiprocess, but this seems like a reasonable change anyway.

garretraziel requested changes to this revision.Aug 21 2015, 8:41 AM

Code still doesn't work for me though, when I try to run it on yesterday's nightlies, I get:

INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): fedoraproject.org
/home/jsedlak/python_virtualenv/lib/python2.7/site-packages/wikitcms/wiki.py:71: DeprecationWarning: page.edit() was deprecated in mwclient 0.7.0 and will be removed in 0.8.0, please use page.text() instead.
  for match in valpatt.finditer(page.edit()):
INFO:root:running nightlies for 20150820
INFO:root:running universal tests for x86_64 with generic boot image for x86_64
INFO:root:downloading https://kojipkgs.fedoraproject.org/mash/rawhide-20150820/rawhide/x86_64/os/images/boot.iso (generic boot image for x86_64) to /home/jsedlak/Projects/openqa_instance/data/factory/iso/Rawhide_20150820_generic_x86_64_boot.iso
INFO:root:running universal tests for x86_64 with generic boot image for x86_64
INFO:root:downloading https://kojipkgs.fedoraproject.org/mash/branched-20150820/23/x86_64/os/images/boot.iso (generic boot image for x86_64) to /home/jsedlak/Projects/openqa_instance/data/factory/iso/23_Branched_20150820_generic_x86_64_boot.iso
Traceback (most recent call last):
  File "openqa_trigger.py", line 456, in <module>
    args.func(args, client, wiki)
  File "openqa_trigger.py", line 346, in run_nightlies
    jobs = pool.map_async(_run_jobs, workers).get(99999)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
pycurl.error: cannot invoke setopt() - perform() is currently running
tools/openqa_trigger/openqa_trigger.py
12

I still think that using Pool inside a Pool is a bad idea - it seems to me that it's like running event loop inside event loop. I think that I will just put up with it, because code in trigger looks reasonable and second Pool is running inside fedfind and that's something I'm not reviewing :-).

We should monitor how it behaves (if it isn't spawning a lot of zombies for example) and when there will be a problem, we can use something complex (like Celery, coincidentally I was reading about it few weeks ago) or something easy and straightforward (like running two, singleprocess jobs by cron).

53–57

Well, darn.

This revision now requires changes to proceed.Aug 21 2015, 8:41 AM

Ok, looking at here, I got an impression that we are going paths we really don't want to go - it smells awfully lot like race condition.

adamwill added a comment.EditedAug 21 2015, 3:26 PM

D'oh - I got lazy and tested on my desktop, which doesn't get as far as actually downloading the images because it has no /var/lib/openqa, it dies right when it goes to download them. I figured it'd be fine from then on :/

Funnily enough, yum folks were testing urlgrabber with threads a mere 11 years ago: http://lists.baseurl.org/pipermail/yum-devel/2004-March/000057.html

Then Pulp ran into the same problem:

https://www.redhat.com/archives/pulp-list/2011-April/msg00034.html

I think we can do what Pulp did and use a lock for download_image() so only one thread at a time gets to download stuff. But I don't want to go too much farther down this road before we just give up on it and give up on the model in favour of something dumber (definitely not smarter, unless you want to write it).

Hell, we could even just have it sit there and check each nightly alternately - try Rawhide, wait two minutes, try Branched, wait two minutes, rinse, repeat. It's not as sexy but it'd get the job done.

Oh, since it seems like you didn't catch it, I'll mention it again - there's no more pools-in-pools. I took the multiprocessing use out of fedfind for the same general reason we're re-thinking it here: too much complexity for too little gain. fedfind now uses xmlrpclib multicalls to batch requests instead.

adamwill updated this revision to Diff 1370.Aug 21 2015, 9:04 PM

Dump multiprocessing, do a dumb round-robin wait instead

So, yeah, it gets pretty icky to set a global lock and dumbly
grab it in download_image() even if we're not multi at all, or
pass a lock instance through a zillion functions, or whatever we
could do to deal with the pycurl thread problem. So here's a
dumber way to do it: we just do a sort of 'round-robin' wait,
waiting for one release then the other. As soon as we see one
or the other we schedule its jobs and stop waiting for it. We
do job reporting after both have been found, or we've found one
but timed out on the other. You can still skip the waiting by
passing -w 0.

I'm also planning an alternative diff which will be even dumber
and dump 'all/nightlies' entirely, and you just have to run two
'compose' commands, one for Rawhide, one for Branched. We can
look at both side-by-side and pick whichever we prefer.

In D516#9840, @adamwill wrote:

Oh, since it seems like you didn't catch it, I'll mention it again - there's no more pools-in-pools. I took the multiprocessing use out of fedfind for the same general reason we're re-thinking it here: too much complexity for too little gain. fedfind now uses xmlrpclib multicalls to batch requests instead.

Yup, sorry, I thought that you are only planning to do it, this is fine then.

adamwill abandoned this revision.Aug 26 2015, 6:34 AM

We went with D525 instead.