handle errors while waiting for compose smartly
ClosedPublic

Authored by adamwill on Aug 25 2015, 3:44 PM.

Details

Summary

So the first attempt to use the waiting stuff in production
failed because, at some point, koji_done got a socket.error
from the server. Not sure if that was a Koji outage or some
kind of rate control, but even if it's rate control and we need
to tweak the wait interval, this seems advisable: we shouldn't
die the first time we hit any kind of error while waiting, we
should retry a few times first (with increasingly long delays
between the retry attempts). I know bare except clauses are BAD,
but I think it's OK here as we can't really cover every possible
exception which might get raised in any module during a 'go hit
a server and do a bunch of stuff' operation, and if the error
keeps happening we *are* going to raise it eventually.

Test Plan

Check I didn't break the 'normal' case, and try causing
an error to appear somehow (e.g. disconnect from the network or
hack up the 'client' instantiation in fedfind to use the wrong URL)
and see if the error handling works as intended.

Diff Detail

Repository
rOPENQA fedora_openqa
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
adamwill retitled this revision from to handle errors while waiting for compose smartly.Aug 25 2015, 3:44 PM
adamwill updated this object.
adamwill edited the test plan for this revision. (Show Details)
adamwill added reviewers: jskladan, garretraziel.
garretraziel added inline comments.Aug 26 2015, 7:54 AM
tools/openqa_trigger/openqa_trigger.py
260

And what's the error that Koji really returns? Isn't there any base error we could catch? I'm still not convinced by catch 'em all approach. I'll wait what @jskladan will say.

jskladan requested changes to this revision.Aug 26 2015, 11:18 AM
jskladan added inline comments.
tools/openqa_trigger/openqa_trigger.py
260

First up - the Right™ Way would be handling the koji errors in the library (fedfind?), and raising FF-specific exception that would be caught here.
On top of that - I don't really like abusing properties for creating read-only attributes - that's what methods are for.

But both these points are moot, IMHO to some extent, as the retry logic should be (design-vise) inside the fedfind(?) library, not the code handling the library - especially since it is directly tied to the koji_done/pungi_done value. Make these two into full-blown methods, add the optional retry/wait params to the call, and just use it here directly as

try:
    if ff_release.koji_done(max_retries, timeout) and ff_release.pungi_done(max_retries, timeout):
       logging.info("Compose complete! Scheduling jobs.")
except FedfindCustomError as err:
     logging.debug("Caught exception while waiting! %s", err)
     raise
This revision now requires changes to proceed.Aug 26 2015, 11:18 AM

It wasn't a Koji error at all - it was in some lower level networking thing, a socket.error, https://docs.python.org/2/library/socket.html . When a process is just sitting there trying network hits for hours all kinds of unexpected crap could happen - network goes down for a minute, for instance, and you might get this kind of low-level connection error, a name resolution error, who knows what else.

Since it wasn't a Koji error I'm not really so convinced that handling it in fedfind is the 'Right Way'. It clearly makes sense for the one-time, 'hit up Koji and see if the compose is done' property to be in fedfind. I'm not convinced the 'retry it every two minutes and handle timeouts if it fails' method belongs there. "Is this compose done?" is something that's entirely reasonable as a property in fedfind; 'wait for this compose to show up' isn't such an obvious thing to have there.

BTW, koji_done isn't a property to make it read-only, it's a property instead of an attribute because calculating it takes a long time. AIUI this is one of the classic reasons to use a property.

adamwill updated this revision to Diff 1399.Aug 26 2015, 9:52 PM

FINE i hate you

fine, you're all fired. the waiting is in fedfind now. And
because it's there, it can use all the bare exceptions it
likes and you don't get a vote!

https://www.happyassassin.net/cgit/fedfind/commit/?id=7054d3f1d33bec146be1262ae2023305d59c63bd

It's in fedfind git master now, I'll cut a release soon.
You'll have to update to git master to test this.

This revision is now accepted and ready to land.Aug 27 2015, 8:45 AM

it should probably actually catch WaitError and exit with a nice message when that happens, I'll do that while merging. Tomorrow, if I do it now I'll do something silly.

This revision was automatically updated to reflect the committed changes.