Report job 'priority' to ResultsDB (T902)
AbandonedPublic

Authored by adamwill on Feb 21 2017, 2:10 AM.

Details

Reviewers
jsedlak
jskladan
Summary

This implements our planned short (medium...?)-term hack of
having test systems forward test priority information to RDB,
in the absence of any other way for consumers (e.g. releng) to
know which tests they really have to care about for any given
workflow.

The concept here is that we specify the relevant workflow in
the key name ('priority.fedora.compose') and the values are
expected to be integers, with *lower* values meaning 'more
important' and *higher* values meaning 'less important'. For
this workflow, 1 means 'critical system functionality' (i.e.
the most important kind of test), 2 means 'Alpha priority',
3 means 'Beta priority', 4 means 'Final priority', and anything
5 or higher is Optional.

Signed-off-by: Adam Williamson <awilliam@redhat.com>

Test Plan

Report some jobs to a test RDB instance and check
they get correct priority.fedora.compose values.

Diff Detail

Repository
rOPENQA fedora_openqa
Branch
rdb-priority (branched from master)
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 1060
Build 1060: arc lint + arc unit
adamwill retitled this revision from to Report job 'priority' to ResultsDB (T902).Feb 21 2017, 2:10 AM
adamwill updated this object.
adamwill edited the test plan for this revision. (Show Details)
adamwill added reviewers: jskladan, jsedlak.

Well, I know I told you this already, but just to reiterate - I do not believe that this is the right place and/or way to store "what is important for this one group" information. The reason being that this is a _policy_ that is by definition _prone to change_. What if there is a decision to change the priority - you can not retrospectively change the values in RDB...
The way to do this is _in a separate system_ - I know you don't like it, and call this a "short term hack" but we all well know that short term hacks have the tendency to stay (and even worse, in this case, also set a standard).
These kind of issues should be handled by an external system, that consumes the raw resutlsdb data, and has the "testplans" (because, honestly, this is nothing more than a poor man's testplan implementation) configured in it.
If it is what it takes, I'm even willing to put together such a tool, that lets you configure what kinds of results, based on metadata (testcase, extradata) are to be taken for, say, "Alpha priority of fedora compose workflow". Thinking about it, we even have a base for that system in the Dashboards.

On an unrealated note - why store a "random number" when there is a clear mapping to human-readable values? Magic numbers were bad when I was taught programming in high school, and they are bad still.

Hey, if a system existed I'd be lining up to use it. There just *isn't* one, and I want to be able to put together the 'see if Rawhide is releasable today' workflow without blocking on that for however long it takes to exist.

The use of numbers was really just for convenience of user implementation, I guess - you can just do 'give me all tests with priority < 5' or whatever. If we use strings, then you have to say "give me all tests with priority 'critical' or 'alpha' or 'beta'" if you're doing Beta, say. We could set up some kind of constant implementation, I guess, I just always feel silly trying to implement 'constants' in Python...it was also taking into consideration RH's tier 1 / tier 2 / tier 3 system.

I entirely agree with the reasons why this is a poor choice for a permanent implementation, but I don't think it's likely to be one of those things that we're still using in a decade. It's not tightly coupled to anything else, after all - it's just a dumb value in ResultsDB. There's no work to be done to 'migrate off' it, you just flip over to using the proper system whenever it actually appears. The value does no-one any harm if it's just sitting there with nothing using it.

oh, forgot to mention: I also think that implementing a 'proper system' for this is likely to be quite a lot of work and involve answering some difficult questions, and I think doing a rushed implementation of the 'proper system' is *more* likely to cause long-term pain than just using a dumb stopgap measure for now. The stopgap gives us a bit of breathing room to design the 'proper system' well.

Hey, if a system existed I'd be lining up to use it. There just *isn't* one, and I want to be able to put together the 'see if Rawhide is releasable today' workflow without blocking on that for however long it takes to exist.

Nice one, here. You had no problems putting together random scripts in the past... ;)

The use of numbers was really just for convenience of user implementation, I guess - you can just do 'give me all tests with priority < 5' or whatever.

Well, but you can't really...

I entirely agree with the reasons why this is a poor choice for a permanent implementation, but I don't think it's likely to be one of those things that we're still using in a decade. It's not tightly coupled to anything else, after all - it's just a dumb value in ResultsDB. There's no work to be done to 'migrate off' it, you just flip over to using the proper system whenever it actually appears. The value does no-one any harm if it's just sitting there with nothing using it.

Still, you are setting a wrong example, and even though you call this "short term hack" it is not one, as we both know it. Since you are doing it, everybody will start doing it, and hey, ResultsDB is a Testplan management system now too.

In the end, do whatever, it is "your" code that reports the stuff and I can't stop you, but not acknowledging it may well lead to some nasty corners, where ResultsDB is once again trying to "do that all". There was a very good reason to remove the execution tracking from there, and I don't see a good reason to put testplan management workflow there either.

oh, forgot to mention: I also think that implementing a 'proper system' for this is likely to be quite a lot of work and involve answering some difficult questions, and I think doing a rushed implementation of the 'proper system' is *more* likely to cause long-term pain than just using a dumb stopgap measure for now. The stopgap gives us a bit of breathing room to design the 'proper system' well.

Well, implementing a "hacky system" (in fact "an initial implementation of the 'proper' one" really, rather than "hacky") takes doing none of these, and keeps the responsibilities clearly decoupled.
But as I said initialy - who am I to stop you...

There's no work to be done to 'migrate off' it, you just flip over to using the proper system whenever it actually appears. The value does no-one any harm if it's just sitting there with nothing using it.

The thing is, this is a faulty argument - as long as _nobody uses the dumb value_ there is no work to be done. But, by definition, the people consuming the data are reliant on it. Tools will be written that rely on it, and "migrating off of it" really won't be just removing it from this piece of code. I get that you might not see it at the moment, but since you are de facto setting the standards, it might be good to stop for a moment, and whether making the "poor choice" becuase it is easy now won't have unpleasant consequences in a year.

Well, my point is this: if you do an 'initial implementation' of the proper system, you're instantly defining some kind of interface to it, and some expectations for how it works. And that's for what we're calling the "proper system", not for some short term stopgap which is clearly labelled as such and which everyone understands as such. Usually, the implementation is going to *imply* things about how even situations you don't initially explicitly cover are handled, or at least limit your options for covering them (without breaking the interface expectations) to some extent. (Have you read fedfind.release.get_release()? :P)

For instance, Ralph was telling me the other day he's struggling with the RH case where someone in a releng role trying to implement some kind of 'is this thing OK?' workflow might plausibly have *thousands* of possible test cases to pick from, whose maintainers can't necessarily be relied upon to accurately define their importance; how do we make it possible to handle that sensibly? Our 'proper system' is going to have to answer questions like that, or at least consider them and explicitly choose *not* to answer them.

I mean, we can just stand up some temporary little web service which simply stores a list of test case names and priority values and reports the one when you give it the other. And call it v0.1 of whatever we're calling the new thing if you like. It just seems like a waste of effort...

There's no work to be done to 'migrate off' it, you just flip over to using the proper system whenever it actually appears. The value does no-one any harm if it's just sitting there with nothing using it.

The thing is, this is a faulty argument - as long as _nobody uses the dumb value_ there is no work to be done. But, by definition, the people consuming the data are reliant on it. Tools will be written that rely on it, and "migrating off of it" really won't be just removing it from this piece of code. I get that you might not see it at the moment, but since you are de facto setting the standards, it might be good to stop for a moment, and whether making the "poor choice" becuase it is easy now won't have unpleasant consequences in a year.

I really don't think it will, no. It's not going to be a difficult operation to take code which says 'consider all tests with a priority.fedora.compose value of < 3` and make it say 'query ShinyNewSystem for all tests we care about for this workflow' instead. I really don't see that it's possibly going to be any harder than just writing it from scratch to query ShinyNewSystem. This isn't going to be some 10,000 line thing that only qualified neurosurgeons are allowed to work on, the whole point of the way I've been trying to set it up is to allow the script that ultimately says 'do we release this or not?' to be *as short and dumb as possible*.

the real system is also going to have to deal with the problem of 'how do we synchronize the test names between the actual test systems and this system in a sensible way'? and now I think about it, it involves the 'test scenario' problem too! because test X on arch Y is not necessarily as important as test X on arch Z. Or test X on arch Y on image A vs. test X on arch Y on image B...

kparal added a subscriber: kparal.Feb 21 2017, 7:47 PM

It seems that somebody is wrong on the Internet and that needs to be resolved with haste! :-)

To me as an observer, it seems that you both agree that it's a bad idea, but Adam considers it good enough for now with easy switch to a better system in the future, and Josef is worried that this is not going away soon but will stay with us for years instead, and that switching to a different system might not be that easy once many tools consume it.

Personally, I'd rather find a different solution here. It's true that temporary solutions last for years, but my biggest gripe is how inflexible this solution is. You can't dynamically adjust configuration, and whenever you want to perform some changes, you need to go ask the particular task owner to change it (so far it's just openqa, but if we add it to autocloud, depcheck, abicheck, etc, all those have different maintainers). And I might be missing something, but a replacement seems quite simple:

Let's create a project on pagure, called e.g. fedora-gating-policy or whatever (bikeshedding time). Let's populate it with yaml/ini/csv files which define what are the blocking testcases for Alpha/Beta/Final/daily rawhide compose/etc. I assume it would be a list of testcase names (compose.openqa.basic_boot, compose.autocloud.systemd, etc), and some additional field requirements if needed (arch=x86_64, image.type=dvd, etc). If we wanted to verify compose $composeid, we would query such testcases with item=$composeid (or perhaps compose.id extradata, I don't know the current convention).

It's a set of simple structured text files, easy to read, easy to find in a single place. Everybody can send pull requests, trusted people can be added as committers across teams. It can include some helper functions if it makes sense, or not. But any consumer tool (like releng's are-we-good-to-push-compose.py) can import it or just include it and make decisions based on it. It's easy to update and you can dynamically adjust requirements and re-evaluate if you decide something wasn't correct.

We don't really need a service or anything, I believe. A simple git repo with a bunch of text-based requirements definition should serve quite fine as a temporary solution. Wdyt?

adamwill added a comment.EditedFeb 21 2017, 8:19 PM

I could live with that. For that approach, though, I don't know if I'd even bother creating the repo at this point; may as well just have the YAML or JSON or whatever file as part of the 'is it releaseable?' script for now. Since that will really be the *only* actual user at this point. If we decide we need some other user, but Policy Engine or whatever the real thing is called doesn't exist yet, we can split it out to a separate project at that point...

Oh, but I'd still prefer to implement what I talked about in https://lists.fedoraproject.org/archives/list/resultsdb-users@lists.fedoraproject.org/thread/DJSIOLA3YEH6HNSJEXVPLIZT5TW4CDED/ (in terms of having a convention for defining the 'test scenario' in resultsdb extradata) than exactly what you suggested with "additional field requirements", which is essentially defining the test scenario 'downstream' in the gating-policy file. The scenarios should be recorded in resultsdb, then the gating policy can simply be a list of test scenarios.

jskladan added a comment.EditedFeb 22 2017, 12:29 PM

https://pagure.io/fedora-qa/lolz_and_roffle - parses template file from openqa for priorities, and shows whether compose identified by compose id is valid alpha/beta/final and a list of failures in each category. Does what you want, does not need to be updated with tests added to openqa.
The only tricky part is handling different priorities for different composes/images on the same testcases (like i386 failures are all basically marked as optional, and so is anything Atomic now). Also, some of your testcases do not report arch (the upgrades as far as I checked) to resultsdb, but only have 64bit in testcase name - seems like a bug, but it is not really relevant to the current state of things.
Example output: https://paste.fedoraproject.org/paste/ar8ur1Pq5iDT7lmoso~zOF5M1UNdIGYhyRLivL9gydE=
Said tool took about two hours to write, and the worst part was figuring out parsing the perl file, and handling the different priorities you have for the same testcases (so, the actual testplan-management-y stuff). Yay for the mighty resultsdb, I guess..

Thanks!

"Also, some of your testcases do not report arch (the upgrades as far as I checked) to resultsdb, but only have 64bit in testcase name - seems like a bug"

Well, not exactly. Note the name of the property you're checking: productmd.image.arch - it's the arch of the image under test, not technically the arch of the test itself. The upgrade tests are not tests of any particular image, they're tests of the compose, so obviously they don't have such a property. (In the conventions, FedoraImageResult inherits from FedoraComposeResult, so the properties of image results are a superset of the properties of compose results, more or less). We could report the arch of the test itself directly in the openQA reporter, I guess...I don't immediately see a way to do it via the conventions, for a test that we only know is a test of the 'compose' as a whole.

"Example output of said tool, which took about two hours to write, and the worst part was figuring out how to handle the different priorities you have for the same testcases (so, the actual testplan-management-y stuff)."

This is exactly what the stuff about 'test scenarios' I wrote is covering, so I'd want to do it in line with how I wrote that: have the test scenario reported to ResultsDB, and query on scenarios. This is also important for de-duplication, which your script doesn't do: if we have a fail and then a pass for the same test (i.e. it was re-run and passed on the second attempt), it'll be considered a failure.

Bonus points for itertools.product usage =)

Thanks for the script, I'll probably take it and rewrite it a bit. I think I'd rather store the priorities in YAML or something as kparal suggested (rather than infer them on the fly on every run) - since we're saying that having openQA 'know' the priorities doesn't make sense, it seems odd to write a script that goes ahead and does that. And I want to include Autocloud results as well, which this doesn't do.

adamwill abandoned this revision.Feb 22 2017, 7:29 PM

So I'm abandoning this as @jskladan doesn't like it. We'll take a different tack.