add config support
ClosedPublic

Authored by kparal on Mar 27 2014, 4:06 PM.

Details

Summary

Features:

  1. Internal class with (hopefully) reasonable defaults. Stored in config_defaults.py
  2. Support for config profiles - development (default), production and testing. Testing profile is automatically set up when running the test suite and doesn't touch the disk. You can change the profile either in the config file, or using an envvar TASKOTRON_PROFILE (having preference).
  3. External config file in YAML syntax. Everything is commented out by default, and if you uncomment it, it overrides the default values (regardless of chosen profile). The config file documentation reflects default _production_ values, the default development values are to be seen in config_defaults.py. The config file is searched for in /etc/taskotron and local checkout dir. All option values are checked for type correctness on parsing.
  4. Config instance is not created on import, only on get_config() function call. This allows to import the module without any disk access and simplifies testing.

Bring on your critique!

Test Plan

A lots of new unit and functional tests.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Some further notes:

  1. I have included some config options that are not used at the moment (like tmpdir, cachedir, logfile, etc). I wanted to have it first as a demonstration and second I think they will be needed in short future anyway. But I can remove them if requested.
  2. yumrepoinfo.py will be part of a separate patch, if somebody's waiting on it. I didn't want to make this patch any larger.

A few small questions but overall, looks pretty good to me. My biggest concern is about the global _config and if we really want to be using it.

conf/taskotron.yaml
36

I wonder if we shouldn't have reporting disabled by default since that'll be the config that people are using to poke at taskotron

libtaskotron/config.py
48

I don't really like the use of globals - what are we gaining by making _config a global variable? The only thing I can think of is if we wanted to enable config changes outside of the config file and wanted to make them persistent. Are we using that and even if we are, is that something we really want to be doing?

162

I wonder if this would be better as safe_load. Granted, we aren't loading it from an untrusted source but I don't think we need anything more than what safe_load gives us

libtaskotron/config_defaults.py
48

This should probably be /var/tmp - we've run into issues with running out of tmpfs space before

testing/functest_config.py
19

Why use stdlib's tempfile instead of the tmpfile in pytest which doesn't have the problem of cleanup. pytest keeps tempfiles from only a couple runs (4 IIRC) and deletes any older ones.

127

Using pytest's builtin tmpdir feature would avoid the need for this finally block

kparal added inline comments.Mar 28 2014, 8:52 AM
conf/taskotron.yaml
36

Actually, report_enabled is False by default, because by default we use the development profile. However, the option documentation is written for the production profile (where report_enabled is True), because I didn't want to force production people to read the source code (developers, otoh, can easily read the source code). I tried to explain this in the general section.

But you're right, this is and will be confusing. Would it be a better way to document both defaults in cases where they differ? Something like:

## [default: True for production profile, False for development profile]
#reporting_enabled: True
libtaskotron/config.py
48

I'm not sure I understand this. If we don't store the Config instance somewhere, we will need to create a fresh new instance for every get_config() call. That means parsing all the files, spewing all possible warnings, over and over again. That is a) slow b) allows you to reconfigure program in the middle of its execution (by adjusting the config file), which I'm not totally comfortable with c) going to happen really often (some modules might do it several times in different places, e.g. every time we create a temporary file, we'll be accessing config.tmpdir). Or did you have some other solution in mind?

You're right that a singleton Config instance allows reconfiguration on the fly. Is that a bad thing? We might not use it at all, but I can imagine some situations where it's useful. For example if user-provided tmpdir or logfile isn't writable, we can change it to some fallback value, and issue a warning just once. If we couldn't adjust a shared config instance, we would have to do the fallback and emit the warning for every attempt to use that file/directory, duplicating the warnings countless times.

162

Oh, thanks, I didn't know about that method.

libtaskotron/config_defaults.py
48

Depends on what we want to use it for, but in general, yes, I think you're right.

testing/functest_config.py
19

That's neat, I didn't know pytest's tmpfile. The problem is, I don't know how to use it inside setup_method(). I already had this problem with monkeypatch, I had to create a separate methods (set_up_conf_file_loading(), disable_create_dirs(), unset_profile()) and call them manually from most of the methods. Which is quite inconvenient, and especially the temporary file I want to have for almost all test_ methods. Do you know how to use it inside setup_method()? I failed to google it.

By the way, according to tempfile.NamedTemporaryFile() documentation, the file also gets automatically removed on close, so there's probably no disadvantage in using it instead.

127

Yeah, here it really helps, because tempfile doesn't offer automatically removed dirs. Thanks.

kparal updated this revision.Mar 28 2014, 9:03 AM
  • use yaml.safe_load() instead of yaml.load()
  • move tmpdir to /var/tmp
  • make use of pytest's tmpdir
  • state defaults for both profiles in the config file
kparal updated this revision.Mar 28 2014, 9:05 AM
  • remove unintended code duplication
kparal updated this revision.Mar 28 2014, 9:07 AM
  • ugh, once more

But you're right, this is and will be confusing. Would it be a better way to document both defaults in cases where they differ?

Yeah, I like that better. It's easy to follow

Do you know how to use it inside setup_method()?

Off the top of my head, no. That might be a good question to post on the TIP (testing in python) list, though.

I'm not sure I understand this. If we don't store the Config instance somewhere, we will need to create a fresh new instance for every get_config() call. That means parsing all the files, spewing all possible warnings, over and over again. That is a) slow b) allows you to reconfigure program in the middle of its execution (by adjusting the config file), which I'm not totally comfortable with c) going to happen really often (some modules might do it several times in different places, e.g. every time we create a temporary file, we'll be accessing config.tmpdir). Or did you have some other solution in mind?

I think that part of it is the instinctive "ew" I have when I see the global keyword because all of the alternatives I can think of aren't all that different in practice. The best other solution I can think of would be to have a @property method in the base runner class that returns the config data and runs the initialization if it hasn't already been run. I think I'd prefer that solution in case we ever wanted to have multiple runners instantiated. While I'm not sure we'd ever use it, it's possible that we might want to do that for gui checks or other tasks where multiple machines and sub-tasks are involved.

You're right that a singleton Config instance allows reconfiguration on the fly. Is that a bad thing? We might not use it at all, but I can imagine some situations where it's useful. For example if user-provided tmpdir or logfile isn't writable, we can change it to some fallback value, and issue a warning just once. If we couldn't adjust a shared config instance, we would have to do the fallback and emit the warning for every attempt to use that file/directory, duplicating the warnings countless times.

I don't have a problem with changing config on the fly, I was just trying to think through reasons to have _config be global.

Off the top of my head, no. That might be a good question to post on the TIP (testing in python) list, though.

I've found out that fixtures can solve this. I have adjusted the patch.

I think that part of it is the instinctive "ew" I have when I see the global keyword because all of the alternatives I can think of aren't all that different in practice. The best other solution I can think of would be to have a @property method in the base runner class that returns the config data and runs the initialization if it hasn't already been run. I think I'd prefer that solution in case we ever wanted to have multiple runners instantiated. While I'm not sure we'd ever use it, it's possible that we might want to do that for gui checks or other tasks where multiple machines and sub-tasks are involved.

Do you mean multiple Runner instances in a single process, right? Because if we run multiple Runners in parallel in separate python processes, they don't share modules (neither their global variables). I just checked :-)

If you really mean a single process, then I think we should introduce a Context class, which will hold this and similar references. The runner will then need to pass the Context instance to all classes it initializes (which need to access the Context). But I think it quite complicates the code and I'm not really sure we should do it right now. We're not fully sure what lies ahead and maybe we won't need it at all. It can be easily added in the future when needed, it's just reference passing. What do you think?

kparal updated this revision.Mar 31 2014, 1:09 PM
  • use fixtures to simplify test initialization
ralph added a comment.Mar 31 2014, 1:47 PM

FWIW, I agree that keeping around a global module-level config object is a bad idea. Stuff like that has bitten me a few times.

If you really mean a single process, then I think we should introduce a Context class, which will hold this and similar references. The runner will then need to pass the Context instance to all classes it initializes (which need to access the Context). But I think it quite complicates the code and I'm not really sure we should do it right now. We're not fully sure what lies ahead and maybe we won't need it at all. It can be easily added in the future when needed, it's just reference passing. What do you think?

Web frameworks like pyramid do what you're talking about with a Context class. Pyramid passes the request object to almost every callable during the creation of a response; it is not otherwise globally available. It's nice in that it makes it easy to reason about when looking at a small section of code and it makes it easy to test individual pieces: no need to mock out some global request-thing that the code you're testing is then going to go and mysteriously access.

Web frameworks like flask do what Tim talked about by hanging a @property off of a base object. This is minimally invasive, but is less explicit. Maybe harder to recognize what's going on at first.

I guess I'd vote for the "@property of the BaseRunner" approach to make things easy.

Web frameworks like flask do what Tim talked about by hanging a @property off of a base object. This is minimally invasive, but is less explicit. Maybe harder to recognize what's going on at first.

I'm not clear how to do that. In order to access Runner.config from e.g. koji_utils.py, I need to have a Runner instance. Where do I get it? And if I receive it in a class constructor for every interested class (e.g. koji_utils.KojiClient), why is it better to pass a Runner instance (an object designed to do many important things) than a Context instance (a very simple container for the most important instance references, no redundant methods and attributes)?

Can you give me a snippet of code? Thanks.

ralph added a comment.Mar 31 2014, 4:06 PM

I'm not clear how to do that. In order to access Runner.config from e.g. koji_utils.py, I need to have a Runner instance. Where do I get it? And if I receive it in a class constructor for every interested class (e.g. koji_utils.KojiClient), why is it better to pass a Runner instance (an object designed to do many important things) than a Context instance (a very simple container for the most important instance references, no redundant methods and attributes)?

Ah, you're right.

Can you give me a snippet of code? Thanks.

I'm not sure how to submit a patch against an existing differential. Is it possible? Either way, here's a take on passing the config object into directives and the koji client:

  1. http://threebean.org/patches/0001-Do-away-with-the-module-level-singleton.patch
  2. http://threebean.org/patches/0002-Have-the-runner-pass-a-config-object-to-its-directiv.patch

I'm not sure how to submit a patch against an existing differential. Is it possible?

Either way, here's a take on passing the config object into directives and the koji client:

The only thing that I'm not crazy about with that specific implementation is the lack of boundaries between the runner and the directives - it uses a python object instead of passing the data in using a simple dict-ish structure.

That brings up 2 questions in my mind:

  • Is it important to simplify the interfaces between the runner and directives?
  • Do we want to allow runner-level config changes to be allowed from directives?

As much as I'd like to simplify interfaces between the runner and the directives, I'm not sure it's all that required at the moment. All directives (even the ones that handle non-python code) will be written in Python so cross-language compatibility isn't much of an issue until we get to the point where we're actually interfacing with non-python code within those directives. Since that's not right now, I don't think that's a needed complexity for the moment.

As far as allowing runner-level config changes inside the directives go, I'm not sure that's a great idea. Mutable vs. immutable config values are one thing but allowing a directive to change the config values used by all other directives seems like an invitation to strange and complex bugs with little/no other benefit that couldn't be dealt with in other ways.

I'm wondering if adding something like a to_dict() or to_bunch() method to the config class and using that to pass in a dict representation of the current config values would be a good way forward - that would effectively simplify the config interface between the runner and the directives and disallow config changes in the directives. If it turns out that we really need to allow config changes inside directives, we could shift to storing config as a mutable dict (in the case of to_dict()) or just start passing the object reference in (in the case of to_bunch()) without significant code changes inside the existing directives.

I discussed this with @jskladan in length and these are major issues we identified:

  1. We need to have the libraries usable even when not executed through Runner. In order to be able to run checks like depcheck or upgradepath standalone, they need to be able to import e.g. koji_utils.py and things must work out of the box. So even if we add config= parameter into KojiClient.init(), we still need to be able to access a default config instance if the value is empty:
class KojiClient():
    def __init__(config=None):
    self.config = config    
    if self.config is None:
            self.config = libtaskotron.config.get_config()

@jskladan is not fond of having this boilerplate in each of our classes.
Because we don't want to initialize the config over and over again for every imported module/class, there still needs to be config._config singleton instance that will be used in these cases (execution outside of Runner). Which somewhat defeats the whole purpose (of having config as a parameter). Do you see a better way how to deal with it?

  1. If we want to initialize a config-dependent class/module from a non-config-dependendent class/module, there is a problem (even if we run it through Runner). Let's say I want to use KojiClient from my_utils.py (not an ideal example, for it will do for illustration purposes). If I don't need config in my_utils.py, I don't have it there. How do I initialize KojiClient?

The only solution I see is to pass config through the whole chain. If any single class needs config during init, all potential callers need to receive it and pass it over, recursively. This potentially ends up with having config parameter just about everywhere :-) In class constructors and in all module functions.

  1. The use case of having multiple Runner objects (most probably running them in threads) is not likely to happen in near future. We would like to rather keep the code simpler and complicate it when needed, than complicate it right now because it might come handy. @jskladan says Done is better than perfect.

What do you think about it? Is it worth it to try to deal with it right now, when we don't know our concrete use cases yet?

Other comments:

Do we want to allow runner-level config changes to be allowed from directives?

I can't imagine any use cases for it at the moment, but if we find some and we think it's a bad idea (it probably is), we can just reject any patch that tries to do it. Of course if there is a strong desire, we can find better ways, e.g. deepcopy Config (or covert it into dict) and pass it to the directives - in this case the aforementioned issues don't seem to be present, because directives are used only with Runner and they share a common interface, so init() adjustments are simple.

I'm not sure how to submit a patch against an existing differential. Is it possible?

I think the only way is to attach a file (just drag n drop on Phab home page) and link it here (there's some keyword for it), or create a new Differential revision and link it here (and close it when the discussion is over). I think both options are completely OK.

ralph added a comment.Apr 1 2014, 1:58 PM

The only thing that I'm not crazy about with that specific implementation is the lack of boundaries between the runner and the directives - it uses a python object instead of passing the data in using a simple dict-ish structure.

A dict object sounds good to me. That's how fedmsg internally models its config -- just a dict.


So, this looks like an anti-pattern to me:

class KojiClient(object):
    def __init__(config=None):
        self.config = config
        if self.config is None:
                self.config = libtaskotron.config.get_config()

If we're going to pass in config, we should just pass in config and make it mandatory (or at least give it a nice empty default). If we're going to load the config from a global singleton in the constructor, we should just do that. Then again, if it is going to be held as a singleton in libtaskotron.config, why bother hanging it on self? (that seems like a bad idea to me.)

We could make the KojiClient responsible for its own defaults, which might simplify things. Say, first we downgrade the config object from being a bona-fide object, to just a dict. Then, we could do something like this:

class KojiClient(object):
    defaults = {
        'koji_url': 'https://koji.fedoraproject.org',
        # etc...
    }
    def __init__(config=None):
        # Set up our config initially with the defaults
        self.config = copy.copy(self.defaults)
        # Then overwrite with anything that may have been passed in.
        self.config.update(config or {})

The above example could be reworked to apply to the Directives, too.

Then again, KojiClient is a kind of 'leaf-node' in the object dependency chain (a runner needs directives, directives need a koji client), so we might give it a more explicit initializer:

class KojiClient(object):
    def __init__(url='https://koji.fedoraproject.org'):
        self.url = url
kparal added a comment.Apr 2 2014, 1:16 PM

Then again, if it is going to be held as a singleton in libtaskotron.config, why bother hanging it on self? (that seems like a bad idea to me.)

That was just a convenience measure. If you want to use it in several places, it's easier to write self.config.option than libtaskotron.config.get_config().option. But there's certainly nothing wrong with the second one either.

We could make the KojiClient responsible for its own defaults, which might simplify things.

I agree that this approach of placing the default values directly into relevant objects makes a lot of sense and makes those objects more independent, at a certain cost. But it doesn't apply everywhere, e.g. config.reporting_enabled = False - that's a shared option for all reporting directives and it needs to be in a central place, libtaskotron.config making the most sense, I think. So, in certain cases we won't avoid this problem by placing the defaults directly into modules/classes.

class KojiClient(object):

def __init__(url='https://koji.fedoraproject.org'):
     self.url = url

I think the major problem still stands. If the user changed the config file (changed the koji_url value), but he runs the check standalone (outside of Runner), how do you plan to deliver the changed URL to the KojiClient object?

ralph added a comment.Apr 2 2014, 1:59 PM

We could make the KojiClient responsible for its own defaults, which might simplify things.

I agree that this approach of placing the default values directly into relevant objects makes a lot of sense and makes those objects more independent, at a certain cost. But it doesn't apply everywhere, e.g. config.reporting_enabled = False - that's a shared option for all reporting directives and it needs to be in a central place, libtaskotron.config making the most sense, I think. So, in certain cases we won't avoid this problem by placing the defaults directly into modules/classes.

Idea: a central place for that config value could be the BaseDirective.

class KojiClient(object):
    def __init__(url='https://koji.fedoraproject.org'):
        self.url = url

I think the major problem still stands. If the user changed the config file (changed the koji_url value), but he runs the check standalone (outside of Runner), how do you plan to deliver the changed URL to the KojiClient object?

Here my knowledge of libtaskotron gets fuzzy. How do you run a task standalone? Do you mean just by the user writing their own python code that employs KojiClient? Or is there some other way standard way of running tasks without the Runner. If so, it seems "too magical" for KojiClient to be able to go out and find its own proper config files, values when it is called by some other code. That calling code should be responsible for setting up KojiClient with what it needs.

In the end, these are my opinions.. not science. I feel strongly about them, but at some point we're just painting the bike shed different colors.

Here my knowledge of libtaskotron gets fuzzy. How do you run a task standalone? Do you mean just by the user writing their own python code that employs KojiClient? Or is there some other way standard way of running tasks without the Runner.

In AutoQA, we mandated that the check is run through our autotest+autoqa framework. It was not possible to run the script standalone (some imports wouldn't work, you wouldn't have correct variables passed, etc). This made AutoQA check development much harder than it could have been. For example, it was quite difficult to run the check through a debugger.

Because of this, one of our goals for Taskotron was to allow the developers to easily run the check standalone, without running it through the whole framework. So, my initial attempt at upgradepath reimplementation looks like this (not runnable at the moment, still waiting for some patches, including this one):
https://bitbucket.org/fedoraqa/task-upgradepath/src/282a315c52363eefac466b717ca5017efc7b8c4a/upgradepath.py?at=feature/T79-upgradepath

And the idea is to have it runnable standalone:
$ ./upgradepath.py f20-updates
(in this case, just the script is executed and its output is visible on stdout)
and through taskotron runner:
$ runtask --tag f20-updates task-upgradepath/upgradepath.yaml
(in this case, the yaml directives are executed, therefore some files can be pre-downloaded etc, and TAP output is visible on stdout)

In the code, I import those libtaskotron modules that I need to work with. My assumption is that we should provide the developers with exactly the same environment, no matter whether they run the check standalone or through our runner. Most developers will probably be satisfied with the provided config defaults, but some of them might need to change something.

Let's say your check needs be to authenticated with FAS to retrieve some restricted information. If we parse the config files always, the developer can use fas_username and fas_passsword in taskotron.conf. If we parse the config files only when run through Runner, the developer needs to implement his own config file parsing and use it when he runs the script standalone.

Or, let's say your check downloads really large files (.iso files) and operates with them. You might want to change the default download/cache directory to something else because of your partition layout. You can change it in cachedir in taskotron.conf, but if we don't parse it when running the script standalone, you also need to implement and run your script with --cachedir=dir option in this case.

Yet another use case that comes to my mind is when I want to run the check standalone and use some fake Bodhi interface where to report results, for testing/debugging purposes (we have implemented mock_fedorainfra tool for this purposes in the past). Again, it's easier to use the existing config file rather than implement your own config files or cmdline options.

So I'm not saying we really have to parse config files in standalone mode, but the opposite approach seem to be a bit inconvenient for check developers. But I'm very open to all ideas how to make this work in a convenient, yet "proper" way. We might even want to reconsider our stance on supporting standalone execution, if people find it reasonable.

If so, it seems "too magical" for KojiClient to be able to go out and find its own proper config files, values when it is called by some other code.

I agree that it's a bit unusual for a library to parse a configuration file. OTOH, if I replace the word "library" with "framework", it's no longer unusual. I don't really know what the best practices are here, I simply tried to find an approach that seems to work best in our case.

In the end, these are my opinions.. not science. I feel strongly about them, but at some point we're just painting the bike shed different colors.

Thanks very much for providing them here. Since this conversation seems to have a bit larger scope than I expected initially, and some people's work is blocked on this review request (they need to have config support available), does anybody mind if I commit this exactly as it is, and we continue this conversation in a separate ticket/in a mailing list? Once we find the best way, I'll create a patch to adjust the existing code.

ralph accepted this revision.Apr 3 2014, 1:22 PM

Thanks for the enormously helpful response about standalone running. That clarifies a lot.

> If so, it seems "too magical" for KojiClient to be able to go out and find its own proper config files, values when it is called by some other code.

I agree that it's a bit unusual for a library to parse a configuration file. OTOH, if I replace the word "library" with "framework", it's no longer unusual. I don't really know what the best practices are here, I simply tried to find an approach that seems to work best in our case.

Yeah, I agree. I think of the distinction between "framework" and "library" as being: "a library is code that your code calls whereas a framework is code that calls your code." Working from there: taskotron taken as a whole is a framework (or at least the BaseRunner). When I look at the KojiClient (or stuff in any *utils.py file) I get the itch to call it a library, to be used by other code and/or the framework. If the framework (BaseRunner) calls into the library (KojiClient) which then calls back into the framework (taskotron.config), then we have a problem with 'boundaries', like Tim was talking about earlier in the conversation.

Since this conversation seems to have a bit larger scope than I expected initially, and some people's work is blocked on this review request (they need to have config support available), does anybody mind if I commit this exactly as it is, and we continue this conversation in a separate ticket/in a mailing list?

That sounds good (sorry for blocking this for so long -- I didn't realize how much other stuff depended on it at the moment). Your explanation of the needs of standalone tasks pretty much solves it for me. No need for futher discussion unless there are particular things we need to hammer out. Accepted!

tflink accepted this revision.Apr 3 2014, 2:32 PM

Because of this, one of our goals for Taskotron was to allow the developers to easily run the check standalone, without running it through the whole framework. So, my initial attempt at upgradepath reimplementation looks like this (not runnable at the moment, still waiting for some patches, including this one):

https://bitbucket.org/fedoraqa/task-upgradepath/src/282a315c52363eefac466b717ca5017efc7b8c4a/upgradepath.py?at=feature/T79-upgradepath

While I'm not necessarily against the idea of running tasks completely standalone, the primary goal was to enable running things locally without needing to install/configure the entire system from trigger to reporting - just libtaskotron.

Let's say your check needs be to authenticated with FAS to retrieve some restricted information. If we parse the config files always, the developer can use fas_username and fas_passsword in taskotron.conf. If we parse the config files only when run through Runner, the developer needs to implement his own config file parsing and use it when he runs the script standalone.

One option might be to provide re-usable config parsing code so that a task author can have separate entry points - one for standalone operation that parses and sets up config values and the other, used by the runner, which takes in processed config information.

Thanks very much for providing them here.

Seconded. Thank you for participating in the reviews and providing another point of view to the discussion.

Since this conversation seems to have a bit larger scope than I expected initially, and some people's work is blocked on this review request (they need to have config support available), does anybody mind if I commit this exactly as it is, and we continue this conversation in a separate ticket/in a mailing list? Once we find the best way, I'll create a patch to adjust the existing code.

Yeah, this would work. I agree that we need to get _something_ working and we can continue the discussion to refine things.

kparal closed this revision.Apr 3 2014, 2:57 PM

Closed by commit rLTRN2e3f891462ad (authored by @kparal).