A possibility of "cached" downloads added to to file/koji utils.
The file_utils' download() method now takes a can_cache argument, that
makes it to download the file into a common 'cache' directory (set in
the configuration file). The 'cached' download needs to be enabled by
configuration, and then explicitely asked for in the download() call,
so we can controll which kind of files we cache (rpms for now), and
whether to do the caching at all (dev vs production).
Details
- Reviewers
tflink jskladan - Maniphest Tasks
- T225: koji_utils: download to a common directory, don't download if not needed
- Commits
- rLTRN4cc01387f51d: implement caching of RPM downloads
rLTRN6b42d633a51b: config_defaults: fix incorrectly named reporting_enabled key
rLTRNcb237606de4c: libtaskotron.spec: add /var/tmp/tasktron
Unittests pass, tried out on several rpmlint runs
Diff Detail
- Repository
- rLTRN libtaskotron
- Lint
Automatic diff as part of commit; lint not applicable. - Unit
Automatic diff as part of commit; unit tests not applicable.
It looks pretty good to me but is this meant to be used in production or mostly as a method for reducing downloads on dev instances? If it's meant for production, I'd like to see some method for cache size management before this gets released
| libtaskotron/config.py | ||
|---|---|---|
| 235 | probably the smallest nit I could find but circular is spelled wrong | |
| libtaskotron/config_defaults.py | ||
| 74 | wouldn't /var/cache/taskotron be better than /var/tmp? | |
This is meant for DEV, I honestly do not think that we really need (or even want) to do the caching (at least this naive way of caching) anywhere else.
I was thinking about requiring Development profile, but after discussing it with kparal, we decided that a simple on/off switch in configuration will do just fine.
Do you think it would be worth having on production?
| libtaskotron/config.py | ||
|---|---|---|
| 235 | thanks! | |
No, I'd rather not have it in production. I can't really think of an advantage to enabling it there, to be honest.
After looking at this again, I'm still not a huge fan of how the file_utils import is handled in conf but it's not a huge issue. What do you think about keeping the CONFIG object out of file_utils and leaving the config handling in anything that would call download. That would change the can_cache arg to do_cache or cache and the "decision" of whether or not to cache a file would be decided by the caller.
That feels a bit cleaner and more straightforward to me but I don't have my heart set on the change. Thoughts?
Cool.
After looking at this again, I'm still not a huge fan of how the file_utils import is handled in conf but it's not a huge issue. What do you think about keeping the CONFIG object out of file_utils and leaving the config handling in anything that would call download. That would change the can_cache arg to do_cache or cache and the "decision" of whether or not to cache a file would be decided by the caller.
That feels a bit cleaner and more straightforward to me but I don't have my heart set on the change. Thoughts?
To be honest, I'm not the biggest fan either, and since the caching is used just in one place anyways, I think it is not even a PITA to do it that way. I'll send a patch shortly.
I might end up strangled for saying this, but the download() method starts to be a bit more complicated with all that caching, and I think it would appreciate a few functional tests, at least to ensure that symlinks are created properly, in correct directories, and files are removed when they are supposed to be removed.
| conf/taskotron.yaml.example | ||
|---|---|---|
| 100–101 | Would it make sense to move cachedir into PATHS section and download_cache_enabled either leave here or move it into RESOURCES section? | |
| 106–107 | Perhaps we should make it clear that at the moment we only intend to and support caching of Koji RPM downloads, nothing else. | |
| 108 | I would love to have this enabled by default in development, and disabled in production. We now have a tmpfiles.d config file for cleaning up directories, so old cached downloads can be easily purged. Even if the person does not install that config file, it still eats less disk space when caching is enabled, because it is downloaded only once and the rest are symlinks. With caching disabled, it is downloaded every time and left lying around in /var/tmp. Are there any reasons against a split defaults for development/production, similarly what we have for reporting_enabled or log_file_enabled? | |
| libtaskotron/config_defaults.py | ||
| 74 | This should be also added into config.py:_create_dirs() so that it's created on libtaskotron start and libtaskotron methods can assume it exists. Also, shouldn't this be in the spec file? I see that /var/tmp/taskotron is also missing in the spec file, I guess that's a mistake. | |
| libtaskotron/file_utils.py | ||
| 62–81 | I have looked into urlgrabber documentation and it seems it can do this for us, and even better. Look for reget in http://urlgrabber.baseurl.org/help/urlgrabber.grabber.html . If we use reget = 'check_timestamp', it's actually safer than what we do in _samefile() and it can even continue a previously interrupted download, which is great. This would allow us to get rid of a lot of our custom code (which I don't have too much faith in, anyway). | |
| 85–93 | The description seems to not be true anymore, AFAICS. And I can't find any method actually using this. <evil plan>Hmm, what about just getting rid of this?</evil plan> If we remove it, I see the desired behavior like this:
So basically this means using reget always, and only manually checking for the third case and running os.remove() if it's present. The resulting logic seems to be pretty simple (simpler than what we have now). | |
| 118–128 | This will all disappear once this patch is rebased to latest libtaskotron. Just a note. | |
| 169 | Recently we did some changes in koji_utils.py and in makedirs() to stop wrapping OSError into our own errors, because they are likely not related. I think the same approach would be fine here, just a simple raise would work (the same applies to line 161). | |
In a new turn of events (after I have finished my new version of the patch), I've found out that reget='check_timestamp' in urlgrabber is documented, but not implemented, and reget='simple' works on partial files but not on fully downloaded files (it just re-downloads them again). *headdesk*
I humbly returned to the original version of using urllib for checking file size.
I talked to the maintainers and python-urlgrabber is basically a dead project now.
There's no point in trying to push reget="simple" fixes into it. So I just
implemented the rest of the concerns I raised. I also fixed a few non-related issues,
I split that into separate commits (which Phab unfortunately doesn't show).
One controversial change is that a) caching is now enabled by default in the
development profile b) I'm raising an exception and exit the execution if one
of the core directories doesn't exist. That sounds harsh, but it should not affect
anyone with libtaskotron rpm installed (which is expected to be the majority of
our user base), just people with git checkout only. I added information into
readme about how to set up these directories. For the defence of this approach: we
have already set the precedent with /var/lib/taskotron/artifacts - if that directory
doesn't exist, taskotron crashes hard. I just extended the check to cover all
important directories, and made sure we crash as soon as possible, instead of the
middle of the execution. I also tried to provide a humanly readable message,
instead of just a generic IOError message.
I haven't managed to provide unit tests yet, sorry, they'll be provided next week.
But I wanted to post the changes asap, so that you can have a look at them.
Thanks.
- implement caching of RPM downloads
- config_defaults: fix incorrectly named reporting_enabled key
- libtaskotron.spec: add /var/tmp/tasktron
Here are the new unit tests. I'll leave it here for one more day, if somebody wants to also have a look at it; if not, I'll push it tomorrow.
- file_utils: add functional tests
Would it make sense to move cachedir into PATHS section and download_cache_enabled either leave here or move it into RESOURCES section?