Download to a common directory (cached downloads)
ClosedPublic

Authored by kparal on Dec 2 2014, 12:30 PM.

Details

Summary

A possibility of "cached" downloads added to to file/koji utils.
The file_utils' download() method now takes a can_cache argument, that
makes it to download the file into a common 'cache' directory (set in
the configuration file). The 'cached' download needs to be enabled by
configuration, and then explicitely asked for in the download() call,
so we can controll which kind of files we cache (rpms for now), and
whether to do the caching at all (dev vs production).

Test Plan

Unittests pass, tried out on several rpmlint runs

Diff Detail

Repository
rLTRN libtaskotron
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
jskladan retitled this revision from to Download to a common directory (cached downloads).Dec 2 2014, 12:30 PM
jskladan updated this object.
jskladan edited the test plan for this revision. (Show Details)
jskladan added reviewers: kparal, tflink.
tflink added a comment.Dec 2 2014, 6:12 PM

It looks pretty good to me but is this meant to be used in production or mostly as a method for reducing downloads on dev instances? If it's meant for production, I'd like to see some method for cache size management before this gets released

libtaskotron/config.py
235

probably the smallest nit I could find but circular is spelled wrong

libtaskotron/config_defaults.py
74

wouldn't /var/cache/taskotron be better than /var/tmp?

This is meant for DEV, I honestly do not think that we really need (or even want) to do the caching (at least this naive way of caching) anywhere else.
I was thinking about requiring Development profile, but after discussing it with kparal, we decided that a simple on/off switch in configuration will do just fine.

Do you think it would be worth having on production?

libtaskotron/config.py
235

thanks!

jskladan updated this revision to Diff 742.Dec 3 2014, 10:28 AM
  • Changed according to review
tflink added a comment.Dec 9 2014, 9:54 PM
In D266#6, @jskladan wrote:

This is meant for DEV, I honestly do not think that we really need (or even want) to do the caching (at least this naive way of caching) anywhere else.
I was thinking about requiring Development profile, but after discussing it with kparal, we decided that a simple on/off switch in configuration will do just fine.

Do you think it would be worth having on production?

No, I'd rather not have it in production. I can't really think of an advantage to enabling it there, to be honest.

After looking at this again, I'm still not a huge fan of how the file_utils import is handled in conf but it's not a huge issue. What do you think about keeping the CONFIG object out of file_utils and leaving the config handling in anything that would call download. That would change the can_cache arg to do_cache or cache and the "decision" of whether or not to cache a file would be decided by the caller.

That feels a bit cleaner and more straightforward to me but I don't have my heart set on the change. Thoughts?

jskladan added a comment.EditedFeb 27 2015, 9:25 AM
In D266#4731, @tflink wrote:

No, I'd rather not have it in production. I can't really think of an advantage to enabling it there, to be honest.

Cool.

After looking at this again, I'm still not a huge fan of how the file_utils import is handled in conf but it's not a huge issue. What do you think about keeping the CONFIG object out of file_utils and leaving the config handling in anything that would call download. That would change the can_cache arg to do_cache or cache and the "decision" of whether or not to cache a file would be decided by the caller.
That feels a bit cleaner and more straightforward to me but I don't have my heart set on the change. Thoughts?

To be honest, I'm not the biggest fan either, and since the caching is used just in one place anyways, I think it is not even a PITA to do it that way. I'll send a patch shortly.

jskladan updated this revision to Diff 824.Feb 27 2015, 10:09 AM
  • Fixed according to review
kparal requested changes to this revision.Mar 12 2015, 12:00 PM

I might end up strangled for saying this, but the download() method starts to be a bit more complicated with all that caching, and I think it would appreciate a few functional tests, at least to ensure that symlinks are created properly, in correct directories, and files are removed when they are supposed to be removed.

conf/taskotron.yaml.example
100–101

Would it make sense to move cachedir into PATHS section and download_cache_enabled either leave here or move it into RESOURCES section?

106–107

Perhaps we should make it clear that at the moment we only intend to and support caching of Koji RPM downloads, nothing else.

108

I would love to have this enabled by default in development, and disabled in production. We now have a tmpfiles.d config file for cleaning up directories, so old cached downloads can be easily purged. Even if the person does not install that config file, it still eats less disk space when caching is enabled, because it is downloaded only once and the rest are symlinks. With caching disabled, it is downloaded every time and left lying around in /var/tmp. Are there any reasons against a split defaults for development/production, similarly what we have for reporting_enabled or log_file_enabled?

libtaskotron/config_defaults.py
74

This should be also added into config.py:_create_dirs() so that it's created on libtaskotron start and libtaskotron methods can assume it exists.

Also, shouldn't this be in the spec file? I see that /var/tmp/taskotron is also missing in the spec file, I guess that's a mistake.

libtaskotron/file_utils.py
62–81

I have looked into urlgrabber documentation and it seems it can do this for us, and even better. Look for reget in http://urlgrabber.baseurl.org/help/urlgrabber.grabber.html . If we use reget = 'check_timestamp', it's actually safer than what we do in _samefile() and it can even continue a previously interrupted download, which is great. This would allow us to get rid of a lot of our custom code (which I don't have too much faith in, anyway).

85–93

The description seems to not be true anymore, AFAICS. And I can't find any method actually using this. <evil plan>Hmm, what about just getting rid of this?</evil plan>

If we remove it, I see the desired behavior like this:

  • If cachedir is None and the file already exists in dirname, just run urlgrabber with reget on it, it will check the timestamp and size and either immediately return it, or download the latest version/missing bits.
  • If cachedir is not None and file already exists in the cache, do the same as above.
  • If cachedir is not None and file already exists in dirname, replace it with a symlink into the cache.

So basically this means using reget always, and only manually checking for the third case and running os.remove() if it's present. The resulting logic seems to be pretty simple (simpler than what we have now).

118–128

This will all disappear once this patch is rebased to latest libtaskotron. Just a note.

169

Recently we did some changes in koji_utils.py and in makedirs() to stop wrapping OSError into our own errors, because they are likely not related. I think the same approach would be fine here, just a simple raise would work (the same applies to line 161).

This revision now requires changes to proceed.Mar 12 2015, 12:00 PM
kparal commandeered this revision.Mar 12 2015, 2:12 PM
kparal edited reviewers, added: jskladan; removed: kparal.

I complained, I'll fix.

In a new turn of events (after I have finished my new version of the patch), I've found out that reget='check_timestamp' in urlgrabber is documented, but not implemented, and reget='simple' works on partial files but not on fully downloaded files (it just re-downloads them again). *headdesk*

kparal updated this revision to Diff 841.Mar 13 2015, 4:20 PM

I humbly returned to the original version of using urllib for checking file size.
I talked to the maintainers and python-urlgrabber is basically a dead project now.
There's no point in trying to push reget="simple" fixes into it. So I just
implemented the rest of the concerns I raised. I also fixed a few non-related issues,
I split that into separate commits (which Phab unfortunately doesn't show).

One controversial change is that a) caching is now enabled by default in the
development profile b) I'm raising an exception and exit the execution if one
of the core directories doesn't exist. That sounds harsh, but it should not affect
anyone with libtaskotron rpm installed (which is expected to be the majority of
our user base), just people with git checkout only. I added information into
readme about how to set up these directories. For the defence of this approach: we
have already set the precedent with /var/lib/taskotron/artifacts - if that directory
doesn't exist, taskotron crashes hard. I just extended the check to cover all
important directories, and made sure we crash as soon as possible, instead of the
middle of the execution. I also tried to provide a humanly readable message,
instead of just a generic IOError message.

I haven't managed to provide unit tests yet, sorry, they'll be provided next week.
But I wanted to post the changes asap, so that you can have a look at them.
Thanks.

  • implement caching of RPM downloads
  • config_defaults: fix incorrectly named reporting_enabled key
  • libtaskotron.spec: add /var/tmp/tasktron
This revision is now accepted and ready to land.Mar 17 2015, 2:21 PM
kparal updated this revision to Diff 849.Mar 18 2015, 3:06 PM

Here are the new unit tests. I'll leave it here for one more day, if somebody wants to also have a look at it; if not, I'll push it tomorrow.

  • file_utils: add functional tests
tflink accepted this revision.Mar 19 2015, 6:19 AM
This revision was automatically updated to reflect the committed changes.