Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding update URL in create_file is very slow with many files #148

Open
benlindsay opened this issue Jul 12, 2018 · 7 comments · May be fixed by #149
Open

Finding update URL in create_file is very slow with many files #148

benlindsay opened this issue Jul 12, 2018 · 7 comments · May be fixed by #149

Comments

@benlindsay
Copy link
Collaborator

I'm finding that by far the biggest bottleneck in uploading a smallish file (<50 MB) to a project with many files is finding the upload URL, i.e. this line. I haven't dug too deep into this, but would it be feasible to restructure the search for the correct URL some other way, like a dictionary lookup, or are we somehow fundamentally limited by the osf.io API or something?

@benlindsay
Copy link
Collaborator Author

In my case, on an upload of a tiny one line test file, it had to loop through 918 file objects at a rate of 5-10 per second before finding the right one.

@benlindsay
Copy link
Collaborator Author

So it looks like getting the urls depends on kind of scrolling through paginated results from the api, right? Would it be feasible to do something like add a --preload flag that does all this scrolling once at the beginning and stores all the file info in a dictionary? This would only be useful for uploading a new directory to a project that has a bunch of files, but that fits my current use case, so it would help at least 1 person :)

@benlindsay
Copy link
Collaborator Author

Predefining a dictionary of existing file objects and a set of existing folder names allows me to skip the biggest time killers for my case, and speeds things up for my case from about 30 seconds per file to a couple seconds per file. I'll clean it up and make a pull request sometime in the next few days.

@betatim
Copy link
Member

betatim commented Jul 13, 2018

Trying to remember what exactly motivated this design. I think one problem is that it seems impossible to predict the URL at which a file will end up (that short 34xfz34 string) so that we could compute that, check for it, and move on.

Your suggestion is to cache the result of

def files(self):
"""Iterate over all files in this storage.
Recursively lists all files in all subfolders.
"""
return self._iter_children(self._files_url, 'file', File,
self._files_key)
is that right? If yes, we need to be careful with how to decide that this cache is now no longer valid. The files could have been updated/deleted by someone else while we are running a long running command or osfclient itself could have changed things.

I think the latter case could be handled by augmenting all operations that add/delete/change files to also update the view of the project that is in the cache.

@benlindsay
Copy link
Collaborator Author

benlindsay commented Jul 13, 2018

@betatim Maybe "store in memory" is a better description for what I'm proposing than "cache". I'm not proposing saving metadata to a local file to store long term (although I'd be open to the idea). I'm proposing the following: at the beginning of a recursive upload, loop over all the files and store those file objects in a dict and store the existing directories in a set. Then for every file we want to upload, we can check if its directory exists in the set to determine if the directory needs to be created, and we can grab the file object from the dict if it exists.

With this, there's still the possibility that some other process could change the files on OSF while the recursive upload process is running, but to me that seems like a minor risk. Maybe to mitigate that, there could be another command-line option (--preload? --cache?) to determine if this process will be used, and have a warning telling the user to leave their OSF project alone while it's running, and to warn of potential (but unlikely) inconsistencies.

Check out my preliminary pull request above and let me know what you think.

@betatim
Copy link
Member

betatim commented Jul 13, 2018

in memory == cache

Ok, that is what I was thinking as well.

Having pondered it a bit I think I agree with you that chances are small and we could add some basic (in)sanity checks maybe or something where as soon as something feels fishy to the code it refreshes the cache.

@benlindsay
Copy link
Collaborator Author

That sounds good, then maybe i'll move the caching step to a member method in the Storage class like update_cache() so we can call that any time. I'm not sure how we'd define "feeling fishy" though. Would that just be every time we try to call a file object's update function, if there's any exception we try updating the cache and retrying the file update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants