Finding update URL in create_file is very slow with many files #148

benlindsay · 2018-07-12T14:40:39Z

I'm finding that by far the biggest bottleneck in uploading a smallish file (<50 MB) to a project with many files is finding the upload URL, i.e. this line. I haven't dug too deep into this, but would it be feasible to restructure the search for the correct URL some other way, like a dictionary lookup, or are we somehow fundamentally limited by the osf.io API or something?

benlindsay · 2018-07-12T14:47:17Z

In my case, on an upload of a tiny one line test file, it had to loop through 918 file objects at a rate of 5-10 per second before finding the right one.

benlindsay · 2018-07-12T15:38:22Z

So it looks like getting the urls depends on kind of scrolling through paginated results from the api, right? Would it be feasible to do something like add a --preload flag that does all this scrolling once at the beginning and stores all the file info in a dictionary? This would only be useful for uploading a new directory to a project that has a bunch of files, but that fits my current use case, so it would help at least 1 person :)

benlindsay · 2018-07-12T18:15:20Z

Predefining a dictionary of existing file objects and a set of existing folder names allows me to skip the biggest time killers for my case, and speeds things up for my case from about 30 seconds per file to a couple seconds per file. I'll clean it up and make a pull request sometime in the next few days.

betatim · 2018-07-13T05:19:27Z

Trying to remember what exactly motivated this design. I think one problem is that it seems impossible to predict the URL at which a file will end up (that short 34xfz34 string) so that we could compute that, check for it, and move on.

Your suggestion is to cache the result of

osfclient/osfclient/models/storage.py

Lines 43 to 49 in 03be300

    
               def files(self): 
        
                   """Iterate over all files in this storage. 
        
                   Recursively lists all files in all subfolders. 
        
                   """ 
        
                   return self._iter_children(self._files_url, 'file', File, 
        
                                              self._files_key)

is that right? If yes, we need to be careful with how to decide that this cache is now no longer valid. The files could have been updated/deleted by someone else while we are running a long running command or osfclient itself could have changed things.

I think the latter case could be handled by augmenting all operations that add/delete/change files to also update the view of the project that is in the cache.

benlindsay · 2018-07-13T13:34:17Z

@betatim Maybe "store in memory" is a better description for what I'm proposing than "cache". I'm not proposing saving metadata to a local file to store long term (although I'd be open to the idea). I'm proposing the following: at the beginning of a recursive upload, loop over all the files and store those file objects in a dict and store the existing directories in a set. Then for every file we want to upload, we can check if its directory exists in the set to determine if the directory needs to be created, and we can grab the file object from the dict if it exists.

With this, there's still the possibility that some other process could change the files on OSF while the recursive upload process is running, but to me that seems like a minor risk. Maybe to mitigate that, there could be another command-line option (--preload? --cache?) to determine if this process will be used, and have a warning telling the user to leave their OSF project alone while it's running, and to warn of potential (but unlikely) inconsistencies.

Check out my preliminary pull request above and let me know what you think.

betatim · 2018-07-13T14:30:13Z

in memory == cache

Ok, that is what I was thinking as well.

Having pondered it a bit I think I agree with you that chances are small and we could add some basic (in)sanity checks maybe or something where as soon as something feels fishy to the code it refreshes the cache.

benlindsay · 2018-07-13T15:13:59Z

That sounds good, then maybe i'll move the caching step to a member method in the Storage class like update_cache() so we can call that any time. I'm not sure how we'd define "feeling fishy" though. Would that just be every time we try to call a file object's update function, if there's any exception we try updating the cache and retrying the file update?

benlindsay mentioned this issue Jul 12, 2018

[MRG] hash checking --update option for upload, fetch, and clone #146

Merged

benlindsay linked a pull request Jul 13, 2018 that will close this issue

[WIP] Store metadata of known files and folders in memory to check against in recursive upload #149

Open

benlindsay mentioned this issue Jul 30, 2018

Upload is extremely slow #155

Open

behinger mentioned this issue Nov 4, 2020

Fetch is extremely slow for large repos #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding update URL in create_file is very slow with many files #148

Finding update URL in create_file is very slow with many files #148

benlindsay commented Jul 12, 2018

benlindsay commented Jul 12, 2018

benlindsay commented Jul 12, 2018

benlindsay commented Jul 12, 2018

betatim commented Jul 13, 2018

benlindsay commented Jul 13, 2018 •

edited

betatim commented Jul 13, 2018

benlindsay commented Jul 13, 2018

Finding update URL in create_file is very slow with many files #148

Finding update URL in create_file is very slow with many files #148

Comments

benlindsay commented Jul 12, 2018

benlindsay commented Jul 12, 2018

benlindsay commented Jul 12, 2018

benlindsay commented Jul 12, 2018

betatim commented Jul 13, 2018

benlindsay commented Jul 13, 2018 • edited

betatim commented Jul 13, 2018

benlindsay commented Jul 13, 2018

benlindsay commented Jul 13, 2018 •

edited