-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding update URL in create_file is very slow with many files #148
Comments
In my case, on an upload of a tiny one line test file, it had to loop through 918 file objects at a rate of 5-10 per second before finding the right one. |
So it looks like getting the urls depends on kind of scrolling through paginated results from the api, right? Would it be feasible to do something like add a |
Predefining a dictionary of existing file objects and a set of existing folder names allows me to skip the biggest time killers for my case, and speeds things up for my case from about 30 seconds per file to a couple seconds per file. I'll clean it up and make a pull request sometime in the next few days. |
Trying to remember what exactly motivated this design. I think one problem is that it seems impossible to predict the URL at which a file will end up (that short Your suggestion is to cache the result of osfclient/osfclient/models/storage.py Lines 43 to 49 in 03be300
osfclient itself could have changed things.
I think the latter case could be handled by augmenting all operations that add/delete/change files to also update the view of the project that is in the cache. |
@betatim Maybe "store in memory" is a better description for what I'm proposing than "cache". I'm not proposing saving metadata to a local file to store long term (although I'd be open to the idea). I'm proposing the following: at the beginning of a recursive upload, loop over all the files and store those file objects in a With this, there's still the possibility that some other process could change the files on OSF while the recursive upload process is running, but to me that seems like a minor risk. Maybe to mitigate that, there could be another command-line option ( Check out my preliminary pull request above and let me know what you think. |
Ok, that is what I was thinking as well. Having pondered it a bit I think I agree with you that chances are small and we could add some basic (in)sanity checks maybe or something where as soon as something feels fishy to the code it refreshes the cache. |
That sounds good, then maybe i'll move the caching step to a member method in the Storage class like |
I'm finding that by far the biggest bottleneck in uploading a smallish file (<50 MB) to a project with many files is finding the upload URL, i.e. this line. I haven't dug too deep into this, but would it be feasible to restructure the search for the correct URL some other way, like a dictionary lookup, or are we somehow fundamentally limited by the osf.io API or something?
The text was updated successfully, but these errors were encountered: