Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Store metadata of known files and folders in memory to check against in recursive upload #149

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

benlindsay
Copy link
Collaborator

Fixes #148

Not sure if this is the right way to do it, but to test this branch out, I needed the fix from #147, so this builds on that even though it isn't merged yet. This is my preliminary attempt to speed up recursive uploads when lots of files are present, although it doesn't address @betatim's concern of

we need to be careful with how to decide that this cache is now no longer valid. The files could have been updated/deleted by someone else while we are running a long running command or osfclient itself could have changed things.

from #148.

@benlindsay benlindsay changed the title [WIP] Cache known files and folders to check against in recursive upload [WIP] Store metadata of known files and folders in memory to check against in recursive upload Jul 13, 2018
@@ -67,34 +71,52 @@ def create_file(self, path, fp, update=False):
directories = directory.split(os.path.sep)
# navigate to the right parent object for our file
parent = self
if os.path.dirname(path) not in self.known_folder_set:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, you're right, missed this one. I recently switched from Vim to mostly VSCode, and I learned the hard way that there's a "diffEditor.ignoreTrimWhitespace" setting with a default of true (terrible default in my opinion), which messed things up when I did line-by-line staging.

@@ -35,6 +37,8 @@ def _update_attributes(self, storage):
self._new_folder_url = self._get_attribute(storage,
'links', 'new_folder')
self._new_file_url = self._get_attribute(storage, 'links', 'upload')
self.known_file_dict = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call these self.known_files and self.known_folders. To avoid having a dict and a set, can you store them both in self.known_paths = dict() where for folders we use None as value?

Performance wise testing for membership in a set and a dict should be ~same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that does seem a little cleaner. Will do.

@betatim
Copy link
Member

betatim commented Jul 13, 2018

Looks like a good approach. Some nitpicks as inline comments.

@benlindsay
Copy link
Collaborator Author

benlindsay commented Jul 16, 2018

@betatim My latest commit implements your suggestion of using a single dict for files and folders, wraps the full cache operation in a Storage class member method, and implements a switch (--cache), without which no caching and cache checking is done. I still haven't done anything to

add some basic (in)sanity checks maybe or something where as soon as something feels fishy to the code it refreshes the cache.

as you mentioned in #148, since I'm still not sure what that would look like. I still need to add tests as well. Any advice you have would be appreciated.

@codecov-io
Copy link

codecov-io commented Aug 17, 2018

Codecov Report

Merging #149 into master will increase coverage by 0.29%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #149      +/-   ##
==========================================
+ Coverage   92.17%   92.47%   +0.29%     
==========================================
  Files          13       13              
  Lines         588      611      +23     
==========================================
+ Hits          542      565      +23     
  Misses         46       46
Impacted Files Coverage Δ
osfclient/cli.py 97.59% <100%> (+0.04%) ⬆️
osfclient/models/storage.py 96.38% <100%> (+1.07%) ⬆️
osfclient/__main__.py 95.08% <100%> (+0.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44b9a87...188778d. Read the comment docs.

@benlindsay
Copy link
Collaborator Author

I've got what I think are a reasonable set of tests for this. Now I just want to do some time tests to make sure this is practically doing what I want.

@benlindsay
Copy link
Collaborator Author

benlindsay commented Aug 21, 2018

I did some quick tests on uploading a single folder with 10 empty files. Uploading the directory to a blank project on osf.io with the command

osf upload -r empty-files/ empty-files

took about 45 seconds for me. Deleting those files on osf.io and uploading with

osf upload -r -c empty-files/ empty-files

reduced the time to about 28 seconds.

Next I tried doing recursive force overwrites of those files. The command

osf upload -r -f empty-files/ empty-files

took about 80 seconds, while

osf upload -r -c -f empty-files/ empty-files

took about 40 seconds.

Also -rU was 55 seconds and -rcU was 34 seconds.

My only question for @betatim and/or @ctb and/or anyone else interested in weighing in is whether we should make the cache option True by default for recursive uploads. Are there some tests you'd want to see before making that call?

@benlindsay
Copy link
Collaborator Author

Also, can I propose bumping the version and pushing to pypi after this version?

@sappelhoff
Copy link

are there any news on this PR? I am trying to upload data using osf upload -r and it's very slow like described in previous posts

@dschneiderch
Copy link

@sappelhoff you can install @benlindsay version with
pip install git+https://github.com/benlindsay/osfclient.git@issue-148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finding update URL in create_file is very slow with many files
5 participants