Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement Request: support wget -i type multifile download #194

Open
cxplay opened this issue Mar 26, 2024 · 7 comments
Open

Enhancement Request: support wget -i type multifile download #194

cxplay opened this issue Mar 26, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@cxplay
Copy link

cxplay commented Mar 26, 2024

It is currently not possible to specify the same output path for two URLs.

@tempusfrangit
Copy link
Member

tempusfrangit commented Mar 26, 2024

@cxplay What is the use case where different sources should map to the exact same file on disk?

What should the behavior be if:

example.com/mymodel-weights.safetensor is targeted to /src/weights.safetensor

and

example.com/my-other-model-weights.safetensor is targeted to /src/weights.safetensor

Should we take the latest one in the list? Should the first one take priority? Since PGET is focused on performance downloading, I'd prefer to not download both files just to overwrite. Additionally there would likely be an error as we do not use O_TRUNC unless the --force flag is provided.

I'm open to a change, but I want to make sure we're being clear about the intent and what the expected behavior should be. At the very least a warning that the local filesystem path is duplicated and clearly defining which file will be downloaded is important. However, it also could be argued that telling the downloader to download two distinct files to the same target is incorrect.

The one caveat is the the null consumer (-o null) which doesn't do the deduplication check and also just funnels all the data to io.Discard so nothing is written. (Null consumer is interesting for testing/cache warming/etc, but not useful in real-world inference environments as far as I've seen)

@cxplay
Copy link
Author

cxplay commented Mar 27, 2024

What I mean is that I used the following manifest.

https://example.com/res.bin ./temp
https://example.com/res-2.bin ./temp

But PGet refuses to read it, telling me the destination is the same.

@tempusfrangit
Copy link
Member

Yes, Pget works in absolute paths so the correct format would be

https://example.com/res.bin ./temp/res.bin
https://example.com/res-2.bin ./temp/res-2.bin

There is a bug that pget doesn't create the directory if it doesn't exist in the 0.6.2 release but in the 0.7.x (upcoming release) it will create the directory if needed.

@cxplay
Copy link
Author

cxplay commented Mar 27, 2024

Thanks for the explanation, I understand. But does it mean that I have to specify the filename of the output file, otherwise PGet will still refuse to read it? My main goal is to quickly batch download files using something like wget -i url.txt like in wget.

@tempusfrangit
Copy link
Member

tempusfrangit commented Mar 27, 2024

Ok! I think I understand now and I think we can add a flag to support this type of behavior. Let me run down what I read from your request:

wget -i like behavior.

This means that all elements in the manifest (multifile mode) would use the target as a directory:

https://example.com/my-file.txt        /src
https://example.com/my-file2.txt       /src

This would result in:

/src/my-file.txt
/src/my-file2.txt

Assumptions

  • PGET creates the target directory if it doesn't exist
  • PGET will deconflict the filenames (same as today), e.g. specifying https://example.com/blah.txt and https://example.org/blah.txt would still be an error case
  • If the target does exist and is not a directory, this is an error case
  • Maintain --force behaviour that utilizes O_TRUNC

Further Questions

  • As we have work to support un-tar in multifile mode coming, should the untar just land all the files in the target directory? Should untar be exempted from working with the wget -i type behavior?
  • Should the new behavior maintain the potential multiple target directories (e.g. each URL has a target) or should it take a simple "target directory" type switch and all files go to the same location?

@cxplay Let me know if this all makes sense and if you can answer the items in "further questions" I'd appreciate it. I think this is a small change (relative to some others) and should be easy to get added to pget.

@tempusfrangit tempusfrangit added the enhancement New feature or request label Mar 27, 2024
@tempusfrangit tempusfrangit changed the title multifile allows the same output path Enhancement Request: support wget -i type multifile download Mar 27, 2024
@cxplay
Copy link
Author

cxplay commented Mar 27, 2024

  • Yes, I think we should observe the global flag setting on the command line first, and then read it in the manifest, assuming that we are compatible with the following format in the future:
    https://example.com/my-file.txt /src
    https://example.com/my-file2.txt /src
    https://example.com/my-file3.tar /src un-tar
    https://example.com/my-file4.tar /src
    
    If the un-tar flag is not set, file3 will still be un-tar to the target. But file4 will not, because un-tar is not set.
  • As above, we should support setting the target (folder) explicitly on the command line, and if it is not set, then try to read it from the manifest.

@tempusfrangit
Copy link
Member

@cxplay I wanted to circle up and let you know this hasn't fallen off the radar and as we implement the multifile form for tar extraction we'll hit this. 0.8.1 needed to land for some memory utilization reasons before we took on new work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants