Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add support for determining the size of arrays in bytes #789

Open
keewis opened this issue Apr 17, 2024 · 3 comments
Open

RFC: add support for determining the size of arrays in bytes #789

keewis opened this issue Apr 17, 2024 · 3 comments
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.

Comments

@keewis
Copy link

keewis commented Apr 17, 2024

In trying to adapt xarray to numpy>=2 (and thus switching testing code from numpy.array_api to array-api-strict), I noticed that the array API does not require the nbytes property on arrays, nor the itemsize property on dtypes.

Thus, the only way to figure out the size of an array we could find was to create a function that dispatches to finfo / iinfo (and returns a hard-coded 1 byte for booleans), then use that and arr.size to compute the size of the array. This feels like more work than should be necessary, so I wonder if you would be open to extending the array API with arr.nbytes or arr.dtype.itemsize (or both)?

@rgommers
Copy link
Member

Hi @keewis, I think it's partially a "no one asked for this" but perhaps partially on purpose too. For the example you give: there is no requirement that an array implements the bool dtype with 1 byte per element. It's conceivable that 1 bit is used (and in fact Arrow only has 1-bit bools). And .itemsize IIRC mismatches between some libraries, can be in bytes or bits (not 100% sure of this).

so I wonder if you would be open to extending the array API with arr.nbytes or arr.dtype.itemsize (or both)?

It seems reasonable - an array attribute probably more so than a dtype attribute, since dtypes are opaque objects (we know nothing aside from the name).

Can I ask what you are doing with the calculated size? Do you have internal logic for creating chunks based on array size or something like that?

@keewis
Copy link
Author

keewis commented Apr 17, 2024

Can I ask what you are doing with the calculated size?

Mainly for user information. Among other things, knowing the size of the variables in a newly opened dataset can help decide whether to involve chunked arrays (like dask or cubed) at all, or whether the whole array fits into memory and thus eager computation would be faster. For that reason, xarray has started to print the size of each array and the total size of all data variables in its reprs.

See pydata/xarray#8690 for some discussion of this (though maybe that is just evidence that nbytes is used a lot?).

@kgryte
Copy link
Contributor

kgryte commented Apr 17, 2024

Another possibility would be adding a functional API for resolving the number of bytes, with some consideration for lazy arrays and arrays having non-deterministic shapes.

@kgryte kgryte changed the title determining the size of arrays in bytes RFC: add support for determining the size of arrays in bytes Apr 18, 2024
@kgryte kgryte added RFC Request for comments. Feature requests and proposed changes. Needs Discussion Needs further discussion. API extension Adds new functions or objects to the API. labels Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API extension Adds new functions or objects to the API. Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

No branches or pull requests

3 participants