Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add quantile? #795

Open
mdhaber opened this issue Apr 21, 2024 · 5 comments
Open

RFC: add quantile? #795

mdhaber opened this issue Apr 21, 2024 · 5 comments
Labels
Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.

Comments

@mdhaber
Copy link

mdhaber commented Apr 21, 2024

I'm working on adding array API support in scipy.stats (scipy/scipy#20544) and one of the the things I'll need is a quantile function. If there is some support for this idea, I'll convert this issue into a proper proposal.

Looks like there is already wide support:

Previous discussions (not much):

There are many conventions for calculating quantiles. Only a few methods would be required by the standard, and if choice of a default is too contentious, perhaps the array-API can consider method to be a required keyword argument, and libraries would be welcome to keep their own default.

@kgryte kgryte added RFC Request for comments. Feature requests and proposed changes. Needs Discussion Needs further discussion. labels Apr 21, 2024
@rgommers
Copy link
Member

Thanks for the proposal @mdhaber. I don't quite have an opinion yet - I think it in part depends on the situation with methods (see below).

Also - how much do you actually need this? I only count two instances of it being used in SciPy, one of which is a test case. I just did a grep, so I may be missing some dynamic usage perhaps. The one instance is pretty simple, no keyword usage:

https://github.com/scipy/scipy/blob/d55cb95282b5cfab381e373d3c127630c7e915a0/scipy/stats/tests/test_fit.py#L1002

Only a few methods would be required by the standard,

Which ones? Could we get away with only a default method, and hence no method keyword?

and if choice of a default is too contentious, perhaps the array-API can consider method to be a required keyword argument, and libraries would be welcome to keep their own default.

@kgryte
Copy link
Contributor

kgryte commented Apr 24, 2024

Related is our previous discussion on median, which we left out due to implementation difficulty in distributed contexts. That stated, Dask does implement median, so may be time to revisit this.

@mdhaber
Copy link
Author

mdhaber commented Apr 24, 2024

Also - how much do you actually need this?

Personally, I don't need it very badly. You're right that quantile/percentile only appear in a few stats functions (stats.bootstrap, stats.iqr, and some distribution fitting code), and it is easy to compute quantiles using existing array-API functions namely (sort). However, when only a few percentiles are needed, it is more efficient to use a partition approach. One could propose adding partition, since it is more general, but I would only use it for quantile.

Which ones? Could we get away with only a default method, and hence no method keyword?
The standard reference is Sample Quantiles in Statistical Packages.

The current variation in sample quantile definitions causes confusion, and so there is a need to standardize the definition of sample quantile across packages and within packages... and we propose that $Q_8$ is the best choice

$Q_8$ corresponds with NumPy's median_unbiased. However, $Q_7$ (linear) is the default in most libraries. It would be unfortunate to have to choose between the two. $Q_5$ (hazen) also satisfies many desirable properties. So I suppose I would suggest having these three in that order of priority.

One wrench I'd like to throw into my own proposal:
The current behavior of np.quantile(a, q, axis) (and other implementations that follow NumPy) is for the quantiles specified by q, a 1D array-like, to be computed for all axis-slices of the data a. This makes it impossible to efficiently compute a different quantile for each axis-slice of a (which is needed, for example, in a vectorized implementation of the BCA-bootstrap).
I would propose following normal broadcasting rules between a and q (possibly with the constraint that q must be 0D or have length 1 along axis), which would meet both needs. A hybrid possibility, which would just be an extension of the existing behavior, would be to allow q to be an N-D array that is broadcastable with a except along axis. In this case, for each pair of corresponding axis-slices a_i and q_i, it would compute all quantiles of a_i specified by q_i.

@kgryte
Copy link
Contributor

kgryte commented May 2, 2024

I did a bit of digging, with the disclaimer that my digging is incomplete, but I'll try to summarize my initial findings below.

Overview

I took a peak at which APIs were implemented across array libraries, spot-checked whether/how they were implemented, and did a search for which APIs were used in SciPy and sklearn.

median

Usage

  • SciPy: low usage; more in tests, than in implementations
  • scikit-learn: a reasonable amount of usage

Implementations

  • NumPy implements in terms of partition
  • CuPy implements as an element-wise kernel
  • PyTorch returns the lower of two medians for an even number of elements, rather than the average (NumPy returns the avg to avoid breaking astropy); to return an average, must call quantile.
  • Dask supports median
  • TensorFlow does not appear to have a dedicated median API, instead deferring to percentile

quantile

Usage

Implementations

  • NumPy implements in terms of partition
  • CuPy relies on sort
    • only supports a subset of methods: lower, higher, midpoint, nearest, and linear
  • PyTorch supports the same methods as CuPy
  • Dask supports in DataFrames, but not arrays
  • TensorFlow implements in terms of percentile
    • supports the same methods as CuPy
    • API differs in that you specify the number of quantiles you want

percentile

Usage

Implementations

  • NumPy implements in terms of quantile
  • PyTorch does not support
  • CuPy implements in terms of quantile
    • supports only linear, lower, higher, nearest, and midpoint methods
  • Dask supports "approximate" percentiles
    • supports same methods as CuPy
  • TensorFlow implements in terms of sort
    • supports same methods as CuPy

partition

Implementations

@mdhaber
Copy link
Author

mdhaber commented May 9, 2024

I reviewed scipy.stats more thoroughly for uses of median, quantile, and/or percentile. The first three functions would benefit from having quantile in the array API; for the rest, median would suffice.

There are also uses of partition in stats.trim functions (trim_mean, trimboth, trim1).

Most of these functions have something else that would make array API conversion challenging at the moment, but I don't think any have non-starters for array API support. I've omitted uses I don't think will get array API support any time soon (e.g. levy_stable). There are also uses in other sub-packages (e.g. scipy.ndimage.median_filter), but I'm less familiar with their prospects for array API support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Needs further discussion. RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

No branches or pull requests

3 participants