Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interchange protocol use case: getting certain columns as numpy array #66

Open
jorisvandenbossche opened this issue Sep 13, 2021 · 8 comments

Comments

@jorisvandenbossche
Copy link
Member

I think it's useful to think through concrete use cases on how the interchange protocol could be used, to see if it covers those use cases / the desired APIs are available.
One example use case could be matplotlib's plot("x", "y", data=obj), where matplotlib already supports getting the x and y column of any "indexable" object. Currently they require obj["x"] to give the desired data, but so in theory this support could be extended to any object that supports the dataframe interchange protocol. But at the same time, matplotlib currently also needs those data (AFAIK) as numpy arrays because the low-level plotting code is implemented in such a way.

With the current API, matplotlib could do something like:

df = obj.__dataframe__()
x_values = some_utility_func(df.get_column_by_name("x").get_buffers())

where some_utility_func can convert the dict of Buffer objects to a numpy array (once numpy supports dlpack, converting the Buffer objects to numpy will become easy, but the function will then still need to handle potentially multiple buffers returned from get_buffers()).

That doesn't seem ideal: 1) writing the some_utility_func to do the conversion to numpy is non-trivial to implement for all different cases, 2) should an end-user library have to go down to the Buffer objects?

This isn't a pure interchange from one dataframe library to another, so we could also say that this use case is out-of-scope at the moment. But on the other hand, it seems a typical use case example, and could in theory already be supported right now (it only needs the "dataframe api" to get a column, which is one of the few things we already provide).

(disclaimer: I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))

@kkraus14
Copy link
Collaborator

I think asking for a column / DataFrame in a single chunk is something reasonable (whether or not it's part of the standard or interchange protocol). If we had the ability to get the column as a single chunk then the utility function becomes really straightforward or just becomes something like a np.asarray call.

@jorisvandenbossche
Copy link
Member Author

If we had the ability to get the column as a single chunk

We already have this ability with the currently documented API, I think (the methods get_column()/get_column_by_name() and num_chunks / get_chunks() on the column should be sufficient to get a single-chunk column).

the utility function becomes really straightforward or just becomes something like a np.asarray call.

A np.asarray call only works if we add something like __array__ or __array_interface__ to Column, which we currently don't specify (cfr #48).

In case you meant calling it on the individual Buffer, that in itself will become trivial once numpy supports dlpack, yes.
You still need to handle the different buffers and dtypes etc. A quick attempt at a version with only limited functionality:

def column_to_numpy_array(col):
    assert col.num_chunks == 1  # for now only deal with single chunks
    kind, _, format_str, _ = col.dtype
    if kind not in (0, 1, 2, 22):
        raise TypeError("only numeric and datetime dtypes are supported")
    if col.describe_null[0] not in (0, 1):
        raise NotImplementedError("Null values represented as masks or "
                                  "sentinel values not handled yet")
    buffer, dtype = col.get_buffers()["data"]
    arr  = buffer_to_ndarray(buffer, dtype)  # this can become `np.asarray` or `np.from_dlpack` in the future
    if kind == 22:  # datetime
        unit = format_string.split(":")[-1]
        arr = arr.view(f"datetime64[{unit}]")
    return arr

where buffer_to_ndarray is currently something like

def buffer_to_ndarray(_buffer, _dtype) -> np.ndarray:
, but in the future can become a single numpy call once numpy supports DLPack.

That's certainly relatively straightforward code, but also dealing with a lot of details of the protocol, and IMO not something many end users should have to implement themselves.

@kkraus14
Copy link
Collaborator

We already have this ability with the currently documented API, I think (the methods get_column()/get_column_by_name() and num_chunks / get_chunks() on the column should be sufficient to get a single-chunk column).

I meant more along the lines of given a column with multiple chunks, requesting the column to combine its chunks into a single chunk so that it has a contiguous buffer under the hood.

@rgommers
Copy link
Member

This is a nice example, thanks @jorisvandenbossche. I feel like df.get_column_by_name("x").get_buffers() is taking a wrong turn though - an end user library should indeed not need to deal with buffers directly.

xref [the plot() docs, see under "Plotting labelled data"

I think this would work:

df_obj = obj.__dataframe__().get_columns_by_name([x, y])
df = pd.from_dataframe(df)
xvals = df[x].values
yvals = df[y].values

Currently they require obj["x"] to give the desired data

That's not the actual requirement today I think - it's that np.asarray(obj[x])) returns the data as a numpy array. Which is a fairly specific requirement - but even so it can be made to work just fine, on the condition that if the user uses the data=obj syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray call succeeds.

@rgommers
Copy link
Member

I am not a matplotlib developer, I also don't know if they for example have efforts to add support for generic array-likes (but it's nonetheless a typical example use case, I think))

I'm not sure about Matplotlib, but I do know that Napari would like this and has tried to improve compatibility with PyTorch and other libraries.

@jorisvandenbossche
Copy link
Member Author

on the condition that if the user uses the data=obj syntax, they have Pandas installed. That is not an unreasonable optional dependency to require, because other dataframe libraries may not be able to provide the guarantee that that np.asarray call succeeds.

IMO that's the big downside of your code snippet. As a pandas maintainer I of course don't mind that people need pandas :), but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?
Also, for the "guarantee that np.asarray call succeeds", that's basically something you can do based on the buffers in the interchange protocol (#66 (comment)), if the original dataframe library doesn't support it directly. But then we get back to the point that ideally library users of the protocol shouldn't get down to the buffer level.

@rgommers
Copy link
Member

but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?

Well, I see it as: the protocol supports turning one kind of dataframe into another kind, so as a downstream library if you support one specific library, you get all the other ones for free.

Really what Matplotlib wants here is: turn a single column into a numpy.ndarray. But if we support that, it should either be generic (like a potentially non-zero-copy way to use DLPack and/or the buffer protocol on a column), or we should support other array libraries too. Otherwise it's pretty ad-hoc imho.

but shouldn't it be one of the goals of this effort that a library can support dataframe-like objects without requiring pandas?

Second thought: that is step two in the Consortium efforts - you need the generic public API, not just the interchange protocol. That's also what's said at https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html#progression-timeline.

@rgommers
Copy link
Member

rgommers commented Oct 14, 2021

We discussed this in a call, and the sentiment was that it would be very nice to have this Matplotlib use case work, and not have it wait for another API that is still to be designed.

For a column one can get from the dataframe interchange protocol, it would be very useful if that could be turned into an array (any kind of array which the consuming library - Matplotlib in this case - wants). Options to achieve that include:

  • inside the protocol, to get an array object from a column (but, we decided against that previously, when for example considering whether __dlpack__ should live on the column or the buffer level, and for `array_interface et al.)
  • inside each array library, it could have a from_column function there to create its own kind of array
  • in each consumer library (so Matplotlib would implement a Column -> numpy.ndarray path)
  • in a separate utility library that is designed to be vendored or depended upon by consumer libraries

The separate utility library likely makes the most sense. Benefits are: this code then only has to be written once, it keeps things outside of the protocol/standard, and it can be made available fairly quickly (no need to wait for multiple array libraries to implement something and then do a release).

To make the code independent of any array or dataframe library, it may have to look something like:

def array_from_column(
    df: DataFrame, 
    column_name: str,
    xp: Any,  # object/namespace implementing the array API
) -> <array>:
    """
    Produces an array from a column, if possible.

    Will raise a ValueError in case the column contains missing data or has a dtype
    that is not supported by the array API standard
    """

It's likely also practical to have a separate column_to_numpy function, given that Matplotlib wants (a) a numpy.ndarray rather than the numpy.array_api array object, and (b) needs things to work with 2 year old numpy releases. If this is in a separate utility library and in no way directly incorporated in the standard, the objections to incorporating numpy-specific things should not apply here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants