Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

from_dataframe: only interchange supported columns? #288

Open
MarcoGorelli opened this issue Oct 20, 2023 · 9 comments
Open

from_dataframe: only interchange supported columns? #288

MarcoGorelli opened this issue Oct 20, 2023 · 9 comments

Comments

@MarcoGorelli
Copy link
Contributor

I just wanted to raise this here mwaskom/seaborn#3533

Try to plot a polars dataframe with a list column raises, even if that column wouldn't have been plotted

Arguably this the issue is that seaborn is trying to convert all columns, rather than only the ones it needs

But still, wondering if there's a solution here - like, that from_dataframe only converts columns if they're of a supported data type, rather than raising. Maybe we could add an option for this in the pandas from_dataframe method, and note that it's technically not part of the protocol

@kkraus14
Copy link
Collaborator

In my opinion, that seems to be working as expected given we don't yet support data types like LIST.

Selecting the relevant columns to interchange seems like something that should either be done using the library specific API or using the API standard we're working on?

@MarcoGorelli
Copy link
Contributor Author

selecting columns can be done with __dataframe__, no need for the dataframe api for this

in theory it is working as expected, it's just a pity to see a library introduce a regression when they move from to_pandas() to using the interchange protocol

@kkraus14
Copy link
Collaborator

selecting columns can be done with __dataframe__, no need for the dataframe api for this

You can't select columns in the actual __dataframe__ call, but on the object returned from calling __dataframe__. I don't think it's well defined behavior to call __dataframe__ against a DataFrame with unsupported column dtypes?

@jorisvandenbossche
Copy link
Member

I would also say that this is the responsibility of the user of the interchange protocol (i.e. in this case seaborn) to first select the columns it needs before converting (which is also useful for other reasons, eg because the conversion can be costly).

I think silently dropping columns in the conversion can also be confusing / lead to corner cases. Assume someone passes a polars dataframe to seaborn and specifies x="my_column" in the seaborn plotting method. If accidentally this column is not one supported by the interchange protocol, and it would be dropped in the conversion, then seaborn would raise a confusing error about "my_column" not being a column name, instead of a type error about the dtype not being supported.

@jorisvandenbossche
Copy link
Member

You can't select columns in the actual __dataframe__ call, but on the object returned from calling __dataframe__. I don't think it's well defined behavior to call __dataframe__ against a DataFrame with unsupported column dtypes?

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types. In any case that's how all the implementations I am aware of work, I think.

@kkraus14
Copy link
Collaborator

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types. In any case that's how all the implementations I am aware of work, I think.

In that case, say you have a library specific dataframe with an unsupported dtype like LIST in a column. When would it throw?

  1. mydf.__dataframe__(...)
  2. mydf.__dataframe__(...).get_column(...)
  3. mydf.__dataframe__(...).get_column(...).dtype

@rgommers
Copy link
Member

I'd suggest that both get_column(...) and get_column(...).dtype have to throw (in the same place, namely inside get_column).

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types.

I think we indeed didn't define anything beyond "nested data types are unsupported". And I suspect Joris is right that it'd be better to allow this - failing later will make it easier to write more generic code here.

@MarcoGorelli
Copy link
Contributor Author

__dataframe__ is already being used here under the assumption that it doesn't fail so please let's not make that raise:

https://github.com/plotly/plotly.py/blob/57e4d1d33c67c5cc715bec5c3c240dd6f4c3b10d/packages/python/plotly/plotly/express/_core.py#L1428-L1431

                columns = list(necessary_columns)
                df = pd.api.interchange.from_dataframe(
                    df.__dataframe__().select_columns_by_name(columns)
                )

@kkraus14
Copy link
Collaborator

I think throwing on unsupported dtype columns in the get_column(...) call is reasonable behavior and we should probably document this as a supported case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants