from_dataframe: only interchange supported columns? #288

MarcoGorelli · 2023-10-20T18:35:21Z

I just wanted to raise this here mwaskom/seaborn#3533

Try to plot a polars dataframe with a list column raises, even if that column wouldn't have been plotted

Arguably this the issue is that seaborn is trying to convert all columns, rather than only the ones it needs

But still, wondering if there's a solution here - like, that from_dataframe only converts columns if they're of a supported data type, rather than raising. Maybe we could add an option for this in the pandas from_dataframe method, and note that it's technically not part of the protocol

kkraus14 · 2023-10-20T18:51:18Z

In my opinion, that seems to be working as expected given we don't yet support data types like LIST.

Selecting the relevant columns to interchange seems like something that should either be done using the library specific API or using the API standard we're working on?

MarcoGorelli · 2023-10-20T18:56:17Z

selecting columns can be done with __dataframe__, no need for the dataframe api for this

in theory it is working as expected, it's just a pity to see a library introduce a regression when they move from to_pandas() to using the interchange protocol

kkraus14 · 2023-10-20T19:10:01Z

selecting columns can be done with __dataframe__, no need for the dataframe api for this

You can't select columns in the actual __dataframe__ call, but on the object returned from calling __dataframe__. I don't think it's well defined behavior to call __dataframe__ against a DataFrame with unsupported column dtypes?

jorisvandenbossche · 2023-10-20T19:11:58Z

I would also say that this is the responsibility of the user of the interchange protocol (i.e. in this case seaborn) to first select the columns it needs before converting (which is also useful for other reasons, eg because the conversion can be costly).

I think silently dropping columns in the conversion can also be confusing / lead to corner cases. Assume someone passes a polars dataframe to seaborn and specifies x="my_column" in the seaborn plotting method. If accidentally this column is not one supported by the interchange protocol, and it would be dropped in the conversion, then seaborn would raise a confusing error about "my_column" not being a column name, instead of a type error about the dtype not being supported.

jorisvandenbossche · 2023-10-20T19:15:53Z

You can't select columns in the actual __dataframe__ call, but on the object returned from calling __dataframe__. I don't think it's well defined behavior to call __dataframe__ against a DataFrame with unsupported column dtypes?

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types. In any case that's how all the implementations I am aware of work, I think.

kkraus14 · 2023-10-20T19:25:44Z

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types. In any case that's how all the implementations I am aware of work, I think.

In that case, say you have a library specific dataframe with an unsupported dtype like LIST in a column. When would it throw?

mydf.__dataframe__(...)
mydf.__dataframe__(...).get_column(...)
mydf.__dataframe__(...).get_column(...).dtype

rgommers · 2023-10-26T16:49:50Z

I'd suggest that both get_column(...) and get_column(...).dtype have to throw (in the same place, namely inside get_column).

Maybe we didn't really define it (and it would be good to do so), but my assumption is that __dataframe__ itself doesn't yet do any conversion, and so it is safe to call this with unsupported column types.

I think we indeed didn't define anything beyond "nested data types are unsupported". And I suspect Joris is right that it'd be better to allow this - failing later will make it easier to write more generic code here.

MarcoGorelli · 2023-10-26T16:52:27Z

__dataframe__ is already being used here under the assumption that it doesn't fail so please let's not make that raise:

https://github.com/plotly/plotly.py/blob/57e4d1d33c67c5cc715bec5c3c240dd6f4c3b10d/packages/python/plotly/plotly/express/_core.py#L1428-L1431

                columns = list(necessary_columns)
                df = pd.api.interchange.from_dataframe(
                    df.__dataframe__().select_columns_by_name(columns)
                )

kkraus14 · 2023-10-27T00:44:37Z

I think throwing on unsupported dtype columns in the get_column(...) call is reasonable behavior and we should probably document this as a supported case.

MarcoGorelli added the interchange-protocol label Oct 20, 2023

MarcoGorelli mentioned this issue Oct 21, 2023

fall back to to_pandas if interchange protocol fails mwaskom/seaborn#3534

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

from_dataframe: only interchange supported columns? #288

from_dataframe: only interchange supported columns? #288

MarcoGorelli commented Oct 20, 2023

kkraus14 commented Oct 20, 2023

MarcoGorelli commented Oct 20, 2023

kkraus14 commented Oct 20, 2023

jorisvandenbossche commented Oct 20, 2023

jorisvandenbossche commented Oct 20, 2023

kkraus14 commented Oct 20, 2023

rgommers commented Oct 26, 2023

MarcoGorelli commented Oct 26, 2023

kkraus14 commented Oct 27, 2023

from_dataframe: only interchange supported columns? #288

from_dataframe: only interchange supported columns? #288

Comments

MarcoGorelli commented Oct 20, 2023

kkraus14 commented Oct 20, 2023

MarcoGorelli commented Oct 20, 2023

kkraus14 commented Oct 20, 2023

jorisvandenbossche commented Oct 20, 2023

jorisvandenbossche commented Oct 20, 2023

kkraus14 commented Oct 20, 2023

rgommers commented Oct 26, 2023

MarcoGorelli commented Oct 26, 2023

kkraus14 commented Oct 27, 2023