Support of bit vs byte-packed boolean #227

AlenkaF · 2023-08-15T04:58:54Z

This topic came up on a PyArrow issue by Polars developers working on their native Dataframe Protocol Implementation. To note, in the PyArrow implementation of the protocol we decided to cast bit-packed boolean values to uint8 when producing the interchange object and we cast uint8 to bit-packed boolean when consuming an interchange object.

As this topic came up again and pandas has added support for bitmask conversion in pandas-dev/pandas#52824 it would make sense to try to support bit-packed boolean dtypes in pyarrow implementation also (without converting to uint8), but I haven't found any information in the specification of the protocol about bit vs byte-packed boolean values.

Are both, bit and byte-packed booleans, supported by the Dataframe Interchange Protocol?

cc @stinodego @jorisvandenbossche

The text was updated successfully, but these errors were encountered:

rgommers · 2023-08-15T05:34:10Z

Hi @AlenkaF, thanks for bringing this up. It should be supported by the design. From https://data-apis.org/dataframe-protocol/latest/design_requirements.html:

_Must allow the consumer to inspect the representation for missing values that the producer uses for each column or data type. Sentinel values, bit masks, and boolean masks must be supported. _

For the describe_null method, there should be no difference:

    def describe_null(self) -> Tuple[ColumnNullType, Any]:
        """
        ...
        Value : if kind is "sentinel value", the actual value. If kind is a bit
        mask or a byte mask, the value (0 or 1) indicating a missing value. None
        otherwise.

The get_buffers method will return a validity buffer:

    def get_buffers(self) -> ColumnBuffers:
        """
        ...
            - "validity": a two-element tuple whose first element is a buffer
                          containing mask values indicating missing data and
                          whose second element is the mask value buffer's
                          associated dtype. None if the null representation is
                          not a bit or byte mask.

That buffer has a bufsize property, which is the buffer size in bytes. So I think to actually figure out if it's a bit mask, you want to check if Column.size (which is in elements) equals get_buffers()['validity'].bufsize - one bit per element for a bit mask.

Better answer: now that I've written all the above, there's an easier answer. The first return value of describe_null is an enum, which can have values:

    NON_NULLABLE = 0
    USE_NAN = 1
    USE_SENTINEL = 2
    USE_BITMASK = 3
    USE_BYTEMASK = 4

so that's the answer right there. The bufsize thing could be a useful sanity check perhaps.

kkraus14 · 2023-08-15T06:22:17Z

@rgommers I believe the question was moreso about the column data type being a boolean with a memory layout of a single bit per value (Arrow's boolean type layout), not how nulls are represented.

@AlenkaF I believe we should be able to represent bit sized boolean types already using the BOOL type enum value (https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L55) along with a type tuple (https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L61) that specifies a bit-width of 1, i.e.

bitpacked_bool_type = (DtypeKind.BOOL, 1, 'b', '=')

I'm not sure if it's currently possible to correctly represent a byte-width boolean due to the use of Arrow C Data Interface format strings which I don't believe support that?

See https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L245-L269 for more details

jorisvandenbossche · 2023-08-15T06:28:50Z

While the dtype tuple specification indeed allows you to specify any bitwidth together with a DtypeKind.BOOL type, we also include the Arrow type format strings (https://arrow.apache.org/docs/dev/format/CDataInterface.html#data-type-description-format-strings), and I am not fully sure how this should be interpreted then?

But that's already an issue right now: pandas uses bools of 8 bits, but then specifies "b" for the arrow format string, which represents a bool of 1 bit, contradicting each other.
Is the Arrow format string meant to specify the "logical" type only then? Or do we consider the pandas implementation to be incorrect?

EDIT: I see Keith edited his post above to raise the exact same question ;)

jorisvandenbossche · 2023-08-15T06:32:52Z

In Arrow's context, you can say that a byte-width boolean array can be represented as an extension type using uint8 as the "storage type". The Arrow C Data Interface then indicates that the format string describes this storage type. In that logic one could say to do:

byte_bool_type = (DtypeKind.BOOL, 8, 'C', '=')

But not sure that wasn't necessarily our intention on how to use the Arrow string format

AlenkaF · 2023-08-16T04:57:17Z

Thanks for all your quick and detailed replies!
I am happy to see the discussion moving in a good direction.

I agree with Joris on the last comment. Maybe this was not the intention on how to use the Arrow string format, but I feel it describes the use case well and am not sure there are any other better options.

stinodego · 2023-08-16T06:58:53Z

When implementing this for Polars, I had some trouble wrapping my head around the offset on Columns and bufsize on Buffers, specifically when it comes to bitmasks.

You can figure out how to read the bitmask buffer from the Column offset and length. But doesn't it make more sense for offset to be a property of Buffer and the bufsize to be in bits?

rgommers · 2023-08-18T13:54:34Z

But not sure that wasn't necessarily our intention on how to use the Arrow string format

Indeed, I think we chose the Arrow format strings because they were more comprehensive that the alternative (NumPy-style format strings). But there's a gap here. I think we wanted to go with (DtypeKind.BOOL, 1, 'b', '='), but missed the inconsistency here (or forgot about it again). I don't really like (DtypeKind.BOOL, 8, 'C', '='), because it's illogical (it's not actually an unsigned integer).

Ideally we'd get Arrow to add a new type string for a byte-sized logical boolean type. Otherwise we could use 'b' and document the inconsistency? Or use an emptry string perhaps, to indicate more clearly that there's no matching type string?

AlenkaF · 2023-10-02T12:45:31Z

I would be in favour of using empty string for byte-packed booleans. What do others think?

…aframe (#37975) ### Rationale for this change Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol. Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). ### What changes are included in this PR? This PR adds the support for bit-packed booleans when consuming a dataframe interchange object. ### Are these changes tested? Only locally, currently! * Closes: #37145 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>

…om_dataframe (apache#37975) ### Rationale for this change Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol. Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). ### What changes are included in this PR? This PR adds the support for bit-packed booleans when consuming a dataframe interchange object. ### Are these changes tested? Only locally, currently! * Closes: apache#37145 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Signed-off-by: AlenkaF <frim.alenka@gmail.com>

rgommers added the interchange-protocol label Aug 15, 2023

This was referenced Oct 2, 2023

GH-37145: [Python] support boolean columns with bitsize 1 in from_dataframe apache/arrow#37975

Merged

[Python][Interchange protocol] Export boolean columns as bit-packed values apache/arrow#37991

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of bit vs byte-packed boolean #227

Support of bit vs byte-packed boolean #227

AlenkaF commented Aug 15, 2023

rgommers commented Aug 15, 2023

kkraus14 commented Aug 15, 2023 •

edited

jorisvandenbossche commented Aug 15, 2023 •

edited

jorisvandenbossche commented Aug 15, 2023

AlenkaF commented Aug 16, 2023

stinodego commented Aug 16, 2023 •

edited

rgommers commented Aug 18, 2023

AlenkaF commented Oct 2, 2023

Support of bit vs byte-packed boolean #227

Support of bit vs byte-packed boolean #227

Comments

AlenkaF commented Aug 15, 2023

rgommers commented Aug 15, 2023

kkraus14 commented Aug 15, 2023 • edited

jorisvandenbossche commented Aug 15, 2023 • edited

jorisvandenbossche commented Aug 15, 2023

AlenkaF commented Aug 16, 2023

stinodego commented Aug 16, 2023 • edited

rgommers commented Aug 18, 2023

AlenkaF commented Oct 2, 2023

kkraus14 commented Aug 15, 2023 •

edited

jorisvandenbossche commented Aug 15, 2023 •

edited

stinodego commented Aug 16, 2023 •

edited