Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of bit vs byte-packed boolean #227

Open
AlenkaF opened this issue Aug 15, 2023 · 8 comments
Open

Support of bit vs byte-packed boolean #227

AlenkaF opened this issue Aug 15, 2023 · 8 comments

Comments

@AlenkaF
Copy link

AlenkaF commented Aug 15, 2023

This topic came up on a PyArrow issue by Polars developers working on their native Dataframe Protocol Implementation. To note, in the PyArrow implementation of the protocol we decided to cast bit-packed boolean values to uint8 when producing the interchange object and we cast uint8 to bit-packed boolean when consuming an interchange object.

As this topic came up again and pandas has added support for bitmask conversion in pandas-dev/pandas#52824 it would make sense to try to support bit-packed boolean dtypes in pyarrow implementation also (without converting to uint8), but I haven't found any information in the specification of the protocol about bit vs byte-packed boolean values.

Are both, bit and byte-packed booleans, supported by the Dataframe Interchange Protocol?

cc @stinodego @jorisvandenbossche

@rgommers
Copy link
Member

Hi @AlenkaF, thanks for bringing this up. It should be supported by the design. From https://data-apis.org/dataframe-protocol/latest/design_requirements.html:

_Must allow the consumer to inspect the representation for missing values that the producer uses for each column or data type. Sentinel values, bit masks, and boolean masks must be supported. _

For the describe_null method, there should be no difference:

    def describe_null(self) -> Tuple[ColumnNullType, Any]:
        """
        ...
        Value : if kind is "sentinel value", the actual value. If kind is a bit
        mask or a byte mask, the value (0 or 1) indicating a missing value. None
        otherwise.

The get_buffers method will return a validity buffer:

    def get_buffers(self) -> ColumnBuffers:
        """
        ...
            - "validity": a two-element tuple whose first element is a buffer
                          containing mask values indicating missing data and
                          whose second element is the mask value buffer's
                          associated dtype. None if the null representation is
                          not a bit or byte mask.

That buffer has a bufsize property, which is the buffer size in bytes. So I think to actually figure out if it's a bit mask, you want to check if Column.size (which is in elements) equals get_buffers()['validity'].bufsize - one bit per element for a bit mask.

Better answer: now that I've written all the above, there's an easier answer. The first return value of describe_null is an enum, which can have values:

    NON_NULLABLE = 0
    USE_NAN = 1
    USE_SENTINEL = 2
    USE_BITMASK = 3
    USE_BYTEMASK = 4

so that's the answer right there. The bufsize thing could be a useful sanity check perhaps.

@kkraus14
Copy link
Collaborator

kkraus14 commented Aug 15, 2023

@rgommers I believe the question was moreso about the column data type being a boolean with a memory layout of a single bit per value (Arrow's boolean type layout), not how nulls are represented.

@AlenkaF I believe we should be able to represent bit sized boolean types already using the BOOL type enum value (https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L55) along with a type tuple (https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L61) that specifies a bit-width of 1, i.e.

bitpacked_bool_type = (DtypeKind.BOOL, 1, 'b', '=')

I'm not sure if it's currently possible to correctly represent a byte-width boolean due to the use of Arrow C Data Interface format strings which I don't believe support that?

See https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L245-L269 for more details

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 15, 2023

While the dtype tuple specification indeed allows you to specify any bitwidth together with a DtypeKind.BOOL type, we also include the Arrow type format strings (https://arrow.apache.org/docs/dev/format/CDataInterface.html#data-type-description-format-strings), and I am not fully sure how this should be interpreted then?

But that's already an issue right now: pandas uses bools of 8 bits, but then specifies "b" for the arrow format string, which represents a bool of 1 bit, contradicting each other.
Is the Arrow format string meant to specify the "logical" type only then? Or do we consider the pandas implementation to be incorrect?

EDIT: I see Keith edited his post above to raise the exact same question ;)

@jorisvandenbossche
Copy link
Member

In Arrow's context, you can say that a byte-width boolean array can be represented as an extension type using uint8 as the "storage type". The Arrow C Data Interface then indicates that the format string describes this storage type. In that logic one could say to do:

byte_bool_type = (DtypeKind.BOOL, 8, 'C', '=')

But not sure that wasn't necessarily our intention on how to use the Arrow string format

@AlenkaF
Copy link
Author

AlenkaF commented Aug 16, 2023

Thanks for all your quick and detailed replies!
I am happy to see the discussion moving in a good direction.

I agree with Joris on the last comment. Maybe this was not the intention on how to use the Arrow string format, but I feel it describes the use case well and am not sure there are any other better options.

@stinodego
Copy link
Contributor

stinodego commented Aug 16, 2023

When implementing this for Polars, I had some trouble wrapping my head around the offset on Columns and bufsize on Buffers, specifically when it comes to bitmasks.

You can figure out how to read the bitmask buffer from the Column offset and length. But doesn't it make more sense for offset to be a property of Buffer and the bufsize to be in bits?

@rgommers
Copy link
Member

But not sure that wasn't necessarily our intention on how to use the Arrow string format

Indeed, I think we chose the Arrow format strings because they were more comprehensive that the alternative (NumPy-style format strings). But there's a gap here. I think we wanted to go with (DtypeKind.BOOL, 1, 'b', '='), but missed the inconsistency here (or forgot about it again). I don't really like (DtypeKind.BOOL, 8, 'C', '='), because it's illogical (it's not actually an unsigned integer).

Ideally we'd get Arrow to add a new type string for a byte-sized logical boolean type. Otherwise we could use 'b' and document the inconsistency? Or use an emptry string perhaps, to indicate more clearly that there's no matching type string?

@AlenkaF
Copy link
Author

AlenkaF commented Oct 2, 2023

I would be in favour of using empty string for byte-packed booleans. What do others think?

AlenkaF added a commit to apache/arrow that referenced this issue Oct 5, 2023
…aframe (#37975)

### Rationale for this change

Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol.

Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). 

### What changes are included in this PR?

This PR adds the support for bit-packed booleans when consuming a dataframe interchange object.

### Are these changes tested?

Only locally, currently!
* Closes: #37145

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…om_dataframe (apache#37975)

### Rationale for this change

Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol.

Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). 

### What changes are included in this PR?

This PR adds the support for bit-packed booleans when consuming a dataframe interchange object.

### Are these changes tested?

Only locally, currently!
* Closes: apache#37145

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…om_dataframe (apache#37975)

### Rationale for this change

Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol.

Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). 

### What changes are included in this PR?

This PR adds the support for bit-packed booleans when consuming a dataframe interchange object.

### Are these changes tested?

Only locally, currently!
* Closes: apache#37145

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…om_dataframe (apache#37975)

### Rationale for this change

Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol.

Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). 

### What changes are included in this PR?

This PR adds the support for bit-packed booleans when consuming a dataframe interchange object.

### Are these changes tested?

Only locally, currently!
* Closes: apache#37145

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants