Could NaNs not be counted as null #126

AlenkaF · 2023-03-30T14:27:45Z

In pyarrow we differentiate between missing (null) values, which we define with a bitmask, and NaN float values.

From the dataframe interchange protocol specification we have understood that one can use NaN to indicate missing values but that does not need to be the case (one can use NaN as a valid value)

dataframe-api/protocol/dataframe_protocol.py

Lines 195 to 213 in 4f7c1e0

    
               @property 
        
               def describe_null(self) -> Tuple[int, Any]: 
        
                   """ 
        
                   Return the missing value (or "null") representation the column dtype 
        
                   uses, as a tuple ``(kind, value)``. 
        
                   Kind: 
        
                       - 0 : non-nullable 
        
                       - 1 : NaN/NaT 
        
                       - 2 : sentinel value 
        
                       - 3 : bit mask 
        
                       - 4 : byte mask 
        
                   Value : if kind is "sentinel value", the actual value. If kind is a bit 
        
                   mask or a byte mask, the value (0 or 1) indicating a missing value. None 
        
                   otherwise. 
        
                   """ 
        
                   pass

There will be disceptancy between pyarrow and pandas, for example, where NaN will be turned into missing value. But we do not think it would be correct for pyarrow to change the null_count property as the information about the difference would be lost for the libraries that would benefit from it. Also the bitmask information and the information in the null_count would need to be made equal.

Is there a way a library could keep the behaviour of not treating NaNs as nulls?

(Connected issue in the arrow repo apache/arrow#34774)

The text was updated successfully, but these errors were encountered:

kkraus14 · 2023-03-31T20:48:00Z

My understanding here is that if PyArrow was exporting the dataframe protocol, it would use option 3 indicating that a bit mask is used for null values, which means that NaN values should be treated as a valid values.

rgommers · 2023-03-31T22:03:18Z

Thanks for opening this issue @AlenkaF. I agree with your request and with @kkraus14, that was the intent and that is exactly why we spent so much time on allowing different ways to encode nulls and have describe_null.

We discussed this yesterday, and it seems that something got lost in translation in the protocol test suite, and in a discussion on a Vaex PR. @honno took the action to investigate.

There will be discrepancy between pyarrow and pandas,

Indeed - @jorisvandenbossche indicated that this is expected; roundtripping with Pandas will lose the nan/NA distinction, but that is what it is due to a pandas design choice, and does not mean nan and NA aren't separately treated by the protocol.

AlenkaF · 2023-04-02T16:59:47Z

Great to hear 👍
Thank you so much for the quick response and share of thoughts!

Will close the open issue on Arrow side (apache/arrow#34774) and am looking forward to changes on the protocol spec and test suite.

rgommers added the interchange-protocol label Mar 31, 2023

AlenkaF mentioned this issue Apr 2, 2023

[Python] Interchange pa.Table's Column.null_count doesn't count NaNs apache/arrow#34774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could NaNs not be counted as null #126

Could NaNs not be counted as null #126

AlenkaF commented Mar 30, 2023

kkraus14 commented Mar 31, 2023

rgommers commented Mar 31, 2023

AlenkaF commented Apr 2, 2023

Could NaNs not be counted as null #126

Could NaNs not be counted as null #126

Comments

AlenkaF commented Mar 30, 2023

kkraus14 commented Mar 31, 2023

rgommers commented Mar 31, 2023

AlenkaF commented Apr 2, 2023