Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roundtrip tests with missing values #15

Open
jorisvandenbossche opened this issue Jan 10, 2023 · 3 comments
Open

Roundtrip tests with missing values #15

jorisvandenbossche opened this issue Jan 10, 2023 · 3 comments

Comments

@jorisvandenbossche
Copy link
Member

Currently the roundtrip tests (

def test_from_dataframe_roundtrip(
) only use a simple data generation logic, which doesn't include any missing values, as far as I see:

def mock_columns(
nominal_dtype: NominalDtype, size: int
) -> st.SearchStrategy[MockColumn]:
dtype = nominal_dtype.value
elements = None
if nominal_dtype == NominalDtype.CATEGORY:
dtype = np.int8
elements = st.integers(0, 15)
elif nominal_dtype == NominalDtype.UTF8:
# nps.arrays(dtype="U8") doesn't skip surrogates by default
elements = utf8_strings()
x_strat = nps.arrays(dtype=dtype, shape=size, elements=elements)
return x_strat.map(lambda x: MockColumn(x, nominal_dtype))

@honno
Copy link
Member

honno commented Jan 10, 2023

Yeah this would be neat. Quite possible but a bit tricky to implement, given we'd need to figure out what protocol adopters allow missing values in which scenarios (i.e. dtypes) for an initial "missing values mask" (not necessarily an actual mask, just what we use on our end to figure out where missing values exist for a given test example), and a translation of said mask for every protocol adopter in wrappers.py.

@jorisvandenbossche
Copy link
Member Author

Another option could be to not create numpy arrays (like in the snippet above), but plain python lists of scalar values for a column, and then let each library convert that to their native representation in wrappers.py mock_to_toplevel.
For example, then we could say that missing values are represented as None (eg ["a", "b", None] for strings). That avoids needing a sentinel or mask approach to make it work with numpy arrays.

@honno
Copy link
Member

honno commented Jan 10, 2023

That avoids needing a sentinel or mask approach to make it work with numpy arrays.

Ah I wasn't concerned about masking the original numpy arrays, as I assume it's easier to create an initial test example (an adopter's native column object) and then create/manipulate a new test example from that with "randomly" placed missing values (either/both sentinel or mask where appropriate).

Another option could be to not create numpy arrays (like in the snippet above), but plain python lists of scalar values for a column,

Just to note I initially was going to generate plain Python lists but found all the adopters consumed numpy arrays, so thought I could utilise Hypothesis' nps.arrays() tool to very easily generated complex test examples for all the dtypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants