Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potentially relevant usage patterns / targets for a developer-focused API #71

Open
rgommers opened this issue Dec 2, 2021 · 18 comments

Comments

@rgommers
Copy link
Member

rgommers commented Dec 2, 2021

In other issues we find some detailed analyses of how the pandas API is used today, e.g. gh-3 (on Kaggle notebooks) and in https://github.com/data-apis/python-record-api/tree/master/data/api (for a set of well-known packages). That data is either not relevant for a developer-focused API though, or is so detailed that it's hard to get a good feel for what's important. So I thought it'd be useful to revisit the topic. I used https://libraries.io/pypi/pandas and looked at some of the top repos that declare a dependency on pandas.

Top 10 listed:

image

Seaborn

Perhaps the most interesting pandas usage. It's a hard dependency, is used a fair amount and for more than just data access, however it all still seems fairly standard and common so may be a reasonable target to make work with multiple libraries. Uses a lot of isinstance checks (on pd.DataFrame, pd.Series).

Folium

just a single non-test usage, in pd.py:

def validate_location(location):  # noqa: C901
    "...J
    if isinstance(location, np.ndarray) \
            or (pd is not None and isinstance(location, pd.DataFrame)):
        location = np.squeeze(location).tolist()


def if_pandas_df_convert_to_numpy(obj):
    """Return a Numpy array from a Pandas dataframe.
    Iterating over a DataFrame has weird side effects, such as the first
    row being the column names. Converting to Numpy is more safe.
    """
    if pd is not None and isinstance(obj, pd.DataFrame):
        return obj.values
    else:
        return obj

PyJanitor

Interesting/unusual common pattern, which extends pd.DataFrame through pandas_flavor with either accessors or methods:. E.g. from [janitor/biology.py]https://github.com/pyjanitor-devs/pyjanitor/blob/a6832d47d2cc86b0aef101bfbdf03404bba01f3e/janitor/biology.py):

import pandas as pd
import pandas_flavor as pf

@pf.register_dataframe_method
def join_fasta(
    df: pd.DataFrame, filename: str, id_col: str, column_name: str
) -> pd.DataFrame:
    """
    Convenience method to join in a FASTA file as a column.
    """
    ...
    return df

Statsmodels

A huge amount of usage, using a large API surface in a messy way - not easy to do anything with or draw conclusions from.

NetworkX

Mostly just conversions to support pandas dataframes as input/output values. E.g., from convert.py and convert_matrix.py:

def to_networkx_graph(data, create_using=None, multigraph_input=False):
    """Make a NetworkX graph from a known data structure."""
        # Pandas DataFrame
    try:
        import pandas as pd

        if isinstance(data, pd.DataFrame):
            if data.shape[0] == data.shape[1]:
                try:
                    return nx.from_pandas_adjacency(data, create_using=create_using)
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame adjacency matrix."
                    raise nx.NetworkXError(msg) from err
            else:
                try:
                    return nx.from_pandas_edgelist(
                        data, edge_attr=True, create_using=create_using
                    )
                except Exception as err:
                    msg = "Input is not a correct Pandas DataFrame edge-list."
                    raise nx.NetworkXError(msg) from err
    except ImportError:
        warnings.warn("pandas not found, skipping conversion test.", ImportWarning)


def from_pandas_adjacency(df, create_using=None):
    try:
        df = df[df.index]
    except Exception as err:
        missing = list(set(df.index).difference(set(df.columns)))
        msg = f"{missing} not in columns"
        raise nx.NetworkXError("Columns must match Indices.", msg) from err

    A = df.values
    G = from_numpy_array(A, create_using=create_using)

    nx.relabel.relabel_nodes(G, dict(enumerate(df.columns)), copy=False)
    return G

And using the .drop method in group.py:

def prominent_group(
    G, k, weight=None, C=None, endpoints=False, normalized=True, greedy=False
):
    import pandas as pd
    ...
    betweenness = pd.DataFrame.from_dict(PB)
    if C is not None:
        for node in C:
            # remove from the betweenness all the nodes not part of the group
            betweenness.drop(index=node, inplace=True)
            betweenness.drop(columns=node, inplace=True)
    CL = [node for _, node in sorted(zip(np.diag(betweenness), nodes), reverse=True)]

Perspective

A multi-language (streaming) viz and analytics library. The Python version uses pandas in core/pd.py. It uses a small but nontrivial amount of the API, including MultiIndex, CategoricalDtype, and time series functionality.

Scikit-learn

TODO: the usage of Pandas in scikit-learn is very much in flux, and more support for "dataframe in, dataframe out" is being added. So it did not seem to make much sense to just look at the code, rather it makes sense to have a chat with the people doing the work there.

Matplotlib

Added because it comes up a lot. Matplotlib uses just a "dictionary of array-likes" approach, no dependence on pandas directly. So it will work today with other dataframe libraries as well, as long as their columns can convert to a numpy array.

@rgommers
Copy link
Member Author

rgommers commented Dec 2, 2021

Other libraries that were suggested as candidates to look into: Xarray, cuDF (utilities), PyJanitor (cleaning functionality, not the pandas_flavor domain-specific parts), https://github.com/sfu-db/dataprep

@rgommers
Copy link
Member Author

PyJanitor (non pandas_flavor code)

Not repeating DataFrame, Series and .columns, those are used a lot.

  • utils.py: .iloc, RangeIndex, MultiIndex, .empty, Index
  • functions/add_columns.py: .copy, .add_column
  • functions/case_when.py: .assign, .mask, .index, Index, .nlevels, .ndim, .size, __len__
  • functions/clean_names.py: .rename, .__dict__
  • functions.coalesce.py: .filter, .bfill, .ffill, .assign
  • functions/complete.py: .copy, .merge, .groupby, .apply, .droplevel, .loc, Index, MultiIndex
  • functions/conditional_join.py: .loc, .index, .empty, .copy, RangeIndex, MultiIndex, index, append, .to_numpy, .dtypes, .items, .join
  • functions/convert_date.py: to_datetime, .astype, .apply
  • functions/count_cumulative_unique.py: .drop_duplicates, .assign, .cumsum, .index, .reindex, .ffill, .astype
  • functions/currency_column_to_numeric.py: to_numeric, .loc, .assign, .apply,

There's a ton more - it uses a fairly large part of the pandas API surface. Even in utils, a lot of the code is in functions that get then tacked onto pd.DataFrame with @pandas_flavor.register_dataframe_method. It does not seem like a great target for initial support via a developer-focused API. Detailed usage data is available at https://github.com/data-apis/python-record-api/blob/master/data/api/pyjanitor.json

Xarray

Detailed usage data is also available at https://github.com/data-apis/python-record-api/blob/master/data/api/xarray.json; that data and a cursory search through the Xarray code base for "import pandas" shows that it uses an even larger API surface. A decent amount of that usage is in tests - that's not actually relevant. This is one of the downsides of the automated analysis tooling, if one traces pandas API usage from running the Xarray test suite, then it's hard to figure out whether the public pandas API usage is from the test files or the "under test" files. Pandas is still used in a lot of places though:

Note that Index is most commonly used, followed by Series and DataFrame, the below listing leaves them out of the results for some files.

  • testing.py: Index,
  • conventions.py: MultiIndex, isnull, .any, __not__,
  • convert.py: isnull
  • coding/times.py: Timestamp, to_timedelta, to_datetime, __version__, notnull, isnull, DatetimeIndex
  • coding/frequencies.py: Series, DatetimeIndex, TimedeltaIndex, infer_freq
  • coding/cftimeindex.py: Index, TimedeltaIndex
  • coding/variables.py: isnull
  • core/common.py: Index, Grouper
  • core/nputils.py: isnull
  • core/merge.py: Series, DataFrame, Panel, Index
  • core/dataarray.py: Series, DataFrame, MultiIndex, Timedelta, isnull
  • core/concat.py: unique
  • core/resample_cftime.py: Series, .duplicated
  • core/pdcompat.py: Panel
  • core/accessor_dt.py: .dt
  • core/duck_array_ops.py: Timedelta, to_timedelta, .astype
  • core/utils.py: .factorize, MultiIndex, isnull
  • core/variable.py: Timestamp, MultiIndex.names, MultiIndex.set_names,
  • core/indexing.py: MultiIndex + methods: .nlevels/.get_loc/.get_loc_level, CategoricalIndex, PeriodIndex, NaT, Timestamp
  • core/indexes.py: MultiIndex + method from_arrays, CategoricalIndex + method remove_unused_categories,
  • core/dataset.py: MultiIndex, Categorical + .codes/.categories
  • core/groupby.py: factorize, DateOffset + .loffset, DatetimeIndex, cut, MultiIndex
  • core/alignment.py: Index + .union, .intersection
  • core/missing.py: isnull, MultiIndex, Timedelta, DatetimeIndex
  • core/coordinates.py: MultiIndex.from_product
  • core/formatting.py: isnull, Timestamp, Timedelta, .astype
  • plot/dataset_plot.py: Interval
  • plot/plot.py: notnull

There is a ton of isinstance usage (e.g. with the various index objects), because Xarray supports both its own container/index classes and pandas ones. Usage seems to be quite different from typical/idiomatic Pandas usage, because Xarray has pretty specific needs.

dataprep

https://github.com/sfu-db/dataprep doesn't seem suitable for analysis - it contains 212 files with pandas imports, a lot of them quite niche (example: a separate file for Albanian VAT number cleaning/validation).

@mwaskom
Copy link
Contributor

mwaskom commented Dec 28, 2021

Hey, very cool initiative — it would be great to be more agnostic to dataframe libraries.

I wanted to flag that seaborn is in the midst of a very extensive internal refactor, which means that the survey of pandas usage in the library is likely to be out of date after future releases.

But there's an upside: it's a perfect time to be revisiting how the pandas API is used in seaborn and to proactively think about working with a more general dataframe interface. I could see the ongoing work evolving in parallel with this project (hopefully in a way that's mutually beneficial).

Let me know if I can be helpful here!

@thomasjpfan
Copy link
Contributor

Scikit-learn mostly treats a DataFrame as a "2D ndarray with column names". Only the OrdinalEncoder and OneHotEncoder treats the data frame as "a collection of 1D arrays".

When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: pandas-dev/pandas#27211. In detail:

  1. First model does computation with ndarrays and is converted to a DataFrame when returned.
  2. The DataFrame is passed into a second model, which internally converts the DataFrame into a ndarray for computation.

Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated.

@rgommers
Copy link
Member Author

rgommers commented Apr 8, 2022

Interesting, thanks for sharing @thomasjpfan.

When scikit-learn's models start returning DataFrames, it will depend on the fact that there is a zero-copy round-trip from numpy: pandas-dev/pandas#27211.

The answers from the Pandas devs there are along the lines of what I'd expect: this isn't necessarily guaranteed in the future. That's more a "labeled array" use case which is Xarray like. Did anything change after that 2019 discussion @thomasjpfan, or is it more a "fingers crossed that Pandas doesn't change this"?

Scikit-learn requires that 2d ndarray -> DataFrame -> 2d ndarray not make any copies so no additional memory is allocated

I think pragmatically there is likely to always be a way for Pandas to do this; scikit-learn is probably important enough that it can have even its own method for this if needed. Conceptually it's not a nice fit though for a standardized dataframe behavior; it only works for a subset of supported dtypes, and it's going to need support for a constructor which accepts 2-D arrays to begin with.

@thomasjpfan
Copy link
Contributor

is it more a "fingers crossed that Pandas doesn't change this"?

It's fingers crossed. I've seen a proposal for a 2D extension array, but I think there is a lot more momentum for 1d extension arrays & a columnar store.

I want to add: There are certain models, such as StandardScalar, that can treat the dataframe as a "1d collection of arrays" but is not implemented that way yet. Other models such as PCA will always need to concat the 1d arrays into a 2d array to work.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Feb 22, 2023

PyJanitor (cleaning functionality, not the pandas_flavor domain-specific parts)

Looks like they only really use rename here, which could easily be standardised

https://github.com/pyjanitor-devs/pyjanitor/blob/7ad98e3564f86534094e4eb425d85ff9a25a3679/janitor/functions/clean_names.py#L84-L106

The trickier part is this decorator, which also uses pandas_flavour:

https://github.com/pyjanitor-devs/pyjanitor/blob/7ad98e3564f86534094e4eb425d85ff9a25a3679/janitor/functions/clean_names.py#L11-L12

pyjanitor adds an extra clean_names method to the pandas DataFrame. How would they make use of the Standard - would they add such a method to all DataFrame objects who have some implementation of the standard?
Would the Standard need to require some decorator that can be used to register custom methods?
Would it actually be possible for pyjanitor to then register clean_names as a method for all libraries, without having to list them all explicitly? Asking because I don't know - although it strikes me as unlikely

@rgommers
Copy link
Member Author

It looks to me like there are two separate things in PyJanitor:

  1. Functionality implemented through code that calls pandas APIs (dataframe methods and attributes mostly, not just rename)
  2. An unusual way of exposing its own PyJanitor API, namely injecting methods into the dataframe of another library, rather than providing standalone functions.

(2) looks motivated only by UX reasons (I could well be wrong here, not being an active user) - dataframe users tend to like methods over functions. It seems unhealthy to me, because one library monkeypatching another library is a big no-no in library design. Any df.new_meth(...) could have been new_func(df, ...) instead I think.

It's actually an interesting question whether (2) should be allowed through a registration mechanism, or it should be discouraged. I'd lean towards the latter, but then again I'm coming from a domain where a functional programming style is preferred over an object-oriented one. If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only.

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Feb 27, 2023

OK true, their methods do work as functions too:

In [2]: from janitor.functions.clean_names import clean_names

In [3]: df = pd.DataFrame({'A ': [1, 2, 3]})

In [4]: df
Out[4]:
   A
0   1
1   2
2   3

In [5]: clean_names(df)
Out[5]:
   a_
0   1
1   2
2   3

So, perhaps that's the part which the standard can target. It might be worthwhile to try taking a handful of functions from them, say:

  • clean_names
  • drop_constant_columns
  • min_max_scale

Then try implementing the Standard for each DataFrame library, seeing if it's sufficient, and whether this would let pyjanitor "just work" on all of them if it was rewritten to use the standard api

@jorisvandenbossche
Copy link
Member

If dataframe library author prefer the former, then a well-defined extension mechanism seems useful. Even for PyJanitor + Pandas only.

FWIW, for pandas itself this already exists (https://pandas.pydata.org/docs/dev/development/extending.html#registering-custom-accessors), and this is also what pyjanitor / pandas_flavor use under the hood (pandas_flavor adds some convenience layer on top of it).

Whether this would also be useful for a DataFrame standard is of course a different question. I think if our goal is to provide a developer-oriented standard API, this is much less needed.

@MarcoGorelli
Copy link
Contributor

Other tools which have been mentioned as potential targets:

  • featuretools
  • pandera

@MarcoGorelli
Copy link
Contributor

This one would be a good candidate, namely because they already support both pandas and polars: https://github.com/Kanaries/pygwalker

@MarcoGorelli
Copy link
Contributor

Well this is encouraging:

Now, all pandas-specific logic is isolated to specific modules, where support for additional non-pandas-compliant schema specifications and their associated backends can be implemented either as 1st-party-maintained libraries (see issues for supporting unionai-oss/pandera#1064 and unionai-oss/pandera#1105) or 3rd party libraries.

https://github.com/unionai-oss/pandera/releases/tag/v0.14.0

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Mar 31, 2023

altair have added support for polars by using the interchange protocol: https://github.com/altair-viz/altair

pyarrow is required as a dependency for this to work though - with the standard, they could potentially support polars (and many others) without requiring extra deps? one to look into

EDIT: I don't think altair is a good candidate, see #133

@MarcoGorelli
Copy link
Contributor

Dropping Dask for now, as they've said this wouldn't solve an actual pain-point of theirs


Anyway, https://github.com/feature-engine/feature_engine looks like a good candidate, and exactly the kind of library where this might be useful!

@MarcoGorelli
Copy link
Contributor

Here's a really good one

https://github.com/Nixtla/statsforecast/blob/c732a6101ce0c9daec886928e0f68371772fcccc/statsforecast/core.py#L540-L633

they literally have

if isinstance(self.dataframe, pl.DataFrame):
    # pandas-specific logic
elif isinstance(self.dataframe, pd.DataFrame):
    # polars-specific logic
else:
   raise

So yeah, really solid candidate here

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jun 28, 2023

another one, where they've already said that their objective is to support multiple dataframe backends https://github.com/skrub-data/skrub

others:

  • scikit-lego
  • tsfresh
  • pandas-ta

@cosmicBboy
Copy link

hi all! pandera author here 👋, just wanted to drop a note here to say we're going to start investing resources in pandera-polars support: unionai-oss/pandera#1064.

Not sure how far along this project is but would love to get some tips on how to design the polars validation backend as described in this mini-roadmap: unionai-oss/pandera#1064 (comment).

Was planning on forging ahead with polars-specific implementations for various things that pandera does during the validation pipeline (see anywhere there's a check_obj variable in the pandas backend as an example). If there's anything we should keep in mind as we build it out please add comments to that issue ^^, we'd really appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants