Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: dataframe protocol implementation #46

Open
5 of 8 tasks
rgommers opened this issue Jun 25, 2021 · 0 comments
Open
5 of 8 tasks

Tracking issue: dataframe protocol implementation #46

rgommers opened this issue Jun 25, 2021 · 0 comments

Comments

@rgommers
Copy link
Member

rgommers commented Jun 25, 2021

The bulk of the dataframe interchange protocol was done in gh-38. There were still a number of TODOs however, and more will likely pop up once we have multiple implementations so we can actually turn one type of dataframe into another type. This is the tracking issue for those TODOs and issues:

  • Categorical dtypes: we should allow having null as a category; it should not have a specified meaning, it's just another category that should (e.g.) roundtrip correctly. See conversation in 8 Apr meeting.
  • Categorical dtypes: should they be a dtype in themselves, or should they be a part of the dtype tuple? Currently dtype is (kind, bitwidth, format_str, endianness), with categorical being a value of the kind enum. Is making a 5th element in the dtype, with that element being another dtype 4-tuple, thereby allowing for nesting, sensible?
  • Add a metadata attribute that can be used to store library-specific things. For example, Vaex should be able to store expressions for its virtual columns there. See PR PR: Add metadata attribute to DataFrame and Column #43
  • Add a flag to throw an exception if the export cannot be zero-copy. (e.g. for pandas, possible due to block manager where rows are contiguous and columns are not - add a test for that). See PR PR: Add allow_copy flag to interchange protocol #44
  • Add a string dtype, with variable-length strings implemented with the same scheme as Arrow uses (an offsets and a data buffer, see Add a prototype of the dataframe interchange protocol #38 (comment)). _See PR Add variable-length string support #45
  • Signature of the from_dataframe protocol? See Signature for a standard from_dataframe constructor function #42 and meeting of 20 May.
  • What can be reused between implementations in different libraries, and can/should we have a reference implementation? --> question needs answering somewhere.
  • What is the ownership for buffers, who owns the memory? This should be clearly spelled out in the docs. An owner attribute is perhaps needed. See meeting minutes 4 March, How to consume a single buffer & connection to array interchange #39, and comments on this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant