Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realtime sync #7

Open
rgbkrk opened this issue Dec 20, 2023 · 6 comments
Open

Realtime sync #7

rgbkrk opened this issue Dec 20, 2023 · 6 comments

Comments

@rgbkrk
Copy link
Member

rgbkrk commented Dec 20, 2023

If we go the CRDT route, I expect we'll be using yrs.

As much as we can, we should try to adhere to jupyter's ydoc data structures: https://github.com/jupyter-server/jupyter_ydoc

However, if there are areas where we want to influence the direction, we can diverge as long as we document why. Perhaps we'll motivate new protocols like we (nteract) have done in the past. We may also realize that our standard and protocol can be very different while we can convert our document to a proper Jupyter notebook anytime.

Update: we'd like to adapt a version of CoCalc's algorithm for use in this project. See #7 (comment)

@kafonek
Copy link
Contributor

kafonek commented Dec 20, 2023

From reading through jupyter_ydoc and pycrdt, it looks like there's not a pure Rust implementation of the YNotebook.

Separately as a sanity check, I haven't seen any overlap between modeling the Notebook for CRDT updates and modeling the Notebook in order to serialize to JSON and save to disk. It's still true that saving to disk is done by the Frontend creating its in-memory Notebook model, serializing to JSON, and posting to the backend server to write to disk?

@rgbkrk
Copy link
Member Author

rgbkrk commented Dec 21, 2023

I view saving the document to disk as something that happens totally separately as a side effect requested by the user or as part of autosaving.

As I just discussed with @kafonek on Zoom, we're going to go with a simple update model that is full replace for cell source or other keys on the notebook. Cells will be by ID (like they are today in nteract). We can bring in CRDT later. Right now we'll optimize a bit for bootstrapping the app with knowledge that the protocols may change as we learn yrs.

Here's a rough sketch of possible communication patterns between the Tauri Window (notebook frontend), the Tauri Core Backend, and the Notebook Backend.

image

@rgbkrk rgbkrk mentioned this issue Dec 21, 2023
@rgbkrk rgbkrk changed the title CRDT Notes Realtime sync Jan 12, 2024
@rgbkrk
Copy link
Member Author

rgbkrk commented Jan 12, 2024

After talking it over with @williamstein, we'd like to work on incorporating the realtime sync algorithm from cocalc into a new package.

  • Document the approach, algorithms, and operational learning lessons
  • Create implementations in multiple languages, starting with one agreed upon common language (TypeScript or Python) and hopefully creating a Rust binding we can use in WebAssembly
  • Demonstrate the utility of our approach to Jupyter and related interested parties to help propel interactive computing forward
  • Release with a liberal license compatible with Jupyter

That will be dependent upon acquiring some grant funding.

@williamstein
Copy link

One addition point would be to document the assumptions you're willing to work under. E.g., this paper https://arxiv.org/pdf/1410.2803.pdf about an RTC algorithm has these assumptions: "Consider a distributed system with nodes containing local memory, with no shared memory between them. Any node can send messages to any other node. The network is asynchronous; there is no global clock, no bound on the time a message takes to arrive, and no bounds are set on relative processing speeds. The network is unreliable: messages can be lost, duplicated or reordered (but are not corrupted)." For your use case, what are the assumptions? With CoCalc:

  • there is a central server
  • nodes only communicate back and forth with the central server; nothing is peer-to-peer
  • there is a global clock, with resolution of 1s (we store a delta from the client system clock compared to the central server clock, and sync periodically)
  • the network is reliable; if the connection isn't working (perhaps measured via a heartbeat) then the client terminates and will reconnect
  • any changes from before the last snapshot must be rebased and sent again
  • there are reasonable bounds on processor speed and memory

The "idea" in CoCalc is that we make at least the above assumptions, which makes the RTC problem easy to solve, at least compared to the difficult problems frequently considered in the literature. The RTC algorithm we use is then hopefully really boring, simple and easy to understand and implement.

Full sync support for ipywidgets is an interesting difficult edge case which I put a lot of work into, but then haven't really pushed through to completion (e.g., support for third party widgets is not 100%). Collaboration could help with moving this forward.

@rgbkrk
Copy link
Member Author

rgbkrk commented Jan 12, 2024

For your use case, what are the assumptions? With CoCalc:

there is a central server

👍

nodes only communicate back and forth with the central server; nothing is peer-to-peer

Definitive agreement here. Peer to peer makes it messy

there is a global clock, with resolution of 1s (we store a delta from the client system clock compared to the central server clock, and sync periodically)

I can be sold on this.

the network is reliable; if the connection isn't working (perhaps measured via a heartbeat) then the client terminates and will reconnect

👍

any changes from before the last snapshot must be rebased and sent again

It's also ok to drop them, but I think rebasing is the right way forward.

there are reasonable bounds on processor speed and memory

👍

Since this issue is in the next-gen-desktop app repo (my fault), I should clarify why I'm interested in having a consistent protocol even on a local machine. The Tauri frontend has to communicate with the backend with some kind of protocol and I'd rather keep that consistent. A side benefit is that you can have multiple windows open of the same notebook or even other services able to work with the notebook "server side".

@williamstein
Copy link

rebasing is the right way forward.

With diff-match-patch it is usually pretty easy to. One options is to take all the old diffs and apply them to the new version of the document, then make one new diff comparing the new version to the result.

A side benefit is that you can have multiple windows open of the same notebook or even other services able to work with the notebook "server side".

This is indeed very nice.

That will be dependent upon acquiring some grant funding.

I really hope such funding happens!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants