Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream Sync deployment TODO items #4139

Open
9 of 13 tasks
JackyWYX opened this issue Apr 20, 2022 · 4 comments
Open
9 of 13 tasks

Stream Sync deployment TODO items #4139

JackyWYX opened this issue Apr 20, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@JackyWYX
Copy link
Contributor

JackyWYX commented Apr 20, 2022

1. Overview

The document contains the steps taken to deploy the stream sync protocol.

2. Items and progress

  • Code review and feature lock up.
  • Fix all test cases (Unit & Localnet)
  • Deployment to testnet for recursive test.
    • Deploy to all nodes with server on & client off
    • Turn on client on beacon chain
    • Enable client on all shards.
  • Bug fixes according to testnet results
    • Increasing CPU & memory as time goes. - Resolved by [stream] Enable downloader #4049 for adding a cool down mechanism in discover protocol.
    • Node does not have enough streams at testnet environment - Resolved by lower the minStream at config file
    • Node fall out of sync daily - Investigating
      • Cannot find the issue by looking at logs. No extra logs when the node get stuck
      • Tried remove all contraint from server side - not working
      • Disable the cooldown and blacklist mechanism - Still fall out of sync daily
      • After fixing with [downloader] fix explorer node get stuck when doing short range sync #4150, the problem is partially resolved. Still need to look into the issue for validator nodes.
  • Deploy to mainnet with probe client node
    • Enable downloader server on all internal nodes
    • Enable client downloader for 1 node for each shard (probe node)
  • Deploy to mainnet all nodes with server and client on
    • Enable stream server & client in all internal nodes of a shard chain
    • Enable stream server & client in all internal nodes
  • Code params update and code release to public.
@JackyWYX JackyWYX added the enhancement New feature or request label Apr 20, 2022
@JackyWYX JackyWYX self-assigned this Apr 20, 2022
@JackyWYX
Copy link
Contributor Author

#4150

Found one potential bug that will potentially effect the testnet downloader sync issue. After the fix, explorer node can finish the sync process without much delay. Further impact need to be observed.

@JackyWYX
Copy link
Contributor Author

After deploying the fix #4150 to internal shard 3 explorer node, the node hasn't gone out of sync for the latest 48 hours, which mean the fix did resolve the issue on Testnet Explorer node.

There are still some sync stuck issue which requires a node restart according to testnet test result. The stuck on validator nodes does not happen regularly. So I added more logs in downloader sync process to better observe the root cause of testnet validator fall out of sync: #4153. The fix would better be deployed to testnet nodes to give us a better understanding when the stuck happens again on validator node.

@JackyWYX
Copy link
Contributor Author

JackyWYX commented May 6, 2022

The code has been run smoothly in testnet (No more out of sync issue) for the past few days. It would be sufficient for us to move the deployment to the next stage. - To deploy the code to mainnet node with probe nodes.

It shall works as follows:

  1. Turn on downloader server at shard 3;
  2. Turn on downloader server in 5 nodes at beacon chain;
  3. Spin up probe node - 1 at shard 0 and 1 at shard 3, and enable the downloader.
  4. Observe the performance for probe nodes and server nodes, fix any bugs discovered in the process.
    @sophoah

@sophoah
Copy link
Contributor

sophoah commented May 8, 2022

@PkayJava after discussion with @JackyWYX

  1. Build 4 new explorer node in s1 with dns sync on, downloader server on, while stream downloader off
  2. Enable downloader server (with stream downloader off) on 4 S1 existing explorer node
  3. Enable downloader server on 8 existing S0 explorer nodes
  4. Build 1 probes node with stream downloader on : 1 in shard 1. Use an existing shard 0 node as probe. A probe is just another node used to test the sync
  5. Observe the performance for probe nodes and server nodes, report any bugs discovered in the process.

All nodes should be in watchdog/grafana for easier monitoring. Make sure to choose s0 node all in different DC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants