New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Flaky tests for sharding (publish) #3845

Closed

ThetaSinner wants to merge 9 commits into develop from 3436-tests-for-sharded-gossip

Contributor

ThetaSinner commented May 9, 2024 •

edited

Summary

Putting this up for review to see if my test logic is reasonable. The tests are definitely flaky but they fail at the final step which I think means that:

The tests are correctly assigning arcs
Ops are sometimes published to agents who don't have the basis of the op being published within their arc

Note that this isn't actually about gossip as the branch suggests, I started with publish which is slightly simpler. These tests should extend nicely to gossip if we can get them working reliably.

TODO:

CHANGELOGs updated with appropriate info

ThetaSinner linked an issue

that may be closed by this pull request

Tests for sharded gossip #3436

Open

ThetaSinner changed the title ~~Flaky tests for sharded gossip~~ Flaky tests for sharding

ThetaSinner changed the title ~~Flaky tests for sharding~~ Flaky tests for sharding (publish)

ThetaSinner commented

View reviewed changes

crates/kitsune_p2p/dht/src/arq/strat.rs

    
                  }

              }

              impl ArqStrat {

                  /// Standard arq strat

                  pub fn standard(local_storage: LocalStorageConfig) -> Self {

                  pub fn standard(local_storage: LocalStorageConfig, min_coverage: f64) -> Self {

Contributor Author

ThetaSinner May 9, 2024

It wasn't actually possible to configure this with the tuning param... I haven't ended up needing this change yet on this PR but I do think the change needs making.

crates/kitsune_p2p/kitsune_p2p/src/gossip/sharded_gossip/initiate.rs

    
            @@ -38,6 +38,7 @@ impl ShardedGossipLocal {
          
                      // Choose a remote agent to gossip with.

                      let remote_agent = self

                          // TODO want to set a test up so that we know who this should be

Contributor Author

ThetaSinner May 9, 2024

Haven't done this yet so I'm going to leave this TODO here for myself.

crates/kitsune_p2p/kitsune_p2p/src/spawn/actor/space.rs

    
                      self.agent_arqs.get(agent).cloned().unwrap_or_else(|| {

                          let dim = SpaceDimension::standard();

                          match self.config.tuning_params.arc_clamping() {

                              Some(ArqClamping::Empty) => Arq::new_empty(dim, agent.get_loc()),

                              Some(ArqClamping::Full) | None => Arq::new_full_max(dim, &strat, agent.get_loc()),

                              Some(ArqClamping::Full) | None => {

                                  let strat = self.config.tuning_params.to_arq_strat();

Contributor Author

ThetaSinner May 9, 2024

This didn't need to be computed on one code path, and wasn't needed at all if there was already an agent arc. Just a minor optimisation.

crates/kitsune_p2p/kitsune_p2p/tests/common/test_legacy_host.rs

    
                      let mut pub_key_bytes = info.ed25519_pub_key.0.to_vec();

                      let loc = dht_location(&pub_key_bytes[0..32].try_into().unwrap());

                      pub_key_bytes.extend(&loc);

                      Arc::new(KitsuneAgent::new(pub_key_bytes))

Contributor Author

ThetaSinner May 9, 2024

This was an omission from the test host implementation that agents didn't have a location. Wasn't needed by the existing tests because they all use full arc.

crates/kitsune_p2p/kitsune_p2p/tests/sharding.rs

Comment on lines +161 to +186

    
                  for i in 0..5 {

                      if i == sender_idx || i == should_recv_idx {

                          continue;

                      }

                      // We've filtered out the sender and the receiver, who are expected to have the data.

                      // Now we check that the agent at the current index does not have the basis that the op was

                      // published to in its arc. That would make the test wrong, not Kitsune, so fail here!

                      let should_this_agent_hold_the_op = should_agent_hold_op_at_basis(&agents[i].0, agents[i].2.clone(), basis.clone());

                      // If this assertion fails, it means that the agent at index `i` has the basis in its arc which is not intended by the test setup.

                      assert!(!should_this_agent_hold_the_op, "Agent {i} should receive the data, this is a setup issue with the test");

                      // Now make the important assertion that the agent at index `i` did not receive the data! If it's not in the agents arc

                      // (which we just asserted above) then it should not have been received.

                      let store_lock = agents[i].0.op_store();

                      let store = store_lock.read();

                      assert!(

                          store.is_empty(),

                          "Agent {} should not have received any data but has {} ops",

                          i,

                          store.len()

                      );

                  }

Contributor Author

ThetaSinner May 9, 2024 •

edited

It's this bit I'd like to get a second pair of eyes on. I think that if the first check for 'basis in agent arc' passes then Kitsune being correct means that the op storage assertion shouldn't fail. However, it is flaky. That means that sometimes the op basis isn't in the agent's arc but it gets the op anyway.

This could mean that the test host implementation has bugs. It could mean that I've not configured Kitsune right and arcs are changing. It could just be that I've written code in this loop that doesn't make sense.

Working out what is wrong is probably going to be difficult, so I'd like to at least check that my assumptions are reasonable!

Contributor

neonphog May 9, 2024

My sense at the moment is that this could be pointing to a behavior in holochain that wasn't explicitly intended. However, we wrote a lot of code assuming we'd someday have garbage collection for things when our arcs shrink, for example, so haven't been too careful about getting ops when we shouldn't.

Perhaps this falls into a class of optimizations that we should wait until after launch to address?

Contributor Author

ThetaSinner May 10, 2024

Possibly, I've tried to disable arc resizing so the arcs shouldn't be moving. I can add assertions to check they aren't and see if that stops the test before the 'who got what' assertions

crates/kitsune_p2p/kitsune_p2p/tests/sharding.rs

    
              #[cfg(feature = "tx5")]

              #[ignore = "This test is flaky, possibly because what it is testing is flaky"]

              #[tokio::test(flavor = "multi_thread")]

              async fn publish_to_basis_from_outside() {

Contributor Author

ThetaSinner May 9, 2024

This test is similar enough to the one above that it can be ignored (at least while this PR is in draft)

ThetaSinner commented

View reviewed changes

crates/kitsune_p2p/kitsune_p2p/tests/sharding.rs

    
                      let store = store_lock.read();

                      assert!(

                          store.is_empty(),

                          "Agent {} should not have received any data but has {} ops",

Contributor Author

ThetaSinner May 9, 2024

These are the assertions the test fails on. Which agent it fails at changes.

ThetaSinner removed a link to an issue

Tests for sharded gossip #3436

Open

neonphog reviewed

View reviewed changes

crates/kitsune_p2p/kitsune_p2p/tests/sharding.rs

Comment on lines +88 to +93

    
                      let arc = Arq::from_start_and_half_len_approximate(

                          dim,

                          &ArqStrat::standard(LocalStorageConfig::default(), 2.0),

                          agent.get_loc(),

                          len.as_() / 2 + 1,

                      );

Contributor

neonphog May 9, 2024

(cc @maackle) Do you know how the quantization will affect this? If it grows it a bit, that's probably not a huge problem, you could potentially get a double overlap with a node in the end of the range and the node two after that at the beginning of the range.

Contributor Author

ThetaSinner May 9, 2024

I agree, it can and does change size and the function name makes it clear that it's expected.

That is what I was assuming was wrong with my testing the last two days. I think to make this test really solid, we'd need to account for that and make sure we pick an op that will really go into the overlap of two agents. However, I seem to be able to get ops on every run that fall into the arcs of the two agents I'm selecting. I've not yet seen the test fail because we don't have enough overlap. What I'm seeing is agents who shouldn't be getting the op... getting the op. That's what the extra assertions I added were to check for.

neonphog reviewed

View reviewed changes

crates/kitsune_p2p/kitsune_p2p/tests/sharding.rs

    
                  let sender_location = &agents[sender_idx].2 .0[32..];

                  let mut kitsune_basis = KitsuneBasis::new(vec![0; 36]);

                  kitsune_basis.0[32..].copy_from_slice(&sender_location);

Contributor

neonphog May 9, 2024

Someday we may add validation that the location is a correct hash of the hash... but I suppose we can address this somehow at that point (perhaps by having a test flag to disable that validation).

neonphog reviewed

View reviewed changes

crates/kitsune_p2p/kitsune_p2p/tests/sharding.rs

Comment on lines +81 to +86

    
                      // Distance to the end of the segment, plus the length of the next segment. Guaranteed to

                      // overlap with the next agent and not the one after that.

                      // Because of arc quantisation, the layout won't be perfect, but we can expect overlap at

                      // the start of the agent's arc, with the previous agent.

                      let len =

                          DhtLocation::new(base_len * (i + 1)) - agent.get_loc() + DhtLocation::new(base_len);

Contributor

neonphog May 9, 2024

It probably isn't the cause of your flaky-ness because of the extreme unlikelyhood of randomly hitting it, but I'm a little worried about an off-by-one error, especially on the wrapping around the beginning if the section 5 node is at the extreme beginning of its range, and the section zero node is at the extreme end.

Contributor Author

ThetaSinner May 9, 2024

Totally agree, this isn't accurate enough - it might make sense to grow the arcs a bit more and then have the assertions be dynamically checking that the agents whose arcs actually hold the op are getting it. Rather than hard-coding. But for now, I'm checking that agents whose arcs (I think?) aren't including the test basis aren't getting the ops.. but they are. Very confused :)

ThetaSinner force-pushed the 3436-tests-for-sharded-gossip branch from 6548f7d to b283190 Compare

May 10, 2024 08:56

ThetaSinner requested a review from maackle

May 10, 2024 15:08

ThetaSinner added 9 commits

May 20, 2024 14:59


          Clearer agent lookup in join

1153cb2


          Allow redundancy to be configured

06c9c34


          Only create strat if needed


          Create broken test for only gossip to agents in basis

ba55d3a


          Check op position


          Get tests running

a664db9


          Demonstrate broken

f0fd622


          Ignore flaky tests

0c8efeb


          Disable gossip to rule that out

2f42dcd

ThetaSinner force-pushed the 3436-tests-for-sharded-gossip branch from b283190 to 2f42dcd Compare

May 20, 2024 14:00

Contributor Author

ThetaSinner commented May 20, 2024

Closing in favor of a new PR where the tests work

ThetaSinner closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment