Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky tests for sharding (publish) #3845

Closed
wants to merge 9 commits into from

Conversation

ThetaSinner
Copy link
Contributor

@ThetaSinner ThetaSinner commented May 9, 2024

Summary

Putting this up for review to see if my test logic is reasonable. The tests are definitely flaky but they fail at the final step which I think means that:

  • The tests are correctly assigning arcs
  • Ops are sometimes published to agents who don't have the basis of the op being published within their arc

Note that this isn't actually about gossip as the branch suggests, I started with publish which is slightly simpler. These tests should extend nicely to gossip if we can get them working reliably.

TODO:

  • CHANGELOGs updated with appropriate info

@ThetaSinner ThetaSinner linked an issue May 9, 2024 that may be closed by this pull request
@ThetaSinner ThetaSinner changed the title Flaky tests for sharded gossip Flaky tests for sharding May 9, 2024
@ThetaSinner ThetaSinner changed the title Flaky tests for sharding Flaky tests for sharding (publish) May 9, 2024
}
}

impl ArqStrat {
/// Standard arq strat
pub fn standard(local_storage: LocalStorageConfig) -> Self {
pub fn standard(local_storage: LocalStorageConfig, min_coverage: f64) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't actually possible to configure this with the tuning param... I haven't ended up needing this change yet on this PR but I do think the change needs making.

@@ -38,6 +38,7 @@ impl ShardedGossipLocal {

// Choose a remote agent to gossip with.
let remote_agent = self
// TODO want to set a test up so that we know who this should be
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't done this yet so I'm going to leave this TODO here for myself.

self.agent_arqs.get(agent).cloned().unwrap_or_else(|| {
let dim = SpaceDimension::standard();
match self.config.tuning_params.arc_clamping() {
Some(ArqClamping::Empty) => Arq::new_empty(dim, agent.get_loc()),
Some(ArqClamping::Full) | None => Arq::new_full_max(dim, &strat, agent.get_loc()),
Some(ArqClamping::Full) | None => {
let strat = self.config.tuning_params.to_arq_strat();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't need to be computed on one code path, and wasn't needed at all if there was already an agent arc. Just a minor optimisation.

let mut pub_key_bytes = info.ed25519_pub_key.0.to_vec();
let loc = dht_location(&pub_key_bytes[0..32].try_into().unwrap());
pub_key_bytes.extend(&loc);
Arc::new(KitsuneAgent::new(pub_key_bytes))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an omission from the test host implementation that agents didn't have a location. Wasn't needed by the existing tests because they all use full arc.

Comment on lines +161 to +186
for i in 0..5 {
if i == sender_idx || i == should_recv_idx {
continue;
}

// We've filtered out the sender and the receiver, who are expected to have the data.
// Now we check that the agent at the current index does not have the basis that the op was
// published to in its arc. That would make the test wrong, not Kitsune, so fail here!
let should_this_agent_hold_the_op = should_agent_hold_op_at_basis(&agents[i].0, agents[i].2.clone(), basis.clone());

// If this assertion fails, it means that the agent at index `i` has the basis in its arc which is not intended by the test setup.
assert!(!should_this_agent_hold_the_op, "Agent {i} should receive the data, this is a setup issue with the test");

// Now make the important assertion that the agent at index `i` did not receive the data! If it's not in the agents arc
// (which we just asserted above) then it should not have been received.
let store_lock = agents[i].0.op_store();
let store = store_lock.read();
assert!(
store.is_empty(),
"Agent {} should not have received any data but has {} ops",
i,
store.len()
);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's this bit I'd like to get a second pair of eyes on. I think that if the first check for 'basis in agent arc' passes then Kitsune being correct means that the op storage assertion shouldn't fail. However, it is flaky. That means that sometimes the op basis isn't in the agent's arc but it gets the op anyway.

This could mean that the test host implementation has bugs. It could mean that I've not configured Kitsune right and arcs are changing. It could just be that I've written code in this loop that doesn't make sense.

Working out what is wrong is probably going to be difficult, so I'd like to at least check that my assumptions are reasonable!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense at the moment is that this could be pointing to a behavior in holochain that wasn't explicitly intended. However, we wrote a lot of code assuming we'd someday have garbage collection for things when our arcs shrink, for example, so haven't been too careful about getting ops when we shouldn't.

Perhaps this falls into a class of optimizations that we should wait until after launch to address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, I've tried to disable arc resizing so the arcs shouldn't be moving. I can add assertions to check they aren't and see if that stops the test before the 'who got what' assertions

#[cfg(feature = "tx5")]
#[ignore = "This test is flaky, possibly because what it is testing is flaky"]
#[tokio::test(flavor = "multi_thread")]
async fn publish_to_basis_from_outside() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is similar enough to the one above that it can be ignored (at least while this PR is in draft)

let store = store_lock.read();
assert!(
store.is_empty(),
"Agent {} should not have received any data but has {} ops",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the assertions the test fails on. Which agent it fails at changes.

Comment on lines +88 to +93
let arc = Arq::from_start_and_half_len_approximate(
dim,
&ArqStrat::standard(LocalStorageConfig::default(), 2.0),
agent.get_loc(),
len.as_() / 2 + 1,
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cc @maackle) Do you know how the quantization will affect this? If it grows it a bit, that's probably not a huge problem, you could potentially get a double overlap with a node in the end of the range and the node two after that at the beginning of the range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it can and does change size and the function name makes it clear that it's expected.

That is what I was assuming was wrong with my testing the last two days. I think to make this test really solid, we'd need to account for that and make sure we pick an op that will really go into the overlap of two agents. However, I seem to be able to get ops on every run that fall into the arcs of the two agents I'm selecting. I've not yet seen the test fail because we don't have enough overlap. What I'm seeing is agents who shouldn't be getting the op... getting the op. That's what the extra assertions I added were to check for.

let sender_location = &agents[sender_idx].2 .0[32..];

let mut kitsune_basis = KitsuneBasis::new(vec![0; 36]);
kitsune_basis.0[32..].copy_from_slice(&sender_location);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someday we may add validation that the location is a correct hash of the hash... but I suppose we can address this somehow at that point (perhaps by having a test flag to disable that validation).

Comment on lines +81 to +86
// Distance to the end of the segment, plus the length of the next segment. Guaranteed to
// overlap with the next agent and not the one after that.
// Because of arc quantisation, the layout won't be perfect, but we can expect overlap at
// the start of the agent's arc, with the previous agent.
let len =
DhtLocation::new(base_len * (i + 1)) - agent.get_loc() + DhtLocation::new(base_len);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably isn't the cause of your flaky-ness because of the extreme unlikelyhood of randomly hitting it, but I'm a little worried about an off-by-one error, especially on the wrapping around the beginning if the section 5 node is at the extreme beginning of its range, and the section zero node is at the extreme end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree, this isn't accurate enough - it might make sense to grow the arcs a bit more and then have the assertions be dynamically checking that the agents whose arcs actually hold the op are getting it. Rather than hard-coding. But for now, I'm checking that agents whose arcs (I think?) aren't including the test basis aren't getting the ops.. but they are. Very confused :)

@ThetaSinner ThetaSinner force-pushed the 3436-tests-for-sharded-gossip branch from 6548f7d to b283190 Compare May 10, 2024 08:56
@ThetaSinner ThetaSinner requested a review from maackle May 10, 2024 15:08
@ThetaSinner ThetaSinner force-pushed the 3436-tests-for-sharded-gossip branch from b283190 to 2f42dcd Compare May 20, 2024 14:00
@ThetaSinner
Copy link
Contributor Author

Closing in favor of a new PR where the tests work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants