-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky tests for sharding (publish) #3845
Conversation
} | ||
} | ||
|
||
impl ArqStrat { | ||
/// Standard arq strat | ||
pub fn standard(local_storage: LocalStorageConfig) -> Self { | ||
pub fn standard(local_storage: LocalStorageConfig, min_coverage: f64) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't actually possible to configure this with the tuning param... I haven't ended up needing this change yet on this PR but I do think the change needs making.
@@ -38,6 +38,7 @@ impl ShardedGossipLocal { | |||
|
|||
// Choose a remote agent to gossip with. | |||
let remote_agent = self | |||
// TODO want to set a test up so that we know who this should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't done this yet so I'm going to leave this TODO here for myself.
self.agent_arqs.get(agent).cloned().unwrap_or_else(|| { | ||
let dim = SpaceDimension::standard(); | ||
match self.config.tuning_params.arc_clamping() { | ||
Some(ArqClamping::Empty) => Arq::new_empty(dim, agent.get_loc()), | ||
Some(ArqClamping::Full) | None => Arq::new_full_max(dim, &strat, agent.get_loc()), | ||
Some(ArqClamping::Full) | None => { | ||
let strat = self.config.tuning_params.to_arq_strat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't need to be computed on one code path, and wasn't needed at all if there was already an agent arc. Just a minor optimisation.
let mut pub_key_bytes = info.ed25519_pub_key.0.to_vec(); | ||
let loc = dht_location(&pub_key_bytes[0..32].try_into().unwrap()); | ||
pub_key_bytes.extend(&loc); | ||
Arc::new(KitsuneAgent::new(pub_key_bytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an omission from the test host implementation that agents didn't have a location. Wasn't needed by the existing tests because they all use full arc.
for i in 0..5 { | ||
if i == sender_idx || i == should_recv_idx { | ||
continue; | ||
} | ||
|
||
// We've filtered out the sender and the receiver, who are expected to have the data. | ||
// Now we check that the agent at the current index does not have the basis that the op was | ||
// published to in its arc. That would make the test wrong, not Kitsune, so fail here! | ||
let should_this_agent_hold_the_op = should_agent_hold_op_at_basis(&agents[i].0, agents[i].2.clone(), basis.clone()); | ||
|
||
// If this assertion fails, it means that the agent at index `i` has the basis in its arc which is not intended by the test setup. | ||
assert!(!should_this_agent_hold_the_op, "Agent {i} should receive the data, this is a setup issue with the test"); | ||
|
||
// Now make the important assertion that the agent at index `i` did not receive the data! If it's not in the agents arc | ||
// (which we just asserted above) then it should not have been received. | ||
let store_lock = agents[i].0.op_store(); | ||
let store = store_lock.read(); | ||
assert!( | ||
store.is_empty(), | ||
"Agent {} should not have received any data but has {} ops", | ||
i, | ||
store.len() | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's this bit I'd like to get a second pair of eyes on. I think that if the first check for 'basis in agent arc' passes then Kitsune being correct means that the op storage assertion shouldn't fail. However, it is flaky. That means that sometimes the op basis isn't in the agent's arc but it gets the op anyway.
This could mean that the test host implementation has bugs. It could mean that I've not configured Kitsune right and arcs are changing. It could just be that I've written code in this loop that doesn't make sense.
Working out what is wrong is probably going to be difficult, so I'd like to at least check that my assumptions are reasonable!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My sense at the moment is that this could be pointing to a behavior in holochain that wasn't explicitly intended. However, we wrote a lot of code assuming we'd someday have garbage collection for things when our arcs shrink, for example, so haven't been too careful about getting ops when we shouldn't.
Perhaps this falls into a class of optimizations that we should wait until after launch to address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, I've tried to disable arc resizing so the arcs shouldn't be moving. I can add assertions to check they aren't and see if that stops the test before the 'who got what' assertions
#[cfg(feature = "tx5")] | ||
#[ignore = "This test is flaky, possibly because what it is testing is flaky"] | ||
#[tokio::test(flavor = "multi_thread")] | ||
async fn publish_to_basis_from_outside() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is similar enough to the one above that it can be ignored (at least while this PR is in draft)
let store = store_lock.read(); | ||
assert!( | ||
store.is_empty(), | ||
"Agent {} should not have received any data but has {} ops", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the assertions the test fails on. Which agent it fails at changes.
let arc = Arq::from_start_and_half_len_approximate( | ||
dim, | ||
&ArqStrat::standard(LocalStorageConfig::default(), 2.0), | ||
agent.get_loc(), | ||
len.as_() / 2 + 1, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(cc @maackle) Do you know how the quantization will affect this? If it grows it a bit, that's probably not a huge problem, you could potentially get a double overlap with a node in the end of the range and the node two after that at the beginning of the range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it can and does change size and the function name makes it clear that it's expected.
That is what I was assuming was wrong with my testing the last two days. I think to make this test really solid, we'd need to account for that and make sure we pick an op that will really go into the overlap of two agents. However, I seem to be able to get ops on every run that fall into the arcs of the two agents I'm selecting. I've not yet seen the test fail because we don't have enough overlap. What I'm seeing is agents who shouldn't be getting the op... getting the op. That's what the extra assertions I added were to check for.
let sender_location = &agents[sender_idx].2 .0[32..]; | ||
|
||
let mut kitsune_basis = KitsuneBasis::new(vec![0; 36]); | ||
kitsune_basis.0[32..].copy_from_slice(&sender_location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someday we may add validation that the location is a correct hash of the hash... but I suppose we can address this somehow at that point (perhaps by having a test flag to disable that validation).
// Distance to the end of the segment, plus the length of the next segment. Guaranteed to | ||
// overlap with the next agent and not the one after that. | ||
// Because of arc quantisation, the layout won't be perfect, but we can expect overlap at | ||
// the start of the agent's arc, with the previous agent. | ||
let len = | ||
DhtLocation::new(base_len * (i + 1)) - agent.get_loc() + DhtLocation::new(base_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably isn't the cause of your flaky-ness because of the extreme unlikelyhood of randomly hitting it, but I'm a little worried about an off-by-one error, especially on the wrapping around the beginning if the section 5 node is at the extreme beginning of its range, and the section zero node is at the extreme end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree, this isn't accurate enough - it might make sense to grow the arcs a bit more and then have the assertions be dynamically checking that the agents whose arcs actually hold the op are getting it. Rather than hard-coding. But for now, I'm checking that agents whose arcs (I think?) aren't including the test basis aren't getting the ops.. but they are. Very confused :)
6548f7d
to
b283190
Compare
b283190
to
2f42dcd
Compare
Closing in favor of a new PR where the tests work |
Summary
Putting this up for review to see if my test logic is reasonable. The tests are definitely flaky but they fail at the final step which I think means that:
Note that this isn't actually about gossip as the branch suggests, I started with publish which is slightly simpler. These tests should extend nicely to gossip if we can get them working reliably.
TODO: