Low-level cluster linearization code #30126

sipa · 2024-05-16T20:29:15Z

Depends on ~~#30160~~ and ~~#30161~~. #28676 builds on this PR.

This introduces low-level optimized cluster linearization code, including tests and some benchmarks. It is currently not hooked up to anything.

Roughly the commits are organized as:

Introduce some data types and fuzzing infrastructure
Introduce unoptimized versions of (including tests)
- Best Ancestor set finding
- Search-based candidate set finding
- Linearization algorithm
Benchmarks
Optimizations to search and linearizations, step by step.
Add merging and postprocessing

Ultimately, what this PR adds is functions Linearize, PostLinearize, and MergeLinearizations which operate on instances of DepGraph (instances of which represent pre-processed transaction clusters) to produce and/or improve linearizations for that cluster.

Along the way two new data structures are introduced (util/bitset.h and util/ringbuffer.h), which could be useful more broadly. They have their own commits, which include tests.

To provide assurance, the code heavily relies on fuzz tests. A novel approach is used here, where the fuzz input is parsed using the serialization.h framework rather than FuzzedDataProvider, with a custom serializer/deserializer for DepGraph objects. By including serialization, it's possible to ascertain that the format can represent every relevant cluster, as well as potentially permitting the construction of ad-hoc fuzz inputs from clusters (not included in this PR, but used during development).

The Linearize(depgraph, iteration_limit, rng_seed, old_linearization) function is an implementation of the (single) LIMO algorithm, with the $S$ in every iteration found as the best out of (a) the best remaining ancestor set and (b) randomized computationally-bounded search. It incrementally builds up a linearization by finding good topologically-valid subsets to move to the front, in such a way that the resulting linearization has a diagram that is at least as good as the old_linearization passed in (if any).

Despite using both best ancestor set and search, this is not Double LIMO, as no intersections between these are involved; just the best of the two.
The iteration_limit and rng_seed only control the (b) randomized search. Even with 0 iterations, the result will be as good as the old linearization, and the included sets at every point will have a feerate at least as high as the best remaining ancestor set at that point.

The search algorithm used in the (b) step above largely follows Section 2 of How to linearize your cluster, though with a few changes:

Connected component analysis is performed inside the search algorithm (creating initial work items per component for each candidate), rather than once at a higher level. This duplicates some work but is significantly simpler in implementation.
No ancestor-set based presplitting inside the search is performed; instead, the best value is initialized with the best topologically valid set known to the LIMO algorithm before search starts: the better one out of the highest-feerate remaining ancestor set, and the highest-feerate prefix of remaining transactions in old_linearization.
Work items are represented using an included set inc and an undefined set und, rather than included and excluded.
Potential sets pot are not computed for work items with empty inc.

At a high level, the only missing optimization from that post is bottleneck analysis; my thinking is that it only really helps with clusters that are already relatively cheap to linearize (doing so would need to be done at a higher level, not inside the search algorithm).

The PostLinearize(depgraph, linearization) function performs an in-place improvement of linearization, using two iterations of the Linearization post-processing algorithm. The first running from back to front, the second from front to back.

The MergeLinearizations(depgraph, linearization1, linearization2) function computes a new linearization for the provided cluster, given two existing linearizations for that cluster, which is at least as good as both inputs. The algorithm is described at a high level in merging incomparable linearizations.

DrahtBot · 2024-05-16T20:29:17Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

No conflicts as of last run.

DrahtBot · 2024-05-16T22:28:59Z

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

_{Debug: https://github.com/bitcoin/bitcoin/runs/25072594213}

sipa · 2024-05-20T18:53:52Z

Benchmarks on my Ryzen 5950X system:

ns/op	op/s	err%	total	benchmark
2,373.94	421,240.11	0.1%	1.10	`LinearizeNoIters16TxWorstCase`
7,530.22	132,798.26	0.0%	1.07	`LinearizeNoIters32TxWorstCase`
16,585.34	60,294.20	0.1%	1.10	`LinearizeNoIters48TxWorstCase`
28,591.70	34,975.18	0.1%	1.10	`LinearizeNoIters64TxWorstCase`
53,918.56	18,546.49	0.0%	1.10	`LinearizeNoIters75TxWorstCase`
93,589.21	10,684.99	0.1%	1.10	`LinearizeNoIters99TxWorstCase`

ns/iters	iters/s	err%	total	benchmark
45.36	22,045,550.98	0.5%	1.10	`LinearizePerIter16TxWorstCase`
35.57	28,111,376.58	0.1%	1.10	`LinearizePerIter32TxWorstCase`
33.04	30,262,951.89	0.0%	1.10	`LinearizePerIter48TxWorstCase`
33.21	30,107,745.17	0.1%	1.10	`LinearizePerIter64TxWorstCase`
75.98	13,161,530.63	0.4%	1.07	`LinearizePerIter75TxWorstCase`
76.62	13,051,066.77	0.5%	1.08	`LinearizePerIter99TxWorstCase`

ns/op	op/s	err%	total	benchmark
332.97	3,003,274.74	0.0%	1.10	`PostLinearize16TxWorstCase`
1,121.92	891,330.77	0.0%	1.10	`PostLinearize32TxWorstCase`
3,358.33	297,767.01	0.3%	1.13	`PostLinearize48TxWorstCase`
5,826.72	171,623.05	0.5%	1.11	`PostLinearize64TxWorstCase`
7,453.31	134,168.55	0.1%	1.07	`PostLinearize75TxWorstCase`
12,476.44	80,151.09	0.1%	1.10	`PostLinearize99TxWorstCase`

This means that for a 64-transaction cluster, it should be possible to linearize (28.59 µs) with 100 candidate search iterations (3.32 µs) plus postlinearize (5.83 µs), within a total of 37.74 µs, on my system.

src/util/bitset.h

sipa · 2024-05-23T15:18:30Z

I've dropped the dependency on #29625, and switched to using FastRandomContext instead; there is a measurable slowdown from using the (ChaCha20-based) FastRandomContext over the (xoroshiro128++-based) InsecureRandomContext introduced there, but it's no more than 1-2%. I can switch back to that approach if 29625 were to make it in.

DrahtBot · 2024-05-23T18:33:11Z

Guix builds (on x86_64) [untrusted test-only build, possibly unsafe, not for production use]

File	commit `83ae1ba` (master)	commit `e5cbc23` (master and this pull)
SHA256SUMS.part	`24fd016e03e8c7da...`	`15fae3483445e33b...`
*-aarch64-linux-gnu-debug.tar.gz	`94942cf7dedf3604...`	`23eeccf77ee5799d...`
*-aarch64-linux-gnu.tar.gz	`4b30ca93b6788f48...`	`ed8e5024d960f53e...`
*-arm-linux-gnueabihf-debug.tar.gz	`a0f57c45e5f02bb1...`	`f22f89c1eba49dda...`
*-arm-linux-gnueabihf.tar.gz	`9f0376baaf54b988...`	`17da8a968635c492...`
*-arm64-apple-darwin-unsigned.tar.gz	`9b952b32db70d099...`	`16d805ab4bcf8d54...`
*-arm64-apple-darwin-unsigned.zip	`d49361bbbc5529fc...`	`e225d79a24b058a5...`
*-arm64-apple-darwin.tar.gz	`34e9cf4b79cbc190...`	`29b28e6d57761201...`
*-powerpc64-linux-gnu-debug.tar.gz	`5f322a7b213e244e...`	`cb5f37b036b5c52c...`
*-powerpc64-linux-gnu.tar.gz	`bb57b46482c5b1e6...`	`57adf954458a27d5...`
*-riscv64-linux-gnu-debug.tar.gz	`d1a3a405c5b45fff...`	`237eb467f8547d22...`
*-riscv64-linux-gnu.tar.gz	`68d7e6671e2dba30...`	`29d9f1e9052e96d3...`
*-x86_64-apple-darwin-unsigned.tar.gz	`6fb22000e8c14c40...`	`67e5bd5b86483c8a...`
*-x86_64-apple-darwin-unsigned.zip	`1c5f2a216e87cbf5...`	`abdbca97fafc146f...`
*-x86_64-apple-darwin.tar.gz	`66f17a574163ecaf...`	`f002830b4b8da330...`
*-x86_64-linux-gnu-debug.tar.gz	`a5044f956a824228...`	`0791685e39e80672...`
*-x86_64-linux-gnu.tar.gz	`23af1dc6cb921b37...`	`ff9625165c3f19c2...`
*.tar.gz	`caac4a182deb1e04...`	`ba8abeef4165dafb...`
guix_build.log	`c7cc0190f7085f04...`	`100da60c2f0e6686...`
guix_build.log.diff		`7c460aa3b1aafc32...`

DrahtBot · 2024-06-11T05:11:19Z

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

_{Debug: https://github.com/bitcoin/bitcoin/runs/26052313359}

47f705b tests: add fuzz tests for BitSet (Pieter Wuille) 59a6df6 util: add BitSet (Pieter Wuille) Pull request description: Extracted from #30126. This introduces the `BitSet` data structure, inspired by `std::bitset`, but with a few features that cannot be implemented on top without efficiency loss: * Finding the first set bit (`First`) * Finding the last set bit (`Last`) * Iterating over all set bits (`begin` and `end`). And a few other operators/member functions that help readability for #30126: * `operator-` for set subtraction * `Overlaps()` for testing whether intersection is non-empty * `IsSupersetOf()` for testing (non-strict) supersetness * `IsSubsetOf()` for testing (non-strict) subsetness * `Fill()` to construct a set with all numbers from 0 to n-1, inclusive * `Singleton()` to construct a set with one specific element. Everything is tested through a simulation-based fuzz test that compares the behavior with normal `std::bitset` equivalent operations. ACKs for top commit: instagibbs: ACK 47f705b achow101: ACK 47f705b cbergqvist: re-ACK 47f705b theStack: Code-review ACK 47f705b Tree-SHA512: e451bf4b801f193239ee434b6b614f5a2ac7bb49c70af5aba24c2ac0c54acbef4672556800e4ac799ae835632bdba716209c5ca8c37433a6883dab4eb7cd67c1

…ypes This primarily adds the DepGraph class, which encapsulated precomputed ancestor/descendant information for a given transaction cluster, with a number of a utility features (inspectors for set feerates, computing reduced parents/children, adding transactions, adding dependencies), which will become needed in future commits.

This introduces a bespoke fuzzing-focused serialization format for DepGraphs, and then tests that this format can represent any graph, roundtrips, and then uses that to test the correctness of DepGraph itself. This forms the basis for future fuzz tests that need to work with interesting graph.

This is a class that encapsulated precomputes ancestor set feerates, and presents an interface for getting the best remaining ancestor set.

Similar to AncestorCandidateFinder, this encapsulates the state needed for finding good candidate sets using a search algorithm.

This adds a first version of the overall linearization interface, which given a DepGraph constructs a good linearization, by incrementally including good candidate sets (found using AncestorCandidateFinder and SearchCandidateFinder).

Add benchmarks for known bad graphs for the purpose of search (as an upper bound on work per search iterations) and ancestor sorting (as an upper bound on linearization work with no search iterations).

Add utility functions to DepGraph for finding connected components.

Before this commit, the worst case for linearization involves clusters which break apart in several smaller components after the first candidate is included in the output linearization. Address this by never considering work items that span multiple components of what remains of the cluster.

Switch to BFS exploration of the search tree in SearchCandidateFinder instead of DFS exploration. This appears to behave better for real world clusters. As BFS has the downside of needing far larger search queues, switch back to DFS temporarily when the queue grows too large.

To make search non-deterministic, change the BFS logic from always picking the first queue item, randomly picking the first or second queue item.

This implements the LIMO algorithm for linearizing by improving an existing linearization. See https://delvingbitcoin.org/t/limo-combining-the-best-parts-of-linearization-search-and-merging for details.

This is a requirement for a future commit, which will rely on quickly iterating over transaction sets in decreasing individual feerate order.

…ion) In each work item, keep track of a conservative overestimate of the best possible feerate that can be reached from it, and then use these to avoid exploring hopeless work items.

Keep track of which transactions in the graph have an individual feerate that is better than the best included set so far. Others do not need to be added to the pot set, as they cannot possibly help beating best.

Automatically add topologically-valid subsets of the potential set pot to inc. It can be proven that these must be part of the best reachable topologically-valid set from that work item. This is a crucial optimization that (apparently) reduces the maximum number of iterations from ~2^(N-1) to ~sqrt(2^N).

Emperically, this approach seems to be more efficient in common real-life clusters, and does not change the worst case.

…ion) Cache the potential set inside work items, and use it to skip part of the computation of split-off work items from it.

DrahtBot added the CI failed label May 16, 2024

sipa force-pushed the 202405_clusterlin branch 5 times, most recently from 079f02d to 07f68f9 Compare May 17, 2024 13:38

DrahtBot removed the CI failed label May 17, 2024

laanwj added the Mempool label May 17, 2024

sipa force-pushed the 202405_clusterlin branch 2 times, most recently from 079f02d to b4bb178 Compare May 20, 2024 01:35

DrahtBot added the CI failed label May 20, 2024

DrahtBot mentioned this pull request May 20, 2024

scripted-diff: Use LogInfo/LogDebug over LogPrintf/LogPrint #29641

Draft

DrahtBot removed the CI failed label May 20, 2024

sipa force-pushed the 202405_clusterlin branch from b4bb178 to 88fe1e3 Compare May 20, 2024 18:49

sipa force-pushed the 202405_clusterlin branch from 88fe1e3 to c9558d5 Compare May 20, 2024 21:33

sipa added the DrahtBot Guix build requested label May 21, 2024

glozow mentioned this pull request May 23, 2024

Package Relay Project Tracking #27463

Open

57 tasks

theuni reviewed May 23, 2024

View reviewed changes

src/util/bitset.h Outdated Show resolved Hide resolved

sipa force-pushed the 202405_clusterlin branch from c9558d5 to 19fb843 Compare May 23, 2024 15:15

sipa force-pushed the 202405_clusterlin branch from 19fb843 to c05a487 Compare May 23, 2024 15:28

sipa mentioned this pull request May 23, 2024

util: add BitSet #30160

Merged

sipa removed the DrahtBot Guix build requested label May 23, 2024

sipa mentioned this pull request May 23, 2024

util: add VecDeque #30161

Merged

sipa force-pushed the 202405_clusterlin branch from c05a487 to 03bc4a5 Compare May 23, 2024 18:23

DrahtBot mentioned this pull request May 24, 2024

Several randomness improvements #29625

Open

DrahtBot added the CI failed label Jun 11, 2024

sipa force-pushed the 202405_clusterlin branch from a7234f9 to d164c62 Compare June 11, 2024 11:23

DrahtBot removed the CI failed label Jun 11, 2024

sipa force-pushed the 202405_clusterlin branch from d164c62 to 2ffb661 Compare June 11, 2024 19:51

DrahtBot added the Needs rebase label Jun 11, 2024

sipa force-pushed the 202405_clusterlin branch from 2ffb661 to a6a60fd Compare June 11, 2024 23:14

DrahtBot removed the Needs rebase label Jun 12, 2024

sipa force-pushed the 202405_clusterlin branch from a6a60fd to de90ac6 Compare June 12, 2024 21:14

sipa added 19 commits June 12, 2024 17:41

clusterlin: add AncestorCandidateFinder class

061cf87

This is a class that encapsulated precomputes ancestor set feerates, and presents an interface for getting the best remaining ancestor set.

clusterlin: add SearchCandidateFinder class

48775c0

Similar to AncestorCandidateFinder, this encapsulates the state needed for finding good candidate sets using a search algorithm.

clusterlin: add Linearize function

0634edc

This adds a first version of the overall linearization interface, which given a DepGraph constructs a good linearization, by incrementally including good candidate sets (found using AncestorCandidateFinder and SearchCandidateFinder).

bench: Candidate finding and linearization benchmarks

e9a9c8a

Add benchmarks for known bad graphs for the purpose of search (as an upper bound on work per search iterations) and ancestor sorting (as an upper bound on linearization work with no search iterations).

clusterlin: add algorithms for connectedness/connected components

8964fcd

Add utility functions to DepGraph for finding connected components.

clusterlin: randomize the SearchCandidateFinder search order

7c1f1e9

To make search non-deterministic, change the BFS logic from always picking the first queue item, randomly picking the first or second queue item.

clusterlin: permit passing in existing linearization to Linearize

4ea7126

This implements the LIMO algorithm for linearizing by improving an existing linearization. See https://delvingbitcoin.org/t/limo-combining-the-best-parts-of-linearization-search-and-merging for details.

clusterlin: use feerate-sorted depgraph in SearchCandidateFinder

36754f2

This is a requirement for a future commit, which will rely on quickly iterating over transaction sets in decreasing individual feerate order.

clusterlin: track upper bound potential set for work items (optimizat…

4aba88e

…ion) In each work item, keep track of a conservative overestimate of the best possible feerate that can be reached from it, and then use these to avoid exploring hopeless work items.

clusterlin: reduce computation of unnecessary pot sets (optimization)

7991102

Keep track of which transactions in the graph have an individual feerate that is better than the best included set so far. Others do not need to be added to the pot set, as they cannot possibly help beating best.

clusterlin: improve heuristic to decide split transaction (optimization)

33de72e

Emperically, this approach seems to be more efficient in common real-life clusters, and does not change the worst case.

clusterlin: avoid recomputing potential set on every split (optimizat…

5433025

…ion) Cache the potential set inside work items, and use it to skip part of the computation of split-off work items from it.

clusterlin: add PostLinearize + benchmarks + fuzz tests

2533fef

clusterlin: add MergeLinearizations function + fuzz test + benchmark

e0f82aa

sipa force-pushed the 202405_clusterlin branch from de90ac6 to e0f82aa Compare June 12, 2024 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-level cluster linearization code #30126

Low-level cluster linearization code #30126

sipa commented May 16, 2024 •

edited

DrahtBot commented May 16, 2024 •

edited

DrahtBot commented May 16, 2024

sipa commented May 20, 2024 •

edited

sipa commented May 23, 2024 •

edited

DrahtBot commented May 23, 2024

DrahtBot commented Jun 11, 2024

Low-level cluster linearization code #30126

Are you sure you want to change the base?

Low-level cluster linearization code #30126

Conversation

sipa commented May 16, 2024 • edited

DrahtBot commented May 16, 2024 • edited

Code Coverage

Reviews

Conflicts

DrahtBot commented May 16, 2024

sipa commented May 20, 2024 • edited

sipa commented May 23, 2024 • edited

DrahtBot commented May 23, 2024

Guix builds (on x86_64) [untrusted test-only build, possibly unsafe, not for production use]

DrahtBot commented Jun 11, 2024

sipa commented May 16, 2024 •

edited

DrahtBot commented May 16, 2024 •

edited

sipa commented May 20, 2024 •

edited

sipa commented May 23, 2024 •

edited