follow-up on #13150: improve the closure-computation algorithm again #13156

gasche · 2024-05-09T09:40:38Z

The Flambda backend contains a transitive-closure computation algorithm that misbehaves on certain Menhir-generated programs. #13150 introduced an improved algorithm that fixes the case known to misbehave, but is asymptotically disappointing in other cases -- see the discussions there for the details. The cases where #13150 performs worse than it should have not been found in the wild yet, but were exhibited by the simple relations in the benchmark script proposed by @fweimer-rh. I was nerd-sniped and decided to tweak the algorithm to perform better.

The PR improves the asymptotic behavior of the closure-computation by ensuring that closure-computation work is shared between nodes instead of being duplicated.

Consider for example the relation

a -> b
b -> c
c -> a

The previous code would first compute

a -> [b, c, a]

by traversing the graph, then

b -> [c, a, b]

by traversing the graph again, then

c -> [a, b, c]

again. The new code traverses the graph until it finds a cyclic node (starting from [a] this would be [a]), then compute

a -> [a, b, c]

using the same algorithm as before, and then immediately computes the result for c and b from the result of a, without any additional traversal.

Performance results using the benchmark script of Florian Weimer ( https://gitlab.com/gasche-snippets/transitive-closure ).

Implementations:

old is the code in 5.2
new is flambda: Improve transitive closure in invariant_params_in_recursion #13150
again is the algorithm in this commit

Tests:

isolated is a diagonal relation (1 -> 1, 2 -> 2, 3 -> 3, ...)
cycle is (1 -> 2, 2 -> 3, 3 -> 4, ..., n -> 1)
complete is (1 -> [1, 2, ..., n], 2 -> [1, 2, ..., n], ...)

isolated	Mean [ms]	Relative
`old`	162.1 ± 6.3	1.66 ± 0.08
`new`	97.7 ± 2.9	1.00
`again`	203.1 ± 4.1	2.08 ± 0.07

cycle	Mean [s]	Relative
`old`	1.952 ± 0.054	596.91 ± 154.38
`new`	0.034 ± 0.002	10.29 ± 2.71
`again`	0.003 ± 0.001	1.00

complete	Mean [ms]	Relative
`old`	59.3 ± 5.3	8.46 ± 1.38
`new`	218.5 ± 15.8	31.15 ± 4.83
`again`	7.0 ± 1.0	1.00

The new code has a larger constant factor than the #13150 algorithm for isolated graphs, which shows as a 2x slowdown. This makes it slightly slower than the "old" code in 5.2 in this trivial case.

It performs much better than both trunk and 5.2 in the other cases.

This change improves the asymptotic behavior of the closure-computation by ensuring that closure-computation work is shared between nodes instead of being duplicated. Consider for example the relation a -> b b -> c c -> a The previous code would first compute a -> [b, c, a] by traversing the graph, then b -> [c, a, b] by traversing the graph again, then c -> [a, b, c] again. The new code traverses the graph until it finds a cyclic node (starting from [a] this would be [a]), then compute a -> [a, b, c] using the same algorithm as before, and then immediately computes the result for c and c from the result of a, without any additional traversal. Performance results using the benchmark script of Florian Weimer ( https://gitlab.com/gasche-snippets/transitive-closure ). Implementations: - `old` is the code in 5.2 - `new` is ocaml#13150 - `again` is the algorithm in this commit Tests: - `isolated` is a diagonal relation (1 -> 1, 2 -> 2, 3 -> 3, ...) - `cycle` is (1 -> 2, 2 -> 3, 3 -> 4, ..., n -> 1) - `complete` is (1 -> [1, 2, ..., n], 2 -> [1, 2, ..., n], ...) | isolated | Mean [ms] | Relative | |:---------|------------:|------------:| | `old` | 162.1 ± 6.3 | 1.66 ± 0.08 | | `new` | 97.7 ± 2.9 | 1.00 | | `again` | 203.1 ± 4.1 | 2.08 ± 0.07 | | cycle | Mean [s] | Relative | |:--------|--------------:|----------------:| | `old` | 1.952 ± 0.054 | 596.91 ± 154.38 | | `new` | 0.034 ± 0.002 | 10.29 ± 2.71 | | `again` | 0.003 ± 0.001 | 1.00 | | complete | Mean [ms] | Relative | |:---------|-------------:|-------------:| | `old` | 59.3 ± 5.3 | 8.46 ± 1.38 | | `new` | 218.5 ± 15.8 | 31.15 ± 4.83 | | `again` | 7.0 ± 1.0 | 1.00 | The new code has a larger constant factor than the ocaml#13150 algorithm for isolated graphs, which shows as a 2x slowdown. This makes it slightly slower than the "old" code in 5.2 in this trivial case. It performs much better than both trunk and 5.2 in the other cases.

gasche · 2024-05-09T09:49:11Z

One may reasonably wonder whether it is worth having an algorithm that is faster in the "hard" cases, if it has worse constant factors in the "easy" case (2x slowdown). Let me explain why I think it is worth it.

The instances in human-written code are both "easy" and very small, so the cost of this closure computation is neglectible compared to the rest of the compilation, and the constant factor does not matter.
The issue with this function is thus not its computation time in the easy case, but the lack of robustness that arise from bad asymptotic behavior in harder cases. Performing asymptotically well is more important than being fast.
The current algorithm in trunk is inefficient in situations that very plausibly occur in generated code: think of N mutually-recursive functions that all share an invariant first parameter: let rec f env ... = ... and g env ... = ... and .... In this case the current algorithm will perform N times the work of the algorithm in this PR.

gasche · 2024-05-15T10:01:16Z

@fweimer-rh I wonder if you would be willing to give a look at the code and give your impression on the complexity/performance tradeoff.

fweimer-rh · 2024-05-18T22:35:05Z

@fweimer-rh I wonder if you would be willing to give a look at the code and give your impression on the complexity/performance tradeoff.

@gasche I extracted the previously problematic relation graph from the Coccinelle/Menhir build, and your implementation is indeed much faster than what's in trunk today (one second versus 5.7 seconds). I'll try an array-based implementation and will report if it is faster.

lthls · 2024-05-19T10:14:17Z

I had some discussions with @chambart earlier this week about this issue (the original #13150, not this new algorithm), and we suspect that for menhir-generated parsers we should simply not compute the transitive closure at all (we expect that very few parameters will be invariant or unused, and for the invariant parameters it's unlikely we will have any opportunity to actually make use of that).
However, we didn't find a good way to detect this case so we didn't make any concrete proposals.
Some solutions we thought about:

a dedicated flag passed to the compiler to warn about big recursive nests (impractical because dune doesn't make it easy to pass per-file flags, although since it knows about menhir it might be able to do it automatically for generated parsers)
a cut-off in the compiler that only computes transitive closures for "small" nests (say, less than 20 functions)

I don't have any code for these suggestions at the moment, but if someone is interested in testing them I could put it on my todo list.

rwmjones · 2024-05-19T10:42:45Z

A note that per-file flags are a big pain for downstream packagers. You have to remember that Dune is not obvious for people who don't work on OCaml daily. I use OCaml a lot but I have no idea how to add a per-file flag via Dune. Someone coming to an OCaml project as an outside downstream packager just thinks Dune is weird. Not to mention there are many other build systems used by OCaml projects, and a lot of half-baked hand-written Makefiles, which makes everything harder.

lthls · 2024-05-19T11:51:31Z

The per-file flags could be made to work independently of the build system with the introduction of unit headers (ocaml/RFCs#26). For example menhir would add at the start of the generated file an attribute disabling this optimisation, and the compiler would read this attribute and change its settings accordingly.
Otherwise I agree that it's not realistic (as far as I know it isn't even possible in dune, you would have to move the files to independent libraries).

gasche · 2024-05-19T12:32:56Z

I hadn't thought of this argument for unit headers: "easier than figuring out how to pass per-file flags with Dune. Genius!

@lthls hopefully this discussion motivated you to review this PR :-)

(The implementation uses persistent maps and sets just like the original code -- I thought that it would offer a fairer comparison. I am happy to start using Hashtbl as suggested by @fweimer-rh to see if it improves constant factors, now that we know that there exists a real program that is affected by those constant factors.)

gasche · 2024-05-19T12:50:44Z

(I tried using hashtables for the state map and visited set, and the performance gains are very small on the synthetic benchmarks -- 30% faster for "isolated" relations, and the same speed for all others. It is not worth the small extra complexity of using different data structures to represent the same data.)

gasche mentioned this pull request May 9, 2024

flambda: Improve transitive closure in invariant_params_in_recursion #13150

Merged

gasche added the enhancement label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

follow-up on #13150: improve the closure-computation algorithm again #13156

follow-up on #13150: improve the closure-computation algorithm again #13156

gasche commented May 9, 2024

gasche commented May 9, 2024

gasche commented May 15, 2024

fweimer-rh commented May 18, 2024

lthls commented May 19, 2024

rwmjones commented May 19, 2024 •

edited

lthls commented May 19, 2024

gasche commented May 19, 2024

gasche commented May 19, 2024

follow-up on #13150: improve the closure-computation algorithm again #13156

Are you sure you want to change the base?

follow-up on #13150: improve the closure-computation algorithm again #13156

Conversation

gasche commented May 9, 2024

gasche commented May 9, 2024

gasche commented May 15, 2024

fweimer-rh commented May 18, 2024

lthls commented May 19, 2024

rwmjones commented May 19, 2024 • edited

lthls commented May 19, 2024

gasche commented May 19, 2024

gasche commented May 19, 2024

rwmjones commented May 19, 2024 •

edited