Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

follow-up on #13150: improve the closure-computation algorithm again #13156

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

gasche
Copy link
Member

@gasche gasche commented May 9, 2024

The Flambda backend contains a transitive-closure computation algorithm that misbehaves on certain Menhir-generated programs. #13150 introduced an improved algorithm that fixes the case known to misbehave, but is asymptotically disappointing in other cases -- see the discussions there for the details. The cases where #13150 performs worse than it should have not been found in the wild yet, but were exhibited by the simple relations in the benchmark script proposed by @fweimer-rh. I was nerd-sniped and decided to tweak the algorithm to perform better.

The PR improves the asymptotic behavior of the closure-computation by ensuring that closure-computation work is shared between nodes instead of being duplicated.

Consider for example the relation

a -> b
b -> c
c -> a

The previous code would first compute

a -> [b, c, a]

by traversing the graph, then

b -> [c, a, b]

by traversing the graph again, then

c -> [a, b, c]

again. The new code traverses the graph until it finds a cyclic node (starting from [a] this would be [a]), then compute

a -> [a, b, c]

using the same algorithm as before, and then immediately computes the result for c and b from the result of a, without any additional traversal.

Performance results using the benchmark script of Florian Weimer ( https://gitlab.com/gasche-snippets/transitive-closure ).

Implementations:

Tests:

  • isolated is a diagonal relation (1 -> 1, 2 -> 2, 3 -> 3, ...)
  • cycle is (1 -> 2, 2 -> 3, 3 -> 4, ..., n -> 1)
  • complete is (1 -> [1, 2, ..., n], 2 -> [1, 2, ..., n], ...)
isolated Mean [ms] Relative
old 162.1 ± 6.3 1.66 ± 0.08
new 97.7 ± 2.9 1.00
again 203.1 ± 4.1 2.08 ± 0.07
cycle Mean [s] Relative
old 1.952 ± 0.054 596.91 ± 154.38
new 0.034 ± 0.002 10.29 ± 2.71
again 0.003 ± 0.001 1.00
complete Mean [ms] Relative
old 59.3 ± 5.3 8.46 ± 1.38
new 218.5 ± 15.8 31.15 ± 4.83
again 7.0 ± 1.0 1.00

The new code has a larger constant factor than the #13150 algorithm for isolated graphs, which shows as a 2x slowdown. This makes it slightly slower than the "old" code in 5.2 in this trivial case.

It performs much better than both trunk and 5.2 in the other cases.

This change improves the asymptotic behavior of the
closure-computation by ensuring that closure-computation work is
shared between nodes instead of being duplicated.

Consider for example the relation

  a -> b
  b -> c
  c -> a

The previous code would first compute

  a -> [b, c, a]

by traversing the graph, then

  b -> [c, a, b]

by traversing the graph again, then

  c -> [a, b, c]

again. The new code traverses the graph until it finds a cyclic
node (starting from [a] this would be [a]), then compute

  a -> [a, b, c]

using the same algorithm as before, and then immediately computes the
result for c and c from the result of a, without any additional
traversal.

Performance results using the benchmark script of Florian Weimer (
https://gitlab.com/gasche-snippets/transitive-closure ).

Implementations:
- `old` is the code in 5.2
- `new` is ocaml#13150
- `again` is the algorithm in this commit

Tests:
- `isolated` is a diagonal relation (1 -> 1, 2 -> 2, 3 -> 3, ...)
- `cycle` is (1 -> 2, 2 -> 3, 3 -> 4, ..., n -> 1)
- `complete` is (1 -> [1, 2, ..., n], 2 -> [1, 2, ..., n], ...)

| isolated |   Mean [ms] |    Relative |
|:---------|------------:|------------:|
| `old`    | 162.1 ± 6.3 | 1.66 ± 0.08 |
| `new`    |  97.7 ± 2.9 |        1.00 |
| `again`  | 203.1 ± 4.1 | 2.08 ± 0.07 |

| cycle   |      Mean [s] |        Relative |
|:--------|--------------:|----------------:|
| `old`   | 1.952 ± 0.054 | 596.91 ± 154.38 |
| `new`   | 0.034 ± 0.002 |    10.29 ± 2.71 |
| `again` | 0.003 ± 0.001 |            1.00 |

| complete |    Mean [ms] |     Relative |
|:---------|-------------:|-------------:|
| `old`    |   59.3 ± 5.3 |  8.46 ± 1.38 |
| `new`    | 218.5 ± 15.8 | 31.15 ± 4.83 |
| `again`  |    7.0 ± 1.0 |         1.00 |

The new code has a larger constant factor than the ocaml#13150 algorithm
for isolated graphs, which shows as a 2x slowdown. This makes it
slightly slower than the "old" code in 5.2 in this trivial case.

It performs much better than both trunk and 5.2 in the other cases.
@gasche
Copy link
Member Author

gasche commented May 9, 2024

One may reasonably wonder whether it is worth having an algorithm that is faster in the "hard" cases, if it has worse constant factors in the "easy" case (2x slowdown). Let me explain why I think it is worth it.

  1. The instances in human-written code are both "easy" and very small, so the cost of this closure computation is neglectible compared to the rest of the compilation, and the constant factor does not matter.

  2. The issue with this function is thus not its computation time in the easy case, but the lack of robustness that arise from bad asymptotic behavior in harder cases. Performing asymptotically well is more important than being fast.

  3. The current algorithm in trunk is inefficient in situations that very plausibly occur in generated code: think of N mutually-recursive functions that all share an invariant first parameter: let rec f env ... = ... and g env ... = ... and .... In this case the current algorithm will perform N times the work of the algorithm in this PR.

@gasche
Copy link
Member Author

gasche commented May 15, 2024

@fweimer-rh I wonder if you would be willing to give a look at the code and give your impression on the complexity/performance tradeoff.

@fweimer-rh
Copy link
Contributor

@fweimer-rh I wonder if you would be willing to give a look at the code and give your impression on the complexity/performance tradeoff.

@gasche I extracted the previously problematic relation graph from the Coccinelle/Menhir build, and your implementation is indeed much faster than what's in trunk today (one second versus 5.7 seconds). I'll try an array-based implementation and will report if it is faster.

@lthls
Copy link
Contributor

lthls commented May 19, 2024

I had some discussions with @chambart earlier this week about this issue (the original #13150, not this new algorithm), and we suspect that for menhir-generated parsers we should simply not compute the transitive closure at all (we expect that very few parameters will be invariant or unused, and for the invariant parameters it's unlikely we will have any opportunity to actually make use of that).
However, we didn't find a good way to detect this case so we didn't make any concrete proposals.
Some solutions we thought about:

  • a dedicated flag passed to the compiler to warn about big recursive nests (impractical because dune doesn't make it easy to pass per-file flags, although since it knows about menhir it might be able to do it automatically for generated parsers)
  • a cut-off in the compiler that only computes transitive closures for "small" nests (say, less than 20 functions)

I don't have any code for these suggestions at the moment, but if someone is interested in testing them I could put it on my todo list.

@rwmjones
Copy link
Contributor

rwmjones commented May 19, 2024

A note that per-file flags are a big pain for downstream packagers. You have to remember that Dune is not obvious for people who don't work on OCaml daily. I use OCaml a lot but I have no idea how to add a per-file flag via Dune. Someone coming to an OCaml project as an outside downstream packager just thinks Dune is weird. Not to mention there are many other build systems used by OCaml projects, and a lot of half-baked hand-written Makefiles, which makes everything harder.

@lthls
Copy link
Contributor

lthls commented May 19, 2024

The per-file flags could be made to work independently of the build system with the introduction of unit headers (ocaml/RFCs#26). For example menhir would add at the start of the generated file an attribute disabling this optimisation, and the compiler would read this attribute and change its settings accordingly.
Otherwise I agree that it's not realistic (as far as I know it isn't even possible in dune, you would have to move the files to independent libraries).

@gasche
Copy link
Member Author

gasche commented May 19, 2024

I hadn't thought of this argument for unit headers: "easier than figuring out how to pass per-file flags with Dune. Genius!

@lthls hopefully this discussion motivated you to review this PR :-)

(The implementation uses persistent maps and sets just like the original code -- I thought that it would offer a fairer comparison. I am happy to start using Hashtbl as suggested by @fweimer-rh to see if it improves constant factors, now that we know that there exists a real program that is affected by those constant factors.)

@gasche
Copy link
Member Author

gasche commented May 19, 2024

(I tried using hashtables for the state map and visited set, and the performance gains are very small on the synthetic benchmarks -- 30% faster for "isolated" relations, and the same speed for all others. It is not worth the small extra complexity of using different data structures to represent the same data.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants