Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark CI #1633

Open
3 of 10 tasks
pitag-ha opened this issue Jun 26, 2023 · 1 comment
Open
3 of 10 tasks

Benchmark CI #1633

pitag-ha opened this issue Jun 26, 2023 · 1 comment

Comments

@pitag-ha
Copy link
Member

pitag-ha commented Jun 26, 2023

The latency of Merlin queries depends on a lot of different factors, such as

  • The global buffer it's run on; in particular, size and typing complexity.
  • The location inside the buffer it's run on.
  • The dependency graph of the buffer.
  • Whether and which PPX is applied.
  • Merlin's cache state at the moment the query is run.
  • Which Merlin query is run.

So for meaningful benchmark results, we need to run Merlin on a big variety of input samples. We've written merl-an to generate such an input sample set in a random but deterministic way. It has a merl-an benchmark command, which persists the telemetry part of the Merlin response in the format expected by current-bench.

The next steps to get a Merlin benchmark CI up and running are:

  • Finish the PoC for a current-bench CI on Merlin using merl-an. We're currently blocked on this by a current-bench issue. Done: see PoC graphs
  • Improve the separation into different benchmarks (in merl-an): I think, with the current merl-an output, current-bench will create one different graph for each file that's being benchmarked. That doesn't scale. Instead: One graph per cache workflow and per query or similar.
  • Improve the Docker set-up: The whole benchmark set-up, such as installing merl-an and fetching the code base on which we run Merlin should be done inside the container etc.
  • Filter out spikes (on merl-an). Non-reproducible latency spikes (i.e. timings that exceed the expected timing by over factor 10), mess up the scale of the current-bench graphs.
  • Add cold-cache workflow to the benchmarks: The reason why the numbers look so good at the moment is that both cmi-caches and typer cache are fully warmed on all queries. Additionally, it would be interesting to have benchmarks for when the caches are cold.
  • Improve the output UX: When some samples call attention, we'll want to know which location and query they correspond to.
  • Lock the version of the dependencies of the project on which we run Merlin: Currently, we use Irmin as a code base to run the benchmarks on. We install Irmin's dependencies via opam without locking the versions of its dependencies. If a dependency splits or merges modules or increases the size of a module, the cmi-files and cmt-files will vary. That adds Merlin-independent noise to the benchmarks. To avoid that, we could vendor a fixed version of each dependency.
  • Find a more significant project input base. For now, we only use Irmin as a code base to run the benchmark on.
  • Our CI will be very resource heavy. We'll need to decide when to run the benchmarks. current-bench supports running the benchmarks only "on demand" (i.e. when tagging the PR with a certain flag).
  • Possibly: It might also be interesting to track the number of latency spikes.
@pitag-ha
Copy link
Member Author

@3Rafal, is there anything you'd add?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant