-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt prefix thresholds #4273
base: release-v1.6.0
Are you sure you want to change the base?
Adapt prefix thresholds #4273
Conversation
/benchmark search_wiki |
Here are your search_wiki benchmarks diff 👊
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what you did there!
Great thinking @ManyTheFish ! Could you add very visibly in the description of the PR: a table indicating the benefits of this PR for index size, indexing time, and the tradeoff for search time? |
/benchmark search_songs |
Here are your search_songs benchmarks diff 👊
|
After discussing with Many, removing this PR from v1.6.0 since the benchmarks results are not satisfying |
This PR changes the prefix database creation thresholds to reduce their impact on the indexing time.
Explanations
There are two thresholds that we can change in the prefix database:
threshold
max_prefix_length
Prior values and drawback
On Meilisearch v1.5, we have 100 as the
threshold
and 4 as themax_prefix_length
, meaning that we compute and store every prefix that has at least 100 words prefixed by it and less than 5 characters.By making some statistics using
movies
,Wikipedia
, ande-commerce
datasets, I found out that only 10% of the total count of stored prefixes have 1 character on a small dataset (10MB), and less than 1% on a bigger dataset (2GB).This is a shame because the 1-character prefixes are the most important regarding search time gain.
On the opposite, the prefixes containing 3 or 4 characters represent more than 30% of the prefixes on a small dataset, which is acceptable, but represent more than 80% of the prefixes on big datasets.
In this configuration for the big datasets, we allocate between 80 and 100 times more resources (indexing time, space..) for the least important prefixes than for the most important ones, which seems non-optimal to me.
New value
This PR changes the
threshold
from 100 to 500 and themax_prefix_length
from 4 to 3 in order to rebalance the distribution.This way:
This suggested change is conservative, and we could think of raising the thresholds even more, but I'm afraid that it would have a significant impact on the search, and because we already are in the prerelease phase we will not be able to make that many benchmarks.
Benchmarks results
Indexing Time
In terms of indexing time, there is a small gain by modifying the prefixes thresholds, but it is under the expectations. Despite the fact this change reduces the number of computed prefixes, the time to compute doesn't change that much. The total indexing time gain is between 5 and 10%.
Below are some indexing graphs of the e-commerce and the Wikipedia dataset showing the time spent to index the documents for each document addition:
Database size
As the indexing time gain, the total database size gain is between 5 and 10%.
Below are some indexing graphs of the e-commerce and the Wikipedia dataset showing the size of the database after each document addition:
Search time
Some search queries lost some performances:
One is multiplied by 2 when the other by 8. It may be because
roll
andfilm
are no longer computed as prefixes, so Meilisearch must compute them at search time.Related
google sheet with some prefix stats: https://docs.google.com/spreadsheets/d/1EOL4Bmg_2RW2WGt6DN5Ec7V2hsbIc42eC9iOjy5vzN8/edit#gid=558148218
google sheet with benchmarks: https://docs.google.com/spreadsheets/d/1PT7tS-UW3mdUqZUau1lsfvHokgTuHLWW8O_2awzZM3Y/edit#gid=0
@ManyTheFish
Is it worth to merge this PR?
We may try other thresholds, for example, keep or raise the
threshold
but come back to amax_prefix_length
of4
to maybe re-compute theroll
andfilm
prefixes and re-gain the time lost on the search benchmarks.