Skip to content
This repository has been archived by the owner on Aug 28, 2021. It is now read-only.

Chunking Lists

Erik Arvidsson edited this page Aug 12, 2016 · 2 revisions

Chunking lists

Warning: This is out of date

Goals

  • Chunk large lists into smaller lists that can be shared between/after minor "mutations".
  • Adding/removing an element should only affect 1 chunk (2 if adjacent chunk is affected)

Implementation strategy

Given a list of refs (primitives are treated the same. We just care about the refs here):

[sha1-0, sha1-1, ..., sha1-n]

we run a rolling hash on the Refs and split when that hash matches a given pattern, probably using 111111 (64 - 1) which will lead to an average chunk size of 64 elements.

We then create a compound list, similarly to how have a compoundBlob:

[ref0, len0, ref1, len1, ... refk, lenk]

where ref0 points to a list of the elements in the first chunk.

The representation is also a list so we split it again until we end up with a list that isn't chunked.

Interface

Same as types.List today with the given big-oh notation

Len() uint64  // O(1)
Empty() bool  // O(1)
Get(idx uint64) Value  // O(log64(n))
Slice(start uint64, end uint64) List // O(log64(n))
Set(idx uint64, v Value) List // O(log64(n))
Append(v ...Value) List  // O(log64(n) + k), k number of appended values
Insert(idx uint64, v ...Value) List
Remove(start uint64, end uint64) List

Len, Empty

The length is stored in the data structure for compound lists.

Get

At every level we have about 64 elements. We do a binary search over those to find the right sub list. We might need to continue the search until we found a leaf. O(log2(64) * log64(n)) => O(log64(n))

Slice

Finding the right chunks is O(log64(n)). I believe that we only need to rechunk the start chunk and the adjacent chunk. We can just reuse the rest of the chunks until the end chunk. Proof?

Set

Finding the right chunk is O(log64(n)). Changing the value in there requires that we rechunk that chunk, possibly reading the next chunk and then propagating this up to the parent compound list.

It is clear that we will need to update the chunk after the current chun at every level but I believe that we only need to update 2 chunks at every level. I need to find a proof or a convincing argument for this.

Append

Find the last chunk and append to that. Rechunk last chunk and new content

Insert, Remove

Find correct chunk. Mutate that and rechunk from there and the adjacent one. Do the same with the parents recursivey. Proof is missing here too.