Adds split_by_episodes to LeRobotDataset #158

radekosmulski · 2024-05-09T09:32:34Z

What this does

For instance, the ACT datasets have only a single split ("train") and it would be great to retain some portion of the data for calculating metrics during training.

But in general, this functionality can also be useful in many other ways (for instance, one thing I have been curious about is how performance increases with the size of the dataset -- ACT trains on just 50 episodes, but what if we had only 10? 20? what about act scripted vs recorded from participant).

In general, I find this functionality to be very useful for experimentation and so I went ahead and added it.

I am not entire sure if there is interest in this functionality -- if yes, happy to make any changes that might be required in order to merge this.

One thing that I feel would be very valuable adding here is specifying a seed for the split. Can add this to this PR no problem (or a follow up one), just again -- not sure if there might be interest in this functionality from the maintainers, this is just to feel the waters.

I drew inspiration from how the train_test_split is implemented in datasets (some of the logic is verbatim from there) but the curve ball here was dealing with mapping of examples to episodes and the episode_data_index so had to add some functionality around this.

This is an example of how this functionality might be used:

Very new to the lib, not sure how this fits into the greater whole etc, etc, but excited to contribute if I can 🙂

How it was tested

I ran a couple of scenarios, caught and ironed out some obvious issues.

How to checkout & try? (for the reviewer)

Call split_by_episodes on an instance of a LeRobotDataset, try various parameter combinations (ints, floats), verify that the properties set on the returned datasets look okay.

Try other datasets.

sim_transfer_cube_scripted_ds = LeRobotDataset('lerobot/aloha_sim_transfer_cube_scripted')
train, test = sim_transfer_cube_scripted_ds.split_by_episodes()  # attempt passing various combinations of arguments
test.episode_data_index

Cadene · 2024-05-09T22:58:08Z

@radekosmulski Super useful! We wanted to support evaluation set. We didnt include it in the alpha release because the loss computed on evaluation set usually doesnt correlate well to success rate in the real-world. However we want to support it on the longer term because it's an interesting metric.

There might be an easier way to achieve this tho. And we might already support it. See: https://huggingface.co/docs/datasets/v2.19.0/loading#slice-splits

Cadene · 2024-05-09T23:01:03Z

lerobot/common/datasets/lerobot_dataset.py

+            raise ValueError(
+                f"test_size={test_size} should be either positive and smaller "
+                f"than the number of samples {self.num_episodes} or a float in the (0, 1) range"
+            )
+
+        if (
+            isinstance(train_size, int)
+            and (train_size >= self.num_episodes or train_size <= 0)
+            or isinstance(train_size, float)
+            and (train_size <= 0 or train_size >= 1)
+        ):
+            raise ValueError(
+                f"train_size={train_size} should be either positive and smaller "
+                f"than the number of samples {self.num_episodes} or a float in the (0, 1) range"
+            )
+
+        if train_size is not None and not isinstance(train_size, (int, float)):
+            raise ValueError(f"Invalid value for train_size: {train_size} of type {type(train_size)}")
+        if test_size is not None and not isinstance(test_size, (int, float)):
+            raise ValueError(f"Invalid value for test_size: {test_size} of type {type(test_size)}")
+
+        if isinstance(train_size, float) and isinstance(test_size, float) and train_size + test_size > 1:
+            raise ValueError(
+                f"The sum of test_size and train_size = {train_size + test_size}, should be in the (0, 1)"
+                " range. Reduce test_size and/or train_size."
+            )
+
+        if isinstance(test_size, float):
+            n_test = ceil(test_size * self.num_episodes)
+        elif isinstance(test_size, int):
+            n_test = float(test_size)
+
+        if isinstance(train_size, float):
+            n_train = floor(train_size * self.num_episodes)
+        elif isinstance(train_size, int):
+            n_train = float(train_size)
+
+        if train_size is None:
+            n_train = self.num_episodes - n_test
+        elif test_size is None:
+            n_test = self.num_episodes - n_train
+
+        if n_train + n_test > self.num_episodes:
+            raise ValueError(
+                f"The sum of train_size and test_size = {n_train + n_test}, "
+                "should be smaller than the number of "
+                f"samples {self.num_episodes}. Reduce test_size and/or "
+                "train_size."
+            )
+
+        n_train, n_test = int(n_train), int(n_test)
+
+        if n_train == 0:
+            raise ValueError(
+                f"With self.num_episodes={self.num_episodes}, test_size={test_size} and train_size={train_size}, the "
+                "resulting train set will be empty. Adjust any of the "
+                "aforementioned parameters."
+            )
+
+        if not shuffle:
+            train_episode_indices = np.arange(n_train)
+            test_episode_indices = np.arange(n_train, n_train + n_test)
+        else:
+            permutation = np.random.permutation(self.num_episodes)
+            test_episode_indices = permutation[:n_test]
+            train_episode_indices = permutation[n_test : (n_test + n_train)]
+
+        train_indices = [idx for idx, episode_idx in enumerate(self.hf_dataset["episode_index"]) if episode_idx.item() in train_episode_indices]
+        test_indices = [idx for idx, episode_idx in enumerate(self.hf_dataset["episode_index"]) if episode_idx.item() in test_episode_indices]
+
+        train_split = LeRobotDataset.from_preloaded(
+            repo_id=self.repo_id,
+            version=self.version,
+            root=self.root,
+            split=self.split,
+            transform=self.transform,
+            delta_timestamps=self.delta_timestamps,
+            hf_dataset=self.hf_dataset.select(indices=train_indices),
+            stats=self.stats,
+            info=self.info,
+            videos_dir=self.videos_dir,
+        )
+        train_split.create_episode_data_index()
+
+        test_split = LeRobotDataset.from_preloaded(
+            repo_id=self.repo_id,
+            version=self.version,
+            root=self.root,
+            split=self.split,
+            transform=self.transform,
+            delta_timestamps=self.delta_timestamps,
+            hf_dataset=self.hf_dataset.select(indices=test_indices),
+            stats=self.stats,
+            info=self.info,
+            videos_dir=self.videos_dir,
+        )
+        test_split.create_episode_data_index()
+
+        return train_split, test_split
+


I think it's possible to find the frame id of a certain episode in `episode_data_index" and use:
https://huggingface.co/docs/datasets/v2.19.0/loading#slice-splits

Hey @Cadene! I am not entirely sure I follow your plan with regards to looking up the frame id. More pointers would be appreciated.

But since I wrote the long comment I thought it might be easier if I just implemented in code what I had in mind.

So we have this now 🙂

Super simple, requires less code for saving and loading the dataset, and lets the user do whatever they'd like.

Mhmm seems there are a couple of other places in code that rely on the episode index containing tensors...

Plus need to take a closer look at load_previous_and_future_frames

fixed things in a bunch of places across the repo, ran tests, fixed what I could

unfortunately, tests on main are failing as well, I am not sure what the failures are due to, suspect some might be due to my env but not sure (I recreated a conda env and followed the instructions from the readme doing pip install ".[aloha, pusht, xarm]")

radekosmulski · 2024-05-09T23:42:18Z

hey @Cadene! thank you very much for looking into this and your comments, appreciate it! 🙂

I didn't know of the slice functionality existing in datasets, this is great to know! Also, I am 100% onboard with simplification/not duplicating functionality/there being less code/things not being coupled together, etc! Great there might be a simpler way of achieving the above.

I started looking at the slice functionality and here are a couple of thoughts (I might not be seeing the whole picture here though so some of this might be off the mark, apologies)

It is absolutely fantastic that we can do dataset surgery as follows, gives a lot of flexibility to the user:

ds1 = datasets.load_dataset('lerobot/aloha_sim_transfer_cube_scripted', split="train[:10%]")
ds2 = datasets.load_dataset('lerobot/aloha_sim_transfer_cube_scripted', split="train[:15%]")
ds3 = datasets.load_dataset('lerobot/aloha_sim_transfer_cube_scripted', split="train[-25%:]")

When we load partial dataset the episode_data_index goes out of sync:

It gets loaded from metadata and one could make an argument that other things (like stats or fps and the video flag) are properties valid across the whole dataset or arbitrary subsets of it, but not so much for the episode_data_index. (yes, argument could be made that using whole dataset stats when splitting into train - val set introduces leakage, but that is rarely relevant in practice and probably not relevant to how people will want to use this functionality).

So the only piece of metadata that is dependent on what subset gets loaded is episode_data_index.

My reasoning:

storing episode_data_index is an extra step that needs to be taken when writing / loading the dataset
adds conceptual ("huh, where does this come from?") and code complexity (needs to be saved and then loaded)

Would it not be better to calculate this when creating a LeRobotDataset? It would be ideal if the LeRobotDataset could work straight off a vanilla hf_dataset, but I understand that given that we want to store info like fps, the video flag, etc, this might not be possible. But at least we can limit the amount of new stuff that gets introduced, to streamline this.

If we move to calculating the episode_data_index when the data is loaded, we would maintain backwards compatibility (new version would just not use the metadata/episode_data_index.safetensors and the LeRobotDataset would work with a slice passed to it in the split!

How the episode_data_index gets created would be explicit (we can make it a cached calculated property or set it at loading, whichever would be simpler).

I think it might also make creating and saving a LeRobotDataset simpler if the functionality for creating the episode_data_index would be centralized to a single spot in the codebase and dynamic.

Anyhow, let me know your thoughts, please 🙂 Happy to start chipping away on the above or otherwise implement conditional recalculation of the episode_data_index when a slice is loaded (or some other alternative).

Cadene · 2024-05-12T15:48:13Z

@radekosmulski thanks for all your work!

My take is that when split != "train" is provided we should recalculate episode_data_index and maybe cache the result somewhere if it takes too much time to calculate it. I also agree, we dont care about stats being computed on the full training set. It's really a minor detail since our "true testing set" is running the robot in the real world, which is always out of distribution anyway.

To update episode_data_index you would need to get all episode_index and index from hf_dataset. For all our datasets (because they are not super big) we can do this:

hf_dataset["episode_index"]
hf_dataset["index"]

Note: if the datasets were really really big we would have to use select_columns to avoid RAM error. See this example:

hf_dataset["image"][0] # loads all your images in RAM and access the first one
hf_dataset[0]["image"] # loads a single item (with all its columns) in RAM
hf_dataset.select_columns("image")[0]["image"] # loads a single image in RAM

I think it would be great to merge this features into main!

I am not sure we should merge the logic in train.py to evaluate on a validation set yet. I dont think it is critical so I would like to keep our code as simple as possible. If you show that evaluating on a validation set correlates with success rate, then we should merge it to main. In the meantime, we could add "how to load/evaluation on a validation set" as a short example inside the examples directory.

What do you think?

radekosmulski · 2024-05-13T00:49:08Z

Hey @Cadene! Thank you for looking into this and for your guidance! 🙂 Your proposal sounds great!

I attempted mapping from the loaded split subset to the episode_data_index from the full dataset but there seems to be not enough information provided in the downloaded chunk as the frame_index gets recalculated on the returned subset and is not the frame_index from the original dataset

As such, I implemented this as a recalculation if split != 'train'.

The example sounds great -- I will work on it but might have it ready later in the week if that would be okay. It seems like it might require a little bit more work as I need to familiarize myself with other aspects of the library to do it right.

Happy to make any other changes that might be necessary, let me know, please! Also, we can merge this and I can open another PR for the example, or wait for the example to be ready, either one is okay on my end.

Sreenshot on how it works now:

Cadene · 2024-05-15T06:39:33Z

No worries! Thanks for this great contribution. Dont hesitate to message me. I can jump on a call if you have any trouble.

radekosmulski · 2024-05-15T07:14:55Z

Hey @Cadene -- I think it all should be done and ready for your review, including the example 🙂

Cadene

Thanks again! One small iteration and we can merge.

Any chance we could add unit tests for calculate_episode_data_index and reset_episode_index?

Could we update test_examples.py to add your example?
https://github.com/huggingface/lerobot/blob/main/tests/test_examples.py#L55-L72

examples/4_slice_dataset_and_calculate_loss.py

lerobot/common/datasets/utils.py

examples/4_slice_dataset_and_calculate_loss.py

lerobot/common/datasets/utils.py

radekosmulski · 2024-05-15T15:46:10Z

Thank you very much for the review, @Cadene! 🙂 do appreciate it

Along the way, I slightly refactored test_examples.py and test_datasets.py. I like the changes as they remove a good bit of code repetition DRYing things up. On the other hand, an argument could be made that the tests were easier to follow when data creation was done inline. I feel we gained some code clarity with those changes, but happy to revert the changes if that would be preferred!

Also, added unit tests for calculate_episode_data_index and reset_episode_index as requested.

Please let me know if any additional changes might be needed

Cadene

Almost there, thanks for taking the time! It's really helpful
I am hesitating the approve right now to let you do the final iteration and merge ^^'

tests/test_utils.py

Cadene · 2024-05-17T01:02:11Z

tests/conftest.py

+@pytest.fixture
+def hf_dataset():
+    dataset = Dataset.from_dict(
+        {
+            "timestamp": [t * 0.1 for t in range(1, 6)],
+            "index": range(5),
+            "episode_index": [0] * 5,
+        },
+    )
+    dataset.set_transform(hf_transform_to_torch)
+    return dataset
+
+
+@pytest.fixture
+def hf_dataset_3_episodes():
+    dataset = Dataset.from_dict(
+        {
+            "timestamp": [torch.tensor(t * 0.1) for t in range(6)],
+            "index": [torch.tensor(idx) for idx in range(6)],
+            "episode_index": [torch.tensor(0)] * 2 + [torch.tensor(1)] + [torch.tensor(2)] * 3,
+        },
+    )
+    dataset.set_transform(hf_transform_to_torch)
+    return dataset


As you guessed, I think it could be better to revert the refactor with the two fixtures. I think it interrupts the flow of the test.

For hf_dataset_3_episodes I find it quite weird to have to add torch.tensor. Did you understand why? Would it be worth adding a comment explaining why? I can help you investigate.

I thought we were supposed to get torch tensor only when doing:

hf_dataset = hf_dataset.with_format("torch")

which was not compatible with hf_dataset.set_transform(hf_transform_to_torch).

Maybe a version update of hugging face dataset changed this behavior?

ah okay, so I wrote this code before I figured out that hf_dataset.set_transform was used, it is a relic of the past that I should have removed! I fixed the hf_dataset but missed fixing hf_dataset_3_episodes

reverted to the version with inline data creation

pleasure working on this with you 🫡 🙂

Cadene

Good to go! Thanks ;) Really helpful.

If you are curious, we are tracking our progress on this project page: https://github.com/orgs/huggingface/projects/46/views/1
(I am not sure you know)

Feel free to reach out to us on discord if you want to add an item on the list, or if want to work on an item.

Thanks!

Cadene · 2024-05-17T13:06:12Z

tests/test_examples.py

+    # Capture the output of the script
+    output_buffer = io.StringIO()
+    sys.stdout = output_buffer
+    exec(file_contents, {})
+    printed_output = output_buffer.getvalue()
+    # Restore stdout to its original state
+    sys.stdout = sys.__stdout__
+    assert "Average loss on validation set" in printed_output


wow interesting ;)

🙌

Thanks for the heads up on the project board -- great to know! And thank you for approving these changes!

I don't think it is letting me merge so you might have to do the honors and push the button 🙂

or maybe some setting on the repo needs to be toggled to allow people to do merges? happy to test drive this with you if you'd like to look for a resolution

Cadene · 2024-05-17T15:07:44Z

@radekosmulski

pre-commit install
pre-commit run --all-files

Cadene · 2024-05-17T21:19:43Z

@radekosmulski https://github.com/huggingface/lerobot/blob/main/CONTRIBUTING.md#do-you-want-a-new-feature

…face#158) * LeRobotDataset now recalculates episode_data_index when loading dataset subset * add an example for calculating validation loss and showcasing the new functionality * add unit tests * refactor test_examples.py for readability

radekosmulski · 2024-05-17T22:03:29Z

@Cadene pre-commit ran and formatting fixed 🙂

I also squashed the commits but not sure if that is necessary?

alexander-soare requested a review from Cadene May 9, 2024 11:11

Cadene requested changes May 9, 2024

View reviewed changes

radekosmulski force-pushed the add_split_by_episode branch from 6770a2e to d1b8976 Compare May 10, 2024 07:56

radekosmulski marked this pull request as draft May 10, 2024 08:20

radekosmulski marked this pull request as ready for review May 10, 2024 09:04

radekosmulski requested a review from Cadene May 10, 2024 09:20

aliberts added 🗃️ Dataset Something dataset-related ✨ Enhancement New feature or request labels May 12, 2024

radekosmulski force-pushed the add_split_by_episode branch from b32f00a to 204117f Compare May 12, 2024 23:55

radekosmulski force-pushed the add_split_by_episode branch from 204117f to 0fdd646 Compare May 13, 2024 00:49

Cadene requested changes May 15, 2024

View reviewed changes

Cadene reviewed May 15, 2024

View reviewed changes

lerobot/common/datasets/utils.py Outdated Show resolved Hide resolved

Cadene assigned radekosmulski and Cadene May 15, 2024

radekosmulski force-pushed the add_split_by_episode branch from cf1ee55 to 71c75d4 Compare May 15, 2024 15:47

radekosmulski requested a review from Cadene May 15, 2024 15:55

radekosmulski force-pushed the add_split_by_episode branch from d17a522 to ff15147 Compare May 16, 2024 21:59

Cadene reviewed May 17, 2024

View reviewed changes

radekosmulski requested a review from Cadene May 17, 2024 01:36

Cadene approved these changes May 17, 2024

View reviewed changes

radekosmulski requested a review from Cadene May 17, 2024 13:24

radekosmulski force-pushed the add_split_by_episode branch 2 times, most recently from 5dfe063 to afcba90 Compare May 17, 2024 22:01

radekosmulski force-pushed the add_split_by_episode branch from afcba90 to b9fa939 Compare May 17, 2024 22:02

Merge branch 'main' into add_split_by_episode

9bd951f

Cadene merged commit 9b62c25 into huggingface:main May 20, 2024

Cadene mentioned this pull request May 20, 2024

Hot fix to compute validation loss example test #200

Merged

HalvardBariller pushed a commit to HalvardBariller/lerobot that referenced this pull request May 21, 2024

Adds split_by_episodes to LeRobotDataset (huggingface#158)

32b6ec5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds split_by_episodes to LeRobotDataset #158

Adds split_by_episodes to LeRobotDataset #158

radekosmulski commented May 9, 2024

Cadene commented May 9, 2024

Cadene May 9, 2024

radekosmulski May 10, 2024 •

edited

radekosmulski May 10, 2024

radekosmulski May 10, 2024

radekosmulski commented May 9, 2024 •

edited

Cadene commented May 12, 2024 •

edited

radekosmulski commented May 13, 2024

Cadene commented May 15, 2024

radekosmulski commented May 15, 2024

Cadene left a comment

radekosmulski commented May 15, 2024

Cadene left a comment

Cadene May 17, 2024 •

edited

radekosmulski May 17, 2024

Cadene left a comment

Cadene May 17, 2024

radekosmulski May 17, 2024 •

edited

radekosmulski May 17, 2024

Cadene commented May 17, 2024

Cadene commented May 17, 2024

radekosmulski commented May 17, 2024

Adds split_by_episodes to LeRobotDataset #158

Adds split_by_episodes to LeRobotDataset #158

Conversation

radekosmulski commented May 9, 2024

What this does

How it was tested

How to checkout & try? (for the reviewer)

Cadene commented May 9, 2024

Cadene May 9, 2024

Choose a reason for hiding this comment

radekosmulski May 10, 2024 • edited

Choose a reason for hiding this comment

radekosmulski May 10, 2024

Choose a reason for hiding this comment

radekosmulski May 10, 2024

Choose a reason for hiding this comment

radekosmulski commented May 9, 2024 • edited

Cadene commented May 12, 2024 • edited

radekosmulski commented May 13, 2024

Cadene commented May 15, 2024

radekosmulski commented May 15, 2024

Cadene left a comment

Choose a reason for hiding this comment

radekosmulski commented May 15, 2024

Cadene left a comment

Choose a reason for hiding this comment

Cadene May 17, 2024 • edited

Choose a reason for hiding this comment

radekosmulski May 17, 2024

Choose a reason for hiding this comment

Cadene left a comment

Choose a reason for hiding this comment

Cadene May 17, 2024

Choose a reason for hiding this comment

radekosmulski May 17, 2024 • edited

Choose a reason for hiding this comment

radekosmulski May 17, 2024

Choose a reason for hiding this comment

Cadene commented May 17, 2024

Cadene commented May 17, 2024

radekosmulski commented May 17, 2024

radekosmulski May 10, 2024 •

edited

radekosmulski commented May 9, 2024 •

edited

Cadene commented May 12, 2024 •

edited

Cadene May 17, 2024 •

edited

radekosmulski May 17, 2024 •

edited