Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds split_by_episodes to LeRobotDataset #158

Merged
merged 2 commits into from
May 20, 2024

Conversation

radekosmulski
Copy link
Contributor

What this does

For instance, the ACT datasets have only a single split ("train") and it would be great to retain some portion of the data for calculating metrics during training.

But in general, this functionality can also be useful in many other ways (for instance, one thing I have been curious about is how performance increases with the size of the dataset -- ACT trains on just 50 episodes, but what if we had only 10? 20? what about act scripted vs recorded from participant).

In general, I find this functionality to be very useful for experimentation and so I went ahead and added it.

I am not entire sure if there is interest in this functionality -- if yes, happy to make any changes that might be required in order to merge this.

One thing that I feel would be very valuable adding here is specifying a seed for the split. Can add this to this PR no problem (or a follow up one), just again -- not sure if there might be interest in this functionality from the maintainers, this is just to feel the waters.

I drew inspiration from how the train_test_split is implemented in datasets (some of the logic is verbatim from there) but the curve ball here was dealing with mapping of examples to episodes and the episode_data_index so had to add some functionality around this.

This is an example of how this functionality might be used:
image

Very new to the lib, not sure how this fits into the greater whole etc, etc, but excited to contribute if I can 🙂

How it was tested

I ran a couple of scenarios, caught and ironed out some obvious issues.

How to checkout & try? (for the reviewer)

Call split_by_episodes on an instance of a LeRobotDataset, try various parameter combinations (ints, floats), verify that the properties set on the returned datasets look okay.

Try other datasets.

sim_transfer_cube_scripted_ds = LeRobotDataset('lerobot/aloha_sim_transfer_cube_scripted')
train, test = sim_transfer_cube_scripted_ds.split_by_episodes()  # attempt passing various combinations of arguments
test.episode_data_index

@Cadene
Copy link
Collaborator

Cadene commented May 9, 2024

@radekosmulski Super useful! We wanted to support evaluation set. We didnt include it in the alpha release because the loss computed on evaluation set usually doesnt correlate well to success rate in the real-world. However we want to support it on the longer term because it's an interesting metric.

There might be an easier way to achieve this tho. And we might already support it. See: https://huggingface.co/docs/datasets/v2.19.0/loading#slice-splits

Screenshot 2024-05-10 at 01 54 26

Comment on lines 153 to 304
raise ValueError(
f"test_size={test_size} should be either positive and smaller "
f"than the number of samples {self.num_episodes} or a float in the (0, 1) range"
)

if (
isinstance(train_size, int)
and (train_size >= self.num_episodes or train_size <= 0)
or isinstance(train_size, float)
and (train_size <= 0 or train_size >= 1)
):
raise ValueError(
f"train_size={train_size} should be either positive and smaller "
f"than the number of samples {self.num_episodes} or a float in the (0, 1) range"
)

if train_size is not None and not isinstance(train_size, (int, float)):
raise ValueError(f"Invalid value for train_size: {train_size} of type {type(train_size)}")
if test_size is not None and not isinstance(test_size, (int, float)):
raise ValueError(f"Invalid value for test_size: {test_size} of type {type(test_size)}")

if isinstance(train_size, float) and isinstance(test_size, float) and train_size + test_size > 1:
raise ValueError(
f"The sum of test_size and train_size = {train_size + test_size}, should be in the (0, 1)"
" range. Reduce test_size and/or train_size."
)

if isinstance(test_size, float):
n_test = ceil(test_size * self.num_episodes)
elif isinstance(test_size, int):
n_test = float(test_size)

if isinstance(train_size, float):
n_train = floor(train_size * self.num_episodes)
elif isinstance(train_size, int):
n_train = float(train_size)

if train_size is None:
n_train = self.num_episodes - n_test
elif test_size is None:
n_test = self.num_episodes - n_train

if n_train + n_test > self.num_episodes:
raise ValueError(
f"The sum of train_size and test_size = {n_train + n_test}, "
"should be smaller than the number of "
f"samples {self.num_episodes}. Reduce test_size and/or "
"train_size."
)

n_train, n_test = int(n_train), int(n_test)

if n_train == 0:
raise ValueError(
f"With self.num_episodes={self.num_episodes}, test_size={test_size} and train_size={train_size}, the "
"resulting train set will be empty. Adjust any of the "
"aforementioned parameters."
)

if not shuffle:
train_episode_indices = np.arange(n_train)
test_episode_indices = np.arange(n_train, n_train + n_test)
else:
permutation = np.random.permutation(self.num_episodes)
test_episode_indices = permutation[:n_test]
train_episode_indices = permutation[n_test : (n_test + n_train)]

train_indices = [idx for idx, episode_idx in enumerate(self.hf_dataset["episode_index"]) if episode_idx.item() in train_episode_indices]
test_indices = [idx for idx, episode_idx in enumerate(self.hf_dataset["episode_index"]) if episode_idx.item() in test_episode_indices]

train_split = LeRobotDataset.from_preloaded(
repo_id=self.repo_id,
version=self.version,
root=self.root,
split=self.split,
transform=self.transform,
delta_timestamps=self.delta_timestamps,
hf_dataset=self.hf_dataset.select(indices=train_indices),
stats=self.stats,
info=self.info,
videos_dir=self.videos_dir,
)
train_split.create_episode_data_index()

test_split = LeRobotDataset.from_preloaded(
repo_id=self.repo_id,
version=self.version,
root=self.root,
split=self.split,
transform=self.transform,
delta_timestamps=self.delta_timestamps,
hf_dataset=self.hf_dataset.select(indices=test_indices),
stats=self.stats,
info=self.info,
videos_dir=self.videos_dir,
)
test_split.create_episode_data_index()

return train_split, test_split

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible to find the frame id of a certain episode in `episode_data_index" and use:
https://huggingface.co/docs/datasets/v2.19.0/loading#slice-splits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Cadene! I am not entirely sure I follow your plan with regards to looking up the frame id. More pointers would be appreciated.

But since I wrote the long comment I thought it might be easier if I just implemented in code what I had in mind.

So we have this now 🙂
image

Super simple, requires less code for saving and loading the dataset, and lets the user do whatever they'd like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhmm seems there are a couple of other places in code that rely on the episode index containing tensors...

Plus need to take a closer look at load_previous_and_future_frames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed things in a bunch of places across the repo, ran tests, fixed what I could

unfortunately, tests on main are failing as well, I am not sure what the failures are due to, suspect some might be due to my env but not sure (I recreated a conda env and followed the instructions from the readme doing pip install ".[aloha, pusht, xarm]")

@radekosmulski
Copy link
Contributor Author

radekosmulski commented May 9, 2024

hey @Cadene! thank you very much for looking into this and your comments, appreciate it! 🙂

I didn't know of the slice functionality existing in datasets, this is great to know! Also, I am 100% onboard with simplification/not duplicating functionality/there being less code/things not being coupled together, etc! Great there might be a simpler way of achieving the above.

I started looking at the slice functionality and here are a couple of thoughts (I might not be seeing the whole picture here though so some of this might be off the mark, apologies)

It is absolutely fantastic that we can do dataset surgery as follows, gives a lot of flexibility to the user:

ds1 = datasets.load_dataset('lerobot/aloha_sim_transfer_cube_scripted', split="train[:10%]")
ds2 = datasets.load_dataset('lerobot/aloha_sim_transfer_cube_scripted', split="train[:15%]")
ds3 = datasets.load_dataset('lerobot/aloha_sim_transfer_cube_scripted', split="train[-25%:]")

When we load partial dataset the episode_data_index goes out of sync:
image

It gets loaded from metadata and one could make an argument that other things (like stats or fps and the video flag) are properties valid across the whole dataset or arbitrary subsets of it, but not so much for the episode_data_index. (yes, argument could be made that using whole dataset stats when splitting into train - val set introduces leakage, but that is rarely relevant in practice and probably not relevant to how people will want to use this functionality).

So the only piece of metadata that is dependent on what subset gets loaded is episode_data_index.

My reasoning:

  • storing episode_data_index is an extra step that needs to be taken when writing / loading the dataset
  • adds conceptual ("huh, where does this come from?") and code complexity (needs to be saved and then loaded)

Would it not be better to calculate this when creating a LeRobotDataset? It would be ideal if the LeRobotDataset could work straight off a vanilla hf_dataset, but I understand that given that we want to store info like fps, the video flag, etc, this might not be possible. But at least we can limit the amount of new stuff that gets introduced, to streamline this.

If we move to calculating the episode_data_index when the data is loaded, we would maintain backwards compatibility (new version would just not use the metadata/episode_data_index.safetensors and the LeRobotDataset would work with a slice passed to it in the split!

How the episode_data_index gets created would be explicit (we can make it a cached calculated property or set it at loading, whichever would be simpler).

I think it might also make creating and saving a LeRobotDataset simpler if the functionality for creating the episode_data_index would be centralized to a single spot in the codebase and dynamic.

Anyhow, let me know your thoughts, please 🙂 Happy to start chipping away on the above or otherwise implement conditional recalculation of the episode_data_index when a slice is loaded (or some other alternative).

@radekosmulski radekosmulski marked this pull request as draft May 10, 2024 08:20
@radekosmulski radekosmulski marked this pull request as ready for review May 10, 2024 09:04
@radekosmulski radekosmulski requested a review from Cadene May 10, 2024 09:20
@aliberts aliberts added 🗃️ Dataset Something dataset-related ✨ Enhancement New feature or request labels May 12, 2024
@Cadene
Copy link
Collaborator

Cadene commented May 12, 2024

@radekosmulski thanks for all your work!

My take is that when split != "train" is provided we should recalculate episode_data_index and maybe cache the result somewhere if it takes too much time to calculate it. I also agree, we dont care about stats being computed on the full training set. It's really a minor detail since our "true testing set" is running the robot in the real world, which is always out of distribution anyway.

To update episode_data_index you would need to get all episode_index and index from hf_dataset. For all our datasets (because they are not super big) we can do this:

hf_dataset["episode_index"]
hf_dataset["index"]

Note: if the datasets were really really big we would have to use select_columns to avoid RAM error. See this example:

hf_dataset["image"][0] # loads all your images in RAM and access the first one
hf_dataset[0]["image"] # loads a single item (with all its columns) in RAM
hf_dataset.select_columns("image")[0]["image"] # loads a single image in RAM

I think it would be great to merge this features into main!

I am not sure we should merge the logic in train.py to evaluate on a validation set yet. I dont think it is critical so I would like to keep our code as simple as possible. If you show that evaluating on a validation set correlates with success rate, then we should merge it to main. In the meantime, we could add "how to load/evaluation on a validation set" as a short example inside the examples directory.

What do you think?

@radekosmulski
Copy link
Contributor Author

Hey @Cadene! Thank you for looking into this and for your guidance! 🙂 Your proposal sounds great!

I attempted mapping from the loaded split subset to the episode_data_index from the full dataset but there seems to be not enough information provided in the downloaded chunk as the frame_index gets recalculated on the returned subset and is not the frame_index from the original dataset

image

As such, I implemented this as a recalculation if split != 'train'.

The example sounds great -- I will work on it but might have it ready later in the week if that would be okay. It seems like it might require a little bit more work as I need to familiarize myself with other aspects of the library to do it right.

Happy to make any other changes that might be necessary, let me know, please! Also, we can merge this and I can open another PR for the example, or wait for the example to be ready, either one is okay on my end.

Sreenshot on how it works now:
image

@Cadene
Copy link
Collaborator

Cadene commented May 15, 2024

No worries! Thanks for this great contribution. Dont hesitate to message me. I can jump on a call if you have any trouble.

@radekosmulski
Copy link
Contributor Author

Hey @Cadene -- I think it all should be done and ready for your review, including the example 🙂

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! One small iteration and we can merge.

Any chance we could add unit tests for calculate_episode_data_index and reset_episode_index?

Could we update test_examples.py to add your example?
https://github.com/huggingface/lerobot/blob/main/tests/test_examples.py#L55-L72

examples/4_slice_dataset_and_calculate_loss.py Outdated Show resolved Hide resolved
lerobot/common/datasets/utils.py Outdated Show resolved Hide resolved
lerobot/common/datasets/utils.py Outdated Show resolved Hide resolved
examples/4_slice_dataset_and_calculate_loss.py Outdated Show resolved Hide resolved
lerobot/common/datasets/utils.py Outdated Show resolved Hide resolved
@radekosmulski
Copy link
Contributor Author

Thank you very much for the review, @Cadene! 🙂 do appreciate it

Along the way, I slightly refactored test_examples.py and test_datasets.py. I like the changes as they remove a good bit of code repetition DRYing things up. On the other hand, an argument could be made that the tests were easier to follow when data creation was done inline. I feel we gained some code clarity with those changes, but happy to revert the changes if that would be preferred!

Also, added unit tests for calculate_episode_data_index and reset_episode_index as requested.

Please let me know if any additional changes might be needed

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, thanks for taking the time! It's really helpful
I am hesitating the approve right now to let you do the final iteration and merge ^^'

tests/test_utils.py Show resolved Hide resolved
Comment on lines 30 to 53
@pytest.fixture
def hf_dataset():
dataset = Dataset.from_dict(
{
"timestamp": [t * 0.1 for t in range(1, 6)],
"index": range(5),
"episode_index": [0] * 5,
},
)
dataset.set_transform(hf_transform_to_torch)
return dataset


@pytest.fixture
def hf_dataset_3_episodes():
dataset = Dataset.from_dict(
{
"timestamp": [torch.tensor(t * 0.1) for t in range(6)],
"index": [torch.tensor(idx) for idx in range(6)],
"episode_index": [torch.tensor(0)] * 2 + [torch.tensor(1)] + [torch.tensor(2)] * 3,
},
)
dataset.set_transform(hf_transform_to_torch)
return dataset
Copy link
Collaborator

@Cadene Cadene May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you guessed, I think it could be better to revert the refactor with the two fixtures. I think it interrupts the flow of the test.

For hf_dataset_3_episodes I find it quite weird to have to add torch.tensor. Did you understand why? Would it be worth adding a comment explaining why? I can help you investigate.

I thought we were supposed to get torch tensor only when doing:

hf_dataset = hf_dataset.with_format("torch")

which was not compatible with hf_dataset.set_transform(hf_transform_to_torch).

Maybe a version update of hugging face dataset changed this behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, so I wrote this code before I figured out that hf_dataset.set_transform was used, it is a relic of the past that I should have removed! I fixed the hf_dataset but missed fixing hf_dataset_3_episodes

reverted to the version with inline data creation

pleasure working on this with you 🫡 🙂

@radekosmulski radekosmulski requested a review from Cadene May 17, 2024 01:36
Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go! Thanks ;) Really helpful.

If you are curious, we are tracking our progress on this project page: https://github.com/orgs/huggingface/projects/46/views/1
(I am not sure you know)

Feel free to reach out to us on discord if you want to add an item on the list, or if want to work on an item.

Thanks!

Comment on lines +121 to +128
# Capture the output of the script
output_buffer = io.StringIO()
sys.stdout = output_buffer
exec(file_contents, {})
printed_output = output_buffer.getvalue()
# Restore stdout to its original state
sys.stdout = sys.__stdout__
assert "Average loss on validation set" in printed_output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow interesting ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

Thanks for the heads up on the project board -- great to know! And thank you for approving these changes!

I don't think it is letting me merge so you might have to do the honors and push the button 🙂

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe some setting on the repo needs to be toggled to allow people to do merges? happy to test drive this with you if you'd like to look for a resolution

@radekosmulski radekosmulski requested a review from Cadene May 17, 2024 13:24
@Cadene
Copy link
Collaborator

Cadene commented May 17, 2024

@radekosmulski

pre-commit install
pre-commit run --all-files

@radekosmulski radekosmulski force-pushed the add_split_by_episode branch 2 times, most recently from 5dfe063 to afcba90 Compare May 17, 2024 22:01
…face#158)

* LeRobotDataset now recalculates episode_data_index when loading dataset subset
* add an example for calculating validation loss and showcasing the new
  functionality
* add unit tests
* refactor test_examples.py for readability
@radekosmulski
Copy link
Contributor Author

@Cadene pre-commit ran and formatting fixed 🙂

I also squashed the commits but not sure if that is necessary?

@Cadene Cadene merged commit 9b62c25 into huggingface:main May 20, 2024
HalvardBariller pushed a commit to HalvardBariller/lerobot that referenced this pull request May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ Dataset Something dataset-related ✨ Enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants