-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds split_by_episodes to LeRobotDataset #158
Conversation
@radekosmulski Super useful! We wanted to support evaluation set. We didnt include it in the alpha release because the loss computed on evaluation set usually doesnt correlate well to success rate in the real-world. However we want to support it on the longer term because it's an interesting metric. There might be an easier way to achieve this tho. And we might already support it. See: https://huggingface.co/docs/datasets/v2.19.0/loading#slice-splits |
raise ValueError( | ||
f"test_size={test_size} should be either positive and smaller " | ||
f"than the number of samples {self.num_episodes} or a float in the (0, 1) range" | ||
) | ||
|
||
if ( | ||
isinstance(train_size, int) | ||
and (train_size >= self.num_episodes or train_size <= 0) | ||
or isinstance(train_size, float) | ||
and (train_size <= 0 or train_size >= 1) | ||
): | ||
raise ValueError( | ||
f"train_size={train_size} should be either positive and smaller " | ||
f"than the number of samples {self.num_episodes} or a float in the (0, 1) range" | ||
) | ||
|
||
if train_size is not None and not isinstance(train_size, (int, float)): | ||
raise ValueError(f"Invalid value for train_size: {train_size} of type {type(train_size)}") | ||
if test_size is not None and not isinstance(test_size, (int, float)): | ||
raise ValueError(f"Invalid value for test_size: {test_size} of type {type(test_size)}") | ||
|
||
if isinstance(train_size, float) and isinstance(test_size, float) and train_size + test_size > 1: | ||
raise ValueError( | ||
f"The sum of test_size and train_size = {train_size + test_size}, should be in the (0, 1)" | ||
" range. Reduce test_size and/or train_size." | ||
) | ||
|
||
if isinstance(test_size, float): | ||
n_test = ceil(test_size * self.num_episodes) | ||
elif isinstance(test_size, int): | ||
n_test = float(test_size) | ||
|
||
if isinstance(train_size, float): | ||
n_train = floor(train_size * self.num_episodes) | ||
elif isinstance(train_size, int): | ||
n_train = float(train_size) | ||
|
||
if train_size is None: | ||
n_train = self.num_episodes - n_test | ||
elif test_size is None: | ||
n_test = self.num_episodes - n_train | ||
|
||
if n_train + n_test > self.num_episodes: | ||
raise ValueError( | ||
f"The sum of train_size and test_size = {n_train + n_test}, " | ||
"should be smaller than the number of " | ||
f"samples {self.num_episodes}. Reduce test_size and/or " | ||
"train_size." | ||
) | ||
|
||
n_train, n_test = int(n_train), int(n_test) | ||
|
||
if n_train == 0: | ||
raise ValueError( | ||
f"With self.num_episodes={self.num_episodes}, test_size={test_size} and train_size={train_size}, the " | ||
"resulting train set will be empty. Adjust any of the " | ||
"aforementioned parameters." | ||
) | ||
|
||
if not shuffle: | ||
train_episode_indices = np.arange(n_train) | ||
test_episode_indices = np.arange(n_train, n_train + n_test) | ||
else: | ||
permutation = np.random.permutation(self.num_episodes) | ||
test_episode_indices = permutation[:n_test] | ||
train_episode_indices = permutation[n_test : (n_test + n_train)] | ||
|
||
train_indices = [idx for idx, episode_idx in enumerate(self.hf_dataset["episode_index"]) if episode_idx.item() in train_episode_indices] | ||
test_indices = [idx for idx, episode_idx in enumerate(self.hf_dataset["episode_index"]) if episode_idx.item() in test_episode_indices] | ||
|
||
train_split = LeRobotDataset.from_preloaded( | ||
repo_id=self.repo_id, | ||
version=self.version, | ||
root=self.root, | ||
split=self.split, | ||
transform=self.transform, | ||
delta_timestamps=self.delta_timestamps, | ||
hf_dataset=self.hf_dataset.select(indices=train_indices), | ||
stats=self.stats, | ||
info=self.info, | ||
videos_dir=self.videos_dir, | ||
) | ||
train_split.create_episode_data_index() | ||
|
||
test_split = LeRobotDataset.from_preloaded( | ||
repo_id=self.repo_id, | ||
version=self.version, | ||
root=self.root, | ||
split=self.split, | ||
transform=self.transform, | ||
delta_timestamps=self.delta_timestamps, | ||
hf_dataset=self.hf_dataset.select(indices=test_indices), | ||
stats=self.stats, | ||
info=self.info, | ||
videos_dir=self.videos_dir, | ||
) | ||
test_split.create_episode_data_index() | ||
|
||
return train_split, test_split | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible to find the frame id of a certain episode in `episode_data_index" and use:
https://huggingface.co/docs/datasets/v2.19.0/loading#slice-splits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Cadene! I am not entirely sure I follow your plan with regards to looking up the frame id. More pointers would be appreciated.
But since I wrote the long comment I thought it might be easier if I just implemented in code what I had in mind.
Super simple, requires less code for saving and loading the dataset, and lets the user do whatever they'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhmm seems there are a couple of other places in code that rely on the episode index containing tensors...
Plus need to take a closer look at load_previous_and_future_frames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed things in a bunch of places across the repo, ran tests, fixed what I could
unfortunately, tests on main
are failing as well, I am not sure what the failures are due to, suspect some might be due to my env but not sure (I recreated a conda env and followed the instructions from the readme doing pip install ".[aloha, pusht, xarm]"
)
hey @Cadene! thank you very much for looking into this and your comments, appreciate it! 🙂 I didn't know of the slice functionality existing in datasets, this is great to know! Also, I am 100% onboard with simplification/not duplicating functionality/there being less code/things not being coupled together, etc! Great there might be a simpler way of achieving the above. I started looking at the slice functionality and here are a couple of thoughts (I might not be seeing the whole picture here though so some of this might be off the mark, apologies) It is absolutely fantastic that we can do dataset surgery as follows, gives a lot of flexibility to the user:
When we load partial dataset the It gets loaded from So the only piece of metadata that is dependent on what subset gets loaded is My reasoning:
Would it not be better to calculate this when creating a If we move to calculating the How the I think it might also make creating and saving a Anyhow, let me know your thoughts, please 🙂 Happy to start chipping away on the above or otherwise implement conditional recalculation of the |
6770a2e
to
d1b8976
Compare
@radekosmulski thanks for all your work! My take is that when To update
Note: if the datasets were really really big we would have to use hf_dataset["image"][0] # loads all your images in RAM and access the first one
hf_dataset[0]["image"] # loads a single item (with all its columns) in RAM
hf_dataset.select_columns("image")[0]["image"] # loads a single image in RAM I think it would be great to merge this features into I am not sure we should merge the logic in What do you think? |
b32f00a
to
204117f
Compare
Hey @Cadene! Thank you for looking into this and for your guidance! 🙂 Your proposal sounds great! I attempted mapping from the loaded split subset to the As such, I implemented this as a recalculation if The example sounds great -- I will work on it but might have it ready later in the week if that would be okay. It seems like it might require a little bit more work as I need to familiarize myself with other aspects of the library to do it right. Happy to make any other changes that might be necessary, let me know, please! Also, we can merge this and I can open another PR for the example, or wait for the example to be ready, either one is okay on my end. |
204117f
to
0fdd646
Compare
No worries! Thanks for this great contribution. Dont hesitate to message me. I can jump on a call if you have any trouble. |
Hey @Cadene -- I think it all should be done and ready for your review, including the example 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again! One small iteration and we can merge.
Any chance we could add unit tests for calculate_episode_data_index
and reset_episode_index
?
Could we update test_examples.py
to add your example?
https://github.com/huggingface/lerobot/blob/main/tests/test_examples.py#L55-L72
Thank you very much for the review, @Cadene! 🙂 do appreciate it Along the way, I slightly refactored Also, added unit tests for Please let me know if any additional changes might be needed |
cf1ee55
to
71c75d4
Compare
d17a522
to
ff15147
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, thanks for taking the time! It's really helpful
I am hesitating the approve right now to let you do the final iteration and merge ^^'
tests/conftest.py
Outdated
@pytest.fixture | ||
def hf_dataset(): | ||
dataset = Dataset.from_dict( | ||
{ | ||
"timestamp": [t * 0.1 for t in range(1, 6)], | ||
"index": range(5), | ||
"episode_index": [0] * 5, | ||
}, | ||
) | ||
dataset.set_transform(hf_transform_to_torch) | ||
return dataset | ||
|
||
|
||
@pytest.fixture | ||
def hf_dataset_3_episodes(): | ||
dataset = Dataset.from_dict( | ||
{ | ||
"timestamp": [torch.tensor(t * 0.1) for t in range(6)], | ||
"index": [torch.tensor(idx) for idx in range(6)], | ||
"episode_index": [torch.tensor(0)] * 2 + [torch.tensor(1)] + [torch.tensor(2)] * 3, | ||
}, | ||
) | ||
dataset.set_transform(hf_transform_to_torch) | ||
return dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you guessed, I think it could be better to revert the refactor with the two fixtures. I think it interrupts the flow of the test.
For hf_dataset_3_episodes
I find it quite weird to have to add torch.tensor
. Did you understand why? Would it be worth adding a comment explaining why? I can help you investigate.
I thought we were supposed to get torch tensor only when doing:
hf_dataset = hf_dataset.with_format("torch")
which was not compatible with hf_dataset.set_transform(hf_transform_to_torch)
.
Maybe a version update of hugging face dataset changed this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay, so I wrote this code before I figured out that hf_dataset.set_transform
was used, it is a relic of the past that I should have removed! I fixed the hf_dataset
but missed fixing hf_dataset_3_episodes
reverted to the version with inline data creation
pleasure working on this with you 🫡 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go! Thanks ;) Really helpful.
If you are curious, we are tracking our progress on this project page: https://github.com/orgs/huggingface/projects/46/views/1
(I am not sure you know)
Feel free to reach out to us on discord if you want to add an item on the list, or if want to work on an item.
Thanks!
# Capture the output of the script | ||
output_buffer = io.StringIO() | ||
sys.stdout = output_buffer | ||
exec(file_contents, {}) | ||
printed_output = output_buffer.getvalue() | ||
# Restore stdout to its original state | ||
sys.stdout = sys.__stdout__ | ||
assert "Average loss on validation set" in printed_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow interesting ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe some setting on the repo needs to be toggled to allow people to do merges? happy to test drive this with you if you'd like to look for a resolution
|
5dfe063
to
afcba90
Compare
…face#158) * LeRobotDataset now recalculates episode_data_index when loading dataset subset * add an example for calculating validation loss and showcasing the new functionality * add unit tests * refactor test_examples.py for readability
afcba90
to
b9fa939
Compare
@Cadene pre-commit ran and formatting fixed 🙂 I also squashed the commits but not sure if that is necessary? |
What this does
For instance, the ACT datasets have only a single split ("train") and it would be great to retain some portion of the data for calculating metrics during training.
But in general, this functionality can also be useful in many other ways (for instance, one thing I have been curious about is how performance increases with the size of the dataset -- ACT trains on just 50 episodes, but what if we had only 10? 20? what about act scripted vs recorded from participant).
In general, I find this functionality to be very useful for experimentation and so I went ahead and added it.
I am not entire sure if there is interest in this functionality -- if yes, happy to make any changes that might be required in order to merge this.
One thing that I feel would be very valuable adding here is specifying a seed for the split. Can add this to this PR no problem (or a follow up one), just again -- not sure if there might be interest in this functionality from the maintainers, this is just to feel the waters.
I drew inspiration from how the
train_test_split
is implemented indatasets
(some of the logic is verbatim from there) but the curve ball here was dealing with mapping of examples to episodes and theepisode_data_index
so had to add some functionality around this.This is an example of how this functionality might be used:
Very new to the lib, not sure how this fits into the greater whole etc, etc, but excited to contribute if I can 🙂
How it was tested
I ran a couple of scenarios, caught and ironed out some obvious issues.
How to checkout & try? (for the reviewer)
Call
split_by_episodes
on an instance of aLeRobotDataset
, try various parameter combinations (ints, floats), verify that the properties set on the returned datasets look okay.Try other datasets.