Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: WandbCallback uploads initial model checkpoint #30897

Conversation

mgerstgrasser
Copy link
Contributor

What does this PR do?

It seems that #30135 inadvertently makes WandbCallback always upload the initial model checkpoint to wandb, without any way to disabling this, and irrespective of model size. I think this is obviously not intended behaviour.

This also breaks training with FSDP, as apparently calling model.save_pretrained() messes things up for later model.forward() calls.

Fixes #30896 #30895

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@pacman100 @muellerzr @amyeroberts

@mgerstgrasser
Copy link
Contributor Author

@pacman100 @muellerzr @amyeroberts
I think this might be high priority, actually. It seems that right now, just starting a training script with wandb logging will immediately upload a full model checkpoint to wandb. After just a few quick tests I'm at over 100GB in my wandb account, and getting a warning message.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mgerstgrasser, thanks for opening a PR to address this!

Integrations such as WANDB aren't actively maintained by the transformers team, and instead by their original contributors / organization. With that in mind, this seems a reasonable change, but let's double check with @parambharat why this was originally added.

Regarding FDSP, deferring to @muellerzr @pacman100 on expected behaviour

@@ -812,7 +812,6 @@ def setup(self, args, state, model, **kwargs):
"initial_model": True,
},
)
model.save_pretrained(temp_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts Just to check, is this directed at me or at @parambharat ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. You mean this one?
# log the initial model and architecture to an artifact
Now that I read this again, I'm not completely sure anymore - maybe uploading the model checkpoint is intentional after all? If so, it should still respect the WANDB_LOG_MODEL setting, but maybe an extra if clause would be the better solution than removing this altogether in that case.
I'm really not sure though what the intent is - to me, uploading later (trained) LORA adapters makes much more sense than uploading the initial model, since the model you start from is usually stored in a permanent way elsewhere already anyway...
For now I've updated the comment.
@parambharat - could you weigh in on what the intent is here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the loading is intentional, but I agree that it might not be a good idea

@mgerstgrasser
Copy link
Contributor Author

@amyeroberts As per discussion in the linked issue, I've added a check to keep this but make it optional. I still am unsure what the intended (or best) behavior here is. For now I've made it so that the initial model is logged if WANDB_LOG_MODEL is set to either checkpoint or end. I'm unsure about the end case though - uploading the initial model will double the bandwidth and storage impact compared to not doing so in this case, and I can see how that may be undesirable. That said, enabling the upload in the end case is closer to the original PR, so that seems like the "safer" option for now. (In the checkpoint case, uploading seems pretty safe - you'd already be doing so many times, so one more is a small impact, relatively speaking.)

@amyeroberts
Copy link
Collaborator

@amyeroberts As per discussion in the linked issue, I've added a check to keep this but make it optional. I still am unsure what the intended (or best) behavior here is. For now I've made it so that the initial model is logged if WANDB_LOG_MODEL is set to either checkpoint or end. I'm unsure about the end case though - uploading the initial model will double the bandwidth and storage impact compared to not doing so in this case, and I can see how that may be undesirable. That said, enabling the upload in the end case is closer to the original PR, so that seems like the "safer" option for now. (In the checkpoint case, uploading seems pretty safe - you'd already be doing so many times, so one more is a small impact, relatively speaking.)

Caling in @muellerzr for his opinion, as he has far more experience with trainer and what might be the canonical way to control this

@mgerstgrasser
Copy link
Contributor Author

mgerstgrasser commented May 22, 2024

Caling in @muellerzr for his opinion, as he has far more experience with trainer and what might be the canonical way to control this

Okay, if we are looking for a proper long-term solution, then two more thoughts:

  1. The current fix will still break FSDP if model upload is enabled. Maybe this won't affect too many people, since perhaps models large enough to require FSDP are also too large to be uploaded anyway, but still. Maybe someone more familiar with the internals like @muellerz can see what the issue is. I suspect it's either that (a) there's a different way to save an FSDP model than just calling model.save_pretrained(), or (b) it's something like "you need to call model.forward() before calling model.save_pretrained() so that FSDP can do all its setup correctly". If it's (b), perhaps that fix should happen elsewhere in the trainer, rather than here, or even in accelerate?
  2. I'm not sure how all this works with PEFT/LORA models. Does save_pretrained() save the entire model, or just the adapter? If the latter, I could see how uploading the base model at the start of training once would make sense potentially, if only the adapters are uploaded later on.

Also, I see that in later on_save() and on_train_end() calls, the callback creates a fake Trainer object and then uses save_model() from there, instead of calling model.save_pretrained(). Does that make a difference for either of the above?

@SilSever
Copy link

SilSever commented May 23, 2024

Thank You @mgerstgrasser
I had your same issue and your fix worked for me!!

@muellerzr
Copy link
Contributor

muellerzr commented May 23, 2024

The current fix will still break FSDP if model upload is enabled. Maybe this won't affect too many people, since perhaps models large enough to require FSDP are also too large to be uploaded anyway, but still. Maybe someone more familiar with the internals like @muellerzr can see what the issue is. I suspect it's either that (a) there's a different way to save an FSDP model than just calling model.save_pretrained(), or (b) it's something like "you need to call model.forward() before calling model.save_pretrained() so that FSDP can do all its setup correctly". If it's (b), perhaps that fix should happen elsewhere in the trainer, rather than here, or even in accelerate?

Yes. IMO we should not be uploading models by default, and esp with FSDP as generally if you're doing that we're doing our own checkpointing locally because otherwise we can be talking about transferring and uploading hundreds of gigs of weights, which will slow down things a ton.

With FSDP you end up in two scenarios depending on if the user has a sharded state dict or not, generally PyTorch is leading to the former nowadays so each shard has its own saved file.

IMO: we don't upload FSDP models, and we can either add that later, or only do so if a user specifically requests it.

I'm not sure how all this works with PEFT/LORA models. Does save_pretrained() save the entire model, or just the adapter? If the latter, I could see how uploading the base model at the start of training once would make sense potentially, if only the adapters are uploaded later on.

Just the adapter. And again, if we have a large model we're doing PEFT on, we're at a similar situation uploading the base weights.

Personally I'm not sure this should be a thing in general is my take unless we should have to explicitly enable it.

@amyeroberts
Copy link
Collaborator

@muellerzr @mgerstgrasser OK, shall we just completely remove uploading the model then? And then we can add this back in if it gets requested as a feature?

@mgerstgrasser
Copy link
Contributor Author

With FSDP you end up in two scenarios depending on if the user has a sharded state dict or not, generally PyTorch is leading to the former nowadays so each shard has its own saved file.

Ah, no, I meant that apparently calling model.save() here breaks FSDP training altogether, i.e. leads to a crash. See #38095

IMO: we don't upload FSDP models, and we can either add that later, or only do so if a user specifically requests it.

But if we don't do this for FSDP anyway, then the above doesn't matter. :)

OK, shall we just completely remove uploading the model then? And then we can add this back in if it gets requested as a feature?

Fine with me, of course. Alternatively I could add an additional configuration option specifically for uploading the initial model, but I agree that it's unlikely to be something that many people would use, given the tradeoffs.

@muellerzr
Copy link
Contributor

Yes, I agree let's remove it then bring it back if folks ask :)

@mgerstgrasser
Copy link
Contributor Author

OK, I've reverted back to the version that just removes initial model upload altogether. @amyeroberts

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing and iterating with us to find a solution @mgerstgrasser!

@amyeroberts
Copy link
Collaborator

@mgerstgrasser For the quality checks - could you rebase to include the recent pushes from main? This should resolve these

@mgerstgrasser mgerstgrasser force-pushed the fix_wandb_always_uploading_initial_model branch from c41f120 to 66818c5 Compare May 23, 2024 18:53
Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@amyeroberts amyeroberts merged commit 6657fb5 into huggingface:main May 23, 2024
21 checks passed
itazap pushed a commit that referenced this pull request May 24, 2024
* fix wandb always uploading initial model

* Update comment.

* Optionally log initial model

* Revert "Optionally log initial model"

This reverts commit 9602cc1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WandbCallback always (!) uploads entire model checkpoint to wandb
5 participants