Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training large models using train.py #221

Open
haotianteng opened this issue May 4, 2024 · 0 comments
Open

Training large models using train.py #221

haotianteng opened this issue May 4, 2024 · 0 comments

Comments

@haotianteng
Copy link

I tried to retrain the model with the PDBBind dataset, I ran the train.py script directly without any parameter, and it finished the whole training process very quickly (<30 minutes). I got the following report:

Epoch 399: Val inference rmsds_lt2 0.000 rmsds_lt5 0.000 min_rmsds_lt2 0.000 min_rmsds_lt5 0.000
Best Validation Loss 0.5782018661499023 on Epoch 387
Best inference metric 0.0 on Epoch 399

So my first question is about the training result, is this result indicating a successful training? Why is 0 rmsds reported? And also I think the training loss didn't decrease a lot. For example this is training loss at the beginning of training process:
Epoch 24: Training loss 0.8713 tr 0.2144 rot 1.2553 tor 1.1707 sc 0.0000 lr 0.0010
And this is training loss in the middle of training.
Training loss 0.9436 tr 0.5029 rot 1.4023 tor 0.9541 sc 0.0000 lr 0.0010
So as you can see the training loss didn't change a lot. Is that a normal behavior?

Also, I found that training using train.py with the default setting only results in a model with 3GB per GPU (across 2 GPUs). So what setting should I change to train the large model instead of this small one? Do you have a model configuration file for the large model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant