Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GluonTS model 'hangs' (on second template?) #215

Open
emobs opened this issue Dec 19, 2023 · 2 comments
Open

GluonTS model 'hangs' (on second template?) #215

emobs opened this issue Dec 19, 2023 · 2 comments

Comments

@emobs
Copy link

emobs commented Dec 19, 2023

This model template, grabbed from current model when the script hangs, is UnivariateMotif:
{"model_number": 0, "model_name": "UnivariateMotif", "model_param_dict": {"window": 10, "point_method": "weighted_mean", "distance_metric": "sqeuclidean", "k": 5, "max_windows": 10000}, "model_transform_dict": {"fillna": "rolling_mean", "transformations": {"0": "bkfilter", "1": "AlignLastDiff", "2": "AlignLastValue"}, "transformation_params": {"0": {}, "1": {"rows": 90, "displacement_rows": 4, "quantile": 0.9, "decay_span": 90}, "2": {"rows": 1, "lag": 1, "method": "additive", "strength": 0.7, "first_value_only": false}}}}

However, I'm quite sure it's GluonTS that's causing the stalling. Ran this training 3 times in a row and it hangs on epoch 1/40, 2/40 26/40 for some reason. There's no error message thrown and I can't interrupt the model either to make it move on to the next (model_interrupt=True). I'm sure it's not a memory or CPU resource issue because I monitored and verified that 3 times as well. It just stops somewhere during the 40-step iteration, that's all I know.

Here's the latest output from the console from the latest run:

100%|█████████████████████| 50/50 [00:18<00:00,  2.64it/s, epoch=149/150, avg_epoch_loss=-9.68]
100%|█████████████████████| 50/50 [00:18<00:00,  2.68it/s, epoch=150/150, avg_epoch_loss=-9.65]
100%|██████████████████████████| 50/50 [00:18<00:00,  2.75it/s, epoch=1/40, avg_epoch_loss=4.8]
100%|█████████████████████████| 50/50 [00:18<00:00,  2.77it/s, epoch=2/40, avg_epoch_loss=2.65]
0%|                                                                   | 0/50 [00:00<?, ?it/s]

and from the root log file:

INFO:gluonts.trainer:Epoch[149] Learning rate is 4.8828125e-07
INFO:gluonts.trainer:Epoch[149] Elapsed time 18.687 seconds
INFO:gluonts.trainer:Epoch[149] Evaluation metric 'epoch_loss'=-9.647407
INFO:root:Computing averaged parameters.
INFO:root:Loading averaged parameters.
INFO:gluonts.trainer:End model training
WARNING:gluonts.time_feature.seasonality:Multiple 5 does not divide base seasonality 1. Falling back to seasonality 1.
INFO:gluonts.mx.model.wavenet._estimator:Using dilation depth 4 and receptive field length 16
INFO:gluonts.trainer:Start model training
INFO:gluonts.trainer:Epoch[0] Learning rate is 0.001
INFO:gluonts.trainer:Number of parameters in WaveNetTraining: 74749
INFO:gluonts.trainer:Epoch[0] Elapsed time 18.185 seconds
INFO:gluonts.trainer:Epoch[0] Evaluation metric 'epoch_loss'=4.803844
INFO:gluonts.trainer:Epoch[1] Learning rate is 0.001
INFO:gluonts.trainer:Epoch[1] Elapsed time 18.023 seconds
INFO:gluonts.trainer:Epoch[1] Evaluation metric 'epoch_loss'=2.650010
INFO:gluonts.trainer:Epoch[2] Learning rate is 0.001

Sorry this is all the information I got, hope it's of any use and helping to pinpoint the issue. I'm running on 0.6.5 by the way on CPU only.

@winedarksea
Copy link
Owner

Do you know if GluonTS is using the Pytorch or Mxnet backend? Looks like that is mxnet with wavenet model.

It would be really weird for current_model_file not to be correct as it is written before the model starts to train.

Hard issue to debug without more clarity, but I'll see if I find something.

@emobs
Copy link
Author

emobs commented Dec 19, 2023

Pytorch is installed in the environment as well, but I didn't explicitly set GluonTS to use that, so I suppose it's running on the default Mxnet backend.

If the current model file is written before the model starts to train then I would say it is the UnivariateMotif model template causing the stalling, however that doesn't seem to rhyme with the log outputs, does it?

Sorry I don't have more information to share. If you need me to run it once again to get more debugging information please let me know (and how to retrieve the extra debug info :) ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants