GluonTS model 'hangs' (on second template?) #215

emobs · 2023-12-19T15:09:07Z

This model template, grabbed from current model when the script hangs, is UnivariateMotif:
{"model_number": 0, "model_name": "UnivariateMotif", "model_param_dict": {"window": 10, "point_method": "weighted_mean", "distance_metric": "sqeuclidean", "k": 5, "max_windows": 10000}, "model_transform_dict": {"fillna": "rolling_mean", "transformations": {"0": "bkfilter", "1": "AlignLastDiff", "2": "AlignLastValue"}, "transformation_params": {"0": {}, "1": {"rows": 90, "displacement_rows": 4, "quantile": 0.9, "decay_span": 90}, "2": {"rows": 1, "lag": 1, "method": "additive", "strength": 0.7, "first_value_only": false}}}}

However, I'm quite sure it's GluonTS that's causing the stalling. Ran this training 3 times in a row and it hangs on epoch 1/40, 2/40 26/40 for some reason. There's no error message thrown and I can't interrupt the model either to make it move on to the next (model_interrupt=True). I'm sure it's not a memory or CPU resource issue because I monitored and verified that 3 times as well. It just stops somewhere during the 40-step iteration, that's all I know.

Here's the latest output from the console from the latest run:

100%|█████████████████████| 50/50 [00:18<00:00,  2.64it/s, epoch=149/150, avg_epoch_loss=-9.68]
100%|█████████████████████| 50/50 [00:18<00:00,  2.68it/s, epoch=150/150, avg_epoch_loss=-9.65]
100%|██████████████████████████| 50/50 [00:18<00:00,  2.75it/s, epoch=1/40, avg_epoch_loss=4.8]
100%|█████████████████████████| 50/50 [00:18<00:00,  2.77it/s, epoch=2/40, avg_epoch_loss=2.65]
0%|                                                                   | 0/50 [00:00<?, ?it/s]

and from the root log file:

INFO:gluonts.trainer:Epoch[149] Learning rate is 4.8828125e-07
INFO:gluonts.trainer:Epoch[149] Elapsed time 18.687 seconds
INFO:gluonts.trainer:Epoch[149] Evaluation metric 'epoch_loss'=-9.647407
INFO:root:Computing averaged parameters.
INFO:root:Loading averaged parameters.
INFO:gluonts.trainer:End model training
WARNING:gluonts.time_feature.seasonality:Multiple 5 does not divide base seasonality 1. Falling back to seasonality 1.
INFO:gluonts.mx.model.wavenet._estimator:Using dilation depth 4 and receptive field length 16
INFO:gluonts.trainer:Start model training
INFO:gluonts.trainer:Epoch[0] Learning rate is 0.001
INFO:gluonts.trainer:Number of parameters in WaveNetTraining: 74749
INFO:gluonts.trainer:Epoch[0] Elapsed time 18.185 seconds
INFO:gluonts.trainer:Epoch[0] Evaluation metric 'epoch_loss'=4.803844
INFO:gluonts.trainer:Epoch[1] Learning rate is 0.001
INFO:gluonts.trainer:Epoch[1] Elapsed time 18.023 seconds
INFO:gluonts.trainer:Epoch[1] Evaluation metric 'epoch_loss'=2.650010
INFO:gluonts.trainer:Epoch[2] Learning rate is 0.001

Sorry this is all the information I got, hope it's of any use and helping to pinpoint the issue. I'm running on 0.6.5 by the way on CPU only.

The text was updated successfully, but these errors were encountered:

winedarksea · 2023-12-19T16:33:15Z

Do you know if GluonTS is using the Pytorch or Mxnet backend? Looks like that is mxnet with wavenet model.

It would be really weird for current_model_file not to be correct as it is written before the model starts to train.

Hard issue to debug without more clarity, but I'll see if I find something.

emobs · 2023-12-19T17:40:56Z

Pytorch is installed in the environment as well, but I didn't explicitly set GluonTS to use that, so I suppose it's running on the default Mxnet backend.

If the current model file is written before the model starts to train then I would say it is the UnivariateMotif model template causing the stalling, however that doesn't seem to rhyme with the log outputs, does it?

Sorry I don't have more information to share. If you need me to run it once again to get more debugging information please let me know (and how to retrieve the extra debug info :) ).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GluonTS model 'hangs' (on second template?) #215

GluonTS model 'hangs' (on second template?) #215

emobs commented Dec 19, 2023

winedarksea commented Dec 19, 2023

emobs commented Dec 19, 2023

GluonTS model 'hangs' (on second template?) #215

GluonTS model 'hangs' (on second template?) #215

Comments

emobs commented Dec 19, 2023

winedarksea commented Dec 19, 2023

emobs commented Dec 19, 2023