Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetune时报错,且Traceback疑似被截断,无法定位出错线程 #301

Open
wrtppp opened this issue Apr 18, 2024 · 3 comments
Open

Comments

@wrtppp
Copy link

wrtppp commented Apr 18, 2024

(torch) ppop@DESKTOP-NMJBJQC:~/Chinese-CLIP$ sudo bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ~/Chinese-CLIP/datapath
Loading vision model config from cn_clip/clip/model_configs/ViT-L-14.json
Loading text model config from cn_clip/clip/model_configs/RoBERTa-wwm-ext-base-chinese.json
2024-04-18,22:23:46 | INFO | Rank 0 | train LMDB file contains 35000 images and 105000 pairs.
2024-04-18,22:23:46 | INFO | Rank 0 | val LMDB file contains 7500 images and 22500 pairs.
2024-04-18,22:23:46 | INFO | Rank 0 | Params:
2024-04-18,22:23:46 | INFO | Rank 0 | accum_freq: 1
2024-04-18,22:23:46 | INFO | Rank 0 | aggregate: True
2024-04-18,22:23:46 | INFO | Rank 0 | batch_size: 128
2024-04-18,22:23:46 | INFO | Rank 0 | bert_weight_path: None
2024-04-18,22:23:46 | INFO | Rank 0 | beta1: 0.9
2024-04-18,22:23:46 | INFO | Rank 0 | beta2: 0.98
2024-04-18,22:23:46 | INFO | Rank 0 | checkpoint_path: /home/ppop/Chinese-CLIP/datapath/experiments/muge_finetune_vit-H-14_roberta-base_bs128_1gpu/checkpoints
2024-04-18,22:23:46 | INFO | Rank 0 | clip_weight_path: None
2024-04-18,22:23:46 | INFO | Rank 0 | context_length: 52
2024-04-18,22:23:46 | INFO | Rank 0 | debug: False
2024-04-18,22:23:46 | INFO | Rank 0 | device: cuda:0
2024-04-18,22:23:46 | INFO | Rank 0 | distllation: False
2024-04-18,22:23:46 | INFO | Rank 0 | eps: 1e-06
2024-04-18,22:23:46 | INFO | Rank 0 | freeze_vision: False
2024-04-18,22:23:46 | INFO | Rank 0 | gather_with_grad: False
2024-04-18,22:23:46 | INFO | Rank 0 | grad_checkpointing: False
2024-04-18,22:23:46 | INFO | Rank 0 | kd_loss_weight: 0.5
2024-04-18,22:23:46 | INFO | Rank 0 | local_device_rank: 0
2024-04-18,22:23:46 | INFO | Rank 0 | log_interval: 1
2024-04-18,22:23:46 | INFO | Rank 0 | log_level: 20
2024-04-18,22:23:46 | INFO | Rank 0 | log_path: /home/ppop/Chinese-CLIP/datapath/experiments/muge_finetune_vit-H-14_roberta-base_bs128_1gpu/out_2024-04-18-14-23-43.log
2024-04-18,22:23:46 | INFO | Rank 0 | logs: /home/ppop/Chinese-CLIP/datapath/experiments/
2024-04-18,22:23:46 | INFO | Rank 0 | lr: 5e-05
2024-04-18,22:23:46 | INFO | Rank 0 | mask_ratio: 0
2024-04-18,22:23:46 | INFO | Rank 0 | max_epochs: 3
2024-04-18,22:23:46 | INFO | Rank 0 | max_steps: 2463
2024-04-18,22:23:46 | INFO | Rank 0 | name: muge_finetune_vit-H-14_roberta-base_bs128_1gpu
2024-04-18,22:23:46 | INFO | Rank 0 | num_workers: 4
2024-04-18,22:23:46 | INFO | Rank 0 | precision: amp
2024-04-18,22:23:46 | INFO | Rank 0 | rank: 0
2024-04-18,22:23:46 | INFO | Rank 0 | report_training_batch_acc: True
2024-04-18,22:23:46 | INFO | Rank 0 | reset_data_offset: False
2024-04-18,22:23:46 | INFO | Rank 0 | reset_optimizer: False
2024-04-18,22:23:46 | INFO | Rank 0 | resume: /home/ppop/Chinese-CLIP/datapath/pretrained_weights/clip_cn_vit-l-14.pt
2024-04-18,22:23:46 | INFO | Rank 0 | save_epoch_frequency: 1
2024-04-18,22:23:46 | INFO | Rank 0 | save_step_frequency: 999999
2024-04-18,22:23:46 | INFO | Rank 0 | seed: 123
2024-04-18,22:23:46 | INFO | Rank 0 | skip_aggregate: False
2024-04-18,22:23:46 | INFO | Rank 0 | skip_scheduler: False
2024-04-18,22:23:46 | INFO | Rank 0 | teacher_model_name: None
2024-04-18,22:23:46 | INFO | Rank 0 | text_model: RoBERTa-wwm-ext-base-chinese
2024-04-18,22:23:46 | INFO | Rank 0 | train_data: /home/ppop/Chinese-CLIP/datapath/datasets/yyut/lmdb/train
2024-04-18,22:23:46 | INFO | Rank 0 | use_augment: False
2024-04-18,22:23:46 | INFO | Rank 0 | use_bn_sync: False
2024-04-18,22:23:46 | INFO | Rank 0 | use_flash_attention: False
2024-04-18,22:23:46 | INFO | Rank 0 | val_data: /home/ppop/Chinese-CLIP/datapath/datasets/yyut/lmdb/valid
2024-04-18,22:23:46 | INFO | Rank 0 | valid_batch_size: 128
2024-04-18,22:23:46 | INFO | Rank 0 | valid_epoch_interval: 1
2024-04-18,22:23:46 | INFO | Rank 0 | valid_num_workers: 1
2024-04-18,22:23:46 | INFO | Rank 0 | valid_step_interval: 150
2024-04-18,22:23:46 | INFO | Rank 0 | vision_model: ViT-L-14
2024-04-18,22:23:46 | INFO | Rank 0 | warmup: 100
2024-04-18,22:23:46 | INFO | Rank 0 | wd: 0.001
2024-04-18,22:23:46 | INFO | Rank 0 | world_size: 1
2024-04-18,22:23:46 | INFO | Rank 0 | Use GPU: 0 for training
2024-04-18,22:23:46 | INFO | Rank 0 | => begin to load checkpoint '/home/ppop/Chinese-CLIP/datapath/pretrained_weights/clip_cn_vit-l-14.pt'
2024-04-18,22:23:47 | INFO | Rank 0 | train LMDB file contains 35000 images and 105000 pairs.
2024-04-18,22:23:47 | INFO | Rank 0 | val LMDB file contains 7500 images and 22500 pairs.
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/ppop/miniconda3/envs/torch/lib/python3.8/threading.py", line 932, in _bootstrap_inner

@JayChou404
Copy link

解决了吗兄弟。

@meisa233
Copy link

我也是同样的问题

@meisa233
Copy link

我也是同样的问题

我发现正常退出也会这样,别的没啥问题了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants