Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

这个问题太折磨了,找不到解决方法,有没有大神看一下 #291

Open
iWangTing opened this issue Apr 11, 2024 · 14 comments

Comments

@iWangTing
Copy link

运行sh脚本总会出现未识别的参数main.py: error: unrecognized arguments: --accum-freq=1,脚本和示例一模一样

`usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE]
               [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH]
               [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
               [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
               [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
               [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED]
main.py: error: unrecognized arguments: --accum-freq=1
[2024-04-11 23:52:11,183] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 2) local_rank: 0 (pid: 5808) of binary: /home/amax/.conda/envs/lxl/bin/python3
Traceback (most recent call last):
  File "/home/amax/.conda/envs/lxl/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/amax/.conda/envs/lxl/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 816, in <module>
    main()
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-11_23:52:11
  host      : amax
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 5808)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
`
@ChesonHuang
Copy link

可以试试下面的命令看看吗

先 cd sdb1/lxl2/Chinese-CLIP-master/

python cn_clip/training/main.py
--train-data=${train_data}
--val-data=${val_data}
--resume=${resume}
${reset_data_offset}
${reset_optimizer}
--logs=${output_base_dir}
--name=${name}
--save-step-frequency=${save_step_frequency}
--save-epoch-frequency=${save_epoch_frequency}
--log-interval=${log_interval}
${report_training_batch_acc}
--context-length=${context_length}
--warmup=${warmup}
--batch-size=${batch_size}
--valid-batch-size=${valid_batch_size}
--valid-step-interval=${valid_step_interval}
--valid-epoch-interval=${valid_epoch_interval}
--lr=${lr}
--accum_freq=${accum_freq}
--wd=${wd}
--max-epochs=${max_epochs}
--vision-model=${vision_model}
${use_augment}
--text-model=${text_model}
--grad-checkpointing

你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数

如果你要用分布式,也可以ps -ef | grep main检查下进程

@iWangTing
Copy link
Author

可以试试下面的命令看看吗

先 cd sdb1/lxl2/Chinese-CLIP-master/

python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing

你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数

如果你要用分布式,也可以ps -ef | grep main检查下进程

(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py
usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE]
[--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH]
[--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
[--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
[--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
[--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED]
main.py: error: the following arguments are required: --train-data
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data}
--train-data=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data}
--val-data=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume}
--resume=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset}
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer}
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir}
--logs=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --name=${name}
--name=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency}
--save-step-frequency=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency}
--save-epoch-frequency=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval}
--log-interval=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc}
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length}
--context-length=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup}
--warmup=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size}
--batch-size=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size}
--valid-batch-size=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval}
--valid-step-interval=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval}
--valid-epoch-interval=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr}
--lr=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq}
--accum_freq=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd}
--wd=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs}
--max-epochs=: command not found
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model}
--vision-model=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment}
(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model}
--text-model=: command not found
(lxl) amax@amax:
/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing
--grad-checkpointing: command not found
您好,运行结果如上。另外,params.py中有accum-freq这个参数

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main
amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main

@ChesonHuang
Copy link

ChesonHuang commented Apr 12, 2024

可以试试下面的命令看看吗
先 cd sdb1/lxl2/Chinese-CLIP-master/
python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing
你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数
如果你要用分布式,也可以ps -ef | grep main检查下进程

(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main


把这个命令替换你sh脚本中原来的torchrun的命令执行,不是直接在终端这样执行,例如:把脚本中下面绿色的去到
clip

@iWangTing
Copy link
Author

可以试试下面的命令看看吗
先 cd sdb1/lxl2/Chinese-CLIP-master/
python cn_clip/training/main.py --train-data=${train_data} --val-data=${val_data} --resume=${resume} ${reset_data_offset} ${reset_optimizer} --logs=${output_base_dir} --name=${name} --save-step-frequency=${save_step_frequency} --save-epoch-frequency=${save_epoch_frequency} --log-interval=${log_interval} ${report_training_batch_acc} --context-length=${context_length} --warmup=${warmup} --batch-size=${batch_size} --valid-batch-size=${valid_batch_size} --valid-step-interval=${valid_step_interval} --valid-epoch-interval=${valid_epoch_interval} --lr=${lr} --accum_freq=${accum_freq} --wd=${wd} --max-epochs=${max_epochs} --vision-model=${vision_model} ${use_augment} --text-model=${text_model} --grad-checkpointing
你可以看看cn_clip/training/params.py文件, 搜索下accum-freq看看有没有这个参数
如果你要用分布式,也可以ps -ef | grep main检查下进程

(lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ python cn_clip/training/main.py usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc] [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL] [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY] [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}] [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH] [--grad-checkpointing] [--local_rank LOCAL_RANK] [--skip-aggregate] [--debug] [--seed SEED] main.py: error: the following arguments are required: --train-data (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --train-data=${train_data} --train-data=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --val-data=${val_data} --val-data=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --resume=${resume} --resume=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${reset_data_offset} (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${reset_optimizer} (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --logs=${output_base_dir} --logs=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --name=${name} --name=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --save-step-frequency=${save_step_frequency} --save-step-frequency=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --save-epoch-frequency=${save_epoch_frequency} --save-epoch-frequency=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --log-interval=${log_interval} --log-interval=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${report_training_batch_acc} (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --context-length=${context_length} --context-length=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --warmup=${warmup} --warmup=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --batch-size=${batch_size} --batch-size=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --valid-batch-size=${valid_batch_size} --valid-batch-size=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --valid-step-interval=${valid_step_interval} --valid-step-interval=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --valid-epoch-interval=${valid_epoch_interval} --valid-epoch-interval=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --lr=${lr} --lr=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --accum_freq=${accum_freq} --accum_freq=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --wd=${wd} --wd=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --max-epochs=${max_epochs} --max-epochs=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --vision-model=${vision_model} --vision-model=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ ${use_augment} (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --text-model=${text_model} --text-model=: command not found (lxl) amax@amax:/sdb1/lxl2/Chinese-CLIP-master$ --grad-checkpointing --grad-checkpointing: command not found 您好,运行结果如上。另外,params.py中有accum-freq这个参数

> > (lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ ps -ef | grep main amax 7490 4067 0 12:41 pts/0 00:00:00 grep --color=auto main
> 
> 把这个命令替换你sh脚本中原来的torchrun的命令执行,不是直接在终端这样执行,例如:把脚本中下面绿色的去到 ![clip](https://private-user-images.githubusercontent.com/34369493/321905989-299cdae4-44a8-41a9-ae64-f6299fb40a1b.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTI5MDg2MTgsIm5iZiI6MTcxMjkwODMxOCwicGF0aCI6Ii8zNDM2OTQ5My8zMjE5MDU5ODktMjk5Y2RhZTQtNDRhOC00MWE5LWFlNjQtZjYyOTlmYjQwYTFiLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDEyVDA3NTE1OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJmODA2NDZiZmVmYTU5ZWZhZWE0ZWIyYjUzYTE4NGQ5ZGI2ZTIyMDZhYzEzZjBkOWRmMzVlMjExMGFkOTdkMGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.a3FfmM1bJ43d9QzXvVKdqihpD9BFZZ3REqpG6R0yfT4)

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 16, in <module>
    from cn_clip.clip import load
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/clip/__init__.py", line 4, in <module>
    from .model import convert_state_dict
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/clip/model.py", line 16, in <module>
    FlashMHA = importlib.import_module('flash_attn.flash_attention').FlashMHA
  File "/home/amax/.conda/envs/lxl/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn/flash_attention.py", line 7, in <module>
    from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
  File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 5, in <module>
    import flash_attn_cuda
ImportError: /home/amax/.conda/envs/lxl/lib/python3.9/site-packages/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE
这是按照您说的先cd后,再替换脚本中命令行后的结果

@ChesonHuang
Copy link

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

@iWangTing
Copy link
Author

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。
我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?

@ChesonHuang
Copy link

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?

pip uninstall flash_attn
image

@iWangTing
Copy link
Author

/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalI

你的linux-gnu.so的依赖有问题,请参考https://github.com/open-mmlab/mmdetection3d/issues/1152这里类似的解决办法

我根据1152的解决方法试过了,但还是不行。这个issues指的应该是mmcv的,但我这个是flash-attn的。 我又从flash-attn相关的issues上找了相关解决方法,还是不行,貌似flash-attn支持的torch是1.12以上的,我的是1.10,并且我也没有要用flash-attn,如何在代码中关闭或者忽略flash-attn相关的内容呢?

pip uninstall flash_attn image

(lxl) amax@amax:~/sdb1/lxl2/Chinese-CLIP-master$ bash /home/amax/sdb1/lxl2/Chinese-CLIP-master/run_scripts/B_finetune_vit-b-16_rbt-base.sh
usage: main.py [-h] --train-data TRAIN_DATA [--val-data VAL_DATA] [--num-workers NUM_WORKERS] [--valid-num-workers VALID_NUM_WORKERS] [--logs LOGS] [--name NAME] [--log-interval LOG_INTERVAL] [--report-training-batch-acc]
               [--batch-size BATCH_SIZE] [--valid-batch-size VALID_BATCH_SIZE] [--max-steps MAX_STEPS] [--max-epochs MAX_EPOCHS] [--valid-step-interval VALID_STEP_INTERVAL] [--valid-epoch-interval VALID_EPOCH_INTERVAL]
               [--context-length CONTEXT_LENGTH] [--lr LR] [--beta1 BETA1] [--beta2 BETA2] [--eps EPS] [--wd WD] [--warmup WARMUP] [--use-bn-sync] [--use-augment] [--skip-scheduler] [--save-epoch-frequency SAVE_EPOCH_FREQUENCY]
               [--save-step-frequency SAVE_STEP_FREQUENCY] [--resume RESUME] [--reset-optimizer] [--reset-data-offset] [--precision {amp,fp16,fp32}] [--vision-model {ViT-B-32,ViT-B-16,ViT-L-14,ViT-L-14-336,ViT-H-14,RN50}]
               [--mask-ratio MASK_RATIO] [--clip-weight-path CLIP_WEIGHT_PATH] [--freeze-vision] [--text-model {RoBERTa-wwm-ext-base-chinese,RoBERTa-wwm-ext-large-chinese,RBT3-chinese}] [--bert-weight-path BERT_WEIGHT_PATH]
               [--grad-checkpointing] [--use-flash-attention] [--gather-with-grad] [--skip-aggregate] [--debug] [--seed SEED] [--distllation] [--teacher-model-name TEACHER_MODEL_NAME] [--kd_loss_weight KD_LOSS_WEIGHT]
               [--accum-freq ACCUM_FREQ]
main.py: error: unrecognized arguments: --accum_freq=1

额执行完您说的“先 cd sdb1/lxl2/Chinese-CLIP-master/...............”,出现了以上的报错,回到开始了属实是

@ChesonHuang
Copy link

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx
image

@iWangTing
Copy link
Author

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

@ChesonHuang
Copy link

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1:shell脚本里面,加上
image

改法2:main.py里面
image

@iWangTing
Copy link
Author

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1:shell脚本里面,加上 image

改法2:main.py里面 image

Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
问题接踵而至。。。

@ChesonHuang
Copy link

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1:shell脚本里面,加上 image
改法2:main.py里面 image

Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set 问题接踵而至。。。
报错里面说没有环境变量,环境变量可以像这样配置,加上 export WORLD_SIZE=xx 就可以
image

@iWangTing
Copy link
Author

accum_freq

shell脚本里面,将--accum_freq=xxx 改成 --accum-freq=xxx image

Traceback (most recent call last):
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module>
    main()
  File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 51, in main
    args.local_device_rank = int(os.environ['LOCAL_RANK'])
  File "/home/amax/.conda/envs/lxl/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

新的参数问题又出现了。。麻烦您再看一下

改法1:shell脚本里面,加上 image
改法2:main.py里面 image

Traceback (most recent call last): File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 346, in <module> main() File "/home/amax/sdb1/lxl2/Chinese-CLIP-master/cn_clip/training/main.py", line 55, in main dist.init_process_group(backend="nccl") File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 224, in _env_rendezvous_handler world_size = int(_get_env_or_raise("WORLD_SIZE")) File "/home/amax/.conda/envs/lxl/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 203, in _get_env_or_raise raise _env_error(env_var) ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set 问题接踵而至。。。
报错里面说没有环境变量,环境变量可以像这样配置,加上 export WORLD_SIZE=xx 就可以
image

主要问题已经基本已经解决了,可以先训练了,感谢多日以来的耐心指导,感激之情溢于言表~[抱拳]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants