Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

use last instead of first bacalhau execution #913

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thetechnocrat-dev
Copy link
Contributor

What type of PR is this?

  • 馃悰 Bug Fix

Description

I noticed

Error listing files in directory: ls: invalid path "": invalid ipfs pathunexpected error monitoring running jobs: ls: invalid path "": invalid ipfs path

In the logs and jobs not processing. I realized this was because a some jobs had multiple bacalhau executions, the first was always a bid rejected capacity error. For these cases we should always look at the most recent execution.

Example

State:
  CreateTime: "2024-03-08T15:46:41.301528202Z"
  Executions:
  - ComputeReference: e-46b17748-0ac3-4181-a252-1fe5e78bdc38
    CreateTime: "2024-03-08T15:46:41.308035083Z"
    DesiredState: 2
    JobID: 24df3836-3ad8-4402-91fa-b779595b7528
    NodeId: QmPui6hPRoktGhteDRUSNzrYceEc3R52nZqp82nbd4Kjiy
    PublishedResults: {}
    State: AskForBidRejected
    Status: 'this node does not have capacity to run the job ({CPU: 0.400000, Memory:
      2.5 GB, Disk: 1.7 TB, GPU: 0} requested but only {%!s(float64=3) %!s(uint64=12000000000)
      %!s(uint64=323702) %!s(uint64=1) []} is available). bid rejected'
    UpdateTime: "2024-03-08T15:46:41.430322833Z"
    Version: 3
  - ComputeReference: e-5cbdb18f-6390-49c6-8756-b7132759d9ba
    CreateTime: "2024-03-08T15:46:41.435599426Z"
    DesiredState: 2
    JobID: 24df3836-3ad8-4402-91fa-b779595b7528
    NodeId: QmVakTbjsKHKho6svUTw5Q5yqbojrhbrAAvcJyCscxyLwa
    PublishedResults:
      CID: QmUCD2RKAd8Q8CR8hmixFExGkshcFa6briXqveNeDw44Zu
      StorageSource: ipfs
    RunOutput:
      exitCode: 0

Copy link

vercel bot commented Mar 8, 2024

The latest updates on your projects. Learn more about Vercel for Git 鈫楋笌

1 Ignored Deployment
Name Status Preview Updated (UTC)
docs 猬滐笍 Ignored (Inspect) Mar 8, 2024 5:29pm

@acashmoney
Copy link
Contributor

acashmoney commented Mar 11, 2024

Queued up 100 labsay jobs. 100/100 bacalhau jobs succeeded, however only 94/100 initially succeeded on the app frontend. The other 6 were perpetually in a state of Running.

Received the following error, similar to @thetechnocrat-dev for 2 of the 6 stalled jobs.

Error listing files in directory: ls: invalid path "": invalid ipfs pathunexpected error monitoring running jobs: ls: invalid path "": invalid ipfs path

Marking the 2 jobs as Failed allowed the other 4 jobs to process successfully, resulting in a final 98/100 success rate.

The 6 stalled jobs seem to have coincided with a scale up from 1 CPU node to 3. Unexpected behavior of the jobs' NodeIDs seem to contribute. See one of the 2 problematic "stalled" jobs despite a successful Bacalhau run:

 bacalhau describe 8e60e076-bc7a-4548-b4d5-93e943a171d7
Job:
  ...
State:
  CreateTime: "2024-03-11T21:55:57.362211708Z"
  Executions:
  - ComputeReference: e-be6e206b-9371-4fc0-833b-80183920a382
    CreateTime: "2024-03-11T21:55:57.368546836Z"
    DesiredState: 2
    JobID: 8e60e076-bc7a-4548-b4d5-93e943a171d7
    NodeId: QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj
    PublishedResults:
      CID: QmWfw7axWtYSUk4XWBvFDvAG3fbcneFyYRWDavthkLWMz7
      StorageSource: ipfs
    RunOutput:
      exitCode: 0
      runnerError: ""
      stderr: ""
      stderrtruncated: false
      stdout: "Job Inputs: {'file_example': '/inputs/file_example/result.txt', 'number_example':
        54, 'speedup': True, 'string_example': '3hello world'}\n\n                                        @\n
        \                                @@@@@@@@@@@@@@@\n                               @@@@@@@@@@@@@@@@@@@\n
        \                             @@@@@@@@@@@@@@@@@@@@@\n             @@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@\n           @@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@@@
        \     @@@@@@@@@@@@\n         @@@@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@@\n
        \       *@@@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@@         @@@@@@@@@@@@@\n
        \        @@@@@@@@@@        @@@@@@@@@@@@@@@@@@@@@%            &@@@@@@@@@@\n
        \          @@@@           @@@@@@@@@@@@@@@@@@&                     @@@@\n                        @@@@@@@@\n
        \                  @@@@@@@@@\n      @@@@@@@@@@@@@@@@@@@@        ,@@@@@@@@@@@
        \                @@@@@@@@@@@@\n   @@@@@@@@@@@@@@@@@@@@@@       @@@@@@@@@@@@@@@@@
        \          @@@@@@@@@@@@@@@@@@\n  @@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@
        \      @@@@@@@@@@@@@@@@@@@@@\n @@@@@@@@@@@@@@@@@@@@@@@     @@@@@@@@@@@@@@@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@\n@@@@@@@@@@@@@@@@@@@@@@@@     @@@@@@@@@@@@@@@@@@@@@@@
        \    @@@@@@@@@@@@@@@@@@@@@@@\n @@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@@@@@@@@@@@
        \    @@@@@@@@@@@@@@@@@@@@@@@\n  @@@@@@@@@@@@@@@@@@@@@       @@@@@@@@@@@@@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@\n   @@@@@@@@@@@@@@@@@@           @@@@@@@@@@@@@@@@@
        \      @@@@@@@@@@@@@@@@@@@@@@\n      @@@@@@@@@@@@                 @@@@@@@@@@@
        \        @@@@@@@@@@@@@@@@@@@@\n                                                     @@@@@@@@@\n
        \                                                @@@@@@@@\n           @@@@
        \                    &@@@@@@@@@@@@@@@@@@           @@@@\n         @@@@@@@@@@
        \            @@@@@@@@@@@@@@@@@@@@@        &@@@@@@@@@@\n        *@@@@@@@@@@@@@
        \       @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@\n         @@@@@@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@@@\n           @@@@@@@@@@@@
        \     @@@@@@@@@@@@@@@@@@@@@@@      @@@@@@@@@@@@\n             @@@@@@@@@@      "
      stdouttruncated: true
    State: Completed
    Status: . execution completed
    UpdateTime: "2024-03-11T21:56:10.569228513Z"
    Version: 6
  JobID: 8e60e076-bc7a-4548-b4d5-93e943a171d7
  State: Completed
  TimeoutAt: "2024-03-14T21:55:57.362211708Z"
  UpdateTime: "2024-03-11T21:56:10.576574822Z"
  Version: 3

The bacalhau describe shows a completed job, published results which can be inspected successfully on IPFS, however notes NodeId: QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj. This NodeId does not appear as valid in the compute cluster:

bacalhau node describe QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj
could not get node QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj: Unexpected response code: 500 ({
  "error": "nodeInfo not found for nodeID: QmQe4oJUqqCLfK2kbgT8omeufYcB837ryRTHcDpdtsDFrj",
  "message": "Internal Server Error"
})
bacalhau node list
 ID        TYPE       LABELS                                              CPU     MEMORY      DISK         GPU
 QmPUc2aE  Requester  Architecture=amd64 Operating-System=linux
                      git-lfs=False owner=labdao
 QmQnWc21  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.3 GB /   768.6 GB /   0 /
                      git-lfs=False instance-id=i-00b2a63d65e16f212       3.2     12.3 GB     768.6 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmSMhNDD  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.1 GB /   769.2 GB /   0 /
                      git-lfs=False instance-id=i-0effc9f20d6d54602       3.2     12.1 GB     769.2 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmXYPp65  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.3 GB /   769.4 GB /   0 /
                      git-lfs=False instance-id=i-0fb5abee9bdd5d7fb       3.2     12.3 GB     769.4 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmcwRQbD  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.3 GB /   769.3 GB /   0 /
                      git-lfs=False instance-id=i-03031c023c7ec89c8       3.2     12.3 GB     769.3 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmdoAGf9  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.1 GB /   768.7 GB /   0 /
                      git-lfs=False instance-id=i-0b2fe586a323789b0       3.2     12.1 GB     768.7 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao
 QmeGfoaw  Compute    Architecture=amd64 Operating-System=linux           3.2 /   12.3 GB /   771.2 GB /   0 /
                      git-lfs=False instance-id=i-0cf245a18f4ffa88b       3.2     12.3 GB     771.2 GB     0
                      instance-type=m5.xlarge node-type=cpu owner=labdao

This seems to suggest that when autoscaling up, we sometimes run into a problem with the NodeId values changing causing stalls to the queue. Anecdotally, similar behavior seems to have occurred when previously scaling up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants