Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Support Ticket]: Boost stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load. #1846

Open
4 of 10 tasks
Shekelme opened this issue Dec 14, 2023 · 21 comments
Labels
support Support Ticket

Comments

@Shekelme
Copy link

Boost component

  • boost daemon - storage providers
  • boost client
  • boost UI
  • boost data-transfer
  • boost index-provider
  • booster-http
  • booster-bitswap
  • LID Database - Yugabyte/LevelDB
  • boostd-data
  • Other

Boost Version

boost version 2.1.1+mainnet+git.e9d18ac
boostd version 2.1.1+mainnet+git.e9d18ac
boostd-data version 1.5.1+mainnet+git.e9d18ac
booster-bitswap version 2.1.1+mainnet+git.e9d18ac

Describe the problem

After some time of more or less successful work (it can be several hours or a day or two), something goes wrong - boostd stops accepting deals, performing retrievals, deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load.

Yugabyte setup details

YugabyteDB v2.18.2.1 (replication factor 3, 3 nodes)
YugabyteDB nodes are running on Lenovo P620 workstations. HW specs of 3 hosts:

Host OS YB NVME FS CPU RAM Role
len4 Ubuntu 22.04.3 LTS WD Black SN750 2TB WDS200T3X0C-00SJG0 XFS AMD Ryzen Threadripper PRO 3955WX 16-Cores 512GB DDR4-3200 ECC PR2/C2 worker
len91 Ubuntu 22.04.3 LTS WD Black SN750 2TB WDS200T3X0C-00SJG0 XFS AMD Ryzen Threadripper PRO 3945WX 12-Cores 256GB DDR4-3200 ECC lotus-miner
len101 Ubuntu 22.04.3 LTS WD Black SN750 2TB WDS200T3X0C-00SJG0 XFS AMD Ryzen Threadripper PRO 3995WX 64-Cores 1024GB DDR4-3200 ECC RU/PC2 worker
In addition to `lotus` software (on all servers) and `boost` (lun91), `subspace` nodes are also installed on two workers, but separate drives are allocated for Yugabyte, and processor resources are not even half occupied. Sealing is carried out intermittently, from time to time.

How does your boostd-data talks to yugabyte

I use boostd-data service:

Description=boostd-data
After=network-online.target
After=yb-tserver.service
Requires=network-online.target
Wants=yb-tserver.service

[Service]
ExecStart=/usr/local/bin/boostd-data run yugabyte --hosts 192.168.11.91,192.168.11.14,192.168.11.101 --connect-string="postgresql://yugabyte:yugabyte@192.168.11.91:5433?sslmode=disable" --addr 0.0.0.0:8044
Restart=always
RestartSec=5
TimeoutStartSec=20
TimeoutStopSec=20
User=adm
Group=adm
Type=simple


[Install]
WantedBy=multi-user.target

I've tried using another option as well, ?sslmode=disable&load_balance=true, wuth the same result.

Boostd logs and boostd-data logs

boostd-data.log
boostd-data's log attached. The boost upgrade to v2.1.1 was carried out on 11.12.2023
Boostd log - please download from here: https://disk.yandex.ru/d/ugDMZaikacDfVQ

Recent error, warning and INFO logs from Yugabyte

Attached.
yb-master-tserver-logs.zip

Logging Information

Logs has been attached as files

Repo Steps

  1. Install lotus-miner v1.25.1+boost v2.2.1+YugabyteDB v2.18.2.1 (replication factor 3, 3 nodes)
  2. Start all processes
  3. Wait for the described symptoms to appear.
    ...
@Shekelme Shekelme added the support Support Ticket label Dec 14, 2023
@Shekelme
Copy link
Author

boost goroutines atttached
boost-15-12-2023.zip

@Shekelme
Copy link
Author

There are at least three miners affected by this problem: 2 big ones (YugabyteDB, though on second miner there are no errors (timeouts) in cutrrent boostd-data log) - I can open the "Local Index Directory" page, but not "Piece Doctor". On third small miner (LevelDB) I can't even open the "Local Index Directory", no errors or timeouts in cutrrent boostd-data log.

@ischasny
Copy link
Collaborator

Do deals get "unstuck" from Announcing after boostd restart?

@Shekelme
Copy link
Author

Shekelme commented Dec 19, 2023

I don't remember trying to do this, however, when restarting boost-data service, the Local Index Directory page does not start displaying until you restart the entire set of boost software.
I checked that right now.

@ischasny
Copy link
Collaborator

I don't remember trying to do this, however, when restarting boost-data service, the Local Index Directory page does not start displaying until you restart the entire set of boost software.

I checked that right now.

Do deals get unstuck from the announcing state then?

@Shekelme
Copy link
Author

There is no way to verify this yet. Let's wait for some deals...

@ischasny
Copy link
Collaborator

What I mean is, as per your description -

deals are stuck in 'Announcing' state, 'Piece Doctor' page does not load

I wonder if those get unstuck after restart?

@Shekelme
Copy link
Author

Yes, after stop and start Boost processess (we do not touch the YugabyteDB services) the deals get unstuck and become 'Proving' after a while.

@Shekelme
Copy link
Author

After yesterday's restart, the problem was repeated. No new deals have been received yet. I will be glad for advice on which logs to look at.
For which log subsystems should I set 'set-level' to some more detailed level?

@ischasny
Copy link
Collaborator

CC @LexLuthr

@ischasny
Copy link
Collaborator

If no new deals came in - what did get stuck in announcing then?

@ischasny
Copy link
Collaborator

@Shekelme I suggest to look into debug logs for piecedirectory. The log lines that start with add index: .... and podsi: .... Both will have breakdown of time / concurrency settings that should help to understand the problem further.

@Shekelme
Copy link
Author

If no new deals came in - what did get stuck in announcing then?

As you may have noticed, the title of this issue includes more than just this problem. So, with the exception of deals stuck in this status (because there were simply no new deals), the rest of the symptoms are present at the moment.
image
image
If possible, someone can throw me a deal to f0187709 while it is exposed to this problemand see what will happen.

@LexLuthr
Copy link
Collaborator

@Shekelme I want to disable the piece doctor on your system and check how it behaves after that. Can you please do the following?

  1. cd boost and confirm that you are running v2.1.1
  2. Create a new test branch git checkout -b test/pdoff
  3. Edit file node/builder.go and search for line Override(StartPieceDoctorKey, modules.NewPieceDoctor), and comment it out. Then try to rebuild the binary.
  4. If binary builds without error, deploy this binary and check LID pages and other things.

@Shekelme
Copy link
Author

Done.

admfc@lenovo91:~/boost$ git status
HEAD detached at v2.1.1
nothing to commit, working tree clean
-----compilation-----
boostd version 2.1.1+mainnet+git.e9d18ac.dirty

image

@LexLuthr
Copy link
Collaborator

@Shekelme By the screenshot, I am assuming it worked. If that is not the case, please let me know. If it is working as you expected then please observe deals, indexing , retrievals etc for next 1 week and let us know.

@Shekelme
Copy link
Author

It worked after restarting the processes, but this morning the problem occurred again.

@Shekelme
Copy link
Author

@Shekelme I suggest to look into debug logs for piecedirectory. The log lines that start with add index: .... and podsi: .... Both will have breakdown of time / concurrency settings that should help to understand the problem further.

BTW nothing happens when I set boostd log set-level piecedirectory=debug

Now I have set the value *=debug. I am attaching the log entries for the last 10 minutes. So as goroutines dump.
boost-21-12-2023.dump.zip
boost_go.zip

Also, I don't know if this is related or not, but today I found this entry in the log:

2023/12/21 09:10:14 graphql: panic occurred: runtime error: invalid memory address or nil pointer dereference
goroutine 179363120 [running]:
github.com/graph-gophers/graphql-go/log.(*DefaultLogger).LogPanic(0xc01b8ab388?, {0x74fe9e8?, 0xc03af20d80}, {0x4f695a0, 0xae63a80})
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/log/log.go:21 +0x65
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection.func2.1()
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:187 +0x89
panic({0x4f695a0, 0xae63a80})
	/snap/go/current/src/runtime/panic.go:884 +0x213
github.com/filecoin-project/boost/gql.(*resolver).IpniDistanceFromLatestAd(0xc00d09e240, {0x74fe9e8, 0xc03af20e10}, {{0xc039b6a280?, 0x4?}, {0xc039b6a2c0?, 0x2?}})
	/home/admfc/boost/gql/resolver_ipni.go:221 +0x1ba
reflect.Value.call({0x5439aa0?, 0xc00d09e240?, 0x81f0d6?}, {0x54796d5, 0x4}, {0xc03af20e40, 0x2, 0x87b30e?})
	/snap/go/current/src/reflect/value.go:586 +0xb07
reflect.Value.Call({0x5439aa0?, 0xc00d09e240?, 0xc012267706?}, {0xc03af20e40?, 0x545b320?, 0x0?})
	/snap/go/current/src/reflect/value.go:370 +0xbc
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection.func2(0x0?, {0x74fe9e8?, 0xc03af20d80?}, 0xc015d072d8, 0xc01146ccc0, 0xc01b8abec0, {0x74fe9e8?, 0xc03af20e10})
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:211 +0x4c6
github.com/graph-gophers/graphql-go/internal/exec.execFieldSelection({0x74fe9e8, 0xc03af20d80}, 0xc031e18400, 0x90?, 0xc01146ccc0, 0x1?, 0x1)
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:231 +0x1ce
github.com/graph-gophers/graphql-go/internal/exec.(*Request).execSelections.func1(0xc01146ccc0)
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:81 +0x1b8
created by github.com/graph-gophers/graphql-go/internal/exec.(*Request).execSelections
	/home/admfc/go/pkg/mod/github.com/graph-gophers/graphql-go@v1.3.0/internal/exec/exec.go:77 +0x1d0

context: context.Background.WithValue(type *http.contextKey, val <not Stringer>).WithValue(type *http.contextKey, val 192.168.11.91:8081).WithCancel.WithCancel.WithValue(type opentracing.contextKey, val <not Stringer>)

@RobQuistNL
Copy link
Contributor

These 2 patches might help speed things up;
#1855
#1853

@Shekelme
Copy link
Author

Shekelme commented Dec 22, 2023

You can speed up something that works somehow, which is not observed on my miners. ((
I will try to downgrade Node.js to 16.x

@Shekelme
Copy link
Author

Shekelme commented Dec 26, 2023

No, it looks like the problem is not with the Node.js version and not in YugabyteDB, because I tried downgrading the Node.js on the miner with LevelDB, and after about 1.5 days the problem appeared on it again.
I need your ideas, gentlemen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support Support Ticket
Projects
Status: No status
Development

No branches or pull requests

4 participants