Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glfsheal coredump occasionally #4239

Open
GeorgeLjz opened this issue Oct 16, 2023 · 0 comments · May be fixed by #4240
Open

glfsheal coredump occasionally #4239

GeorgeLjz opened this issue Oct 16, 2023 · 0 comments · May be fixed by #4240

Comments

@GeorgeLjz
Copy link

Description of problem:
glustershd failed because glfsheal got coredump

The exact command to reproduce the issue:
daemon process glustershd failed, no exact command for this issue.

The full output of the command that failed: N/A

Expected results:

Mandatory info:
- The output of the gluster volume info command:
Volume Name: log
Type: Replicate
Volume ID: 786a290a-28a7-4f4d-8930-450319b79c5c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 169.254.0.20:/mnt/bricks/log/brick
Brick2: 169.254.0.28:/mnt/bricks/log/brick
Options Reconfigured:
nfs.disable: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
cluster.server-quorum-type: none
cluster.consistent-metadata: no
server.allow-insecure: on
network.ping-timeout: 42
cluster.favorite-child-policy: mtime
cluster.heal-timeout: 60
storage.health-check-interval: 0
performance.client-io-threads: off
diagnostics.brick-log-level: INFO
cluster.server-quorum-ratio: 51

- The output of the gluster volume status command:
Status of volume: log
Gluster process TCP Port RDMA Port Online Pid

Brick 169.254.0.20:/mnt/bricks/log/brick 53954 0 Y 1701
Brick 169.254.0.28:/mnt/bricks/log/brick 53955 0 Y 2489
Self-heal Daemon on localhost N/A N/A N N/A
Self-heal Daemon on 169.254.0.24 N/A N/A Y 1283
Self-heal Daemon on 169.254.0.28 N/A N/A N N/A

Task Status of Volume log

There are no active volume tasks

- The output of the gluster volume heal command:
Brick 169.254.0.20:/mnt/bricks/log/brick
/tmpdir1\test/sn.log
/ - Is in split-brain
/tmpdir1\test - Is in split-brain
Status: Connected
Number of entries: 3

Brick 169.254.0.28:/mnt/bricks/log/brick
/tmp9_hard2.log
/tmpdir4\test
/tmpdir4\test/1/2/3/4/5/6/7/8
/tmpdir4\test/1/2/3/4
/tmpdir4\test/1
/tmpdir4\test/1/2/3/4/5/6
/tmpdir1\test/tmp8.log
/tmp4 complex_test-0 &%~$=' ;\ .txt
/tmpdir1\test/tmp2.log
/tmpdir4\test/1/2/3/4/5
/tmpdir4\test/1/2/3
/tmpdir4\test/1/2/3/4/5/6/7
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20
/tmpdir2\test
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13/14
/tmpdir1\test - Is in split-brain
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15
/tmp3.log
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10
/master/fsaudit/auth.log
/master/fsaudit/alarms
/tmpdir4\test/1/2/3/4/5/6/7/8/9
/tmpdir1\test/mgn.log
/master/syslog
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12
/tmpdir4\test/1/2
/ - Is in split-brain
/tmpdir1\test/tmp6.log
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19/20/tmp_deep.log
/tmpdir1\test/tmp7.log
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19
/tmp5.log
/tmp9_soft2.log
/tmpdir4\test/1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16
Status: Connected
Number of entries: 38

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/
Final graph:
+------------------------------------------------------------------------------+
1: volume services-client-0
2: type protocol/client
3: option opversion 70000
4: option clnt-lk-version 1
5: option volfile-checksum 0
6: option volfile-key shd/config
7: option client-version 7.0
8: option process-name glustershd
9: option process-uuid CTX_ID:1181f87e-c3c6-46d6-83fa-fcb5783a2a67-GRAPH_ID:5-PID:9852-HOST:SN-0-PC_NAME:services-client-0-RECON_NO:-0
10: option fops-version 1298437
11: option ping-timeout 42
12: option remote-host 169.254.0.20
13: option remote-subvolume /mnt/bricks/services/brick
14: option transport-type socket
15: option transport.address-family inet
16: option username b24f982a-5276-466f-b6a5-42a88e20a2a7
17: option password 72cc7d66-e66f-491e-8901-8629ae0960f0
18: option transport.socket.ssl-enabled off
19: option transport.tcp-user-timeout 9
20: option transport.socket.keepalive-time 20
21: option transport.socket.keepalive-interval 10
22: option transport.socket.keepalive-count 3
23: end-volume
24:
25: volume services-client-1
26: type protocol/client
27: option ping-timeout 42
[2023-09-26 06:50:29.273048] I [rpc-clnt.c:1969:rpc_clnt_reconfig] 5-services-client-1: changing port to 53957 (from 0)
[2023-09-26 06:50:29.273194] I [socket.c:864:__socket_shutdown] 5-services-client-1: intentional socket shutdown(18)
28: option remote-host 169.254.0.28
29: option remote-subvolume /mnt/bricks/services/brick
30: option transport-type socket
31: option transport.address-family inet
32: option username b24f982a-5276-466f-b6a5-42a88e20a2a7
33: option password 72cc7d66-e66f-491e-8901-8629ae0960f0
34: option transport.socket.ssl-enabled off
35: option transport.tcp-user-timeout 9
36: option transport.socket.keepalive-time 20
37: option transport.socket.keepalive-interval 10
38: option transport.socket.keepalive-count 3
39: end-volume
40:
41: volume services-replicate-0
42: type cluster/replicate
43: option node-uuid 88fd68ac-47d0-4cda-87af-4ffc54e4d8e5
44: option afr-pending-xattr services-client-0,services-client-1
45: option background-self-heal-count 0
46: option metadata-self-heal on
47: option data-self-heal on
48: option entry-self-heal on
49: option self-heal-daemon enable
50: option heal-timeout 60
51: option consistent-metadata no
52: option favorite-child-policy mtime
53: option use-compound-fops off
54: option iam-self-heal-daemon yes
55: subvolumes services-client-0 services-client-1
56: end-volume
57:
58: volume services
59: type debug/io-stats
60: option log-level INFO
61: option threads 16
62: subvolumes services-replicate-0
63: end-volume
64:
+------------------------------------------------------------------------------+
[2023-09-26 06:50:29.274719] I [MSGID: 100041] [glusterfsd-mgmt.c:1108:glusterfs_handle_svc_attach] 0-glusterfs: received attach request for volfile-id=shd/mstate
[2023-09-26 06:50:29.274776] I [MSGID: 100040] [glusterfsd-mgmt.c:105:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2023-09-26 06:50:29.274811] I [MSGID: 100041] [glusterfsd-mgmt.c:1108:glusterfs_handle_svc_attach] 0-glusterfs: received attach request for volfile-id=shd/services
[2023-09-26 06:50:29.274828] I [MSGID: 100040] [glusterfsd-mgmt.c:105:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2023-09-26 06:50:29.274850] I [MSGID: 108026] [afr-self-heald.c:424:afr_shd_index_heal] 4-mstate-replicate-0: got entry: b4286ea7-6cd8-4931-ac7b-b5a979dca17b from mstate-client-0
[2023-09-26 06:50:29.274895] I [MSGID: 108026] [afr-self-heald.c:424:afr_shd_index_heal] 3-log-replicate-0: got entry: 24972e03-d45b-4900-a147-2903780b5302 from log-client-0
[2023-09-26 06:50:29.274945] I [MSGID: 100040] [glusterfsd-mgmt.c:105:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2023-09-26 06:50:29.276312] I [MSGID: 100040] [glusterfsd-mgmt.c:105:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2023-09-26 06:50:29.276402] I [MSGID: 108026] [afr-self-heald.c:333:afr_shd_selfheal] 4-mstate-replicate-0: entry: path /tmpdir1/sn.log, gfid: b4286ea7-6cd8-4931-ac7b-b5a979dca17b
[2023-09-26 06:50:29.276514] I [MSGID: 108026] [afr-self-heald.c:333:afr_shd_selfheal] 3-log-replicate-0: entry: path gfid:24972e03-d45b-4900-a147-2903780b5302, gfid: 24972e03-d45b-4900-a147-2903780b5302
[2023-09-26 06:50:29.277013] I [MSGID: 100040] [glusterfsd-mgmt.c:105:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2023-09-26 06:50:29.277265] I [MSGID: 114057] [client-handshake.c:1373:select_server_supported_programs] 5-services-client-1: Using Program GlusterFS 4.x v1, Num (1298437), Version (400)
[2023-09-26 06:50:29.277832] I [MSGID: 114046] [client-handshake.c:1104:client_setvolume_cbk] 5-services-client-1: Connected to services-client-1, attached to remote volume '/mnt/bricks/services/brick'.
[2023-09-26 06:50:29.281028] I [MSGID: 108026] [afr-self-heald.c:424:afr_shd_index_heal] 3-log-replicate-0: got entry: a91e162b-e352-451e-8b40-2d15982e4748 from log-client-0
[2023-09-26 06:50:29.281074] I [MSGID: 108026] [afr-self-heal-data.c:327:afr_selfheal_data_do] 4-mstate-replicate-0: performing data selfheal on b4286ea7-6cd8-4931-ac7b-b5a979dca17b
[2023-09-26 06:50:29.284335] I [MSGID: 108026] [afr-self-heald.c:333:afr_shd_selfheal] 3-log-replicate-0: entry: path gfid:a91e162b-e352-451e-8b40-2d15982e4748, gfid: a91e162b-e352-451e-8b40-2d15982e4748
[2023-09-26 06:50:29.289615] I [MSGID: 108026] [afr-self-heald.c:424:afr_shd_index_heal] 3-log-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from log-client-0
[2023-09-26 06:50:29.289942] I [MSGID: 108026] [afr-self-heald.c:333:afr_shd_selfheal] 3-log-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
[2023-09-26 06:50:29.292540] I [MSGID: 108026] [afr-self-heal-entry.c:905:afr_selfheal_entry_do] 3-log-replicate-0: performing entry selfheal on 00000000-0000-0000-0000-000000000001
[2023-09-26 06:50:29.298955] I [MSGID: 108026] [afr-self-heal-common.c:1748:afr_log_selfheal] 4-mstate-replicate-0: Completed data selfheal on b4286ea7-6cd8-4931-ac7b-b5a979dca17b. sources=[0] sinks=1
[2023-09-26 06:50:29.300452] I [MSGID: 108026] [afr-self-heal-metadata.c:51:__afr_selfheal_metadata_do] 4-mstate-replicate-0: performing metadata selfheal on b4286ea7-6cd8-4931-ac7b-b5a979dca17b
[2023-09-26 06:50:29.307964] I [MSGID: 108026] [afr-self-heal-common.c:1748:afr_log_selfheal] 4-mstate-replicate-0: Completed metadata selfheal on b4286ea7-6cd8-4931-ac7b-b5a979dca17b. sources=[0] sinks=1
[2023-09-26 06:50:29.308050] I [MSGID: 108026] [afr-self-heald.c:424:afr_shd_index_heal] 4-mstate-replicate-0: got entry: 00000000-0000-0000-0000-000000000001 from mstate-client-0
[2023-09-26 06:50:29.308807] I [MSGID: 108026] [afr-self-heald.c:333:afr_shd_selfheal] 4-mstate-replicate-0: entry: path /, gfid: 00000000-0000-0000-0000-000000000001
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2023-09-26 06:50:29
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 7.0
/lib64/libglusterfs.so.0(+0x2c254)[0x7f0912169254]
/lib64/libglusterfs.so.0(gf_print_trace+0x34a)[0x7f0912173f2a]
/lib64/libc.so.6(+0x3db70)[0x7f0911f18b70]
/lib64/libc.so.6(+0xd0f1f)[0x7f0911fabf1f]
/lib64/libc.so.6(__strftime_l+0x2d)[0x7f0911fae6ed]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x43fda)[0x7f090cafffda]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x45968)[0x7f090cb01968]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x52ded)[0x7f090cb0eded]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x53723)[0x7f090cb0f723]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x53a70)[0x7f090cb0fa70]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x4b705)[0x7f090cb07705]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x4b89c)[0x7f090cb0789c]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x549d7)[0x7f090cb109d7]
/usr/lib64/glusterfs/7.0/xlator/cluster/replicate.so(+0x54c7b)[0x7f090cb10c7b]
/lib64/libglusterfs.so.0(+0x932b1)[0x7f09121d02b1]
/lib64/libglusterfs.so.0(+0x69cbc)[0x7f09121a6cbc]
/lib64/libc.so.6(+0x54e50)[0x7f0911f2fe50]

**- Is there any crash ? Provide the backtrace and coredump
YES
coredump file attached, and the list the back trace as the below:
backtrace:
Stack trace of thread 40893:
#0 0x00007f1fb9e4ef1f __strftime_internal (libc.so.6 + 0xd0f1f)
#1 0x00007f1fb9e516ed __strftime_l (libc.so.6 + 0xd36ed)
#2 0x00007f1fb44c2fda afr_mark_split_brain_source_sinks_by_policy (replicate.so + 0x43fda)
#3 0x00007f1fb44c4968 afr_mark_split_brain_source_sinks (replicate.so + 0x45968)
#4 0x00007f1fb44d1ded __afr_selfheal_metadata_finalize_source (replicate.so + 0x52ded)
#5 0x00007f1fb44d2723 __afr_selfheal_metadata_prepare (replicate.so + 0x53723)
#6 0x00007f1fb44eb1ea afr_selfheal_locked_metadata_inspect (replicate.so + 0x6c1ea)
#7 0x00007f1fb44ebc06 afr_selfheal_locked_inspect (replicate.so + 0x6cc06)
#8 0x00007f1fb44ebd7e afr_get_heal_info (replicate.so + 0x6cd7e)
#9 0x00007f1fb449e6ff afr_getxattr (replicate.so + 0x1f6ff)
#10 0x00007f1fba17b5ba syncop_getxattr (libglusterfs.so.0 + 0x725ba)
#11 0x000055606791d786 glfsh_process_entries (glfsheal + 0x5786)
#12 0x000055606791e772 glfsh_crawl_directory.isra.0 (glfsheal + 0x6772)
#13 0x000055606791ea97 glfsh_print_pending_heals_type (glfsheal + 0x6a97)
#14 0x000055606791ecff glfsh_print_pending_heals (glfsheal + 0x6cff)
#15 0x000055606791ee79 glfsh_gather_heal_info (glfsheal + 0x6e79)
#16 0x000055606791c372 main (glfsheal + 0x4372)
#17 0x00007f1fb9da5b4a __libc_start_call_main (libc.so.6 + 0x27b4a)
#18 0x00007f1fb9da5c0b __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x27c0b)
#19 0x000055606791c3b5 _start (glfsheal + 0x43b5)

            Stack trace of thread 40894:
            #0  0x00007f1fb9e54293 clock_nanosleep@GLIBC_2.2.5 (libc.so.6 + 0xd6293)
            #1  0x00007f1fb9e58d37 __nanosleep (libc.so.6 + 0xdad37)
            #2  0x00007f1fb9e58c63 sleep (libc.so.6 + 0xdac63)
            #3  0x00007f1fba15f63b pool_sweeper (libglusterfs.so.0 + 0x5663b)
            #4  0x00007f1fb9e0a886 start_thread (libc.so.6 + 0x8c886)
            #5  0x00007f1fb9e906e0 __clone3 (libc.so.6 + 0x1126e0)

            Stack trace of thread 40895:
            #0  0x00007f1fb9e07189 __futex_abstimed_wait_common (libc.so.6 + 0x89189)
            #1  0x00007f1fb9e09e62 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x8be62)
            #2  0x00007f1fba1759c8 syncenv_task (libglusterfs.so.0 + 0x6c9c8)
            #3  0x00007f1fba176845 syncenv_processor (libglusterfs.so.0 + 0x6d845)
            #4  0x00007f1fb9e0a886 start_thread (libc.so.6 + 0x8c886)
            #5  0x00007f1fb9e906e0 __clone3 (libc.so.6 + 0x1126e0)

            Stack trace of thread 40896:
            #0  0x00007f1fb9e07189 __futex_abstimed_wait_common (libc.so.6 + 0x89189)
            #1  0x00007f1fb9e09e62 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x8be62)
            #2  0x00007f1fba1759c8 syncenv_task (libglusterfs.so.0 + 0x6c9c8)
            #3  0x00007f1fba176845 syncenv_processor (libglusterfs.so.0 + 0x6d845)
            #4  0x00007f1fb9e0a886 start_thread (libc.so.6 + 0x8c886)
            #5  0x00007f1fb9e906e0 __clone3 (libc.so.6 + 0x1126e0)

            Stack trace of thread 40898:
            #0  0x00007f1fb9e07189 __futex_abstimed_wait_common (libc.so.6 + 0x89189)
            #1  0x00007f1fb9e09e62 pthread_cond_timedwait@@GLIBC_2.3.2 (libc.so.6 + 0x8be62)

Additional info:

- The operating system / glusterfs version: 7.0.1
after checking the latest source code, the issue should be existed in latest version also.

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

core.glfsheal.0.d3696c8cc7b54594aa8d2c0e0a347230.51734.1695707401000000.zip

GeorgeLjz added a commit to GeorgeLjz/glusterfs that referenced this issue Oct 16, 2023
glfsheal encounter a SIGSEGV in __strftime_interna called from afr_mark_split_brain_source_sinks_by_policy

Root cause: mis-compare between the int and unisgned int
Solution: convert the compare between 2 ints

Fixes: gluster#4239
Change-Id: If6a356db60298da39a48c7979abdfbac03521aa7
GeorgeLjz added a commit to GeorgeLjz/glusterfs that referenced this issue Oct 16, 2023
glfsheal encounter a SIGSEGV in __strftime_interna called from afr_mark_split_brain_source_sinks_by_policy

Root cause: ctime is negative
Solution: change ctime to 0 when ctime is negative before strftime

Fixes: gluster#4239
Change-Id: If6a356db60298da39a48c7979abdfbac03521aa7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant