我在两台服务器 node1 和 node2 上 模拟物理发包,发现有很大概率卡主 或者 挂死。 之前怀疑是 pfc死锁,但是使用 ib_send_bw 是可以正常发包的。
通过gdb看到 卡主在这里 judge_exit_flag 。 这会导致CPU暴涨!!
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 4 index in ring: 0 offset: 2total nodes in ring: 4
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 4 index in ring: 0 offset: 2total nodes in ring: 4
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 4 index in ring: 0 offset: 2total nodes in ring: 4
ring of node 0, id: 0 dimension: local total nodes in ring: 2 index in ring: 0 offset: 1total nodes in ring: 2
ring of node 0, id: 0 dimension: local total nodes in ring: 4 index in ring: 0 offset: 2total nodes in ring: 4
pp_commize:0
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: embedding_layer , depen: -1 , wg_comp_time: 1
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 6 compute scale: 1 ,comm scale: 1
chunk size is: 16777216 , size is: 16777216 , layer_num is: 0 , node: 0
info: all-reduce forward pass collective issued for layer: embedding_layer, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
chunk size is: 16777216 , size is: 16777216 , layer_num is: 1 , node: 0
info: all-reduce forward pass collective issued for layer: embedding_layer, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
chunk size is: 16777216 , size is: 16777216 , layer_num is: 2 , node: 0
[node1:1017216:0:1017233] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
==== backtrace (tid:1017233) ====
0 /lib/libucs.so.0(ucs_handle_error+0x2dc) [0x7f2b993feb5c]
1 /lib/libucs.so.0(+0x28d3f) [0x7f2b993fed3f]
2 /lib/libucs.so.0(+0x28f0a) [0x7f2b993fef0a]
3 /opt/SimAI/bin/SimAI_phynet(+0x11b57a) [0x561fd35f157a]
4 /opt/SimAI/bin/SimAI_phynet(+0x2c11a) [0x561fd350211a]
5 /opt/SimAI/bin/SimAI_phynet(+0x2b8e4) [0x561fd35018e4]
6 /opt/SimAI/bin/SimAI_phynet(+0x2b3e6) [0x561fd35013e6]
7 /opt/SimAI/bin/SimAI_phynet(+0x2a73e) [0x561fd350073e]
8 /opt/SimAI/bin/SimAI_phynet(+0x2adf0) [0x561fd3500df0]
9 /opt/SimAI/bin/SimAI_phynet(+0x39492) [0x561fd350f492]
10 /opt/SimAI/bin/SimAI_phynet(+0x393a8) [0x561fd350f3a8]
11 /opt/SimAI/bin/SimAI_phynet(+0x392b9) [0x561fd350f2b9]
12 /opt/SimAI/bin/SimAI_phynet(+0x3923f) [0x561fd350f23f]
13 /opt/SimAI/bin/SimAI_phynet(+0x39210) [0x561fd350f210]
14 /opt/SimAI/bin/SimAI_phynet(+0x194c44) [0x561fd366ac44]
15 /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f2b9b60c609]
16 /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f2b9b3c7353]
=================================
[node1:1017216] *** Process received signal ***
[node1:1017216] Signal: Segmentation fault (11)
[node1:1017216] Signal code: (-6)
[node1:1017216] Failing at address: 0xf8580
[node1:1017216] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f2b9b618420]
[node1:1017216] [ 1] /opt/SimAI/bin/SimAI_phynet(+0x11b57a)[0x561fd35f157a]
[node1:1017216] [ 2] /opt/SimAI/bin/SimAI_phynet(+0x2c11a)[0x561fd350211a]
[node1:1017216] [ 3] /opt/SimAI/bin/SimAI_phynet(+0x2b8e4)[0x561fd35018e4]
[node1:1017216] [ 4] /opt/SimAI/bin/SimAI_phynet(+0x2b3e6)[0x561fd35013e6]
[node1:1017216] [ 5] /opt/SimAI/bin/SimAI_phynet(+0x2a73e)[0x561fd350073e]
[node1:1017216] [ 6] /opt/SimAI/bin/SimAI_phynet(+0x2adf0)[0x561fd3500df0]
[node1:1017216] [ 7] /opt/SimAI/bin/SimAI_phynet(+0x39492)[0x561fd350f492]
[node1:1017216] [ 8] /opt/SimAI/bin/SimAI_phynet(+0x393a8)[0x561fd350f3a8]
[node1:1017216] [ 9] /opt/SimAI/bin/SimAI_phynet(+0x392b9)[0x561fd350f2b9]
[node1:1017216] [10] /opt/SimAI/bin/SimAI_phynet(+0x3923f)[0x561fd350f23f]
[node1:1017216] [11] /opt/SimAI/bin/SimAI_phynet(+0x39210)[0x561fd350f210]
[node1:1017216] [12] /opt/SimAI/bin/SimAI_phynet(+0x194c44)[0x561fd366ac44]
[node1:1017216] [13] /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2b9b60c609]
[node1:1017216] [14] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2b9b3c7353]
[node1:1017216] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 1017216 on node 192.168.22.1 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
我在两台服务器 node1 和 node2 上 模拟物理发包,发现有很大概率卡主 或者 挂死。 之前怀疑是 pfc死锁,但是使用 ib_send_bw 是可以正常发包的。
通过gdb看到 卡主在这里 judge_exit_flag 。 这会导致CPU暴涨!!
多尝试几次后发现容易挂死,
挂死日志