| 1234567891011121314151617181920212223242526272829303132333435363738394041 |
- # run two tinygrad matrix example in a loop
- # amdgpu-6.0.5-1581431.20.04
- # NOT fixed in kernel 6.2.14
- [ 553.016624] gmc_v11_0_process_interrupt: 30 callbacks suppressed
- [ 553.016631] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:9 pasid:32770, for process python3 pid 10001 thread python3 pid 10001)
- [ 553.016790] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x00007f0000000000 from client 10
- [ 553.016892] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00901A30
- [ 553.016974] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: SDMA0 (0xd)
- [ 553.017051] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0
- [ 553.017111] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0
- [ 553.017173] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x3
- [ 553.017238] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0
- [ 553.017300] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0
- [ 553.123921] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
- [ 553.124153] amdgpu: failed to add hardware queue to MES, doorbell=0x1a16
- [ 553.124195] amdgpu: MES might be in unrecoverable state, issue a GPU reset
- [ 553.124237] amdgpu: Failed to restore queue 2
- [ 553.124266] amdgpu: Failed to restore process queues
- [ 553.124270] amdgpu: Failed to evict queue 3
- [ 553.124297] amdgpu: amdgpu_amdkfd_restore_userptr_worker: Failed to resume KFD
- # alternative crash in kernel 6.2.14
- [ 151.097948] gmc_v11_0_process_interrupt: 30 callbacks suppressed
- [ 151.097953] amdgpu 0000:0b:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24 vmid:8 pasid:32771, for process python3 pid 7525 thread python3 pid 7525)
- [ 151.097993] amdgpu 0000:0b:00.0: amdgpu: in page starting at address 0x00007f0000000000 from client 10
- [ 151.098008] amdgpu 0000:0b:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00801A30
- [ 151.098020] amdgpu 0000:0b:00.0: amdgpu: Faulty UTCL2 client ID: SDMA0 (0xd)
- [ 151.098032] amdgpu 0000:0b:00.0: amdgpu: MORE_FAULTS: 0x0
- [ 151.098042] amdgpu 0000:0b:00.0: amdgpu: WALKER_ERROR: 0x0
- [ 151.098052] amdgpu 0000:0b:00.0: amdgpu: PERMISSION_FAULTS: 0x3
- [ 151.098062] amdgpu 0000:0b:00.0: amdgpu: MAPPING_ERROR: 0x0
- [ 151.098071] amdgpu 0000:0b:00.0: amdgpu: RW: 0x0
- [ 151.209517] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=2
- [ 151.209724] amdgpu: failed to add hardware queue to MES, doorbell=0x1002
- [ 151.209734] amdgpu: MES might be in unrecoverable state, issue a GPU reset
- [ 151.209743] amdgpu: Failed to restore queue 1
- [ 151.209751] amdgpu: Failed to restore process queues
- [ 151.209759] amdgpu: amdgpu_amdkfd_restore_userptr_worker: Failed to resume KFD
- [ 151.209858] amdgpu 0000:0b:00.0: amdgpu: GPU reset begin!
|