synchronous abort

内存硬件错误触发的系统异常

Posted by icecube on January 26, 2024

简介

在arm64架构设备上,遇到过内存硬件问题导致的系统异常,非内核代码问题。

据boot同事说和启动阶段配置内存涉及到内存时序的一个参数有关。

查阅arm cpu手册,根据异常错误码EC区分。

EC, bits [31:26]

Exception Class. Indicates the reason for the exception that this register holds information about.
For each EC value, the table references a subsection that gives information about:
• The cause of the exception, for example the configuration required to enable the trap.
• The encoding of the associated ISS.

IFSC, bits [5:0]

Instruction Fault Status Code.

0x96000210

0x96000210 取数据异常,属于外部异常,非页表问题

[ 6621.614063] Unhandled fault: synchronous external abort (0x96000210) at 0xffffffdfd05b2bf0
[ 6621.713051] Internal error: : 96000210 [#1] SMP
[ 6621.767235] Modules linked in: system(PO) comeon(O)
[ 6621.824562] CPU: 0 PID: 787 Comm: dport_omcd Tainted: P S         O    4.4.65-bex01a #1
[ 6621.920419] Hardware name: E2000Q DEMO DDR4 (DT)
[ 6621.975642] task: ffffffe0e1ae4880 ti: ffffffe0076e0000 task.ti: ffffffe0076e0000
[ 6622.065251] PC is at rb_next+0x0/0x60
[ 6622.109017] LR is at set_next_entity+0x640/0x7b0
[ 6622.164243] pc : [<ffffff804035ef60>] lr : [<ffffff80400e6ea0>] pstate: 600001c5
[ 6622.252805] sp : ffffffe0076ffa10

EC, bits [31:26] EC == 100101

Data Abort taken without a change in Exception level.
Used for MMU faults generated by data accesses, alignment faults other than those
caused by the Stack Pointer misalignment, and Synchronous external aborts, including synchronous parity or ECC errors.
Not used for debug related exceptions.This value is valid for all described registers.
  IFSC, bits [5:0] 010000

Synchronous external abort, not on translation table walk

0x86000210

0x86000210 取指令异常,属于外部异常,非页表问题

[17:04:40][  755.236399] Bad mode in Synchronous Abort handler detected, code 0x86000210 -- IABT (current EL)
[17:04:40][  755.236418] Bad mode in Synchronous Abort handler detected, code 0x86000210 -- IABT (current EL)
[17:04:40][  755.236436] Bad mode in Synchronous Abort handler detected, code 0x86000210 -- IABT (current EL)
[17:04:40][  755.236454] par_el1 = 0
[17:04:40][  755.236464] Bad mode in Synchronous Abort handler detected, code 0x86000210 -- IABT (current EL)
[17:04:40][  755.236481] Internal error: Oops - bad mode: 0 [#1] SMP
[17:04:40][  755.236490] par_el1 = 0
[17:04:40][  755.236497] Modules linked in:
[17:04:40][  755.236497] par_el1 = 0
[17:04:40][  755.236509]  system(PO) comeon(O)
[17:04:40][  755.236519] CPU: 0 PID: 2277 Comm: routed Tainted: P S         O    4.4.65-bex01a #1
[17:04:40][  755.236533] Hardware name: E2000Q DEMO DDR4 (DT)
[17:04:40][  755.236542] task: ffffffdfdc58d700 ti: ffffffdfdc720000 task.ti: ffffffdfdc720000
[17:04:40][  755.236561] PC is at vectors+0x200/0x790
[17:04:40][  755.236570] LR is at el0_da+0x18/0x1c

EC, bits [31:26] EC == 100001

Instruction Abort taken without a change in Exception level.
Used for MMU faults generated by instruction accesses and Synchronous external
aborts, including synchronous parity or ECC errors. Not used for debug related exceptions

  IFSC, bits [5:0] 010000

Synchronous external abort, not on translation table walk

0x96000217

0x96000217 取数据异常,属于外部异常,页表问题

Unhandled fault: synchronous abort (translation table walk) (0x96000217) at 0xffffff807ee701e8

EC, bits [31:26] EC == 100101

IFSC, bits [5:0]   010111

Synchronous External abort on translation table walk or hardware update of translation table, level 3.

参考

《Arm Architecture Reference Manual for A-profile architecture》