[MCA][X86] llvm-mca very inaccurate for pop instructions

(Doing the below in terms of reciprocal throughput to make it easier to compare different tools/benchmark results).

For the following snippet of code:
```asm
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
```

llvm-mca predicts a reciprocal throughput of 30 cycles:
```
Iterations:        1000
Instructions:      5000
Total Cycles:      30003
Total uOps:        10000

Dispatch Width:    6
uOps Per Cycle:    0.33
IPC:               0.17
Block RThroughput: 2.5
```
```
Iterations:        2000
Instructions:      10000
Total Cycles:      60003
Total uOps:        20000
```

(60003-30003)/1000=30 cycles per iteration.

`llvm-exegesis` measures a reciprocal throughput of about 6.5-7 cycles, but runs into a bunch of cache misses since all of the cache lines it is touching haven't been loaded yet:
```asm
# LLVM-EXEGESIS-DEFREG RSP 20000
# LLVM-EXEGESIS-MEM-DEF test1 131072 7fffffff
# LLVM-EXEGESIS-MEM-MAP test1 131072
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
```
```
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=10000 -validation-counter=l1d-cache-load-misses
---
mode:            latency
key:
  instructions:
    - 'POP64r RAX'
    - 'POP64r RCX'
    - 'POP64r RDX'
    - 'POP64r RBX'
    - 'POP64r R12'
  config:          ''
  register_initial_values:
    - 'RSP=0x20000'
cpu_name:        skylake-avx512
llvm_triple:     x86_64-grtev4-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 1.3718, per_snippet_value: 6.859, validation_counters:
      l1d-cache-load-misses: 3744 }
error:           ''
info:            ''
assembled_snippet: 41554154534989FC4989F548BF0000000000000000488D350000000048C1EE0C48C1E60C4881EE0010000048B80B000000000000000F054C8D05000000004C89E74C01C748C1EF0C48C1E70C4881C70010000048BE00F0FFFFFF7F00004829FE48B80B000000000000000F0548BF00E0FFFFFF7F000048BE001000000000000048BA030000000000000049BA11000000000000004D89E849B9000000000000000048B809000000000000000F0548BF000002000000000048BE000002000000000048BA030000000000000049BA110000000000000049B804E0FFFFFF7F0000458B0049B9000000000000000048B809000000000000000F0548BC00F0FFFFFF7F00005141535057565248BF00E0FFFFFF7F00008B3F48BE032400000000000048BA010000000000000048B810000000000000000F055A5E5F58415B5948BC000002000000000049B8020000000000000058595A5B415C58595A5B415C4983C0FF75EE48BF00E0FFFFFF7F00008B3F48BE012400000000000048BA010000000000000048B810000000000000000F0548BF000000000000000048B83C000000000000000F055B415C415DC3
...
```
```
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=5000 -validation-counter=l1d-cache-load-misses
---
mode:            latency
key:
  instructions:
    - 'POP64r RAX'
    - 'POP64r RCX'
    - 'POP64r RDX'
    - 'POP64r RBX'
    - 'POP64r R12'
  config:          ''
  register_initial_values:
    - 'RSP=0x20000'
cpu_name:        skylake-avx512
llvm_triple:     x86_64-grtev4-linux-gnu
min_instructions: 5000
measurements:
  - { key: latency, value: 1.4052, per_snippet_value: 7.026, validation_counters:
      l1d-cache-load-misses: 2169 }
error:           ''
info:            ''
assembled_snippet: 41554154534989FC4989F548BF0000000000000000488D350000000048C1EE0C48C1E60C4881EE0010000048B80B000000000000000F054C8D05000000004C89E74C01C748C1EF0C48C1E70C4881C70010000048BE00F0FFFFFF7F00004829FE48B80B000000000000000F0548BF00E0FFFFFF7F000048BE001000000000000048BA030000000000000049BA11000000000000004D89E849B9000000000000000048B809000000000000000F0548BF000002000000000048BE000002000000000048BA030000000000000049BA110000000000000049B804E0FFFFFF7F0000458B0049B9000000000000000048B809000000000000000F0548BC00F0FFFFFF7F00005141535057565248BF00E0FFFFFF7F00008B3F48BE032400000000000048BA010000000000000048B810000000000000000F055A5E5F58415B5948BC000002000000000049B8020000000000000058595A5B415C58595A5B415C4983C0FF75EE48BF00E0FFFFFF7F00008B3F48BE012400000000000048BA010000000000000048B810000000000000000F0548BF000000000000000048B83C000000000000000F055B415C415DC3
...
```

uiCA predicts a reciprocal throughput of 2.5 cycles per iteration, although I'm not (currently) convinced that is completely accurate.

I'm pretty sure MCA is seeing the dependency on `%rsp` and delaying the instructions because of that, although it seems like the hardware is able to figure that out without delaying execution. If we reset `%rsp` every iteration, MCA seems to do much better. Given this is a real register dependency, I'm not sure this is super easy to fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MCA][X86] llvm-mca very inaccurate for pop instructions #152008

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MCA][X86] llvm-mca very inaccurate for pop instructions #152008

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions