Skip to content

[MCA][X86] llvm-mca very inaccurate for pop instructions #152008

@boomanaiden154

Description

@boomanaiden154

(Doing the below in terms of reciprocal throughput to make it easier to compare different tools/benchmark results).

For the following snippet of code:

popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12

llvm-mca predicts a reciprocal throughput of 30 cycles:

Iterations:        1000
Instructions:      5000
Total Cycles:      30003
Total uOps:        10000

Dispatch Width:    6
uOps Per Cycle:    0.33
IPC:               0.17
Block RThroughput: 2.5
Iterations:        2000
Instructions:      10000
Total Cycles:      60003
Total uOps:        20000

(60003-30003)/1000=30 cycles per iteration.

llvm-exegesis measures a reciprocal throughput of about 6.5-7 cycles, but runs into a bunch of cache misses since all of the cache lines it is touching haven't been loaded yet:

# LLVM-EXEGESIS-DEFREG RSP 20000
# LLVM-EXEGESIS-MEM-DEF test1 131072 7fffffff
# LLVM-EXEGESIS-MEM-MAP test1 131072
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=10000 -validation-counter=l1d-cache-load-misses
---
mode:            latency
key:
  instructions:
    - 'POP64r RAX'
    - 'POP64r RCX'
    - 'POP64r RDX'
    - 'POP64r RBX'
    - 'POP64r R12'
  config:          ''
  register_initial_values:
    - 'RSP=0x20000'
cpu_name:        skylake-avx512
llvm_triple:     x86_64-grtev4-linux-gnu
min_instructions: 10000
measurements:
  - { key: latency, value: 1.3718, per_snippet_value: 6.859, validation_counters:
      l1d-cache-load-misses: 3744 }
error:           ''
info:            ''
assembled_snippet: 41554154534989FC4989F548BF0000000000000000488D350000000048C1EE0C48C1E60C4881EE0010000048B80B000000000000000F054C8D05000000004C89E74C01C748C1EF0C48C1E70C4881C70010000048BE00F0FFFFFF7F00004829FE48B80B000000000000000F0548BF00E0FFFFFF7F000048BE001000000000000048BA030000000000000049BA11000000000000004D89E849B9000000000000000048B809000000000000000F0548BF000002000000000048BE000002000000000048BA030000000000000049BA110000000000000049B804E0FFFFFF7F0000458B0049B9000000000000000048B809000000000000000F0548BC00F0FFFFFF7F00005141535057565248BF00E0FFFFFF7F00008B3F48BE032400000000000048BA010000000000000048B810000000000000000F055A5E5F58415B5948BC000002000000000049B8020000000000000058595A5B415C58595A5B415C4983C0FF75EE48BF00E0FFFFFF7F00008B3F48BE012400000000000048BA010000000000000048B810000000000000000F0548BF000000000000000048B83C000000000000000F055B415C415DC3
...
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=5000 -validation-counter=l1d-cache-load-misses
---
mode:            latency
key:
  instructions:
    - 'POP64r RAX'
    - 'POP64r RCX'
    - 'POP64r RDX'
    - 'POP64r RBX'
    - 'POP64r R12'
  config:          ''
  register_initial_values:
    - 'RSP=0x20000'
cpu_name:        skylake-avx512
llvm_triple:     x86_64-grtev4-linux-gnu
min_instructions: 5000
measurements:
  - { key: latency, value: 1.4052, per_snippet_value: 7.026, validation_counters:
      l1d-cache-load-misses: 2169 }
error:           ''
info:            ''
assembled_snippet: 41554154534989FC4989F548BF0000000000000000488D350000000048C1EE0C48C1E60C4881EE0010000048B80B000000000000000F054C8D05000000004C89E74C01C748C1EF0C48C1E70C4881C70010000048BE00F0FFFFFF7F00004829FE48B80B000000000000000F0548BF00E0FFFFFF7F000048BE001000000000000048BA030000000000000049BA11000000000000004D89E849B9000000000000000048B809000000000000000F0548BF000002000000000048BE000002000000000048BA030000000000000049BA110000000000000049B804E0FFFFFF7F0000458B0049B9000000000000000048B809000000000000000F0548BC00F0FFFFFF7F00005141535057565248BF00E0FFFFFF7F00008B3F48BE032400000000000048BA010000000000000048B810000000000000000F055A5E5F58415B5948BC000002000000000049B8020000000000000058595A5B415C58595A5B415C4983C0FF75EE48BF00E0FFFFFF7F00008B3F48BE012400000000000048BA010000000000000048B810000000000000000F0548BF000000000000000048B83C000000000000000F055B415C415DC3
...

uiCA predicts a reciprocal throughput of 2.5 cycles per iteration, although I'm not (currently) convinced that is completely accurate.

I'm pretty sure MCA is seeing the dependency on %rsp and delaying the instructions because of that, although it seems like the hardware is able to figure that out without delaying execution. If we reset %rsp every iteration, MCA seems to do much better. Given this is a real register dependency, I'm not sure this is super easy to fix.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions