-
Notifications
You must be signed in to change notification settings - Fork 14.7k
Description
(Doing the below in terms of reciprocal throughput to make it easier to compare different tools/benchmark results).
For the following snippet of code:
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
llvm-mca predicts a reciprocal throughput of 30 cycles:
Iterations: 1000
Instructions: 5000
Total Cycles: 30003
Total uOps: 10000
Dispatch Width: 6
uOps Per Cycle: 0.33
IPC: 0.17
Block RThroughput: 2.5
Iterations: 2000
Instructions: 10000
Total Cycles: 60003
Total uOps: 20000
(60003-30003)/1000=30 cycles per iteration.
llvm-exegesis
measures a reciprocal throughput of about 6.5-7 cycles, but runs into a bunch of cache misses since all of the cache lines it is touching haven't been loaded yet:
# LLVM-EXEGESIS-DEFREG RSP 20000
# LLVM-EXEGESIS-MEM-DEF test1 131072 7fffffff
# LLVM-EXEGESIS-MEM-MAP test1 131072
popq %rax
popq %rcx
popq %rdx
popq %rbx
popq %r12
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=10000 -validation-counter=l1d-cache-load-misses
---
mode: latency
key:
instructions:
- 'POP64r RAX'
- 'POP64r RCX'
- 'POP64r RDX'
- 'POP64r RBX'
- 'POP64r R12'
config: ''
register_initial_values:
- 'RSP=0x20000'
cpu_name: skylake-avx512
llvm_triple: x86_64-grtev4-linux-gnu
min_instructions: 10000
measurements:
- { key: latency, value: 1.3718, per_snippet_value: 6.859, validation_counters:
l1d-cache-load-misses: 3744 }
error: ''
info: ''
assembled_snippet: 41554154534989FC4989F548BF0000000000000000488D350000000048C1EE0C48C1E60C4881EE0010000048B80B000000000000000F054C8D05000000004C89E74C01C748C1EF0C48C1E70C4881C70010000048BE00F0FFFFFF7F00004829FE48B80B000000000000000F0548BF00E0FFFFFF7F000048BE001000000000000048BA030000000000000049BA11000000000000004D89E849B9000000000000000048B809000000000000000F0548BF000002000000000048BE000002000000000048BA030000000000000049BA110000000000000049B804E0FFFFFF7F0000458B0049B9000000000000000048B809000000000000000F0548BC00F0FFFFFF7F00005141535057565248BF00E0FFFFFF7F00008B3F48BE032400000000000048BA010000000000000048B810000000000000000F055A5E5F58415B5948BC000002000000000049B8020000000000000058595A5B415C58595A5B415C4983C0FF75EE48BF00E0FFFFFF7F00008B3F48BE012400000000000048BA010000000000000048B810000000000000000F0548BF000000000000000048B83C000000000000000F055B415C415DC3
...
taskset -c 5 /tmp/llvm-exegesis -mode=latency -snippets-file=/tmp/test.s -execution-mode=subprocess -benchmark-process-cpu=6 -repetition-mode=loop -min-instructions=5000 -validation-counter=l1d-cache-load-misses
---
mode: latency
key:
instructions:
- 'POP64r RAX'
- 'POP64r RCX'
- 'POP64r RDX'
- 'POP64r RBX'
- 'POP64r R12'
config: ''
register_initial_values:
- 'RSP=0x20000'
cpu_name: skylake-avx512
llvm_triple: x86_64-grtev4-linux-gnu
min_instructions: 5000
measurements:
- { key: latency, value: 1.4052, per_snippet_value: 7.026, validation_counters:
l1d-cache-load-misses: 2169 }
error: ''
info: ''
assembled_snippet: 41554154534989FC4989F548BF0000000000000000488D350000000048C1EE0C48C1E60C4881EE0010000048B80B000000000000000F054C8D05000000004C89E74C01C748C1EF0C48C1E70C4881C70010000048BE00F0FFFFFF7F00004829FE48B80B000000000000000F0548BF00E0FFFFFF7F000048BE001000000000000048BA030000000000000049BA11000000000000004D89E849B9000000000000000048B809000000000000000F0548BF000002000000000048BE000002000000000048BA030000000000000049BA110000000000000049B804E0FFFFFF7F0000458B0049B9000000000000000048B809000000000000000F0548BC00F0FFFFFF7F00005141535057565248BF00E0FFFFFF7F00008B3F48BE032400000000000048BA010000000000000048B810000000000000000F055A5E5F58415B5948BC000002000000000049B8020000000000000058595A5B415C58595A5B415C4983C0FF75EE48BF00E0FFFFFF7F00008B3F48BE012400000000000048BA010000000000000048B810000000000000000F0548BF000000000000000048B83C000000000000000F055B415C415DC3
...
uiCA predicts a reciprocal throughput of 2.5 cycles per iteration, although I'm not (currently) convinced that is completely accurate.
I'm pretty sure MCA is seeing the dependency on %rsp
and delaying the instructions because of that, although it seems like the hardware is able to figure that out without delaying execution. If we reset %rsp
every iteration, MCA seems to do much better. Given this is a real register dependency, I'm not sure this is super easy to fix.