-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[LICM] Do not reassociate constant offset GEP #151492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-powerpc @llvm/pr-subscribers-llvm-transforms Author: Nikita Popov (nikic) ChangesLICM tries to reassociate GEPs in order to hoist an invariant GEP. Currently, it also does this in the case where the GEP has a constant offset. This is usually undesirable. From a back-end perspective, constant GEPs are usually free because they can be folded into addressing modes, so this just increases register pressume. From a middle-end perspective, keeping constant offsets last in the chain makes it easier to analyze the relationship between multiple GEPs on the same base, especially after CSE. The worst that can happen here is if we start with something like
And LICM converts it into:
Which is much worse than leaving it for CSE to convert to:
Patch is 48.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151492.diff 9 Files Affected:
diff --git a/llvm/lib/Transforms/Scalar/LICM.cpp b/llvm/lib/Transforms/Scalar/LICM.cpp
index 68094c354cf46..5197e2a7b7d43 100644
--- a/llvm/lib/Transforms/Scalar/LICM.cpp
+++ b/llvm/lib/Transforms/Scalar/LICM.cpp
@@ -2517,6 +2517,12 @@ static bool hoistGEP(Instruction &I, Loop &L, ICFLoopSafetyInfo &SafetyInfo,
if (!L.isLoopInvariant(SrcPtr) || !all_of(GEP->indices(), LoopInvariant))
return false;
+ // Do not try to hoist a constant GEP out of the loop via reassociation.
+ // Constant GEPs can often be folded into addressing modes, and reassociating
+ // them may inhibit CSE of a common base.
+ if (GEP->hasAllConstantIndices())
+ return false;
+
// This can only happen if !AllowSpeculation, otherwise this would already be
// handled.
// FIXME: Should we respect AllowSpeculation in these reassociation folds?
diff --git a/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll b/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll
index 1e6b77ecea85e..702a69f776de3 100644
--- a/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll
+++ b/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll
@@ -77,7 +77,7 @@ define amdgpu_kernel void @copy_flat(ptr nocapture %d, ptr nocapture readonly %s
; GFX1250-NEXT: s_add_nc_u64 s[2:3], s[2:3], 16
; GFX1250-NEXT: s_cmp_lg_u32 s6, 0
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: flat_store_b128 v0, v[2:5], s[0:1]
+; GFX1250-NEXT: flat_store_b128 v0, v[2:5], s[0:1] scope:SCOPE_SE
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_add_nc_u64 s[0:1], s[0:1], 16
; GFX1250-NEXT: s_cbranch_scc1 .LBB0_2
@@ -400,9 +400,9 @@ define amdgpu_kernel void @copy_flat_divergent(ptr nocapture %d, ptr nocapture r
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-NEXT: v_lshlrev_b32_e32 v0, 4, v0
; GFX12-NEXT: s_wait_kmcnt 0x0
-; GFX12-NEXT: v_add_co_u32 v2, s1, s6, v0
+; GFX12-NEXT: v_add_co_u32 v2, s1, v0, s6
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-NEXT: v_add_co_ci_u32_e64 v3, null, s7, 0, s1
+; GFX12-NEXT: v_add_co_ci_u32_e64 v3, null, 0, s7, s1
; GFX12-NEXT: v_add_co_u32 v0, s1, s4, v0
; GFX12-NEXT: v_add_co_u32 v2, vcc_lo, 0xb0, v2
; GFX12-NEXT: s_wait_alu 0xf1ff
@@ -438,9 +438,9 @@ define amdgpu_kernel void @copy_flat_divergent(ptr nocapture %d, ptr nocapture r
; GFX12-SPREFETCH-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-SPREFETCH-NEXT: v_lshlrev_b32_e32 v0, 4, v0
; GFX12-SPREFETCH-NEXT: s_wait_kmcnt 0x0
-; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, s1, s6, v0
+; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, s1, v0, s6
; GFX12-SPREFETCH-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-SPREFETCH-NEXT: v_add_co_ci_u32_e64 v3, null, s7, 0, s1
+; GFX12-SPREFETCH-NEXT: v_add_co_ci_u32_e64 v3, null, 0, s7, s1
; GFX12-SPREFETCH-NEXT: v_add_co_u32 v0, s1, s4, v0
; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, vcc_lo, 0xb0, v2
; GFX12-SPREFETCH-NEXT: s_wait_alu 0xf1ff
@@ -490,7 +490,7 @@ define amdgpu_kernel void @copy_flat_divergent(ptr nocapture %d, ptr nocapture r
; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_cmp_lg_u32 s0, 0
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: flat_store_b128 v[0:1], v[4:7]
+; GFX1250-NEXT: flat_store_b128 v[0:1], v[4:7] scope:SCOPE_SE
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: v_add_nc_u64_e32 v[0:1], 16, v[0:1]
; GFX1250-NEXT: s_cbranch_scc1 .LBB4_2
@@ -531,9 +531,9 @@ define amdgpu_kernel void @copy_global_divergent(ptr addrspace(1) nocapture %d,
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-NEXT: v_lshlrev_b32_e32 v0, 4, v0
; GFX12-NEXT: s_wait_kmcnt 0x0
-; GFX12-NEXT: v_add_co_u32 v2, s1, s6, v0
+; GFX12-NEXT: v_add_co_u32 v2, s1, v0, s6
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-NEXT: v_add_co_ci_u32_e64 v3, null, s7, 0, s1
+; GFX12-NEXT: v_add_co_ci_u32_e64 v3, null, 0, s7, s1
; GFX12-NEXT: v_add_co_u32 v0, s1, s4, v0
; GFX12-NEXT: v_add_co_u32 v2, vcc_lo, 0xb0, v2
; GFX12-NEXT: s_wait_alu 0xf1ff
@@ -569,9 +569,9 @@ define amdgpu_kernel void @copy_global_divergent(ptr addrspace(1) nocapture %d,
; GFX12-SPREFETCH-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-SPREFETCH-NEXT: v_lshlrev_b32_e32 v0, 4, v0
; GFX12-SPREFETCH-NEXT: s_wait_kmcnt 0x0
-; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, s1, s6, v0
+; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, s1, v0, s6
; GFX12-SPREFETCH-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-SPREFETCH-NEXT: v_add_co_ci_u32_e64 v3, null, s7, 0, s1
+; GFX12-SPREFETCH-NEXT: v_add_co_ci_u32_e64 v3, null, 0, s7, s1
; GFX12-SPREFETCH-NEXT: v_add_co_u32 v0, s1, s4, v0
; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, vcc_lo, 0xb0, v2
; GFX12-SPREFETCH-NEXT: s_wait_alu 0xf1ff
diff --git a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
index be020457ce87d..4c0ab91b7d622 100644
--- a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
+++ b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
@@ -6982,7 +6982,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; CHECK-NEXT: global_store_dwordx4 v[100:101], v[96:99], off offset:16
; CHECK-NEXT: s_cmp_lg_u64 s[4:5], 0x800
; CHECK-NEXT: s_cbranch_scc1 .LBB6_2
-; CHECK-NEXT: .LBB6_3: ; %Flow9
+; CHECK-NEXT: .LBB6_3: ; %Flow7
; CHECK-NEXT: s_andn2_saveexec_b32 s8, s6
; CHECK-NEXT: s_cbranch_execz .LBB6_6
; CHECK-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
@@ -7048,7 +7048,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; CHECK-NEXT: global_store_dwordx4 v[100:101], v[96:99], off offset:16
; CHECK-NEXT: s_cmp_eq_u64 s[4:5], s[6:7]
; CHECK-NEXT: s_cbranch_scc0 .LBB6_5
-; CHECK-NEXT: .LBB6_6: ; %Flow10
+; CHECK-NEXT: .LBB6_6: ; %Flow8
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_setpc_b64 s[30:31]
;
@@ -7689,7 +7689,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; ALIGNED-NEXT: global_store_byte v[16:17], v11, off offset:3
; ALIGNED-NEXT: global_store_byte v[16:17], v4, off offset:1
; ALIGNED-NEXT: s_cbranch_scc1 .LBB6_2
-; ALIGNED-NEXT: .LBB6_3: ; %Flow9
+; ALIGNED-NEXT: .LBB6_3: ; %Flow7
; ALIGNED-NEXT: s_andn2_saveexec_b32 s8, s6
; ALIGNED-NEXT: s_cbranch_execz .LBB6_6
; ALIGNED-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
@@ -8316,7 +8316,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; ALIGNED-NEXT: global_store_byte v[16:17], v11, off offset:3
; ALIGNED-NEXT: global_store_byte v[16:17], v4, off offset:1
; ALIGNED-NEXT: s_cbranch_scc0 .LBB6_5
-; ALIGNED-NEXT: .LBB6_6: ; %Flow10
+; ALIGNED-NEXT: .LBB6_6: ; %Flow8
; ALIGNED-NEXT: s_or_b32 exec_lo, exec_lo, s8
; ALIGNED-NEXT: s_clause 0x7
; ALIGNED-NEXT: buffer_load_dword v47, off, s[0:3], s32
@@ -8369,7 +8369,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; UNROLL3-NEXT: global_store_dwordx4 v[0:1], v[2:5], off offset:2032
; UNROLL3-NEXT: ; implicit-def: $vgpr2_vgpr3
; UNROLL3-NEXT: ; implicit-def: $vgpr0_vgpr1
-; UNROLL3-NEXT: .LBB6_4: ; %Flow7
+; UNROLL3-NEXT: .LBB6_4: ; %Flow5
; UNROLL3-NEXT: s_andn2_saveexec_b32 s8, s6
; UNROLL3-NEXT: s_cbranch_execz .LBB6_7
; UNROLL3-NEXT: ; %bb.5: ; %memmove_bwd_residual
@@ -8403,7 +8403,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; UNROLL3-NEXT: global_store_dwordx4 v[16:17], v[12:15], off offset:32
; UNROLL3-NEXT: s_cmp_eq_u64 s[4:5], s[6:7]
; UNROLL3-NEXT: s_cbranch_scc0 .LBB6_6
-; UNROLL3-NEXT: .LBB6_7: ; %Flow8
+; UNROLL3-NEXT: .LBB6_7: ; %Flow6
; UNROLL3-NEXT: s_or_b32 exec_lo, exec_lo, s8
; UNROLL3-NEXT: s_setpc_b64 s[30:31]
entry:
diff --git a/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll b/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
index 272daa9dd0b59..dd5c247f6ef35 100644
--- a/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
+++ b/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
@@ -460,10 +460,10 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
; CHECK-NEXT: s_xor_b32 s7, exec_lo, s6
; CHECK-NEXT: s_cbranch_execnz .LBB3_3
-; CHECK-NEXT: ; %bb.1: ; %Flow34
+; CHECK-NEXT: ; %bb.1: ; %Flow36
; CHECK-NEXT: s_andn2_saveexec_b32 s6, s7
; CHECK-NEXT: s_cbranch_execnz .LBB3_10
-; CHECK-NEXT: .LBB3_2: ; %Flow35
+; CHECK-NEXT: .LBB3_2: ; %Flow37
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_setpc_b64 s[30:31]
@@ -494,7 +494,7 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v11, null, 0, v11, s6
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s9
; CHECK-NEXT: s_cbranch_execnz .LBB3_5
-; CHECK-NEXT: .LBB3_6: ; %Flow29
+; CHECK-NEXT: .LBB3_6: ; %Flow31
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_and_saveexec_b32 s8, s4
; CHECK-NEXT: s_cbranch_execz .LBB3_9
@@ -520,7 +520,7 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, s6
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s9
; CHECK-NEXT: s_cbranch_execnz .LBB3_8
-; CHECK-NEXT: .LBB3_9: ; %Flow27
+; CHECK-NEXT: .LBB3_9: ; %Flow29
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: ; implicit-def: $vgpr6_vgpr7
; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
@@ -556,7 +556,7 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v5, null, -1, v5, s5
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_cbranch_execnz .LBB3_12
-; CHECK-NEXT: .LBB3_13: ; %Flow33
+; CHECK-NEXT: .LBB3_13: ; %Flow35
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s7
; CHECK-NEXT: s_and_saveexec_b32 s5, vcc_lo
; CHECK-NEXT: s_cbranch_execz .LBB3_16
@@ -584,7 +584,7 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: flat_store_dwordx4 v[12:13], v[8:11]
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s7
; CHECK-NEXT: s_cbranch_execnz .LBB3_15
-; CHECK-NEXT: .LBB3_16: ; %Flow31
+; CHECK-NEXT: .LBB3_16: ; %Flow33
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s5
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
@@ -907,10 +907,10 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
; CHECK-NEXT: s_xor_b32 s7, exec_lo, s6
; CHECK-NEXT: s_cbranch_execnz .LBB6_3
-; CHECK-NEXT: ; %bb.1: ; %Flow41
+; CHECK-NEXT: ; %bb.1: ; %Flow39
; CHECK-NEXT: s_andn2_saveexec_b32 s6, s7
; CHECK-NEXT: s_cbranch_execnz .LBB6_10
-; CHECK-NEXT: .LBB6_2: ; %Flow42
+; CHECK-NEXT: .LBB6_2: ; %Flow40
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
; CHECK-NEXT: s_setpc_b64 s[30:31]
; CHECK-NEXT: .LBB6_3: ; %memmove_copy_forward
@@ -940,7 +940,7 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v11, null, 0, v11, s6
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s9
; CHECK-NEXT: s_cbranch_execnz .LBB6_5
-; CHECK-NEXT: .LBB6_6: ; %Flow36
+; CHECK-NEXT: .LBB6_6: ; %Flow34
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_and_saveexec_b32 s8, s4
; CHECK-NEXT: s_cbranch_execz .LBB6_9
@@ -966,11 +966,11 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, s6
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s9
; CHECK-NEXT: s_cbranch_execnz .LBB6_8
-; CHECK-NEXT: .LBB6_9: ; %Flow34
+; CHECK-NEXT: .LBB6_9: ; %Flow32
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: ; implicit-def: $vgpr6_vgpr7
-; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
; CHECK-NEXT: ; implicit-def: $vgpr0_vgpr1
+; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
; CHECK-NEXT: ; implicit-def: $vgpr8_vgpr9
; CHECK-NEXT: ; implicit-def: $vgpr4_vgpr5
; CHECK-NEXT: s_andn2_saveexec_b32 s6, s7
@@ -1002,15 +1002,15 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v5, null, -1, v5, s5
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_cbranch_execnz .LBB6_12
-; CHECK-NEXT: .LBB6_13: ; %Flow40
+; CHECK-NEXT: .LBB6_13: ; %Flow38
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s7
; CHECK-NEXT: s_and_saveexec_b32 s5, vcc_lo
; CHECK-NEXT: s_cbranch_execz .LBB6_16
; CHECK-NEXT: ; %bb.14: ; %memmove_bwd_main_loop.preheader
-; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, v2, -16
-; CHECK-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
; CHECK-NEXT: v_add_co_u32 v0, vcc_lo, v0, -16
; CHECK-NEXT: v_add_co_ci_u32_e64 v1, null, -1, v1, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, v2, -16
+; CHECK-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
; CHECK-NEXT: s_mov_b32 s7, 0
; CHECK-NEXT: .p2align 6
; CHECK-NEXT: .LBB6_15: ; %memmove_bwd_main_loop
@@ -1030,7 +1030,7 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: global_store_dwordx4 v[12:13], v[8:11], off
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s7
; CHECK-NEXT: s_cbranch_execnz .LBB6_15
-; CHECK-NEXT: .LBB6_16: ; %Flow38
+; CHECK-NEXT: .LBB6_16: ; %Flow36
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s5
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
; CHECK-NEXT: s_setpc_b64 s[30:31]
@@ -1181,8 +1181,8 @@ define void @memmove_p1_p4(ptr addrspace(1) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: .LBB8_9: ; %Flow31
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: ; implicit-def: $vgpr6_vgpr7
-; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
; CHECK-NEXT: ; implicit-def: $vgpr0_vgpr1
+; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
; CHECK-NEXT: ; implicit-def: $vgpr8_vgpr9
; CHECK-NEXT: ; implicit-def: $vgpr4_vgpr5
; CHECK-NEXT: s_andn2_saveexec_b32 s6, s7
@@ -1219,10 +1219,10 @@ define void @memmove_p1_p4(ptr addrspace(1) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: s_and_saveexec_b32 s5, vcc_lo
; CHECK-NEXT: s_cbranch_execz .LBB8_16
; CHECK-NEXT: ; %bb.14: ; %memmove_bwd_main_loop.preheader
-; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, v2, -16
-; CHECK-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
; CHECK-NEXT: v_add_co_u32 v0, vcc_lo, v0, -16
; CHECK-NEXT: v_add_co_ci_u32_e64 v1, null, -1, v1, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, v2, -16
+; CHECK-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
; CHECK-NEXT: s_mov_b32 s7, 0
; CHECK-NEXT: .p2align 6
; CHECK-NEXT: .LBB8_15: ; %memmove_bwd_main_loop
diff --git a/llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll b/llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
index 9f62477ae01df..af0942e99182d 100644
--- a/llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
+++ b/llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
@@ -56,155 +56,153 @@ define void @foo(ptr %.m, ptr %.n, ptr %.a, ptr %.x, ptr %.l, ptr %.vy01, ptr %.
; CHECK-NEXT: .cfi_offset v29, -240
; CHECK-NEXT: .cfi_offset v30, -224
; CHECK-NEXT: .cfi_offset v31, -208
+; CHECK-NEXT: std 14, 400(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 15, 408(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 2, 728(1)
+; CHECK-NEXT: ld 14, 688(1)
+; CHECK-NEXT: ld 11, 704(1)
+; CHECK-NEXT: std 20, 448(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 21, 456(1) # 8-byte Folded Spill
+; CHECK-NEXT: mr 21, 5
+; CHECK-NEXT: lwa 5, 0(7)
+; CHECK-NEXT: ld 7, 720(1)
; CHECK-NEXT: std 22, 464(1) # 8-byte Folded Spill
; CHECK-NEXT: std 23, 472(1) # 8-byte Folded Spill
-; CHECK-NEXT: mr 22, 5
-; CHECK-NEXT: ld 5, 848(1)
+; CHECK-NEXT: mr 22, 6
+; CHECK-NEXT: ld 6, 848(1)
; CHECK-NEXT: addi 3, 3, 1
-; CHECK-NEXT: mr 11, 7
-; CHECK-NEXT: ld 23, 688(1)
-; CHECK-NEXT: ld 7, 728(1)
+; CHECK-NEXT: ld 15, 736(1)
; CHECK-NEXT: std 18, 432(1) # 8-byte Folded Spill
; CHECK-NEXT: std 19, 440(1) # 8-byte Folded Spill
-; CHECK-NEXT: mr 18, 6
-; CHECK-NEXT: li 6, 9
; CHECK-NEXT: ld 19, 768(1)
-; CHECK-NEXT: ld 2, 760(1)
-; CHECK-NEXT: std 26, 496(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 27, 504(1) # 8-byte Folded Spill
-; CHECK-NEXT: cmpldi 3, 9
-; CHECK-NEXT: ld 27, 816(1)
-; CHECK-NEXT: ld 26, 808(1)
-; CHECK-NEXT: std 14, 400(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 15, 408(1) # 8-byte Folded Spill
-; CHECK-NEXT: ld 15, 736(1)
-; CHECK-NEXT: lxv 39, 0(8)
+; CHECK-NEXT: ld 18, 760(1)
; CHECK-NEXT: std 30, 528(1) # 8-byte Folded Spill
; CHECK-NEXT: std 31, 536(1) # 8-byte Folded Spill
-; CHECK-NEXT: ld 30, 704(1)
-; CHECK-NEXT: lxv 38, 0(9)
-; CHECK-NEXT: std 20, 448(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 21, 456(1) # 8-byte Folded Spill
-; CHECK-NEXT: ld 21, 784(1)
+; CHECK-NEXT: ld 12, 696(1)
+; CHECK-NEXT: lxv 0, 0(9)
+; CHECK-NEXT: std 9, 64(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 10, 72(1) # 8-byte Folded Spill
+; CHECK-NEXT: lxv 1, 0(8)
+; CHECK-NEXT: cmpldi 3, 9
+; CHECK-NEXT: ld 30, 824(1)
+; CHECK-NEXT: std 28, 512(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 29, 520(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 29, 840(1)
+; CHECK-NEXT: ld 28, 832(1)
+; CHECK-NEXT: std 16, 416(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 17, 424(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 23, 784(1)
; CHECK-NEXT: ld 20, 776(1)
; CHECK-NEXT: std 24, 480(1) # 8-byte Folded Spill
; CHECK-NEXT: std 25, 488(1) # 8-byte Folded Spill
-; CHECK-NEXT: iselgt 3, 3, 6
-; CHECK-NEXT: ld 6, 720(1)
+; CHECK-NEXT: ld 25, 800(1)
; CHECK-NEXT: ld 24, 792(1)
-; CHECK-NEXT: std 10, 72(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 7, 80(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 26, 496(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 27, 504(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 27, 816(1)
+; CHECK-NEXT: ld 26, 808(1)
+; CHECK-NEXT: stfd 26, 544(1) # 8-byte Folded Spill
+; CHECK-NEXT: stfd 27, 552(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 17, 752(1)
+; CHECK-NEXT: extswsli 9, 5, 3
+; CHECK-NEXT: lxv 4, 0(14)
+; CHECK-NEXT: std 14, 32(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 12, 40(1) # 8-byte Folded Spill
+; CHECK-NEXT: mulli 0, 5, 40
+; CHECK-NEXT: sldi 14, 5, 5
+; CHECK-NEXT: mulli 31, 5, 24
+; CHECK-NEXT: lxv 38, 0(2)
+; CHECK-NEXT: lxv 2, 0(11)
+; CHECK-NEXT: std 2, 80(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 15, 88(1) # 8-byte Folded Spill
+; CHECK-NEXT: mulli 2, 5, 48
+; CHECK-NEXT: sldi 5, 5, 4
+; CHECK-NEXT: ld 16, 744(1)
+; CHECK-NEXT: lxv 5, 0(10)
+; CHECK-NEXT: std 6, 200(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 29, 192(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 6, 712(1)
+; CHECK-NEXT: mr 10, 7
+; CHECK-NEXT: add 7, 14, 21
+; CHECK-NEXT: lxv 13, 0(19)
+; CHECK-NEXT: std 8, 48(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 6, 56(1) # 8-byte Folded Spill
+; CHECK-NEXT: mr 8, 11
+; CHECK-NEXT: li 11, 9
+; CHECK-NEXT: iselgt 3, 3, 11
; CHECK-NEXT: addi 3, 3, -2
-; CHECK-NEXT: lxv 6, 0(19)
-; CHECK-NEXT: lxv 11, 0(7)
-; CHECK-NEXT: std 5, 200(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 23, 40(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 6, 48(1) # 8-byte Folded Spill
-; CHECK-NEXT: ld 5, 840(1)
-; CHECK-NEXT: lxv 12, 0(6)
-; CHECK-NEXT: rldicl 12, 3, 61, 3
+; CHECK-NEXT: rldicl 11, 3, 61, 3
+; CHECK-NEXT: lxv 3, 0(12)
+; CHECK-NEXT: lxv 40, 0(6)
+; CHECK-NEXT: std 18, 112(1) # 8-byte Folded Spill
; CHECK-NEXT: std 19, 120(1) # 8-byte Folded Spill
+; CHECK-NEXT: add 19, 21, 5
+; CHECK-NEXT: ld 5, 200(1) # 8-byte Folde...
[truncated]
|
@llvm/pr-subscribers-backend-amdgpu Author: Nikita Popov (nikic) ChangesLICM tries to reassociate GEPs in order to hoist an invariant GEP. Currently, it also does this in the case where the GEP has a constant offset. This is usually undesirable. From a back-end perspective, constant GEPs are usually free because they can be folded into addressing modes, so this just increases register pressume. From a middle-end perspective, keeping constant offsets last in the chain makes it easier to analyze the relationship between multiple GEPs on the same base, especially after CSE. The worst that can happen here is if we start with something like
And LICM converts it into:
Which is much worse than leaving it for CSE to convert to:
Patch is 48.51 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/151492.diff 9 Files Affected:
diff --git a/llvm/lib/Transforms/Scalar/LICM.cpp b/llvm/lib/Transforms/Scalar/LICM.cpp
index 68094c354cf46..5197e2a7b7d43 100644
--- a/llvm/lib/Transforms/Scalar/LICM.cpp
+++ b/llvm/lib/Transforms/Scalar/LICM.cpp
@@ -2517,6 +2517,12 @@ static bool hoistGEP(Instruction &I, Loop &L, ICFLoopSafetyInfo &SafetyInfo,
if (!L.isLoopInvariant(SrcPtr) || !all_of(GEP->indices(), LoopInvariant))
return false;
+ // Do not try to hoist a constant GEP out of the loop via reassociation.
+ // Constant GEPs can often be folded into addressing modes, and reassociating
+ // them may inhibit CSE of a common base.
+ if (GEP->hasAllConstantIndices())
+ return false;
+
// This can only happen if !AllowSpeculation, otherwise this would already be
// handled.
// FIXME: Should we respect AllowSpeculation in these reassociation folds?
diff --git a/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll b/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll
index 1e6b77ecea85e..702a69f776de3 100644
--- a/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll
+++ b/llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll
@@ -77,7 +77,7 @@ define amdgpu_kernel void @copy_flat(ptr nocapture %d, ptr nocapture readonly %s
; GFX1250-NEXT: s_add_nc_u64 s[2:3], s[2:3], 16
; GFX1250-NEXT: s_cmp_lg_u32 s6, 0
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: flat_store_b128 v0, v[2:5], s[0:1]
+; GFX1250-NEXT: flat_store_b128 v0, v[2:5], s[0:1] scope:SCOPE_SE
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: s_add_nc_u64 s[0:1], s[0:1], 16
; GFX1250-NEXT: s_cbranch_scc1 .LBB0_2
@@ -400,9 +400,9 @@ define amdgpu_kernel void @copy_flat_divergent(ptr nocapture %d, ptr nocapture r
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-NEXT: v_lshlrev_b32_e32 v0, 4, v0
; GFX12-NEXT: s_wait_kmcnt 0x0
-; GFX12-NEXT: v_add_co_u32 v2, s1, s6, v0
+; GFX12-NEXT: v_add_co_u32 v2, s1, v0, s6
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-NEXT: v_add_co_ci_u32_e64 v3, null, s7, 0, s1
+; GFX12-NEXT: v_add_co_ci_u32_e64 v3, null, 0, s7, s1
; GFX12-NEXT: v_add_co_u32 v0, s1, s4, v0
; GFX12-NEXT: v_add_co_u32 v2, vcc_lo, 0xb0, v2
; GFX12-NEXT: s_wait_alu 0xf1ff
@@ -438,9 +438,9 @@ define amdgpu_kernel void @copy_flat_divergent(ptr nocapture %d, ptr nocapture r
; GFX12-SPREFETCH-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-SPREFETCH-NEXT: v_lshlrev_b32_e32 v0, 4, v0
; GFX12-SPREFETCH-NEXT: s_wait_kmcnt 0x0
-; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, s1, s6, v0
+; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, s1, v0, s6
; GFX12-SPREFETCH-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-SPREFETCH-NEXT: v_add_co_ci_u32_e64 v3, null, s7, 0, s1
+; GFX12-SPREFETCH-NEXT: v_add_co_ci_u32_e64 v3, null, 0, s7, s1
; GFX12-SPREFETCH-NEXT: v_add_co_u32 v0, s1, s4, v0
; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, vcc_lo, 0xb0, v2
; GFX12-SPREFETCH-NEXT: s_wait_alu 0xf1ff
@@ -490,7 +490,7 @@ define amdgpu_kernel void @copy_flat_divergent(ptr nocapture %d, ptr nocapture r
; GFX1250-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1250-NEXT: s_cmp_lg_u32 s0, 0
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0
-; GFX1250-NEXT: flat_store_b128 v[0:1], v[4:7]
+; GFX1250-NEXT: flat_store_b128 v[0:1], v[4:7] scope:SCOPE_SE
; GFX1250-NEXT: s_wait_xcnt 0x0
; GFX1250-NEXT: v_add_nc_u64_e32 v[0:1], 16, v[0:1]
; GFX1250-NEXT: s_cbranch_scc1 .LBB4_2
@@ -531,9 +531,9 @@ define amdgpu_kernel void @copy_global_divergent(ptr addrspace(1) nocapture %d,
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-NEXT: v_lshlrev_b32_e32 v0, 4, v0
; GFX12-NEXT: s_wait_kmcnt 0x0
-; GFX12-NEXT: v_add_co_u32 v2, s1, s6, v0
+; GFX12-NEXT: v_add_co_u32 v2, s1, v0, s6
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-NEXT: v_add_co_ci_u32_e64 v3, null, s7, 0, s1
+; GFX12-NEXT: v_add_co_ci_u32_e64 v3, null, 0, s7, s1
; GFX12-NEXT: v_add_co_u32 v0, s1, s4, v0
; GFX12-NEXT: v_add_co_u32 v2, vcc_lo, 0xb0, v2
; GFX12-NEXT: s_wait_alu 0xf1ff
@@ -569,9 +569,9 @@ define amdgpu_kernel void @copy_global_divergent(ptr addrspace(1) nocapture %d,
; GFX12-SPREFETCH-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
; GFX12-SPREFETCH-NEXT: v_lshlrev_b32_e32 v0, 4, v0
; GFX12-SPREFETCH-NEXT: s_wait_kmcnt 0x0
-; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, s1, s6, v0
+; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, s1, v0, s6
; GFX12-SPREFETCH-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)
-; GFX12-SPREFETCH-NEXT: v_add_co_ci_u32_e64 v3, null, s7, 0, s1
+; GFX12-SPREFETCH-NEXT: v_add_co_ci_u32_e64 v3, null, 0, s7, s1
; GFX12-SPREFETCH-NEXT: v_add_co_u32 v0, s1, s4, v0
; GFX12-SPREFETCH-NEXT: v_add_co_u32 v2, vcc_lo, 0xb0, v2
; GFX12-SPREFETCH-NEXT: s_wait_alu 0xf1ff
diff --git a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
index be020457ce87d..4c0ab91b7d622 100644
--- a/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
+++ b/llvm/test/CodeGen/AMDGPU/memintrinsic-unroll.ll
@@ -6982,7 +6982,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; CHECK-NEXT: global_store_dwordx4 v[100:101], v[96:99], off offset:16
; CHECK-NEXT: s_cmp_lg_u64 s[4:5], 0x800
; CHECK-NEXT: s_cbranch_scc1 .LBB6_2
-; CHECK-NEXT: .LBB6_3: ; %Flow9
+; CHECK-NEXT: .LBB6_3: ; %Flow7
; CHECK-NEXT: s_andn2_saveexec_b32 s8, s6
; CHECK-NEXT: s_cbranch_execz .LBB6_6
; CHECK-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
@@ -7048,7 +7048,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; CHECK-NEXT: global_store_dwordx4 v[100:101], v[96:99], off offset:16
; CHECK-NEXT: s_cmp_eq_u64 s[4:5], s[6:7]
; CHECK-NEXT: s_cbranch_scc0 .LBB6_5
-; CHECK-NEXT: .LBB6_6: ; %Flow10
+; CHECK-NEXT: .LBB6_6: ; %Flow8
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_setpc_b64 s[30:31]
;
@@ -7689,7 +7689,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; ALIGNED-NEXT: global_store_byte v[16:17], v11, off offset:3
; ALIGNED-NEXT: global_store_byte v[16:17], v4, off offset:1
; ALIGNED-NEXT: s_cbranch_scc1 .LBB6_2
-; ALIGNED-NEXT: .LBB6_3: ; %Flow9
+; ALIGNED-NEXT: .LBB6_3: ; %Flow7
; ALIGNED-NEXT: s_andn2_saveexec_b32 s8, s6
; ALIGNED-NEXT: s_cbranch_execz .LBB6_6
; ALIGNED-NEXT: ; %bb.4: ; %memmove_bwd_loop.preheader
@@ -8316,7 +8316,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; ALIGNED-NEXT: global_store_byte v[16:17], v11, off offset:3
; ALIGNED-NEXT: global_store_byte v[16:17], v4, off offset:1
; ALIGNED-NEXT: s_cbranch_scc0 .LBB6_5
-; ALIGNED-NEXT: .LBB6_6: ; %Flow10
+; ALIGNED-NEXT: .LBB6_6: ; %Flow8
; ALIGNED-NEXT: s_or_b32 exec_lo, exec_lo, s8
; ALIGNED-NEXT: s_clause 0x7
; ALIGNED-NEXT: buffer_load_dword v47, off, s[0:3], s32
@@ -8369,7 +8369,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; UNROLL3-NEXT: global_store_dwordx4 v[0:1], v[2:5], off offset:2032
; UNROLL3-NEXT: ; implicit-def: $vgpr2_vgpr3
; UNROLL3-NEXT: ; implicit-def: $vgpr0_vgpr1
-; UNROLL3-NEXT: .LBB6_4: ; %Flow7
+; UNROLL3-NEXT: .LBB6_4: ; %Flow5
; UNROLL3-NEXT: s_andn2_saveexec_b32 s8, s6
; UNROLL3-NEXT: s_cbranch_execz .LBB6_7
; UNROLL3-NEXT: ; %bb.5: ; %memmove_bwd_residual
@@ -8403,7 +8403,7 @@ define void @memmove_p1_p1_sz2048(ptr addrspace(1) align 1 %dst, ptr addrspace(1
; UNROLL3-NEXT: global_store_dwordx4 v[16:17], v[12:15], off offset:32
; UNROLL3-NEXT: s_cmp_eq_u64 s[4:5], s[6:7]
; UNROLL3-NEXT: s_cbranch_scc0 .LBB6_6
-; UNROLL3-NEXT: .LBB6_7: ; %Flow8
+; UNROLL3-NEXT: .LBB6_7: ; %Flow6
; UNROLL3-NEXT: s_or_b32 exec_lo, exec_lo, s8
; UNROLL3-NEXT: s_setpc_b64 s[30:31]
entry:
diff --git a/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll b/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
index 272daa9dd0b59..dd5c247f6ef35 100644
--- a/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
+++ b/llvm/test/CodeGen/AMDGPU/memmove-var-size.ll
@@ -460,10 +460,10 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
; CHECK-NEXT: s_xor_b32 s7, exec_lo, s6
; CHECK-NEXT: s_cbranch_execnz .LBB3_3
-; CHECK-NEXT: ; %bb.1: ; %Flow34
+; CHECK-NEXT: ; %bb.1: ; %Flow36
; CHECK-NEXT: s_andn2_saveexec_b32 s6, s7
; CHECK-NEXT: s_cbranch_execnz .LBB3_10
-; CHECK-NEXT: .LBB3_2: ; %Flow35
+; CHECK-NEXT: .LBB3_2: ; %Flow37
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
; CHECK-NEXT: s_setpc_b64 s[30:31]
@@ -494,7 +494,7 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v11, null, 0, v11, s6
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s9
; CHECK-NEXT: s_cbranch_execnz .LBB3_5
-; CHECK-NEXT: .LBB3_6: ; %Flow29
+; CHECK-NEXT: .LBB3_6: ; %Flow31
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_and_saveexec_b32 s8, s4
; CHECK-NEXT: s_cbranch_execz .LBB3_9
@@ -520,7 +520,7 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, s6
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s9
; CHECK-NEXT: s_cbranch_execnz .LBB3_8
-; CHECK-NEXT: .LBB3_9: ; %Flow27
+; CHECK-NEXT: .LBB3_9: ; %Flow29
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: ; implicit-def: $vgpr6_vgpr7
; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
@@ -556,7 +556,7 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v5, null, -1, v5, s5
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_cbranch_execnz .LBB3_12
-; CHECK-NEXT: .LBB3_13: ; %Flow33
+; CHECK-NEXT: .LBB3_13: ; %Flow35
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s7
; CHECK-NEXT: s_and_saveexec_b32 s5, vcc_lo
; CHECK-NEXT: s_cbranch_execz .LBB3_16
@@ -584,7 +584,7 @@ define void @memmove_p0_p4(ptr addrspace(0) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: flat_store_dwordx4 v[12:13], v[8:11]
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s7
; CHECK-NEXT: s_cbranch_execnz .LBB3_15
-; CHECK-NEXT: .LBB3_16: ; %Flow31
+; CHECK-NEXT: .LBB3_16: ; %Flow33
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s5
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
; CHECK-NEXT: s_waitcnt lgkmcnt(0)
@@ -907,10 +907,10 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: v_cmpx_ge_u64_e64 v[2:3], v[0:1]
; CHECK-NEXT: s_xor_b32 s7, exec_lo, s6
; CHECK-NEXT: s_cbranch_execnz .LBB6_3
-; CHECK-NEXT: ; %bb.1: ; %Flow41
+; CHECK-NEXT: ; %bb.1: ; %Flow39
; CHECK-NEXT: s_andn2_saveexec_b32 s6, s7
; CHECK-NEXT: s_cbranch_execnz .LBB6_10
-; CHECK-NEXT: .LBB6_2: ; %Flow42
+; CHECK-NEXT: .LBB6_2: ; %Flow40
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
; CHECK-NEXT: s_setpc_b64 s[30:31]
; CHECK-NEXT: .LBB6_3: ; %memmove_copy_forward
@@ -940,7 +940,7 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v11, null, 0, v11, s6
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s9
; CHECK-NEXT: s_cbranch_execnz .LBB6_5
-; CHECK-NEXT: .LBB6_6: ; %Flow36
+; CHECK-NEXT: .LBB6_6: ; %Flow34
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_and_saveexec_b32 s8, s4
; CHECK-NEXT: s_cbranch_execz .LBB6_9
@@ -966,11 +966,11 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v1, null, 0, v1, s6
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s9
; CHECK-NEXT: s_cbranch_execnz .LBB6_8
-; CHECK-NEXT: .LBB6_9: ; %Flow34
+; CHECK-NEXT: .LBB6_9: ; %Flow32
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: ; implicit-def: $vgpr6_vgpr7
-; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
; CHECK-NEXT: ; implicit-def: $vgpr0_vgpr1
+; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
; CHECK-NEXT: ; implicit-def: $vgpr8_vgpr9
; CHECK-NEXT: ; implicit-def: $vgpr4_vgpr5
; CHECK-NEXT: s_andn2_saveexec_b32 s6, s7
@@ -1002,15 +1002,15 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: v_add_co_ci_u32_e64 v5, null, -1, v5, s5
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: s_cbranch_execnz .LBB6_12
-; CHECK-NEXT: .LBB6_13: ; %Flow40
+; CHECK-NEXT: .LBB6_13: ; %Flow38
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s7
; CHECK-NEXT: s_and_saveexec_b32 s5, vcc_lo
; CHECK-NEXT: s_cbranch_execz .LBB6_16
; CHECK-NEXT: ; %bb.14: ; %memmove_bwd_main_loop.preheader
-; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, v2, -16
-; CHECK-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
; CHECK-NEXT: v_add_co_u32 v0, vcc_lo, v0, -16
; CHECK-NEXT: v_add_co_ci_u32_e64 v1, null, -1, v1, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, v2, -16
+; CHECK-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
; CHECK-NEXT: s_mov_b32 s7, 0
; CHECK-NEXT: .p2align 6
; CHECK-NEXT: .LBB6_15: ; %memmove_bwd_main_loop
@@ -1030,7 +1030,7 @@ define void @memmove_p1_p1(ptr addrspace(1) align 1 %dst, ptr addrspace(1) align
; CHECK-NEXT: global_store_dwordx4 v[12:13], v[8:11], off
; CHECK-NEXT: s_andn2_b32 exec_lo, exec_lo, s7
; CHECK-NEXT: s_cbranch_execnz .LBB6_15
-; CHECK-NEXT: .LBB6_16: ; %Flow38
+; CHECK-NEXT: .LBB6_16: ; %Flow36
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s5
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s6
; CHECK-NEXT: s_setpc_b64 s[30:31]
@@ -1181,8 +1181,8 @@ define void @memmove_p1_p4(ptr addrspace(1) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: .LBB8_9: ; %Flow31
; CHECK-NEXT: s_or_b32 exec_lo, exec_lo, s8
; CHECK-NEXT: ; implicit-def: $vgpr6_vgpr7
-; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
; CHECK-NEXT: ; implicit-def: $vgpr0_vgpr1
+; CHECK-NEXT: ; implicit-def: $vgpr2_vgpr3
; CHECK-NEXT: ; implicit-def: $vgpr8_vgpr9
; CHECK-NEXT: ; implicit-def: $vgpr4_vgpr5
; CHECK-NEXT: s_andn2_saveexec_b32 s6, s7
@@ -1219,10 +1219,10 @@ define void @memmove_p1_p4(ptr addrspace(1) align 1 %dst, ptr addrspace(4) align
; CHECK-NEXT: s_and_saveexec_b32 s5, vcc_lo
; CHECK-NEXT: s_cbranch_execz .LBB8_16
; CHECK-NEXT: ; %bb.14: ; %memmove_bwd_main_loop.preheader
-; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, v2, -16
-; CHECK-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
; CHECK-NEXT: v_add_co_u32 v0, vcc_lo, v0, -16
; CHECK-NEXT: v_add_co_ci_u32_e64 v1, null, -1, v1, vcc_lo
+; CHECK-NEXT: v_add_co_u32 v2, vcc_lo, v2, -16
+; CHECK-NEXT: v_add_co_ci_u32_e64 v3, null, -1, v3, vcc_lo
; CHECK-NEXT: s_mov_b32 s7, 0
; CHECK-NEXT: .p2align 6
; CHECK-NEXT: .LBB8_15: ; %memmove_bwd_main_loop
diff --git a/llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll b/llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
index 9f62477ae01df..af0942e99182d 100644
--- a/llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
+++ b/llvm/test/CodeGen/PowerPC/more-dq-form-prepare.ll
@@ -56,155 +56,153 @@ define void @foo(ptr %.m, ptr %.n, ptr %.a, ptr %.x, ptr %.l, ptr %.vy01, ptr %.
; CHECK-NEXT: .cfi_offset v29, -240
; CHECK-NEXT: .cfi_offset v30, -224
; CHECK-NEXT: .cfi_offset v31, -208
+; CHECK-NEXT: std 14, 400(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 15, 408(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 2, 728(1)
+; CHECK-NEXT: ld 14, 688(1)
+; CHECK-NEXT: ld 11, 704(1)
+; CHECK-NEXT: std 20, 448(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 21, 456(1) # 8-byte Folded Spill
+; CHECK-NEXT: mr 21, 5
+; CHECK-NEXT: lwa 5, 0(7)
+; CHECK-NEXT: ld 7, 720(1)
; CHECK-NEXT: std 22, 464(1) # 8-byte Folded Spill
; CHECK-NEXT: std 23, 472(1) # 8-byte Folded Spill
-; CHECK-NEXT: mr 22, 5
-; CHECK-NEXT: ld 5, 848(1)
+; CHECK-NEXT: mr 22, 6
+; CHECK-NEXT: ld 6, 848(1)
; CHECK-NEXT: addi 3, 3, 1
-; CHECK-NEXT: mr 11, 7
-; CHECK-NEXT: ld 23, 688(1)
-; CHECK-NEXT: ld 7, 728(1)
+; CHECK-NEXT: ld 15, 736(1)
; CHECK-NEXT: std 18, 432(1) # 8-byte Folded Spill
; CHECK-NEXT: std 19, 440(1) # 8-byte Folded Spill
-; CHECK-NEXT: mr 18, 6
-; CHECK-NEXT: li 6, 9
; CHECK-NEXT: ld 19, 768(1)
-; CHECK-NEXT: ld 2, 760(1)
-; CHECK-NEXT: std 26, 496(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 27, 504(1) # 8-byte Folded Spill
-; CHECK-NEXT: cmpldi 3, 9
-; CHECK-NEXT: ld 27, 816(1)
-; CHECK-NEXT: ld 26, 808(1)
-; CHECK-NEXT: std 14, 400(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 15, 408(1) # 8-byte Folded Spill
-; CHECK-NEXT: ld 15, 736(1)
-; CHECK-NEXT: lxv 39, 0(8)
+; CHECK-NEXT: ld 18, 760(1)
; CHECK-NEXT: std 30, 528(1) # 8-byte Folded Spill
; CHECK-NEXT: std 31, 536(1) # 8-byte Folded Spill
-; CHECK-NEXT: ld 30, 704(1)
-; CHECK-NEXT: lxv 38, 0(9)
-; CHECK-NEXT: std 20, 448(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 21, 456(1) # 8-byte Folded Spill
-; CHECK-NEXT: ld 21, 784(1)
+; CHECK-NEXT: ld 12, 696(1)
+; CHECK-NEXT: lxv 0, 0(9)
+; CHECK-NEXT: std 9, 64(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 10, 72(1) # 8-byte Folded Spill
+; CHECK-NEXT: lxv 1, 0(8)
+; CHECK-NEXT: cmpldi 3, 9
+; CHECK-NEXT: ld 30, 824(1)
+; CHECK-NEXT: std 28, 512(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 29, 520(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 29, 840(1)
+; CHECK-NEXT: ld 28, 832(1)
+; CHECK-NEXT: std 16, 416(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 17, 424(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 23, 784(1)
; CHECK-NEXT: ld 20, 776(1)
; CHECK-NEXT: std 24, 480(1) # 8-byte Folded Spill
; CHECK-NEXT: std 25, 488(1) # 8-byte Folded Spill
-; CHECK-NEXT: iselgt 3, 3, 6
-; CHECK-NEXT: ld 6, 720(1)
+; CHECK-NEXT: ld 25, 800(1)
; CHECK-NEXT: ld 24, 792(1)
-; CHECK-NEXT: std 10, 72(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 7, 80(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 26, 496(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 27, 504(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 27, 816(1)
+; CHECK-NEXT: ld 26, 808(1)
+; CHECK-NEXT: stfd 26, 544(1) # 8-byte Folded Spill
+; CHECK-NEXT: stfd 27, 552(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 17, 752(1)
+; CHECK-NEXT: extswsli 9, 5, 3
+; CHECK-NEXT: lxv 4, 0(14)
+; CHECK-NEXT: std 14, 32(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 12, 40(1) # 8-byte Folded Spill
+; CHECK-NEXT: mulli 0, 5, 40
+; CHECK-NEXT: sldi 14, 5, 5
+; CHECK-NEXT: mulli 31, 5, 24
+; CHECK-NEXT: lxv 38, 0(2)
+; CHECK-NEXT: lxv 2, 0(11)
+; CHECK-NEXT: std 2, 80(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 15, 88(1) # 8-byte Folded Spill
+; CHECK-NEXT: mulli 2, 5, 48
+; CHECK-NEXT: sldi 5, 5, 4
+; CHECK-NEXT: ld 16, 744(1)
+; CHECK-NEXT: lxv 5, 0(10)
+; CHECK-NEXT: std 6, 200(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 29, 192(1) # 8-byte Folded Spill
+; CHECK-NEXT: ld 6, 712(1)
+; CHECK-NEXT: mr 10, 7
+; CHECK-NEXT: add 7, 14, 21
+; CHECK-NEXT: lxv 13, 0(19)
+; CHECK-NEXT: std 8, 48(1) # 8-byte Folded Spill
+; CHECK-NEXT: std 6, 56(1) # 8-byte Folded Spill
+; CHECK-NEXT: mr 8, 11
+; CHECK-NEXT: li 11, 9
+; CHECK-NEXT: iselgt 3, 3, 11
; CHECK-NEXT: addi 3, 3, -2
-; CHECK-NEXT: lxv 6, 0(19)
-; CHECK-NEXT: lxv 11, 0(7)
-; CHECK-NEXT: std 5, 200(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 23, 40(1) # 8-byte Folded Spill
-; CHECK-NEXT: std 6, 48(1) # 8-byte Folded Spill
-; CHECK-NEXT: ld 5, 840(1)
-; CHECK-NEXT: lxv 12, 0(6)
-; CHECK-NEXT: rldicl 12, 3, 61, 3
+; CHECK-NEXT: rldicl 11, 3, 61, 3
+; CHECK-NEXT: lxv 3, 0(12)
+; CHECK-NEXT: lxv 40, 0(6)
+; CHECK-NEXT: std 18, 112(1) # 8-byte Folded Spill
; CHECK-NEXT: std 19, 120(1) # 8-byte Folded Spill
+; CHECK-NEXT: add 19, 21, 5
+; CHECK-NEXT: ld 5, 200(1) # 8-byte Folde...
[truncated]
|
@@ -77,7 +77,7 @@ define amdgpu_kernel void @copy_flat(ptr nocapture %d, ptr nocapture readonly %s | |||
; GFX1250-NEXT: s_add_nc_u64 s[2:3], s[2:3], 16 | |||
; GFX1250-NEXT: s_cmp_lg_u32 s6, 0 | |||
; GFX1250-NEXT: s_wait_loadcnt_dscnt 0x0 | |||
; GFX1250-NEXT: flat_store_b128 v0, v[2:5], s[0:1] | |||
; GFX1250-NEXT: flat_store_b128 v0, v[2:5], s[0:1] scope:SCOPE_SE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this really change from this patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just a test regeneration artifact. Landed separately in e2bd92e.
LICM tries to reassociate GEPs in order to hoist an invariant GEP. Currently, it also does this in the case where the GEP has a constant offset. This is usually undesirable. From a back-end perspective, constant GEPs are usually free because they can be folded into addressing modes, so this just increases register pressume. From a middle-end perspective, keeping constant offsets last in the chain makes it easier to analyze the relationship between multiple GEPs on the same base. The worst that can happen here is if we start with something like ``` loop { p + 4*x p + 4*x + 1 p + 4*x + 2 p + 4*x + 3 } ``` And LICM converts it into: ``` p.1 = p + 1 p.2 = p + 2 p.3 = p + 3 loop { p + 4*x p.1 + 4*x p.2 + 4*x p.3 + 4*x } ``` Which is much worse than leaving it for CSE to convert to: ``` loop { p2 = p + 4*x p2 + 1 p2 + 2 p2 + 3 } ```
53302cc
to
817f8e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me. However, if the gep chain gets reassociated and idx2 is folded into a constant later, we may still miss some CSE opportunities. See also dtcxzyw/llvm-opt-benchmark#2626 (comment).
llvm/lib/Transforms/Scalar/LICM.cpp
Outdated
// Do not try to hoist a constant GEP out of the loop via reassociation. | ||
// Constant GEPs can often be folded into addressing modes, and reassociating | ||
// them may inhibit CSE of a common base. | ||
if (GEP->hasAllConstantIndices()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move the check to the front? isLoopInvariant
and contains
are a bit more expensive.
I'm considering canonicalizing |
LICM tries to reassociate GEPs in order to hoist an invariant GEP. Currently, it also does this in the case where the GEP has a constant offset.
This is usually undesirable. From a back-end perspective, constant GEPs are usually free because they can be folded into addressing modes, so this just increases register pressume. From a middle-end perspective, keeping constant offsets last in the chain makes it easier to analyze the relationship between multiple GEPs on the same base, especially after CSE.
The worst that can happen here is if we start with something like
And LICM converts it into:
Which is much worse than leaving it for CSE to convert to: