Skip to content

[LV] Increase vectorize-memory-check-threshold to 256 #151712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

igogo-x86
Copy link
Contributor

We have a benchmark with large loops that benefit from vectorisation; however, they currently require several thousands runtime checks due to the way LoopAccessAnalysis is implemented. I would like to improve LAA to enable vectorisation with significantly fewer checks - though still somewhat more than the current limit of 128. Before committing to this task, I need to know whether we can raise this threshold. I checked and found that increasing it to 256 caused no performance or compile-time regressions, including when using the benchmarks from https://llvm-compile-time-tracker.com/

@llvmbot
Copy link
Member

llvmbot commented Aug 1, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Igor Kirillov (igogo-x86)

Changes

We have a benchmark with large loops that benefit from vectorisation; however, they currently require several thousands runtime checks due to the way LoopAccessAnalysis is implemented. I would like to improve LAA to enable vectorisation with significantly fewer checks - though still somewhat more than the current limit of 128. Before committing to this task, I need to know whether we can raise this threshold. I checked and found that increasing it to 256 caused no performance or compile-time regressions, including when using the benchmarks from https://llvm-compile-time-tracker.com/


Full diff: https://github.com/llvm/llvm-project/pull/151712.diff

1 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+1-1)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 850c4a11edc67..45460003f4a4e 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -203,7 +203,7 @@ static cl::opt<unsigned> TinyTripCountVectorThreshold(
              "are incurred."));
 
 static cl::opt<unsigned> VectorizeMemoryCheckThreshold(
-    "vectorize-memory-check-threshold", cl::init(128), cl::Hidden,
+    "vectorize-memory-check-threshold", cl::init(256), cl::Hidden,
     cl::desc("The maximum allowed number of runtime memory checks"));
 
 // Option prefer-predicate-over-epilogue indicates that an epilogue is undesired,

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share a reproducer that shows the issue?

@igogo-x86
Copy link
Contributor Author

One of the improvements would be about this problem. In the following example, LLVM currently generates 21 pairwise pointer-disjointness checks to prove the loop is safe to vectorised:

void test(int *a, int *b, int off_1, int off_2, int off_3, int *x, int *y, int *z) {
    for (int i = 0; i < 100000; ++i) {
        x[i] = a[i + off_1] + b[i + off_1];
        y[i] = a[i + off_2] - b[i + off_2];
        z[i] = a[i + off_3] * b[i + off_3];
    }
}

Below is the complete list of 21 pairwise disjointness (non-aliasing) checks needed for vectorization safety. These ensure that the read and write memory regions accessed do not overlap:

# Check Description
1 a + off_1 vs x
2 a + off_1 vs y
3 a + off_1 vs z
4 b + off_1 vs x
5 b + off_1 vs y
6 b + off_1 vs z
7 a + off_2 vs x
8 a + off_2 vs y
9 a + off_2 vs z
10 b + off_2 vs x
11 b + off_2 vs y
12 b + off_2 vs z
13 a + off_3 vs x
14 a + off_3 vs y
15 a + off_3 vs z
16 b + off_3 vs x
17 b + off_3 vs y
18 b + off_3 vs z
19 x vs y
20 x vs z
21 y vs z

Instead of checking every pair, we can derive lower/upper bounds on the regions accessed in a and b and reduce the number of checks to 13:

# Check Description Comment
1 a + off_1 vs a + off_2 Determine boundaries of a
2 prev-range vs a + off_3 Determine boundaries of a
3 b + off_1 vs b + off_2 Determine boundaries of b
4 prev-range vs b + off_3 Determine boundaries of b
5 range of a vs x
6 range of a vs y
7 range of a vs z
8 range of b vs x
9 range of b vs y
10 range of b vs z
11 x vs y
12 x vs z
13 y vs z

In the benchmark, there are approximately 20–30 groups of objects being read, followed by 6–11 objects being written. These 20–30 groups access different memory locations multiple times, depending on an outer loop variable, which makes the number of required aliasing checks overwhelming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants