-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
Compiler Explorer link:
https://rust.godbolt.org/z/3EesGrT98
I tried this code:
unsafe extern "C" {
fn write(p: *const u16);
}
#[inline(always)]
fn in_inclusive_range16(u: u16, start: u16, end: u16) -> bool {
u.wrapping_sub(start) <= (end - start)
}
// Weirdness:
// Look for `cset` under LBB0_3 in aarch64
// and `setne` + `lea` under LBB0_4 in x86_64.
// These seem to correlate with bb2.preheader in LLVM IR.
#[unsafe(no_mangle)]
pub fn bad_loop(pending_slice: &[u16], slice: &[u16], bound: u32, other: u32) {
// SPDX: Unicode-3.0
// From icu_normalizer
let mut code_unit_iter = slice.iter();
let mut upcoming32;
let mut trie_value: u32; // Significant even if unused!
'fast: loop {
if let Some(&upcoming_code_unit) = code_unit_iter.next() {
upcoming32 = u32::from(upcoming_code_unit); // may be surrogate
if upcoming32 < bound {
continue 'fast;
}
'surrogateloop: loop {
let surrogate_base = upcoming32.wrapping_sub(0xD800);
if surrogate_base > (0xDFFF - 0xD800) {
break 'surrogateloop;
}
if surrogate_base <= (0xDBFF - 0xD800) {
let iter_backup = code_unit_iter.clone(); // This line is significant even as unused!
if let Some(&low) = code_unit_iter.next() {
if in_inclusive_range16(low, 0xDC00, 0xDFFF) { // Can't simplify this!
upcoming32 = (upcoming32 << 10) + u32::from(low)
- (((0xD800u32 << 10) - 0x10000u32) + 0xDC00u32);
if upcoming32 < other {
continue 'fast;
}
break 'surrogateloop;
} else {
// Somehow this can be commented out:
// code_unit_iter = iter_backup;
}
}
}
break 'surrogateloop;
}
let consumed_so_far = &pending_slice[..pending_slice.len() - code_unit_iter.as_slice().len() - 1];
unsafe{write(consumed_so_far.as_ptr())};
break 'fast;
}
break;
}
}
I expected to see this happen:
Expect the very top part of the loop that advances the iterator and continues if the item is less than bound
to start with an add
with 2 as immediate, and then use mov
and cmp
without setne
+ lea
trickery on x86_64 or cset
to make an operand of add
conditionally 2 or 0 on aarch64.
Instead, this happened:
If the loop has the kind of shape as in the minimized test case (which comes from https://github.com/unicode-org/icu4x/blob/932aaa3f6b7afe322c305d124fe5c237f008be6d/components/normalizer/src/lib.rs#L2549 ), the loop advancement and sentinel-reached check compile to a less usual instruction sequence than in the case of simpler loops.
Notably, unused variable affect the post-optimizer output here. I've remarked in the minimized test case that a couple of unused variables need to be there. These aren't unused in icu_normalizer
before minimization, but it seems like an indication of a deeper problem that unused variables affect the outcome as opposed to being eliminated as dead code before further transformations.
The part from 'fast: loop {
to the first continue
is the hottest loop and the most performance-significant part of icu_normalizer
. On both x86_64 and aarch64, it is slower (more than by a factor of 2) than its counterpart loop in ICU4C.
In non-minimized code, the loops look like this:
ICU4X x86_64 (bad)
0x000000000b72b1c0 <+768>: mov %r15,%rcx
0x000000000b72b1c3 <+771>: xor %eax,%eax
0x000000000b72b1c5 <+773>: cmp %r13,%r15
0x000000000b72b1c8 <+776>: setne %al
0x000000000b72b1cb <+779>: lea (%r15,%rax,2),%r15
0x000000000b72b1cf <+783>: je 0xb72bd8a <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+3786>
0x000000000b72b1d5 <+789>: movzwl (%rcx),%r14d
0x000000000b72b1d9 <+793>: cmp %r14d,-0x44(%rbp)
0x000000000b72b1dd <+797>: ja 0xb72b1c0 <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+768>
ICU4C x86_64 (good)
0x00000000041aa340 <+208>: add $0x2,%r13
0x00000000041aa344 <+212>: mov %r13,%r14
0x00000000041aa347 <+215>: mov %r14,%r13
0x00000000041aa34a <+218>: cmp %r12,%r14
0x00000000041aa34d <+221>: je 0x41aa72f <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+1215>
0x00000000041aa353 <+227>: movzwl 0x0(%r13),%eax
0x00000000041aa358 <+232>: cmp %r9w,%ax
0x00000000041aa35c <+236>: jb 0x41aa340 <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+208>
ICU4X aarch64 (bad)
XUL[0x69616cc] <+548>: mov x8, x28
XUL[0x69616d0] <+552>: cmp x28, x22
XUL[0x69616d4] <+556>: cset w9, ne
XUL[0x69616d8] <+560>: add x28, x28, w9, uxtw #1
XUL[0x69616dc] <+564>: b.eq 0x6961fc4 ; <+2844> at lib.rs
XUL[0x69616e0] <+568>: ldrh w23, [x8]
XUL[0x69616e4] <+572>: cmp w12, w23
XUL[0x69616e8] <+576>: b.hi 0x69616cc ; <+548> [inlined] core::slice::iter::Iter$LT$T$GT$::new::h9583c40a83f784cf at iter.rs:104:13
ICU4C aarch64 (good)
XUL[0x718d3c] <+224>: add x0, x23, #0x2
XUL[0x718d40] <+228>: mov x23, x0
XUL[0x718d44] <+232>: cmp x0, x22
XUL[0x718d48] <+236>: b.eq 0x7191f8 ; <+1436> at normalizer2impl.cpp:1844:1
XUL[0x718d4c] <+240>: ldrh w10, [x23]
XUL[0x718d50] <+244>: cmp w10, w25
XUL[0x718d54] <+248>: b.lo 0x718d3c ; <+224> at normalizer2impl.cpp:1753:17
(Note: the x86_64 has the additional problem of allocating a value that gets compared against on the stack instead of a register. Still, the aarch64 performance factor is worse on M3 Pro than the x86_64 performance factor is on a Zen 3 Threadripper.)
Meta
rustc --version --verbose
:
Both 1.88 and rustc 1.90.0-nightly (498ae9fed 2025-07-28)