Skip to content

Slice iterator advancement can become unidiomatic, which seems like a performance problem #144684

@hsivonen

Description

@hsivonen

Compiler Explorer link:
https://rust.godbolt.org/z/3EesGrT98

I tried this code:

unsafe extern "C" {
    fn write(p: *const u16);
}

#[inline(always)]
fn in_inclusive_range16(u: u16, start: u16, end: u16) -> bool {
    u.wrapping_sub(start) <= (end - start)
}

// Weirdness:
// Look for `cset` under LBB0_3 in aarch64
// and `setne` + `lea` under LBB0_4 in x86_64.
// These seem to correlate with bb2.preheader in LLVM IR.
#[unsafe(no_mangle)]
pub fn bad_loop(pending_slice: &[u16], slice: &[u16], bound: u32, other: u32) {
    // SPDX: Unicode-3.0
    // From icu_normalizer
    let mut code_unit_iter = slice.iter();
    let mut upcoming32;
    let mut trie_value: u32; // Significant even if unused!
    'fast: loop {
        if let Some(&upcoming_code_unit) = code_unit_iter.next() {
            upcoming32 = u32::from(upcoming_code_unit); // may be surrogate
            if upcoming32 < bound {
                continue 'fast;
            }
            'surrogateloop: loop {
                let surrogate_base = upcoming32.wrapping_sub(0xD800);
                if surrogate_base > (0xDFFF - 0xD800) {
                    break 'surrogateloop;
                }
                if surrogate_base <= (0xDBFF - 0xD800) {
                    let iter_backup = code_unit_iter.clone(); // This line is significant even as unused!
                    if let Some(&low) = code_unit_iter.next() {
                        if in_inclusive_range16(low, 0xDC00, 0xDFFF) { // Can't simplify this!
                            upcoming32 = (upcoming32 << 10) + u32::from(low)
                                - (((0xD800u32 << 10) - 0x10000u32) + 0xDC00u32);
                            if upcoming32 < other {
                                continue 'fast;
                            }
                            break 'surrogateloop;
                        } else {
                            // Somehow this can be commented out:
                            // code_unit_iter = iter_backup;
                        }
                    }
                }
                break 'surrogateloop;
            }

            let consumed_so_far = &pending_slice[..pending_slice.len() - code_unit_iter.as_slice().len() - 1];
            unsafe{write(consumed_so_far.as_ptr())};
            break 'fast;
        }
        break;
    }
}

I expected to see this happen:

Expect the very top part of the loop that advances the iterator and continues if the item is less than bound to start with an add with 2 as immediate, and then use mov and cmp without setne + lea trickery on x86_64 or cset to make an operand of add conditionally 2 or 0 on aarch64.

Instead, this happened:

If the loop has the kind of shape as in the minimized test case (which comes from https://github.com/unicode-org/icu4x/blob/932aaa3f6b7afe322c305d124fe5c237f008be6d/components/normalizer/src/lib.rs#L2549 ), the loop advancement and sentinel-reached check compile to a less usual instruction sequence than in the case of simpler loops.

Notably, unused variable affect the post-optimizer output here. I've remarked in the minimized test case that a couple of unused variables need to be there. These aren't unused in icu_normalizer before minimization, but it seems like an indication of a deeper problem that unused variables affect the outcome as opposed to being eliminated as dead code before further transformations.

The part from 'fast: loop { to the first continue is the hottest loop and the most performance-significant part of icu_normalizer. On both x86_64 and aarch64, it is slower (more than by a factor of 2) than its counterpart loop in ICU4C.

In non-minimized code, the loops look like this:

ICU4X x86_64 (bad)

   0x000000000b72b1c0 <+768>:   mov    %r15,%rcx
   0x000000000b72b1c3 <+771>:   xor    %eax,%eax
   0x000000000b72b1c5 <+773>:   cmp    %r13,%r15
   0x000000000b72b1c8 <+776>:   setne  %al
   0x000000000b72b1cb <+779>:   lea    (%r15,%rax,2),%r15
   0x000000000b72b1cf <+783>:   je     0xb72bd8a <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+3786>
   0x000000000b72b1d5 <+789>:   movzwl (%rcx),%r14d
   0x000000000b72b1d9 <+793>:   cmp    %r14d,-0x44(%rbp)
   0x000000000b72b1dd <+797>:   ja     0xb72b1c0 <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+768>

ICU4C x86_64 (good)

   0x00000000041aa340 <+208>:   add    $0x2,%r13
   0x00000000041aa344 <+212>:   mov    %r13,%r14
   0x00000000041aa347 <+215>:   mov    %r14,%r13
   0x00000000041aa34a <+218>:   cmp    %r12,%r14
   0x00000000041aa34d <+221>:   je     0x41aa72f <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+1215>
   0x00000000041aa353 <+227>:   movzwl 0x0(%r13),%eax
   0x00000000041aa358 <+232>:   cmp    %r9w,%ax
   0x00000000041aa35c <+236>:   jb     0x41aa340 <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+208>

ICU4X aarch64 (bad)

XUL[0x69616cc] <+548>:  mov    x8, x28
XUL[0x69616d0] <+552>:  cmp    x28, x22
XUL[0x69616d4] <+556>:  cset   w9, ne
XUL[0x69616d8] <+560>:  add    x28, x28, w9, uxtw #1
XUL[0x69616dc] <+564>:  b.eq   0x6961fc4                 ; <+2844> at lib.rs
XUL[0x69616e0] <+568>:  ldrh   w23, [x8]
XUL[0x69616e4] <+572>:  cmp    w12, w23
XUL[0x69616e8] <+576>:  b.hi   0x69616cc                 ; <+548> [inlined] core::slice::iter::Iter$LT$T$GT$::new::h9583c40a83f784cf at iter.rs:104:13

ICU4C aarch64 (good)

XUL[0x718d3c] <+224>:  add    x0, x23, #0x2
XUL[0x718d40] <+228>:  mov    x23, x0
XUL[0x718d44] <+232>:  cmp    x0, x22
XUL[0x718d48] <+236>:  b.eq   0x7191f8                  ; <+1436> at normalizer2impl.cpp:1844:1
XUL[0x718d4c] <+240>:  ldrh   w10, [x23]
XUL[0x718d50] <+244>:  cmp    w10, w25
XUL[0x718d54] <+248>:  b.lo   0x718d3c                  ; <+224> at normalizer2impl.cpp:1753:17

(Note: the x86_64 has the additional problem of allocating a value that gets compared against on the stack instead of a register. Still, the aarch64 performance factor is worse on M3 Pro than the x86_64 performance factor is on a Zen 3 Threadripper.)

Meta

rustc --version --verbose:

Both 1.88 and rustc 1.90.0-nightly (498ae9fed 2025-07-28)

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchI-slowIssue: Problems and improvements with respect to performance of generated code.needs-triageThis issue may need triage. Remove it if it has been sufficiently triaged.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions