Slice iterator advancement can become unidiomatic, which seems like a performance problem

Compiler Explorer link:
https://rust.godbolt.org/z/3EesGrT98

I tried this code:

```rust
unsafe extern "C" {
    fn write(p: *const u16);
}

#[inline(always)]
fn in_inclusive_range16(u: u16, start: u16, end: u16) -> bool {
    u.wrapping_sub(start) <= (end - start)
}

// Weirdness:
// Look for `cset` under LBB0_3 in aarch64
// and `setne` + `lea` under LBB0_4 in x86_64.
// These seem to correlate with bb2.preheader in LLVM IR.
#[unsafe(no_mangle)]
pub fn bad_loop(pending_slice: &[u16], slice: &[u16], bound: u32, other: u32) {
    // SPDX: Unicode-3.0
    // From icu_normalizer
    let mut code_unit_iter = slice.iter();
    let mut upcoming32;
    let mut trie_value: u32; // Significant even if unused!
    'fast: loop {
        if let Some(&upcoming_code_unit) = code_unit_iter.next() {
            upcoming32 = u32::from(upcoming_code_unit); // may be surrogate
            if upcoming32 < bound {
                continue 'fast;
            }
            'surrogateloop: loop {
                let surrogate_base = upcoming32.wrapping_sub(0xD800);
                if surrogate_base > (0xDFFF - 0xD800) {
                    break 'surrogateloop;
                }
                if surrogate_base <= (0xDBFF - 0xD800) {
                    let iter_backup = code_unit_iter.clone(); // This line is significant even as unused!
                    if let Some(&low) = code_unit_iter.next() {
                        if in_inclusive_range16(low, 0xDC00, 0xDFFF) { // Can't simplify this!
                            upcoming32 = (upcoming32 << 10) + u32::from(low)
                                - (((0xD800u32 << 10) - 0x10000u32) + 0xDC00u32);
                            if upcoming32 < other {
                                continue 'fast;
                            }
                            break 'surrogateloop;
                        } else {
                            // Somehow this can be commented out:
                            // code_unit_iter = iter_backup;
                        }
                    }
                }
                break 'surrogateloop;
            }

            let consumed_so_far = &pending_slice[..pending_slice.len() - code_unit_iter.as_slice().len() - 1];
            unsafe{write(consumed_so_far.as_ptr())};
            break 'fast;
        }
        break;
    }
}
```

I expected to see this happen:

Expect the very top part of the loop that advances the iterator and continues if the item is less than `bound` to start with an `add` with 2 as immediate, and then use `mov` and `cmp` without `setne` + `lea` trickery on x86_64 or `cset` to make an operand of `add` conditionally 2 or 0 on aarch64.

Instead, this happened:

If the loop has the kind of shape as in the minimized test case (which comes from https://github.com/unicode-org/icu4x/blob/932aaa3f6b7afe322c305d124fe5c237f008be6d/components/normalizer/src/lib.rs#L2549 ), the loop advancement and sentinel-reached check compile to a less usual instruction sequence than in the case of simpler loops.

Notably, unused variable affect the post-optimizer output here. I've remarked in the minimized test case that a couple of unused variables need to be there. These aren't unused in `icu_normalizer` before minimization, but it seems like an indication of a deeper problem that unused variables affect the outcome as opposed to being eliminated as dead code before further transformations.

The part from `'fast: loop {` to the first `continue` is the hottest loop and the most performance-significant part of `icu_normalizer`. On both x86_64 and aarch64, it is slower (more than by a factor of 2) than its counterpart loop in ICU4C.

In non-minimized code, the loops look like this:

ICU4X x86_64 (bad)

```
   0x000000000b72b1c0 <+768>:   mov    %r15,%rcx
   0x000000000b72b1c3 <+771>:   xor    %eax,%eax
   0x000000000b72b1c5 <+773>:   cmp    %r13,%r15
   0x000000000b72b1c8 <+776>:   setne  %al
   0x000000000b72b1cb <+779>:   lea    (%r15,%rax,2),%r15
   0x000000000b72b1cf <+783>:   je     0xb72bd8a <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+3786>
   0x000000000b72b1d5 <+789>:   movzwl (%rcx),%r14d
   0x000000000b72b1d9 <+793>:   cmp    %r14d,-0x44(%rbp)
   0x000000000b72b1dd <+797>:   ja     0xb72b1c0 <_ZN14icu_normalizer27ComposingNormalizerBorrowed25is_normalized_utf16_up_to17h3a3a573d0919bfd4E+768>
```

ICU4C x86_64 (good)

```
   0x00000000041aa340 <+208>:   add    $0x2,%r13
   0x00000000041aa344 <+212>:   mov    %r13,%r14
   0x00000000041aa347 <+215>:   mov    %r14,%r13
   0x00000000041aa34a <+218>:   cmp    %r12,%r14
   0x00000000041aa34d <+221>:   je     0x41aa72f <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+1215>
   0x00000000041aa353 <+227>:   movzwl 0x0(%r13),%eax
   0x00000000041aa358 <+232>:   cmp    %r9w,%ax
   0x00000000041aa35c <+236>:   jb     0x41aa340 <_ZNK6icu_7715Normalizer2Impl17composeQuickCheckEPKDsS2_aP25UNormalizationCheckResult+208>
```

ICU4X aarch64 (bad)

```
XUL[0x69616cc] <+548>:  mov    x8, x28
XUL[0x69616d0] <+552>:  cmp    x28, x22
XUL[0x69616d4] <+556>:  cset   w9, ne
XUL[0x69616d8] <+560>:  add    x28, x28, w9, uxtw #1
XUL[0x69616dc] <+564>:  b.eq   0x6961fc4                 ; <+2844> at lib.rs
XUL[0x69616e0] <+568>:  ldrh   w23, [x8]
XUL[0x69616e4] <+572>:  cmp    w12, w23
XUL[0x69616e8] <+576>:  b.hi   0x69616cc                 ; <+548> [inlined] core::slice::iter::Iter$LT$T$GT$::new::h9583c40a83f784cf at iter.rs:104:13
```

ICU4C aarch64 (good)

```
XUL[0x718d3c] <+224>:  add    x0, x23, #0x2
XUL[0x718d40] <+228>:  mov    x23, x0
XUL[0x718d44] <+232>:  cmp    x0, x22
XUL[0x718d48] <+236>:  b.eq   0x7191f8                  ; <+1436> at normalizer2impl.cpp:1844:1
XUL[0x718d4c] <+240>:  ldrh   w10, [x23]
XUL[0x718d50] <+244>:  cmp    w10, w25
XUL[0x718d54] <+248>:  b.lo   0x718d3c                  ; <+224> at normalizer2impl.cpp:1753:17
```

(Note: the x86_64 has the additional problem of allocating a value that gets compared against on the stack instead of a register. Still, the aarch64 performance factor is worse on M3 Pro than the x86_64 performance factor is on a Zen 3 Threadripper.)

### Meta

`rustc --version --verbose`:
```
Both 1.88 and rustc 1.90.0-nightly (498ae9fed 2025-07-28)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slice iterator advancement can become unidiomatic, which seems like a performance problem #144684

Meta

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slice iterator advancement can become unidiomatic, which seems like a performance problem #144684

Description

Meta

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions