|
| 1 | +# Profiling with perf |
| 2 | + |
| 3 | +sThis is a guide for how to profile rustc with perf. |
| 4 | + |
| 5 | +## Initial steps |
| 6 | + |
| 7 | +- Get a clean checkout of rust-lang/master, or whatever it is you want to profile. |
| 8 | +- Set the following settings in your `config.toml`: |
| 9 | + - `debuginfo-lines = true` |
| 10 | + - `use-jemalloc = false` -- lets you do memory use profiling with valgrind |
| 11 | + - leave everything else the defaults |
| 12 | +- Run `./x.py build` to get a full build |
| 13 | +- Make a rustup toolchain (let's call it `rust-prof`) pointing to that result |
| 14 | + - `rustup toolchain link` XXX |
| 15 | + |
| 16 | +## Gathering a perf profile |
| 17 | + |
| 18 | +perf is an excellent tool on linux that can be used to gather and |
| 19 | +analyze all kinds of information. Mostly it is used to figure out |
| 20 | +where a program spends its time. It can also be used for other sorts |
| 21 | +of events, though, like cache misses and so forth. |
| 22 | + |
| 23 | +### The basics |
| 24 | + |
| 25 | +The basic `perf` command is this: |
| 26 | + |
| 27 | +``` |
| 28 | +perf record -F99 --call-graph dwarf XXX |
| 29 | +``` |
| 30 | + |
| 31 | +The `-F99` tells perf to sample at 99 Hz, which avoids generating too |
| 32 | +much data for longer runs. The `--call-graph dwarf` tells perf to get |
| 33 | +call-graph information from debuginfo, which is accurate. The `XXX` is |
| 34 | +the command you want to profile. So, for example, you might do: |
| 35 | + |
| 36 | +``` |
| 37 | +perf record -F99 --call-graph dwarf cargo +rust-prof rustc |
| 38 | +``` |
| 39 | + |
| 40 | +to run `cargo`. But there are some things to be aware of: |
| 41 | + |
| 42 | +- You probably don't want to profile the time spend building |
| 43 | + dependencies. So something like `cargo build; cargo clean -p $C` may |
| 44 | + be helpful (where `$C` is the crate name) |
| 45 | +- You probably don't want incremental messing about with your |
| 46 | + profile. So something like `CARGO_INCREMENTAL=0` can be helpful. |
| 47 | + |
| 48 | +### Gathering a perf profile from a `perf.rust-lang.org` test |
| 49 | + |
| 50 | +Often we want to analyze a specific test from `perf.rust-lang.org`. To |
| 51 | +do that, the first step is to clone |
| 52 | +[the rustc-perf repository][rustc-perf-gh]: |
| 53 | + |
| 54 | +```bash |
| 55 | +> git clone https://github.com/rust-lang-nursery/rustc-perf |
| 56 | +``` |
| 57 | + |
| 58 | +[rustc-perf-gh]: https://github.com/rust-lang-nursery/rustc-perf |
| 59 | + |
| 60 | +This repo contains a bunch of stuff, but the sources for the tests are |
| 61 | +found in [the `collector/benchmarks` directory][dir]. So let's go into |
| 62 | +the directory of a specific test; we'll use `clap-rs` as an example: |
| 63 | + |
| 64 | +[dir]: https://github.com/rust-lang-nursery/rustc-perf/tree/master/collector/benchmarks |
| 65 | + |
| 66 | +```bash |
| 67 | +cd collector/benchmarks/clap-rs |
| 68 | +``` |
| 69 | + |
| 70 | +In this case, let's say we want to profile the `cargo check` |
| 71 | +performance. In that case, I would first run some basic commands to |
| 72 | +build the dependencies: |
| 73 | + |
| 74 | +```bash |
| 75 | +# Setup: first clean out any old results and build the dependencies: |
| 76 | +cargo +rust-prof clean |
| 77 | +CARGO_INCREMENTAL=0 cargo +rust-prof check |
| 78 | +``` |
| 79 | + |
| 80 | +Next: we want record the execution time for *just* the clap-rs crate, |
| 81 | +running cargo check. I tend to use `cargo rustc` for this, since it |
| 82 | +also allows me to add explicit flags, which we'll do later on. |
| 83 | + |
| 84 | +```bash |
| 85 | +touch src/lib.rs |
| 86 | +CARGO_INCREMENTAL=0 perf record -F99 --call-graph dwarf cargo rustc --profile check --lib |
| 87 | +``` |
| 88 | + |
| 89 | +Note that final command: it's a doozy! It uses the `cargo rustc` |
| 90 | +command, which executes rustc with (potentially) additional options; |
| 91 | +the `--profile check` and `--lib` options specify that we are doing a |
| 92 | +`cargo check` execution, and that this is a library (not an |
| 93 | +execution). |
| 94 | + |
| 95 | +At this point, we can use `perf` tooling to analyze the results. For example: |
| 96 | + |
| 97 | +```bash |
| 98 | +> perf report |
| 99 | +``` |
| 100 | + |
| 101 | +will open up an interactive TUI program. In simple cases, that can be |
| 102 | +helpful. For more detailed examination, the [`perf-focus` tool][pf] |
| 103 | +can be helpful; it is covered below. |
| 104 | + |
| 105 | +**A note of caution.** Each of the rustc-perf tests is its own special |
| 106 | + snowflake. In particular, some of them are not libraries, in which |
| 107 | + case you would want to do `touch src/main.rs` and avoid passing |
| 108 | + `--lib`. I'm not sure how best to tell which test is which to be |
| 109 | + honest. |
| 110 | + |
| 111 | +### Gathering NLL data |
| 112 | + |
| 113 | +If you want to profile an NLL run, you can just pass extra options to the `cargo rustc` command. The actual perf site just uses `-Zborrowck=mir`, which we can simulate like so: |
| 114 | + |
| 115 | +```bash |
| 116 | +touch src/lib.rs |
| 117 | +CARGO_INCREMENTAL=0 perf record -F99 --call-graph dwarf cargo rustc --profile check --lib -- -Zborrowck=mir |
| 118 | +``` |
| 119 | + |
| 120 | +[pf]: https://github.com/nikomatsakis/perf-focus |
| 121 | + |
| 122 | +## Analyzing a perf profile with `perf focus` |
| 123 | + |
| 124 | +Once you've gathered a perf profile, we want to get some information |
| 125 | +about it. For this, I personally use [perf focus][pf]. It's a kind of |
| 126 | +simple but useful tool that lets you answer queries like: |
| 127 | + |
| 128 | +- "how much time was spent in function F" (no matter where it was called from) |
| 129 | +- "how much time was spent in function F when it was called from G" |
| 130 | +- "how much time was spent in function F *excluding* time spent in G" |
| 131 | +- "what fns does F call and how much time does it spend in them" |
| 132 | + |
| 133 | +To understand how it works, you have to know just a bit about |
| 134 | +perf. Basically, perf works by *sampling* your process on a regular |
| 135 | +basis (or whenever some event occurs). For each sample, perf gathers a |
| 136 | +backtrace. `perf focus` lets you write a regular expression that tests |
| 137 | +which fns appear in that backtrace, and then tells you which |
| 138 | +percentage of samples had a backtrace that met the regular |
| 139 | +expression. It's probably easiest to explain by walking through how I |
| 140 | +would analyze NLL performance. |
| 141 | + |
| 142 | +## Installing `perf-focus` |
| 143 | + |
| 144 | +You can install perf-focus using `cargo install`: |
| 145 | + |
| 146 | +``` |
| 147 | +cargo install perf-focus |
| 148 | +``` |
| 149 | + |
| 150 | +## Example: How much time is spent in MIR borrowck? |
| 151 | + |
| 152 | +Let's say we've gathered the NLL data for a test. We'd like to know |
| 153 | +how much time it is spending in the MIR borrow-checker. The "main" |
| 154 | +function of the MIR borrowck is called `do_mir_borrowck`, so we can do |
| 155 | +this command: |
| 156 | + |
| 157 | +```bash |
| 158 | +> perf focus '{do_mir_borrowck}' |
| 159 | +Matcher : {do_mir_borrowck} |
| 160 | +Matches : 228 |
| 161 | +Not Matches: 542 |
| 162 | +Percentage : 29% |
| 163 | +``` |
| 164 | + |
| 165 | +The `'{do_mir_borrowck}'` argument is called the **matcher**. It |
| 166 | +specifies the test to be applied on the backtrace. In this case, the |
| 167 | +`{X}` indicates that there must be *some* function on the backtrace |
| 168 | +that meets the regular expression `X`. In this case, that regex is |
| 169 | +just the name of the fn we want (in fact, it's a subset of the name; |
| 170 | +the full name includes a bunch of other stuff, like the module |
| 171 | +path). In this mode, perf-focus just prints out the percentage of |
| 172 | +samples where `do_mir_borrowck` was on the stack: in this case, 29%. |
| 173 | + |
| 174 | +**A note about c++filt.** To get the data from `perf`, `perf focus` |
| 175 | + currently executes `perf script` (perhaps there is a better |
| 176 | + way...). I've sometimes found that `perf script` outputs C++ mangled |
| 177 | + names. This is annoying. You can tell by running `perf script | |
| 178 | + head` yourself -- if you see named like `5rustc6middle` instead of |
| 179 | + `rustc::middle`, then you have the same problem. You can solve this |
| 180 | + by doing: |
| 181 | + |
| 182 | +```bash |
| 183 | +> perf script | c++filt | perf focus --from-stdin ... |
| 184 | +``` |
| 185 | + |
| 186 | +This will pipe the output from `perf script` through `c++filt` and |
| 187 | +should mostly convert those names into a more friendly format. The |
| 188 | +`--from-stdin` flag to `perf focus` tells it to get its data from |
| 189 | +stdin, rather than executing `perf focus`. We should make this more |
| 190 | +convenient (at worst, maybe add a `c++filt` option to `perf focus`, or |
| 191 | +just always use it -- it's pretty harmless). |
| 192 | + |
| 193 | +## Example: How much time does MIR borrowck spend solving traits? |
| 194 | + |
| 195 | +Perhaps we'd like to know how much time MIR borrowck spends in the |
| 196 | +trait checker. We can ask this using a more complex regex: |
| 197 | + |
| 198 | +```bash |
| 199 | +> perf focus '{do_mir_borrowck}..{^rustc::traits}' |
| 200 | +Matcher : {do_mir_borrowck},..{^rustc::traits} |
| 201 | +Matches : 12 |
| 202 | +Not Matches: 1311 |
| 203 | +Percentage : 0% |
| 204 | +``` |
| 205 | + |
| 206 | +Here we used the `..` operator to ask "how often do we have |
| 207 | +`do_mir_borrowck` on the stack and then, later, some fn whose name |
| 208 | +begins with `rusc::traits`?" (basically, code in that module). It |
| 209 | +turns out the answer is "almost never" -- only 12 samples fit that |
| 210 | +description (if you ever see *no* samples, that often indicates your |
| 211 | +query is messed up). |
| 212 | + |
| 213 | +If you're curious, you can find out exactly which samples by using the |
| 214 | +`--print-match` option. This will print out the full backtrace for |
| 215 | +each sample. The `|` at the front of the line indicates the part that |
| 216 | +the regular expression matched. |
| 217 | + |
| 218 | +## Example: Where does MIR borrowck spend its time? |
| 219 | + |
| 220 | +Often we want to do a more "explorational" queries. Like, we know that |
| 221 | +MIR borrowck is 29% of the time, but where does that time get spent? |
| 222 | +For that, the `--tree-callees` option is often the best tool. You |
| 223 | +usually also want to give `--tree-min-percent` or |
| 224 | +`--tree-max-depth`. The result looks like this: |
| 225 | + |
| 226 | +```bash |
| 227 | +> perf focus '{do_mir_borrowck}' --tree-callees --tree-min-percent 3 |
| 228 | +Matcher : {do_mir_borrowck} |
| 229 | +Matches : 577 |
| 230 | +Not Matches: 746 |
| 231 | +Percentage : 43% |
| 232 | + |
| 233 | +Tree |
| 234 | +| matched `{do_mir_borrowck}` (43% total, 0% self) |
| 235 | +: | rustc_mir::borrow_check::nll::compute_regions (20% total, 0% self) |
| 236 | +: : | rustc_mir::borrow_check::nll::type_check::type_check_internal (13% total, 0% self) |
| 237 | +: : : | core::ops::function::FnOnce::call_once (5% total, 0% self) |
| 238 | +: : : : | rustc_mir::borrow_check::nll::type_check::liveness::generate (5% total, 3% self) |
| 239 | +: : : | <rustc_mir::borrow_check::nll::type_check::TypeVerifier<'a, 'b, 'gcx, 'tcx> as rustc::mir::visit::Visitor<'tcx>>::visit_mir (3% total, 0% self) |
| 240 | +: | rustc::mir::visit::Visitor::visit_mir (8% total, 6% self) |
| 241 | +: | <rustc_mir::borrow_check::MirBorrowckCtxt<'cx, 'gcx, 'tcx> as rustc_mir::dataflow::DataflowResultsConsumer<'cx, 'tcx>>::visit_statement_entry (5% total, 0% self) |
| 242 | +: | rustc_mir::dataflow::do_dataflow (3% total, 0% self) |
| 243 | +``` |
| 244 | + |
| 245 | +What happens with `--tree-callees` is that |
| 246 | + |
| 247 | +- we find each sample matching the regular expression |
| 248 | +- we look at the code that is occurs *after* the regex match and try to build up a call tree |
| 249 | + |
| 250 | +The `--tree-min-percent 3` option says "only show me things that take |
| 251 | +more than 3% of the time. Without this, the tree often gets really |
| 252 | +noisy and includes random stuff like the innards of |
| 253 | +malloc. `--tree-max-depth` can be useful too, it just limits how many |
| 254 | +levels we print. |
| 255 | + |
| 256 | +For each line, we display the percent of time in that function |
| 257 | +altogether ("total") and the percent of time spent in **just that |
| 258 | +function and not some callee of that function** (self). Usually |
| 259 | +"total" is the more interesting number, but not always. |
| 260 | + |
| 261 | +### Absolute vs relative percentages |
| 262 | + |
| 263 | +By default, all in perf-focus are relative to the **total program |
| 264 | +execution**. This is useful to help you keep perspective -- often as |
| 265 | +we drill down to find hot spots, we can lose sight of the fact that, |
| 266 | +in terms of overall program execution, this "hot spot" is actually not |
| 267 | +important. It also ensures that percentages between different queries |
| 268 | +are easily compared against one another. |
| 269 | + |
| 270 | +That said, sometimes it's useful to get relative percentages, so `perf |
| 271 | +focus` offers a `--relative` option. In this case, the percentages are |
| 272 | +listed only for samples that match (vs all samples). So for example we |
| 273 | +could find out get our percentages relative to the borrowck itself |
| 274 | +like so: |
| 275 | + |
| 276 | +```bash |
| 277 | +> perf focus '{do_mir_borrowck}' --tree-callees --relative --tree-max-depth 1 --tree-min-percent 5 |
| 278 | +Matcher : {do_mir_borrowck} |
| 279 | +Matches : 577 |
| 280 | +Not Matches: 746 |
| 281 | +Percentage : 100% |
| 282 | + |
| 283 | +Tree |
| 284 | +| matched `{do_mir_borrowck}` (100% total, 0% self) |
| 285 | +: | rustc_mir::borrow_check::nll::compute_regions (47% total, 0% self) [...] |
| 286 | +: | rustc::mir::visit::Visitor::visit_mir (19% total, 15% self) [...] |
| 287 | +: | <rustc_mir::borrow_check::MirBorrowckCtxt<'cx, 'gcx, 'tcx> as rustc_mir::dataflow::DataflowResultsConsumer<'cx, 'tcx>>::visit_statement_entry (13% total, 0% self) [...] |
| 288 | +: | rustc_mir::dataflow::do_dataflow (8% total, 1% self) [...] |
| 289 | +``` |
| 290 | +
|
| 291 | +Here you see that `compute_regions` came up as "47% total" -- that |
| 292 | +means that 47% of `do_mir_borrowck` is spent in that function. Before, |
| 293 | +we saw 20% -- that's because `do_mir_borrowck` itself is only 43% of |
| 294 | +the total time (and `.47 * .43 = .20`). |
0 commit comments