Skip to content

Commit 44c05c7

Browse files
nikomatsakismark-i-m
authored andcommitted
add a section about profiling with perf
1 parent 1277f2a commit 44c05c7

File tree

3 files changed

+306
-1
lines changed

3 files changed

+306
-1
lines changed

src/SUMMARY.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,15 @@
44
- [About the compiler team](./compiler-team.md)
55
- [How to build the compiler and run what you built](./how-to-build-and-run.md)
66
- [Coding conventions](./conventions.md)
7+
- [Walkthrough: a typical contribution](./walkthrough.md)
78
- [The compiler testing framework](./tests/intro.md)
89
- [Running tests](./tests/running.md)
910
- [Adding new tests](./tests/adding.md)
1011
- [Using `compiletest` + commands to control test
1112
execution](./compiletest.md)
1213
- [Debugging the Compiler](./compiler-debugging.md)
13-
- [Walkthrough: a typical contribution](./walkthrough.md)
14+
- [Profiling the compiler](./profiling.md)
15+
- [with the linux perf tool](./profiling/with_perf.md)
1416
- [High-level overview of the compiler source](./high-level-overview.md)
1517
- [The Rustc Driver](./rustc-driver.md)
1618
- [Rustdoc](./rustdoc.md)

src/profiling.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Profiling the compiler
2+
3+
This discussion talks about how profile the compiler and find out
4+
where it spends its time. If you just want to get a general overview,
5+
it is often a good idea to just add `-Zself-profile` option to the
6+
rustc command line. This will break down time spent into various
7+
categories. But if you want a more detailed look, you probably want
8+
to break out a custom profiler.
9+

src/profiling/with_perf.md

Lines changed: 294 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,294 @@
1+
# Profiling with perf
2+
3+
sThis is a guide for how to profile rustc with perf.
4+
5+
## Initial steps
6+
7+
- Get a clean checkout of rust-lang/master, or whatever it is you want to profile.
8+
- Set the following settings in your `config.toml`:
9+
- `debuginfo-lines = true`
10+
- `use-jemalloc = false` -- lets you do memory use profiling with valgrind
11+
- leave everything else the defaults
12+
- Run `./x.py build` to get a full build
13+
- Make a rustup toolchain (let's call it `rust-prof`) pointing to that result
14+
- `rustup toolchain link` XXX
15+
16+
## Gathering a perf profile
17+
18+
perf is an excellent tool on linux that can be used to gather and
19+
analyze all kinds of information. Mostly it is used to figure out
20+
where a program spends its time. It can also be used for other sorts
21+
of events, though, like cache misses and so forth.
22+
23+
### The basics
24+
25+
The basic `perf` command is this:
26+
27+
```
28+
perf record -F99 --call-graph dwarf XXX
29+
```
30+
31+
The `-F99` tells perf to sample at 99 Hz, which avoids generating too
32+
much data for longer runs. The `--call-graph dwarf` tells perf to get
33+
call-graph information from debuginfo, which is accurate. The `XXX` is
34+
the command you want to profile. So, for example, you might do:
35+
36+
```
37+
perf record -F99 --call-graph dwarf cargo +rust-prof rustc
38+
```
39+
40+
to run `cargo`. But there are some things to be aware of:
41+
42+
- You probably don't want to profile the time spend building
43+
dependencies. So something like `cargo build; cargo clean -p $C` may
44+
be helpful (where `$C` is the crate name)
45+
- You probably don't want incremental messing about with your
46+
profile. So something like `CARGO_INCREMENTAL=0` can be helpful.
47+
48+
### Gathering a perf profile from a `perf.rust-lang.org` test
49+
50+
Often we want to analyze a specific test from `perf.rust-lang.org`. To
51+
do that, the first step is to clone
52+
[the rustc-perf repository][rustc-perf-gh]:
53+
54+
```bash
55+
> git clone https://github.com/rust-lang-nursery/rustc-perf
56+
```
57+
58+
[rustc-perf-gh]: https://github.com/rust-lang-nursery/rustc-perf
59+
60+
This repo contains a bunch of stuff, but the sources for the tests are
61+
found in [the `collector/benchmarks` directory][dir]. So let's go into
62+
the directory of a specific test; we'll use `clap-rs` as an example:
63+
64+
[dir]: https://github.com/rust-lang-nursery/rustc-perf/tree/master/collector/benchmarks
65+
66+
```bash
67+
cd collector/benchmarks/clap-rs
68+
```
69+
70+
In this case, let's say we want to profile the `cargo check`
71+
performance. In that case, I would first run some basic commands to
72+
build the dependencies:
73+
74+
```bash
75+
# Setup: first clean out any old results and build the dependencies:
76+
cargo +rust-prof clean
77+
CARGO_INCREMENTAL=0 cargo +rust-prof check
78+
```
79+
80+
Next: we want record the execution time for *just* the clap-rs crate,
81+
running cargo check. I tend to use `cargo rustc` for this, since it
82+
also allows me to add explicit flags, which we'll do later on.
83+
84+
```bash
85+
touch src/lib.rs
86+
CARGO_INCREMENTAL=0 perf record -F99 --call-graph dwarf cargo rustc --profile check --lib
87+
```
88+
89+
Note that final command: it's a doozy! It uses the `cargo rustc`
90+
command, which executes rustc with (potentially) additional options;
91+
the `--profile check` and `--lib` options specify that we are doing a
92+
`cargo check` execution, and that this is a library (not an
93+
execution).
94+
95+
At this point, we can use `perf` tooling to analyze the results. For example:
96+
97+
```bash
98+
> perf report
99+
```
100+
101+
will open up an interactive TUI program. In simple cases, that can be
102+
helpful. For more detailed examination, the [`perf-focus` tool][pf]
103+
can be helpful; it is covered below.
104+
105+
**A note of caution.** Each of the rustc-perf tests is its own special
106+
snowflake. In particular, some of them are not libraries, in which
107+
case you would want to do `touch src/main.rs` and avoid passing
108+
`--lib`. I'm not sure how best to tell which test is which to be
109+
honest.
110+
111+
### Gathering NLL data
112+
113+
If you want to profile an NLL run, you can just pass extra options to the `cargo rustc` command. The actual perf site just uses `-Zborrowck=mir`, which we can simulate like so:
114+
115+
```bash
116+
touch src/lib.rs
117+
CARGO_INCREMENTAL=0 perf record -F99 --call-graph dwarf cargo rustc --profile check --lib -- -Zborrowck=mir
118+
```
119+
120+
[pf]: https://github.com/nikomatsakis/perf-focus
121+
122+
## Analyzing a perf profile with `perf focus`
123+
124+
Once you've gathered a perf profile, we want to get some information
125+
about it. For this, I personally use [perf focus][pf]. It's a kind of
126+
simple but useful tool that lets you answer queries like:
127+
128+
- "how much time was spent in function F" (no matter where it was called from)
129+
- "how much time was spent in function F when it was called from G"
130+
- "how much time was spent in function F *excluding* time spent in G"
131+
- "what fns does F call and how much time does it spend in them"
132+
133+
To understand how it works, you have to know just a bit about
134+
perf. Basically, perf works by *sampling* your process on a regular
135+
basis (or whenever some event occurs). For each sample, perf gathers a
136+
backtrace. `perf focus` lets you write a regular expression that tests
137+
which fns appear in that backtrace, and then tells you which
138+
percentage of samples had a backtrace that met the regular
139+
expression. It's probably easiest to explain by walking through how I
140+
would analyze NLL performance.
141+
142+
## Installing `perf-focus`
143+
144+
You can install perf-focus using `cargo install`:
145+
146+
```
147+
cargo install perf-focus
148+
```
149+
150+
## Example: How much time is spent in MIR borrowck?
151+
152+
Let's say we've gathered the NLL data for a test. We'd like to know
153+
how much time it is spending in the MIR borrow-checker. The "main"
154+
function of the MIR borrowck is called `do_mir_borrowck`, so we can do
155+
this command:
156+
157+
```bash
158+
> perf focus '{do_mir_borrowck}'
159+
Matcher : {do_mir_borrowck}
160+
Matches : 228
161+
Not Matches: 542
162+
Percentage : 29%
163+
```
164+
165+
The `'{do_mir_borrowck}'` argument is called the **matcher**. It
166+
specifies the test to be applied on the backtrace. In this case, the
167+
`{X}` indicates that there must be *some* function on the backtrace
168+
that meets the regular expression `X`. In this case, that regex is
169+
just the name of the fn we want (in fact, it's a subset of the name;
170+
the full name includes a bunch of other stuff, like the module
171+
path). In this mode, perf-focus just prints out the percentage of
172+
samples where `do_mir_borrowck` was on the stack: in this case, 29%.
173+
174+
**A note about c++filt.** To get the data from `perf`, `perf focus`
175+
currently executes `perf script` (perhaps there is a better
176+
way...). I've sometimes found that `perf script` outputs C++ mangled
177+
names. This is annoying. You can tell by running `perf script |
178+
head` yourself -- if you see named like `5rustc6middle` instead of
179+
`rustc::middle`, then you have the same problem. You can solve this
180+
by doing:
181+
182+
```bash
183+
> perf script | c++filt | perf focus --from-stdin ...
184+
```
185+
186+
This will pipe the output from `perf script` through `c++filt` and
187+
should mostly convert those names into a more friendly format. The
188+
`--from-stdin` flag to `perf focus` tells it to get its data from
189+
stdin, rather than executing `perf focus`. We should make this more
190+
convenient (at worst, maybe add a `c++filt` option to `perf focus`, or
191+
just always use it -- it's pretty harmless).
192+
193+
## Example: How much time does MIR borrowck spend solving traits?
194+
195+
Perhaps we'd like to know how much time MIR borrowck spends in the
196+
trait checker. We can ask this using a more complex regex:
197+
198+
```bash
199+
> perf focus '{do_mir_borrowck}..{^rustc::traits}'
200+
Matcher : {do_mir_borrowck},..{^rustc::traits}
201+
Matches : 12
202+
Not Matches: 1311
203+
Percentage : 0%
204+
```
205+
206+
Here we used the `..` operator to ask "how often do we have
207+
`do_mir_borrowck` on the stack and then, later, some fn whose name
208+
begins with `rusc::traits`?" (basically, code in that module). It
209+
turns out the answer is "almost never" -- only 12 samples fit that
210+
description (if you ever see *no* samples, that often indicates your
211+
query is messed up).
212+
213+
If you're curious, you can find out exactly which samples by using the
214+
`--print-match` option. This will print out the full backtrace for
215+
each sample. The `|` at the front of the line indicates the part that
216+
the regular expression matched.
217+
218+
## Example: Where does MIR borrowck spend its time?
219+
220+
Often we want to do a more "explorational" queries. Like, we know that
221+
MIR borrowck is 29% of the time, but where does that time get spent?
222+
For that, the `--tree-callees` option is often the best tool. You
223+
usually also want to give `--tree-min-percent` or
224+
`--tree-max-depth`. The result looks like this:
225+
226+
```bash
227+
> perf focus '{do_mir_borrowck}' --tree-callees --tree-min-percent 3
228+
Matcher : {do_mir_borrowck}
229+
Matches : 577
230+
Not Matches: 746
231+
Percentage : 43%
232+
233+
Tree
234+
| matched `{do_mir_borrowck}` (43% total, 0% self)
235+
: | rustc_mir::borrow_check::nll::compute_regions (20% total, 0% self)
236+
: : | rustc_mir::borrow_check::nll::type_check::type_check_internal (13% total, 0% self)
237+
: : : | core::ops::function::FnOnce::call_once (5% total, 0% self)
238+
: : : : | rustc_mir::borrow_check::nll::type_check::liveness::generate (5% total, 3% self)
239+
: : : | <rustc_mir::borrow_check::nll::type_check::TypeVerifier<'a, 'b, 'gcx, 'tcx> as rustc::mir::visit::Visitor<'tcx>>::visit_mir (3% total, 0% self)
240+
: | rustc::mir::visit::Visitor::visit_mir (8% total, 6% self)
241+
: | <rustc_mir::borrow_check::MirBorrowckCtxt<'cx, 'gcx, 'tcx> as rustc_mir::dataflow::DataflowResultsConsumer<'cx, 'tcx>>::visit_statement_entry (5% total, 0% self)
242+
: | rustc_mir::dataflow::do_dataflow (3% total, 0% self)
243+
```
244+
245+
What happens with `--tree-callees` is that
246+
247+
- we find each sample matching the regular expression
248+
- we look at the code that is occurs *after* the regex match and try to build up a call tree
249+
250+
The `--tree-min-percent 3` option says "only show me things that take
251+
more than 3% of the time. Without this, the tree often gets really
252+
noisy and includes random stuff like the innards of
253+
malloc. `--tree-max-depth` can be useful too, it just limits how many
254+
levels we print.
255+
256+
For each line, we display the percent of time in that function
257+
altogether ("total") and the percent of time spent in **just that
258+
function and not some callee of that function** (self). Usually
259+
"total" is the more interesting number, but not always.
260+
261+
### Absolute vs relative percentages
262+
263+
By default, all in perf-focus are relative to the **total program
264+
execution**. This is useful to help you keep perspective -- often as
265+
we drill down to find hot spots, we can lose sight of the fact that,
266+
in terms of overall program execution, this "hot spot" is actually not
267+
important. It also ensures that percentages between different queries
268+
are easily compared against one another.
269+
270+
That said, sometimes it's useful to get relative percentages, so `perf
271+
focus` offers a `--relative` option. In this case, the percentages are
272+
listed only for samples that match (vs all samples). So for example we
273+
could find out get our percentages relative to the borrowck itself
274+
like so:
275+
276+
```bash
277+
> perf focus '{do_mir_borrowck}' --tree-callees --relative --tree-max-depth 1 --tree-min-percent 5
278+
Matcher : {do_mir_borrowck}
279+
Matches : 577
280+
Not Matches: 746
281+
Percentage : 100%
282+
283+
Tree
284+
| matched `{do_mir_borrowck}` (100% total, 0% self)
285+
: | rustc_mir::borrow_check::nll::compute_regions (47% total, 0% self) [...]
286+
: | rustc::mir::visit::Visitor::visit_mir (19% total, 15% self) [...]
287+
: | <rustc_mir::borrow_check::MirBorrowckCtxt<'cx, 'gcx, 'tcx> as rustc_mir::dataflow::DataflowResultsConsumer<'cx, 'tcx>>::visit_statement_entry (13% total, 0% self) [...]
288+
: | rustc_mir::dataflow::do_dataflow (8% total, 1% self) [...]
289+
```
290+
291+
Here you see that `compute_regions` came up as "47% total" -- that
292+
means that 47% of `do_mir_borrowck` is spent in that function. Before,
293+
we saw 20% -- that's because `do_mir_borrowck` itself is only 43% of
294+
the total time (and `.47 * .43 = .20`).

0 commit comments

Comments
 (0)