Performance analysis with flame graph under Linux

Performance analysis with flame graph under Linux

CSDNGitHub
Performance analysis with flame graph under Linux
LDD-LinuxDeviceDrivers/study/debug/tools/perf/flame_graph

This work is licensed under the Creative Commons Attribution-Non-Commercial Use-Sharing in the Same Way 4.0 International License Agreement . Please indicate the source for reprinting. Thank you for your cooperation

Due to my limited technical level and knowledge, if there are any deficiencies or needs to be corrected in the content, I welcome everyone to correct me, and I also welcome you to provide some other good debugging tools for inclusion, I thank you here


Software performance analysis, often need to check

CPU
Time-consuming, understand where the bottleneck is.

Flame graph (

flame graph
) Is a powerful tool for performance analysis

1 Introduction to flame diagram


When many people have a cold and have a fever, they often imitate Shennong's way of tasting herbs: try antiviral drugs first, then antibacterial drugs, don t control what medicines at home, what Chinese medicines and western medicines, the blind cat will always In the event of dead mice, this is naturally undesirable. The correct way is to go to the hospital for a blood test, and then prescribe the right medicine after the diagnosis.

Let us recall how we generally debug programs: usually relying on subjective assumptions without data, rather than thinking about what caused the problem!

There is no doubt that when tuning program performance problems, you also need to prescribe the right medicine. The good news is

Invented the flame graph

1.1 Flame graph


Common types of flame graphs are

,
Off-CPU
, and also
Memory
,
Hot/Cold
,
Differential
and many more.

For a detailed introduction to the flame diagram, please refer to

, In short: The entire graphic looks like a ball of beating flame, which is the origin of its name. The one burning at the tip of the flame is
CPU
The operation being performed, but it needs to be explained that the color is random and has no special meaning in itself. The vertical indicates the depth of the call stack, and the horizontal indicates the time consumed. Because the call stack will be sorted alphabetically in the horizontal, and the same call stack It will be merged, so the larger the width of a grid, the more it may be a bottleneck. In summary, the main thing is to look at those relatively large flames, and pay special attention to those similar to Pingdingshan.

To generate a flame graph, you must have a handy

Tool, if the operating system is
Linux
, Then the choice is usually
perf
,
systemtap
One of. Among them
perf
Relatively more commonly used, because it is
Linux Kernel
Built-in performance tuning tools, most
Linux
It is included, and interested readers can refer to it later
Linux Profiling at Netflix
In the introduction, especially in how to deal with
Broken stacks
The description of the problem, it is recommended to read it several times, and
systemtap
Relatively more powerful, but the disadvantage is that you need to learn its own programming language first.

The early flame diagram is in

Nginx
And the community is more active, if you are a
Nginx
Development or optimization of staff, then I strongly recommend that you use Chun 's
nginx-systemtap-toolkit
, At first glance at the name, you might mistakenly think that this toolkit is
nginx
Dedicated, in fact, many of these tools are suitable for any
C/CPP
Program written in language:

programFeatures
sample-bt
Sampling data used to generate the On-CPU flame graph (
DEMO
)
sample-bt-off-cpu
Sampling data used to generate Off-CPU flame graph (
DEMO
)

1.2 On/Off-CPU flame diagram


So when to use

On-CPU
Flame diagram? When to use
Off-CPU
What about the flame graph?

Depends on what the current bottleneck is, if it is

CPU
Then use
On-CPU
Flame graph, if it is
IO
Or lock then use
Off-CPU
Flame diagram. If you are not sure, you can use the pressure test tool to confirm: Use the pressure test tool to see if you can make
CPU
The utilization rate tends to be saturated, if you can use it
On-CPU
Flame graph, if no matter how you press it,
CPU
Utilization rate has never risen, so most of it means that the program is
IO
Or the lock is stuck, suitable for use at this time
Off-CPU
Flame diagram.

If you still can t confirm, then you might as well

On-CPU
Flame graph and
Off-CPU
The flame diagrams are all messed up, and under normal circumstances they will be quite different. If the two flame diagrams are similar, it is usually considered
CPU
Preempted by other processes.

When sampling data, it is best to continue to pressure the program through the pressure measurement tool in order to collect enough samples. Regarding the choice of the pressure measurement tool, if you choose

ab
, Then remember to turn on
-k
Option to avoid exhausting the available ports of the system. In addition, I recommend trying to use something like
wrk
Such more modern stress testing tools.

##1.3 Flame Graph Visualization Generator

Brendan D. Gregg
of
Flame Graph
The project implemented a set of scripts to generate flame graphs.

Flame Graph
Project is located in
GitHub
on

github.com/brendangreg...

use

git
Put it
clone
Come down

HTTPS clone git: //github.com/brendangregg/FlameGraph.git copy the code

The following steps are required to generate and create a flame graph

Processdescriptionscript
Capture stackuse
perf/systemtap/dtrace
And other tools to grab the running stack of the program
perf/systemtap/dtrace
Folded stack
trace
The stack information of the system and program captured by the tool at each moment of running, they need to be analyzed and combined, and the repeated stacks are accumulated together to reflect the load and critical path
FlameGraph
middle
stackcollapse
program
Generate flame graphAnalyze the stack information output by stackcollapse to generate a flame graph
flamegraph.pl

Different trace tools capture different information, so

Flame Graph
Provides a series of
stackcollapse
tool.

stackcollapse
description
stackcollapse.plfor DTrace stacks
stackcollapse-perf.plfor Linux perf_events "perf script" output
stackcollapse-pmc.plfor FreeBSD pmcstat -G stacks
stackcollapse-stap.plfor SystemTap stacks
stackcollapse-instruments.plfor XCode Instruments
stackcollapse-vtune.plfor Intel VTune profiles
stackcollapse-ljp.awkfor Lightweight Java Profiler
stackcollapse-jstack.plfor Java jstack(1) output
stackcollapse-gdb.plfor gdb(1) stacks
stackcollapse-go.plfor Golang pprof stacks
stackcollapse-vsprof.plfor Microsoft Visual Studio profiles

2 Generate flame graph with perf


2.1 Perf collects data


Let's start from

perf
command(
performance
Abbreviation for) Speaking, it is
Linux
The performance analysis tool provided by the system will return
CPU
The name of the function being executed and the call stack (
stack
)

Perf -F Record the sudo 99 -p 3887 -g - SLEEP 30 duplicated code

[External link image transfer failed. The source site may have an anti-leech link mechanism. It is recommended to save the image and upload it directly (img-KNgLEYyP-1624459176139)(./perf_record_chrome.png)]

perf record
Indicates that system events are collected, not used
-e
Specify the collection event, then the default collection
cycles
(which is
CPU clock
cycle),
-F 99
Means per second
99
Times,
-p 13204
Is the process number, that is, which process is analyzed,
-g
Indicates that the call stack is recorded,
sleep 30
Is continuous
30
second.

-F
Specify the sampling frequency as
99Hz
(Per second
99
Times), if
99 times
All return the same function name, that means
CPU
The same function is being executed this second, and there may be performance problems.

After running, a huge text file will be generated. If a server has

16
A
CPU
, Samples per second
99
Times, lasting
30
Seconds, you get
47,520
A call stack, up to hundreds of thousands or even millions of lines.

For ease of reading,

perf record
The command can count the percentage of each call stack appearing, and then sort from high to low.

sudo perf report -n --stdio copy the code

2.2 Generate flame graph


First use

perf script
Tool pair
perf.data
Parse

# Generate folded call stack perf script -i perf.data &> perf.unfold Copy code

Save the parsed information for generating flame graphs

First use

stackcollapse-perf.pl
The content parsed by perf
perf.unfold
The symbols in are folded:

# Generate flame graph ./stackcollapse-perf.pl perf.unfold &> perf.folded Copy code

Finally generated

svg
Figure

./flamegraph.pl perf.folded> perf.svg copy the code

We can use pipelines to simplify the above process into one command

perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph /flamegraph.pl> process.svg copy the code

3 Analyze the flame graph


Finally, you can use the browser to open the flame graph for analysis.

3.1 The meaning of the flame graph


The flame graph is based on

stack
Information generated
SVG
Picture, used to show
CPU
Call stack.

y
The axis represents the call stack, each layer is a function. The deeper the call stack, the higher the flame, the top is the function being executed, and the bottom is its parent function.

x
The axis represents the number of samples, if a function is
x
The wider the width occupied by the axis, the more times it has been drawn, that is, the longer the execution time. Note,
x
The axis does not represent time, but all the call stacks are merged and arranged in alphabetical order.

The flame graph is to see which function on the top layer occupies the largest width. As long as there is a "flat top" (

plateaus
), it means that the function may have performance problems.

The color has no special meaning, because the flame diagram represents

CPU
It's busy, so warm colors are generally chosen.

3.2 Interactivity


The flame graph is

SVG
Pictures, you can interact with users.

  • Mouse hover

Each layer of the flame will be marked with the function name. When the mouse is hovered, the complete function name, the number of sampling draws, and the percentage of the total sampling times will be displayed

  • Click to enlarge

Click on a certain layer, the flame graph will be enlarged horizontally, the layer will occupy all the width, and detailed information will be displayed.

The upper left corner will also display "Reset Zoom", click the link, the picture will be restored to its original shape.

  • search for

Pressing Ctrl + F will display a search box, the user can enter keywords or regular expressions, and all the function names that meet the conditions will be highlighted.

3.3 Limitations


In both cases, the flame graph cannot be drawn, and the system behavior needs to be corrected.

  • Incomplete call stack

When the call stack is too deep, some systems only return to the previous part (such as the first 10 layers).

  • Function name is missing

Some functions have no names, and the compiler only uses memory addresses to represent them (such as anonymous functions).

3.4 The flame graph of the browser


Chrome
The browser can generate the flame diagram of the page script for
CPU
analysis.

Open developer tools, switch to

Performance
Panel. Then, click the "Record" button to start recording data. At this time, you can perform various operations on the page, and then stop "Recording".

At this time, the developer tool will display a timeline. Below it is the flame graph.

There are two differences between the browser flame graph and the standard flame graph: it is inverted (that is, the function at the top of the call stack is at the bottom);

x
The axis is the time axis, not the number of samples.

4 Red and blue bifurcation flame diagram


Refer to www.brendangregg.com/blog/2014-1...

Fortunately

CPU
Flame graph (
flame graphs
),
CPU
The problem of utilization rate is generally better positioned. But to deal with the problem of performance regression, it is necessary to constantly switch and compare the flame graphs before and after the modification or between different periods and scenes to find out the problem. This feels like Search for Pluto in the solar system. Although this method can solve the problem, I think there should be a better way.

Therefore, the following Introducing the red/blue differential FIG flame (red/blue differential flame graphs)

4.1 Example of red and blue differential flame diagram


Above is a pair of interactive

Format picture . Two colors are used in the picture to indicate the state, red indicates growth, and blue indicates attenuation.

The shape and size of each flame in this flame picture are the same as the second grab

profile
File corresponding
CPU
The flame graph is the same. (where,
y
The axis represents the depth of the stack,
x
The axis represents the total number of samples, and the width of the stack frame represents
profile
The proportion of the function appearing in the file, the top layer represents the function that is running, and then the stack that calls it).

The following example shows that after the system is upgraded, a workload of

CPU
Utilization rate has increased. The following is the corresponding
CPU
Flame graph (
SVG
format)

Usually, the colors of the stack frame and the stack tower in the standard flame diagram are randomly selected. In the red/blue differential flame diagram, different colors are used to represent the two

profile
The difference in the file.

In the second

profile
in
deflate_slow()
The function and its subsequent calls run more times than the previous one, so this stack frame is marked in red in the above figure. It can be seen that the cause of the problem is that the ZFS compression function is enabled, and before the system upgrade This feature is turned off.

This example is too simple, I can even analyze it without using the differential flame graph. But imagine that if you are analyzing a small performance degradation, such as less than 5%, and the code is more complex, the problem is as good as that. Dealt with.

4.2 Introduction to Red and Blue Differential Flame Diagram


I have been discussing this matter for several years, and finally I wrote an implementation that I personally think is valuable. It works like this:

  1. Grab the stack before modification

    profile1
    file

  2. Grab the modified stack

    profile2
    file

  3. use

    profile2
    To generate the flame graph. (So the width of the stack frame is
    profile2
    Document-based)

  4. Use the difference of "2-1" to recolor the flame graph. The principle of coloring is that if the stack frame is in

    profile2
    If it appears more frequently, it is marked as red, otherwise it is marked as blue. The color is filled according to the difference before and after modification.

The purpose of this is to use both before and after the modification

profile
File comparison is very useful when performing functional verification tests or evaluating the impact of code modifications on performance. The new flame graph is based on the revised
profile
File generation (so the width of the stack frame still shows the current CPU consumption). Through the color comparison, you can understand the reason for the difference in system performance.

Only functions that have a direct impact on performance will be marked with colors (for example, functions that are running), and the sub-functions it calls will not be marked repeatedly.

4.3 Generate red/blue differential flame graph


author's

GitHub
warehouse
FlameGrdph
A program script is implemented in
difffolded.pl
Used to generate red and blue differential flame graphs. In order to show how the tool works, use Linux perf_events to demonstrate the operation steps. You can also use other
profiler/tracer
.

  • Grab the profile 1 file before modification:
# dedicate data perf record -F 99 -a -g - sleep 30 # Analyze data to generate stack information perf script> out.stacks1 # Fold stack ./stackcollapse-perf.pl ../out.stacks1> out.folded1 Copy code
  • After a period of time (or after the program code is modified), grab the profile 2` file
# dedicate data perf record -F 99 -a -g - sleep 30 # Analyze data to generate stack information perf script> out.stacks2 # Fold stack ./stackcollapse-perf.pl ../out.stacks2> out.folded2 Copy code

Generate red and blue differential flame diagram

./difffolded.pl out.folded1 out.folded2 | ./flamegraph.pl> diff2.svg copy the code

difffolded.pl
Only the "folded" stack
profile
The file is operated, and the folding operation is performed by the previous
stackcollapse
The series of scripts are completed. The scripts are output
3
Column data, one column represents the folded call stack, and the other two columns are before and after modification
profile
File statistics.

func_a;func_b;func_c 31 33 [...] Copy code

In the above example, "funca()->funcb()->func_c()" represents the call stack, which is in profile1

Document CCP appeared
31
Times at
profile2
Document CCP appeared
33
Times. Then, use
flamegraph.pl
The script handles this
3` column data, a red/blue differential flame graph will be automatically generated.

Here are some useful options:

other optionsdescription
difffolded.pl -nThis option will normalize the data in the two profile files so that they can match each other. If you don't do this, the statistics of all the stacks grabbed will definitely be different, because the grabbing time and CPU load are different. In this case, it looks either red (increased load) or blue (decreased load). The -n option balances the first profile file, so you can get a complete red/blue map
difffolded.pl -xThis option will delete the hexadecimal address. The profiler often fails to convert the address to a symbol, so there will be a hexadecimal address in the stack. If this address is different in the two profile files, the two stacks will be considered different stacks, but in fact they are the same. If you encounter such a problem, use the -x option to fix it
flamegraph.pl --negateUsed to reverse the red/blue color scheme. In the following chapters, this function will be used

4.4 Shortcomings


Although the red/blue differential flame diagram is useful, there is actually a problem: if a code execution path disappears completely, then there is no place to mark blue in the flame diagram. You can only see the current one

CPU
Usage, and don't know why it becomes like this.

One way is to reverse the order of comparison and draw an opposite differential flame diagram. For example:

The flame diagram above is based on before modification

profile
The file is the benchmark, and the color expresses what is going to happen. The part highlighted in blue on the right shows the modified
CPU Idle
Consumed
CPU
Time will be less. (Actually, usually
cpuidle
To filter out, use the command line
grep -v cpuidle
)

In the figure, the disappeared code is also highlighted (or it should be said that it is not highlighted), because the compression function was not enabled before the modification, so it did not appear before the modification

profile
The file is gone, and there is no part marked in red.

The following is the corresponding command line:

./difffolded.pl out.folded2 out.folded1 | ./flamegraph.pl --negate> diff1.svg copy the code

In this way, the previous generation

diff2.svg
Using them together, we can get:

Flame graph informationdescription
diff1.svgThe width is based on the profile file before modification, and the color indicates what will happen
diff2.svgThe width is based on the modified profile file, and the color indicates what has happened

If you are doing a functional verification test, I will generate these two images at the same time.

4.5 CPI flame graph


These scripts were initially used in the analysis of the CPI flame chart . Compare the before and after modification

profile
The file is different, in the analysis
CPI
When the flame graph, it can be analyzed
CPU
The difference between the working cycle and the pause cycle changes, which can highlight the working status of the CPU.

4.6 Other differential flame diagrams


There are other people who have done similar work. Robert Mustacchi also made some attempts not long ago . The method he used is similar to the color code style during code inspection: only the differences are shown, and red indicates the new (rising) code. Path, blue indicates the code path for deletion (descent). A key difference is that the width of the stack frame only reflects the number of different samples. An example is on the right. This is a very good idea, but it feels a bit strange in actual use, because the context of the complete profile file is missing as a background, this picture is a bit difficult to understand.

Cor-Paul Bezemer also created a differential display method flamegraphdiff . He puts 3 flame graphs in the same graph at the same time, one for each standard flame graph before and after the modification, and a differential flame graph is added below, but the stack frame width It is also the number of samples of the difference. The figure above is an example . Move the mouse to the stack frame in the difference graph, the same stack frame in the three graphs will be highlighted. This method adds two standard flame diagrams, so the context problem is solved.

The difference flame diagrams of the three of us all have their own strengths. The three can be used in combination: the two images above in the Cor-Paul method can use my diff1.svg and diff2.svg. The flame diagram below can use Robert's way. To maintain consistency, I can use my coloring method for the flame map below: blue->white->red.

The flame map is spreading widely, and now many companies are using it. If you know of other ways to implement differential flame graphs, I wouldn't be surprised. (Please tell me in the comments)

4.7 Summary


If you have a performance regression problem, the red/blue differential flame graph is the fastest way to find the root cause. In this way, two ordinary flame pictures were captured, and then compared, and the differences were color-coded: red means rising, blue means falling. The differential flame graph is based on the current ("modified") profile file, and the shape and size remain unchanged. Therefore, you can intuitively find the difference through the difference in color, and you can see why there is such a difference.

The differential flame graph can be applied to the daily construction of the project, so that the performance regression problem can be discovered and corrected in time.

via: www.brendangregg.com/blog/2014-1...

5 Reference


Use linux perf tool to generate java program flame diagram

Use perf to generate Flame Graph (flame graph)

The site of the great god brendangregg