perf introduction

简介

与Ftrace相反，Perf的监控范围主要限于单个进程。
因此Perf更适合分析特定程序的行为，比如代码的缓存友好性，或者每个函数中花费的时间。

Perf工具本身是构建在内核的perf_events子系统之上，该子系统实际上实现了跟踪、分析和采样功能。

Perf还可以利用Ftrace的架构来连接到跟踪点，类似于Ftrace和LTTng进行跟踪。

Perf_events内部使用一个环形缓冲区，可以映射到用户空间进行使用。
通过perf_event_open()系统调用，用户空间进程可以获得要测量的指标/计数器的文件描述符。
然后，文件描述符可以通过mmap()映射，并从用户空间进程的内存空间访问。

功能

perf功能
1. 跟踪系统调用(比strace快)
2. 采样不同语言的运行程序
3. 跟踪、计数几乎任何内核事件

perf list

事件类型
• Hardware Events: cpu性能监测计数器counter
• Software Events: 基于内核计数器的底层事件。例如 CPU migration，minor faults，major faults
• Kernel Tracepoint Events: 内核静态定义的插入点，开发人员编码到感兴趣的代码逻辑路径上
• User Statically-Defined Tracing: 用户态程序定义的插入点
• Dynamic Tracing: 动态跟踪。可以根据需要在任何允许的位置插入探针。内核态利用了kprobes机制，用户态利用uprobes
• Timed Profiling：以指定的频率扫描收集信息(perf record -F hz)。主要用于cpu usage profiling。通过创建自定义的时间中断事件。

perf list命令可以罗列出所有perf支持跟踪记录的事件

软件事件都有默认的收集时间间隔。采样的时候只是收集到其中部分事件，而不是每一次都能记录下来。

$ perf list | grep -i "software event"                                                                                  
  alignment-faults                                   [Software event]                                                               
  context-switches OR cs                             [Software event]                                                               
  cpu-clock                                          [Software event]                                                               
  cpu-migrations OR migrations                       [Software event]                                                               
  dummy                                              [Software event]                                                               
  emulation-faults                                   [Software event]                                                               
  major-faults                                       [Software event]                                                               
  minor-faults                                       [Software event]                                                               
  page-faults OR faults                              [Software event]                                                               
  task-clock                                         [Software event]

硬件事件（PMCs)
Perfomance montoring couters 性能监测计数器，统计底层处理器的活动，比如cpu时钟周期
PMCs典型的处理器实现方式是：在同一时间，只能纪录几个或者一少部分事件，尽管处理器可以跟踪上千个不同类型的事件。
这是因为cpu上这些寄存器数量是固定有限的，只能编程为统计计数指定的几个事件。

$ perf list | grep -i "hardware event"                                                                                  
  branch-instructions OR branches                    [Hardware event]                                                               
  branch-misses                                      [Hardware event]                                                               
  cache-misses                                       [Hardware event]                                                               
  cache-references                                   [Hardware event]                                                               
  cpu-cycles OR cycles                               [Hardware event]                                                               
  instructions                                       [Hardware event]                                                               
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]                                                               
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]        

perf top/record

当一个进程cpu使用率很高时，可以使用perf top -p pid

top的统计粒度是进程
perf top 的统计粒度是进程运行时调用到的函数，精确到每个函数cpu使用率，包括用户态和内核态的函数

perf top 是及时输出出来，perf record却是和perf top一样收集信息，但是是把信息保存到perf.data文件，供稍后分析

perf record是以采样的方式检查当前正在执行哪个函数，比如说频率可以是100 times/second
perf record
perf record COMMAND start COMMAND and profile it until it exits
perf record PID profile PID until you press CTRL+C
perf record -a profile all process until you press CTRL+C

perf record -p 8325 sleep 5 profile 8325 for 5 seconds

perf record也可以记录多种不同的事件events，这个时候不再是采样的方式，而是试图记录每一次事件。

system calls
sending network packets
reading from a block device(disk)
context switches/page faults
make any kernel function into an event(kprobes)

-g 指示收集调用栈这样可以追踪到产生该事件的具体的代码位置
perf record -e syscalls:sys_enter_connect -ag

分析perf record data
perf report 交互式显示哪个函数调用次数最多
perf annotate 可以打印出进程执行时的热点汇编指令
perf script 以文本方式打印出所有采样数据，类似于火焰图

有时候perf记录的用户进程栈里夹杂着内核函数，这是正常的。
这意味着要么进程执行了系统调用或者产生了page fault，从而触发了相应的内核函数。

perf stat

要写一个高性能的程序，有许多需要关注的地方

cpu/hardware level events
page faults
TLB misses
cpu cycles
L1 cache hits/misses
instructions per cycle
branch prediction misses

perf stat 会带来很多开销，统计计数每一次进程触发的系统调用事件可能会使系统慢6倍以上

perf stat -e context-switches ls -R /              // count context-switches between the kernel and userspace  
perf stat -e 'syscalls:sys_enter_*' ls -R /        // count system calls  
perf stat -ddd ls      // -ddd d is for detailed

perf trace

strace好用，但是会使程序运行慢很多（10x slower)
perf trace跟踪系统调用开销很小，可以用在客户环境，但是perf trace也有两个缺点

有时候会丢失syscalls事件(也算优点，限制了开销)
不会显示读写的字符串

perf 工作原理

perf 分成两部分

运行在用户态的 perf 程序
响应 perf 命令的内核部分

内核负责进程调度，所以内核知道正在运行哪个进程，以及正在执行哪条指令（PC程序计数器）

当运行 perf record/perf stat/perf top命令时，实现步骤如下：

perf 要求内核收集相应信息
内核根据perf要求采样获取samples/traces/cpu counters
perf显示收集到的数据

有时候perf不能把指令地址转换为函数名称字符串，所以显示的就是地址数值

perf推算出运行的程序的步骤如下

获取到程序指令指针地址
获取到程序调用栈
展开栈找到函数调用关系
使用符号表把各个栈地址翻译成符号名称字符串
如果二进制程序符号表被strip掉了，那perf是没办法给出函数名称字符串的，只能显示成地址数值

具体到代码实现

perf调用perf_event_open系统调用
内核把事件信息写入用户态ringbuffer
perf从ringbuffer读出信息并显示出来（或者存到perf.data)

ringbuffer就是内核分配的固定大小的环形缓冲区，每次写入一个记录record就会占用一部分内存。
因为缓冲区大小固定，所以写入的记录数量是受限的。当缓冲区被写满的时候，就会发生丢失，此时perf会有告警打印。

perf / bcc等工具都会使用到perf_event_open系统调用。
bcc(BCC_PERF_OUTPUT)利用该系统调用写出自己定制的profiling/trace events，按照自己的意愿收集并显示跟踪记录数据。

内核版本
perf和内核版本号深度绑定

安装perf命令时需要匹配对应的内核版本
perf功能特性可能随着内核版本号而改变

注意事项
perf_events 和其它debug工具一样，也需要符号表，用于把内存地址映射成对应的变量名称字符串，方便开发人员。
没有符号表就只能看到十六进制数字地址.也自己编译软件，生成符号表。

使用perf命令需要时刻注意给系统带来的开销。这取决于指定跟踪的事件发生的频率，事件发生越频繁，系统开销越大，perf.data文件越大。

命令示例

-F 指定采用频率
-g 记录栈信息
-e 指定跟踪的事件
-a 跟踪整个系统
-p 指定进程pid

perf top -F 49  
perf top -ns comm,dso        
perf top -e raw_syscalls:sys_enter -ns comm -d 1       //统计系统调用次数，每一秒刷新一次  
stdbuf -oL perf top -e net:net_dev_xmit  -ns comm | strings     //统计进程发包次数  

perf stat COMMAND  
perf sat -ddd command     
perf stat -e cycles,instructions,cache-misses -a  
perf stat -e 'syscalls:sys_enter_*' -p pid  
perf stat -e 'block:*' -a sleep 10  

perf report  
perf report --stdio  
perf scirpt  
perf annotate [--stdio]  

perf trace            // trace syscalls system-wide  
perf trace -p pid     // trace syscalls for PID  

perf record -F 99 COMMAND  
perf record -p PID  
perf record -p PID sleep 10  
perf record -p PID -g -- sleep 10  
perf record -p PID --call-graph dwarf    // using DWARF to unwind stack  

perf record -e sched:sched_process_exec -a     // tracing new processes, until ctrl-c  
perf record -e context-switches -a             // tracing all context-switches, until ctrl-c  
perf record -e context-switches -ag -- sleep 10  
perf record -e page-faults -ag                 // tracing all page faults with stack traces,until ctrl-c  

// add a tracepoint for kernel function tcp_sendmsg()，and trace this newly created probe
perf probe 'tcp_sendmsg' && perf record -e -a probe:tcp_sendmsg     

// add a tracepoint for myfunc() return, and include the reval as a string
perf probe 'myfunc%return +0($retval):string'     

// trace and filter   (state is not TCP_ESTABLISHED(1)
perf probe 'tcp_sendmsg' && perf record -e -a probe:tcp_sendmsg --filter 'size > 0 && skc_state != 1' -a     

// add a tracepoint for do_sys_open() with the filename as a string  
perf probe 'do_sys_open filename:string'           

参考

https://wizardzines.com/zines/perf/
https://www.brendangregg.com/perf.html

perf原理与应用

简介

功能

perf list

perf top/record

perf stat

perf trace

perf 工作原理

命令示例

参考

CATALOG

FEATURED TAGS

FRIENDS