BCC、bpftrace && BPF Performance Tools

简单学习了一下BCC和bpftrace 增加了bypass go tls的bpftrace写法见eBPF Uprobe bypass Go tls——抓取https明文流量

BCC & bpftrace内部实现

BCC由一个C++前端API用于内核态的BPF程序的编制、一个C++后端驱动使用Clang/LLVM编译BPF程序并将其装载到内核上挂载到时间上并对BPF映射表进行读写、用于编写BCC工具的语言前端：Python、C++ 组成。

bpftrace前端使用lex和yacc对bpftrace编程语言进行此法和语法分析，使用clang来解析结构体。后端则将bpftrace程序编译成LLVM IR，然后再通过LLVM库将其编译成BPF代码。

事件源及其实现

kprobe（int3中断）具体使用依赖于前端跟踪器:包括perf、systemtap以及BPF追踪器如BCC和bpftrace
kretprobe（返回地址trampoline）

tracepoint (nop 启用时改jump trampoline) 性能损耗更低比kprobe稳定
许多应用默认不开启USDT 可以使用Folly C++库添加或者使用systemtap-sdt-dev包提供的头文件和工具（nop 启用改int3）

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


# BCC
attach_kprobe()
attach_kretprobe()
attach_uprobe()
attach_uretprobe()
TRACEPOINT_PROBE()
USDT().enable_probe()
# bpftrace
'kprobe:subsystem:eventname { to do }'
'uprobe:/path/to/file:method { to do }'
'tracepoint:sched:sched_process_exec { to do }'
'usdt:/path/to/file:method { to do }'
'software:event:count'
'hardware:event:count'

动态USDT

预编译共享库函数预先插入USDT
需要时dlopen()加载动态库
调用函数

PMC时CPU上硬件可编程的计数器广泛用于性能分析如缓存命中率指令执行效率阻塞指令周期等
两种工作模式：计数和溢出采样(超过一定值发送信号)

PMC由于存在中断延迟或者乱序执行 Intel开发一种解决方案叫PEBS精确计数

BCC和bpftrace使用perf_events作为他们的环形缓冲区然后又增加了PMC的支持现在又通过perf_event_open() 来对所有事件进行观测

perf也开发了一个使用BPF的接口使其成为BPF追踪器是唯一一个内置在Linux中的BPF前端

bpftrace还包括BEGIN和END即bpftrace启动和退出的事件

Software

cpu-clock or cpu
task-clock
page-faults or faults
context-switches or cs
cpu-migrations
minor-faults
major-faults
alignment-faults
emulation-faults
dummy
bpf-output

Hardware

cpu-cycles or cycles
instructions
cache-references
cache-misses
branch-instructions or branches
branch-misses
bus-cycles
frontend-stalls
backend-stalls
ref-cycles

bpftool

kernel代码tools/bpf中包含bpftool 其有很多功能模块

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


# perf 子命令显示哪些BPF程序正在通过pref_event_open()
bpftool perf
# prog show 子命令会列出全部的程序（不止基于pref_event_open()的）
bpftool prog show
# xlated 可以 dump 进程对应的BPF汇编指令
bpftool prog dump xlated id 234
# 如果包含BTF信息 显示源码
bpftool prog dump xlated id 263
# 如果包含BTF信息 显示行号
bpftool prog dump xlated id 263 linum
# opcodes 显示BPF指令opcode
bpftool prog dump xlated id 263 opcodes
# visual dot格式输出控制流信息 可以用GraphViz绘图
bpftool prog dump xlated id 263 visual
# jited 显示x86_64汇编
bpftool prog dump jited id 263

bpftrace使用

基础

bpftrace有很多内置变量和函数，具体见https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md

基础变量（global、$local和per[tid]）

内置变量（username、pid、tid、comm（进程名）、nsecs（纳秒为单位的时间戳）..）

关联数组（key[value]）

频率计数（count() 或者 ++）

临时变量 $

BPF映射表变量 @（可在不同动作之间传递）

调用栈信息（kstack、ustack）

返回值（retval）追踪函数名（func）探针名（probe） CgroupId(cgroup)

参数（args、arg0.arg1 argN）

参数（sarg0.sarg1 sargN）那些用栈传值的程序（1.17以前的Go）

bpftrace选项

bpftrace -e 'probe /filter/ { action };

bpftrace -l 可以使用通配符列出所有动态插桩点和静态插桩点

tracepoint的命名方式是subsystem:eventname 如 kmem:kmalloc

1
2
3
4


sudo bpftrace -l 'kprobe:*'
sudo bpftrace -l 'kretprobe:*'
sudo bpftrace -l 'tracepoint:syscalls:*'
sudo bpftrace -l "usdt:/lib/x86_64-linux-gnu/libc.so.6:*"

动态插桩分成kprobe kretprobe uprobe uretprobe
tracepoint通配可以看到系统调用都被分为sys_open_* 和 sys_close_*

bpftrace -e 可以执行简短bpftrace脚本

1
2
3
4
5


➜  tools git:(master) sudo bpftrace -e 'tracepoint:syscalls:sys_enter_open* { @[probe] = count(); }'
Attaching 5 probes...
^C

@[tracepoint:syscalls:sys_enter_openat]: 33

追踪参数

tracepoint通常带有参数 bpftrace可以通过args访问这些参数的信息，如

1

'tracepoint:net:netif_rx' args->len

这些参数可以通过-v查看 -l列出 -v详细模式

1
2


sudo bpftrace -lv tracepoint:syscalls:sys_enter_read
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_clone { printf("->clone() by %s PID %d\n", comm, pid); } tracepoint:syscalls:sys_exit_close { printf("<-clone() return %d, %s PID %d\n", args->ret, comm, pid); }' 

kprobe的参数"arg0,arg1…argN"是进入函数的参数类型均为uint64 如果他们指向C结构体的指针可以强制类型转化为对应的结构体。

也就是说静态插桩如tracepoint用的是args，动态插桩如kprobe用的是arg0…

BPF并发控制、栈回溯

并行多线程可能同时更新映射表数据从而导致数据丢失，BCC和bpftrace前端使用了per CPU的独立数组映射避免并行线程对共享数据的更新

1
2
3
4
5
6
7
8


➜  linux sudo strace -febpf bpftrace -e 'k:vfs_read { @ = count(); }'
...
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_PERCPU_ARRAY,
...
➜  linux sudo strace -febpf bpftrace -e 'k:vfs_read { @++; }'
...
bpf(BPF_MAP_CREATE, {map_type=BPF_MAP_TYPE_HASH,
...

由于验证器的存在BPF是受限的图灵完备跟ethereum的gas类似有限制无限循环的功能

栈大小限制在512字节也就是0x200

同xv6实验 BPF也可以用栈回溯获取调用信息(RBP gcc默认关闭功能但是x64性能提升不明显)
栈回溯还可以使用debuginfo、LBR、ORC等方法
debuginfo DWARF格式浪费计算资源 LBR intel基于硬件的分支回溯 ORC新的调试信息格式

BCC和bpftrace调试

BCC

1

b = BPF(text=prog,debug=flags)

DEBUG_LLVM_IR = 0x1 compiled LLVM IR
DEBUG_BPF = 0x2 loaded BPF bytecode and register state on branches
DEBUG_PREPROCESSOR = 0x4 pre-processor result
DEBUG_SOURCE = 0x8 ASM instructions embedded with source
DEBUG_BPF_REGISTER_STATE = 0x10 register state on all instructions in addition to DEBUG_BPF
DEBUG_BTF = 0x20 print the messages from the libbpf library.

BCC的bpflist工具内核的bpftool工具

bpftrace

-d输出AST和LLVM IR

-dd详情模式

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67


➜  bpftool sudo bpftrace -d -e 'k:vfs_read { @[pid] = count(); }'

AST after: parser
-------------------
Program
 kprobe:vfs_read
  =
   map: @ :: type[none, ctx: 0]
    builtin: pid :: type[none, ctx: 0]
   call: count :: type[none, ctx: 0]


AST after: Semantic
-------------------
Program
 kprobe:vfs_read
  =
   map: @ :: type[count, ctx: 0]
    builtin: pid :: type[unsigned int64, ctx: 0]
   call: count :: type[count, ctx: 0]


AST after: NodeCounter
-------------------
Program
 kprobe:vfs_read
  =
   map: @ :: type[count, ctx: 0]
    builtin: pid :: type[unsigned int64, ctx: 0]
   call: count :: type[count, ctx: 0]


AST after: ResourceAnalyser
-------------------
Program
 kprobe:vfs_read
  =
   map: @ :: type[count, ctx: 0]
    builtin: pid :: type[unsigned int64, ctx: 0]
   call: count :: type[count, ctx: 0]

; ModuleID = 'bpftrace'
source_filename = "bpftrace"
target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n32:64-S128"
target triple = "bpf-pc-linux"

; Function Attrs: nounwind
declare i64 @llvm.bpf.pseudo(i64 %0, i64 %1) #0

define i64 @"kprobe:vfs_read"(i8* nocapture readnone %0) local_unnamed_addr section "s_kprobe:vfs_read_1" {
entry:
  %"@_val" = alloca i64, align 8
  %"@_key" = alloca i64, align 8
  %get_pid_tgid = tail call i64 inttoptr (i64 14 to i64 ()*)()
  %1 = lshr i64 %get_pid_tgid, 32
  %2 = bitcast i64* %"@_key" to i8*
  call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %2)
  store i64 %1, i64* %"@_key", align 8
  %pseudo = tail call i64 @llvm.bpf.pseudo(i64 1, i64 0)
  %lookup_elem = call i8* inttoptr (i64 1 to i8* (i64, i64*)*)(i64 %pseudo, i64* nonnull %"@_key")
  %map_lookup_cond.not = icmp eq i8* %lookup_elem, null
  br i1 %map_lookup_cond.not, label %lookup_merge, label %lookup_success
  ...
➜  bpftool sudo bpftrace -d -e 'k:vfs_read { @[pid] = count(); }' | wc -l 
88
➜  bpftool sudo bpftrace -dd -e 'k:vfs_read { @[pid] = count(); }' | wc -l
163

-v开启详情模式书上说会输出programe id和字节码可以配合bpftool prog使用

实际上只输出了programe id

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


➜  bpftool sudo bpftrace -v -e 'k:vfs_read { @[pid] = count(); }'        
INFO: node count: 7
Attaching 1 probe...

Program ID: 64

The verifier log: 
processed 30 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 1

Attaching kprobe:vfs_read
^C

@[919]: 1
@[1300]: 1
...

安全

https://github.com/brendangregg/bpf-perf-tools-book/tree/master/originals/Ch11_Security

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


bashreadline.bt
capable.bt
elfsnoop.bt
eperm.bt
modsnoop.bt
setuids.bt
shellsnoop.bt
shellsnoop.py
tcpreset.bt
ttysnoop.bt

# 监控bash输入
sudo bpftrace -e 'uretprobe:/bin/bash:readline { printf("%-6d %s\n",pid,str(retval)) }'

容器

命名空间 cgroup进行资源控制所有容器共享同一个内核

内核中有针对cgroup事件的跟踪点包括 cgroup:cgroup_setup_root、cgroup:cgroup_attach_task

也可以使用BPF_PROG_TYPE_CGROUP_SKB程序类型和附加到cgroup入口点和出口点上处理网络数据包

BPF跟踪需要root特权这对于大部分容器环境来说意味着BPF跟踪工具只能在宿主机上执行不能在容器内执行

容器使用一些命名空间的组合详细信息可以通过内核的nsproxy.h结构体读取 linux/nsproxy.h

传统工具：systemd-cgtop、kubectl top、 docker stats、/sys/fs/cgroups、pref

BPF工具：runqlat、pidnss、blkthrot、overlayfs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


#include <linux/sched.h>
#include <linux/nsproxy.h>
#include <linux/utsname.h>
#include <linux/pid_namespace.h>

BEGIN
{
	printf("Tracing PID namespace switches. Ctrl-C to end\n");
}

kprobe:finish_task_switch
{
	$prev = (struct task_struct *)arg0;
	$curr = (struct task_struct *)curtask;
	$prev_pidns = $prev->nsproxy->pid_ns_for_children->ns.inum;
	$curr_pidns = $curr->nsproxy->pid_ns_for_children->ns.inum;
	if ($prev_pidns != $curr_pidns) {
		@[$prev_pidns, $prev->nsproxy->uts_ns->name.nodename] = count();
	}
}

END
{
	printf("\nVictim PID namespace switch counts [PIDNS, nodename]:\n");
}

cgroup blk 控制器基于硬限制来限制I/O的时间统计被限制次数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


#include <linux/cgroup-defs.h>
#include <linux/blk-cgroup.h>

BEGIN
{
	printf("Tracing block I/O throttles by cgroup. Ctrl-C to end\n");
}

kprobe:blk_throtl_bio
{
	@blkg[tid] = arg1;
}

kretprobe:blk_throtl_bio
/@blkg[tid]/
{
	$blkg = (struct blkcg_gq *)@blkg[tid];
	if (retval) {
		@throttled[$blkg->blkcg->css.id] = count();
	} else {
		@notthrottled[$blkg->blkcg->css.id] = count();
	}
	delete(@blkg[tid]);
}

跟踪overlay文件系统的读写延迟

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45


#include <linux/nsproxy.h>
#include <linux/pid_namespace.h>

kprobe:ovl_read_iter
/((struct task_struct *)curtask)->nsproxy->pid_ns_for_children->ns.inum == $1/
{
	@read_start[tid] = nsecs;
}

kretprobe:ovl_read_iter
/((struct task_struct *)curtask)->nsproxy->pid_ns_for_children->ns.inum == $1/
{
	$duration_us = (nsecs - @read_start[tid]) / 1000;
	@read_latency_us = hist($duration_us);
	delete(@read_start[tid]);
}

kprobe:ovl_write_iter
/((struct task_struct *)curtask)->nsproxy->pid_ns_for_children->ns.inum == $1/
{
	@write_start[tid] = nsecs;
}

kretprobe:ovl_write_iter
/((struct task_struct *)curtask)->nsproxy->pid_ns_for_children->ns.inum == $1/
{
	$duration_us = (nsecs - @write_start[tid]) / 1000;
	@write_latency_us = hist($duration_us);
	delete(@write_start[tid]);
}

interval:ms:1000
{
	time("\n%H:%M:%S --------------------\n");
	print(@write_latency_us);
	print(@read_latency_us);
	clear(@write_latency_us);
	clear(@read_latency_us);
}

END
{
	clear(@write_start);
	clear(@read_start);
}

其他

1
2


man getaddrinfo
man setitimer