Ftrace公开课火热报名中:Ftrace公开课:站在设计者的角度来理解ftrace(限50人)。课程第一期报名已截止且已开课,第二期报名请咨询客服(小月微信:linuxer2016)。
作者简介
伟林,中年码农,从事过电信、手机、安全、芯片等行业,目前依旧从事Linux方向开发工作,个人爱好Linux相关知识分享,个人微博CSDN pwl999,欢迎大家关注!
Linux trace中,最基础的就是:function tracer和trace event。鉴于他们搭建的良好的框架(ringbuffer、tracefs),各种trace纷纷投奔而来。
tracer发展出了function、function_graph、irqsoff、preemptoff、wakeup等一系列tracer。
而event也发展出tracepoint、kprobe、uprobe等一系列的event。

不论是哪种trace,主要的工作都离不开上图的主要流程:
函数插桩。使用各种插桩方式把自己的trace函数插入到需要跟踪的probe point上;
input trace数据。在trace的probe函数被命中时,会存储数据到ringbuffer当中;这里还包括filter和trigger功能;
ouput trace数据。用户或者程序需要读出trace数据,根据需要输出ram数据或者是方面用户阅读的数据;对数据的解析,遵循谁存储谁提供解析规则的原则;
1、函数插桩
trace event的插桩使用的是tracepoint机制,tracepoint是一种静态的插桩方法。他需要静态的定义桩函数,并且在插桩位置显式的调用。
这种方法的好处是效率高、可靠,并且可以处于函数中的任何位置、方便的访问各种变量;坏处当然是不太灵活。
kernel在重要节点的固定位置,插入了几百个trace event用于跟踪。
1.1、trace point

关于tracepoint的工作的原理,我们举例说明:
1、我们在“kernel/sched/core.c”中通过
创建了一个tracepoint:__tracepoint_sched_switch (其中##name = sched_switch)。
同时创建了几个操作tracepoint的函数:
桩函数trace_##name();
注册回调函数register_trace_##name();
注销回调函数unregister_trace_##name();
2、tracepoint的定义如下:
struct tracepoint {const char *name; /* Tracepoint name */struct static_key key;void (*regfunc)(void);void (*unregfunc)(void);struct tracepoint_func __rcu *funcs;};
各个字段的含义如下:
key。tracepoint是否使能的开关。如果回调函数数组为空则key为disable,如果回调函数数组中有函数指针则key为enable;
*funcs。回调函数数组,tracepoint的作用就是在桩函数被命中时,逐个调用回调函数数组中的函数;
*regfunc/*unregfunc。注册/注销回调函数时的的钩子函数。注意这两个函数只是hook函数,不是用户用来注册/注销回调函数的,注册/注销回调函数的是register_trace_##name()/unregister_trace_##name();
3、我们在探测点插入桩函数:
kernel/sched/core.c:
static void __sched notrace __schedule(bool preempt){trace_sched_switch(preempt, prev, next);}
4、桩函数被命中时的执行流程:
可以看到就是逐条的执行回调函数数组中的函数指针。
trace_sched_switch()↓static inline void trace_##name(proto) \{ \if (static_key_false(&__tracepoint_##name.key)) \__DO_TRACE(&__tracepoint_##name, \TP_PROTO(data_proto), \TP_ARGS(data_args), \TP_CONDITION(cond),,); \if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \rcu_read_lock_sched_notrace(); \rcu_dereference_sched(__tracepoint_##name.funcs);\rcu_read_unlock_sched_notrace(); \} \}↓#define __DO_TRACE(tp, proto, args, cond, prercu, postrcu) \do { \struct tracepoint_func *it_func_ptr; \void *it_func; \void *__data; \\if (!(cond)) \return; \prercu; \rcu_read_lock_sched_notrace(); \(1) 逐个调用(tp)->funcs函数数组中的函数这些函数可以由trace_event注册,也可以由其他人注册*/it_func_ptr = rcu_dereference_sched((tp)->funcs); \if (it_func_ptr) { \do { \it_func = (it_func_ptr)->func; \__data = (it_func_ptr)->data; \((void(*)(proto))(it_func))(args); \} while ((++it_func_ptr)->func); \} \rcu_read_unlock_sched_notrace(); \postrcu; \} while (0)
5、注册回调函数:
可以通过
register_trace_##name()/unregister_trace_##name()函数向回调函数数组中添加/删除函数指针。
static inline int \register_trace_{ \return tracepoint_probe_register(&__tracepoint_(void *)probe, data); \} \static inline int \unregister_trace_{ \return tracepoint_probe_unregister(&__tracepoint_(void *)probe, data); \}
6、trace event对tracepoint的利用:
以上可以看到,tracepoint只是一种静态插桩方法。trace event可以使用,其他机制也可以使用,只是kernel的绝大部分tracepoint都是trace event在使用。
单纯的定义和使用一个trace point,可以参考:Documentation/trace/tracepoints.txt
trace event也必须向tracepoint注册自己的回调函数,这些回调函数的作用就是在函数被命中时往ringbuffer中写入trace信息。
1.2、增加一个新的trace event
在现有的代码中添加探测函数,这是让很多内核开发者非常不爽的一件事,因为这可能降低性能或者让代码看起来非常臃肿。为了解决这些问题,内核最终进化出了一个TRACE_EVENT()来实现trace event的定义,这是非常简洁、智能的一个宏定义。
在理解TRACE_EVENT()宏之前,我们先来了解一下怎么样使用TRACE_EVENT()新增加一个trace event。
新增加trace event,我们必须遵循规定的格式。格式可以参考:Using the TRACE_EVENT() macro (Part 1)和samples/trace_events。
以下以内核中已经存在的event sched_switch为例,说明定义过程。
1、首先需要在include/trace/events/文件夹下添加一个自己event的头文件,需要遵循注释的标准格式:
include/trace/events/sched.h:
// (1)TRACE_SYSTEM需要定义的 和头文件名一样,不然后续使用TRACE_SYSTEM多次include头文件的时候会出错#undef TRACE_SYSTEM#define TRACE_SYSTEM sched// (2)定义一个可以重复include的入口:TRACE_HEADER_MULTI_READ#if !defined(_TRACE_SCHED_H) || defined(TRACE_HEADER_MULTI_READ)#define _TRACE_SCHED_H// (3)包含进tracepoint.h文件#include <linux/tracepoint.h>// (4)TRACE_EVENT包含5个参数:(name, proto, args, struct, assign, print)// 前面两个参数:proto, args,是给定义tracepoint使用的。在linux/tracepoint.h中构造tracepoint桩函数、callback regitser/unregister函数,在trace/define_trace.h中定义tracepoint变量。// 后面三个参数:struct, assign, print,是给trace_event使用的。在trace/trace_events.h,构造trace_event的callback函数,注册到tracepoint。TRACE_EVENT(sched_switch,TP_PROTO(bool preempt,struct task_struct *prev,struct task_struct *next),TP_ARGS(preempt, prev, next),TP_STRUCT__entry(__array( char, prev_comm, TASK_COMM_LEN )__field( pid_t, prev_pid )__field( int, prev_prio )__field( long, prev_state )__array( char, next_comm, TASK_COMM_LEN )__field( pid_t, next_pid )__field( int, next_prio )),TP_fast_assign(memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);__entry->prev_pid = prev->pid;__entry->prev_prio = prev->prio;__entry->prev_state = __trace_sched_switch_state(preempt, prev);memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);__entry->next_pid = next->pid;__entry->next_prio = next->prio;),TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,__entry->prev_state & (TASK_STATE_MAX-1) ?__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },{ 16, "Z" }, { 32, "X" }, { 64, "x" },{ 128, "K" }, { 256, "W" }, { 512, "P" },{ 1024, "N" }) : "R",__entry->prev_state & TASK_STATE_MAX ? "+" : "",__entry->next_comm, __entry->next_pid, __entry->next_prio));// (5) 为了节约text空间,如果参数都一样只是位置不一样,可以使用DECLARE_EVENT_CLASS() + DEFINE_EVENT()的形式来定义TRACE_EVENT()// 最新的kernel中,使用TRACE_EVENT()和使用DECLARE_EVENT_CLASS() + DEFINE_EVENT()是等价的DECLARE_EVENT_CLASS(sched_process_template,TP_PROTO(struct task_struct *p),TP_ARGS(p),TP_STRUCT__entry(__array( char, comm, TASK_COMM_LEN )__field( pid_t, pid )__field( int, prio )),TP_fast_assign(memcpy(__entry->comm, p->comm, TASK_COMM_LEN);__entry->pid = p->pid;__entry->prio = p->prio;),TP_printk("comm=%s pid=%d prio=%d",__entry->comm, __entry->pid, __entry->prio));/** Tracepoint for freeing a task:*/DEFINE_EVENT(sched_process_template, sched_process_free,TP_PROTO(struct task_struct *p),TP_ARGS(p));/** Tracepoint for a task exiting:*/DEFINE_EVENT(sched_process_template, sched_process_exit,TP_PROTO(struct task_struct *p),TP_ARGS(p));/** Tracepoint for waiting on task to unschedule:*/DEFINE_EVENT(sched_process_template, sched_wait_task,TP_PROTO(struct task_struct *p),TP_ARGS(p));#endif /* _TRACE_SCHED_H */// (6) 包含进define_trace.h文件/* This part must be outside protection */#include <trace/define_trace.h>
2、在探测点位置中调用桩函数,需要遵循注释的标准格式:
kernel/sched/core.c:
// (1) 定义CREATE_TRACE_POINTS,让define_trace.h能够进入,进行tracepoint和trace_event的创建static void __sched notrace __schedule(bool preempt){// (2)trace_point定义的桩函数,在需要插桩的位置进行调用trace_sched_switch(preempt, prev, next);}
1.3、TRACE_EVENT()宏的解析
从上一节的使用中看,TRACE_EVENT()的使用非常简洁,只是定义了一系列的数据,调用也只是一句话。TRACE_EVENT()帮用户隐藏了所有的复杂细节,函数的定义、数据的写入读出、数据的解析。具体实现了以下功能:
创建了一个tracepoint,可以放到kernel代码中;
创建了一个回调函数,可以被上述tracepoint调用
回调函数必须实现以最快的方式将传递给它的数据记录到trace ringbuffer中。
必须创建一个函数能解析从ringbuffer读出的数据,转换成便于用户理解的形式。
TRACE_EVENT()为什么这么强大,用一个宏定义出所有的数据和函数?它的主要魔法是使用了一个cpp多次定义的一个技巧:
enum dog_enums DOGS;char *dog_strings[] = DOGS;char *dog_to_string(enum dog_enums dog){return dog_strings[dog];}
上例中,我们可以看到利用C()这个宏,重复的进行”#undef #define”。第一次我们定义了dog_enums,第二次我们定义了dog_strings[]。内核中也有大量的代码使用了这个技巧,TRACE_EVENT()更是把这个技巧运用到了极致。
我们来分析一下TRACE_EVENT()的扩展过程:
1、在tracepoint.h中,宏TRACE_EVENT()第一次展开,定义了tracepoint需要的一些函数。包括:
tracepoint桩函数;桩函数的作用,就是在被调用时,按优先级调用所有的callback函数去执行;
tracepoint callback的register/unregister函数;
#define TRACE_EVENT(name, proto, args, struct, assign, print) \PARAMS(proto), PARAMS(args))↓#define DECLARE_TRACE(name, proto, args) \PARAMS(proto), PARAMS(args), \cpu_online(raw_smp_processor_id()), \PARAMS(void *__data, proto), \PARAMS(__data, args))↓#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \(1)tracepoint变量“__tracepoint_##name”本身,并不是本次定义的extern struct tracepoint __tracepoint_##name; \// (2)tracepoint桩函数“trace_##name”的定义static inline void trace_##name(proto) \{ \if (static_key_false(&__tracepoint_##name.key)) \__DO_TRACE(&__tracepoint_##name, \TP_PROTO(data_proto), \TP_ARGS(data_args), \TP_CONDITION(cond),,); \if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \rcu_read_lock_sched_notrace(); \rcu_dereference_sched(__tracepoint_##name.funcs);\rcu_read_unlock_sched_notrace(); \} \} \__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \PARAMS(cond), PARAMS(data_proto), PARAMS(data_args)) \// (3)tracepoint的回调函数注册函数"register_trace_##name"出来trace_event,普通用户也可以调用这个函数来注册自己的回调函数static inline int \register_trace_##name(void (*probe)(data_proto), void *data) \{ \return tracepoint_probe_register(&__tracepoint_##name, \(void *)probe, data); \} \static inline int \register_trace_prio_##name(void (*probe)(data_proto), void *data,\int prio) \{ \return tracepoint_probe_register_prio(&__tracepoint_##name, \(void *)probe, data, prio); \} \// (4)tracepoint的回调函数反注册函数"unregister_trace_##name"static inline int \unregister_trace_##name(void (*probe)(data_proto), void *data) \{ \return tracepoint_probe_unregister(&__tracepoint_##name,\(void *)probe, data); \} \static inline void \check_trace_callback_type_##name(void (*cb)(data_proto)) \{ \} \static inline bool \trace_##name##_enabled(void) \{ \return static_key_false(&__tracepoint_##name.key); \}↓(2.1) 桩函数中,逐个调用回调函数进行执行#define __DO_TRACE(tp, proto, args, cond, prercu, postrcu) \do { \struct tracepoint_func *it_func_ptr; \void *it_func; \void *__data; \\if (!(cond)) \return; \prercu; \rcu_read_lock_sched_notrace(); \it_func_ptr = rcu_dereference_sched((tp)->funcs); \if (it_func_ptr) { \do { \it_func = (it_func_ptr)->func; \__data = (it_func_ptr)->data; \((void(*)(proto))(it_func))(args); \} while ((++it_func_ptr)->func); \} \rcu_read_unlock_sched_notrace(); \postrcu; \} while (0)
2、在define_trace.h中,宏TRACE_EVENT()第二次的展开:
这一次的目的是定义tracepoint变量“_tracepoint##name”本身。
// (1) 只有定义了CREATE_TRACE_POINTS,才准许进入// (2) 为了避免后面TRACE_INCLUDE(TRACE_INCLUDE_FILE)多次重复包含,undefine掉CREATE_TRACE_POINTS/* Prevent recursion */// (3)定义TRACE_INCLUDE()宏和TRACE_INCLUDE_FILE()宏// "#include TRACE_INCLUDE(TRACE_INCLUDE_FILE)"的意思就是对用户定义的头文件进行再次包含(例如"include/trace/events/sched.h")// 这样会让其中的TRACE_EVENT()宏再次展开// (4.1)重新定义TRACE_EVENT()DEFINE_TRACE(name)// (4.2)定义TRACE_HEADER_MULTI_READ,让用户定义的头文件能重复进入(例如"include/trace/events/sched.h")/* Let the trace headers be reread */// (4.3)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开// (5) 包含trace_events.h// 在trace_events.h中,继续多次重新定义TRACE_EVENT(),重新包含用户定义的头文件(例如"include/trace/events/sched.h")// 目的是实现trace_event对应tracepoint callback函数的定义// (6) 恢复CREATE_TRACE_POINTS的定义/* We may be processing more files */
tracepoint.h:
↓DEFINE_TRACE_FN(name, NULL, NULL);↓static const char __tpstrtab___attribute__((section("__tracepoints_strings"))) =// (4.1.1)定义tracepoint变量“__tracepoint_##name”本身struct tracepoint __tracepoint___attribute__((section("__tracepoints"))) = \{ __tpstrtab_// (4.1.2)将tracepoint变量指针"__tracepoint_ptr_##name"存放到section("__tracepoints_ptrs")static struct tracepoint * const __tracepoint_ptr___attribute__((section("__tracepoints_ptrs"))) = \&__tracepoint_
在随后的trace_events.h中,宏TRACE_EVENT()又进行了多次的展开。
其中的关键是使用了一个技巧,重复多次定义多次展开。这里是这种技巧的一个最简单的demo:
enum dog_enums DOGS;char *dog_strings[] = DOGS;char *dog_to_string(enum dog_enums dog){return dog_strings[dog];}
3、trace_events.h中,宏TRACE_EVENT()又进行了多次的展开,定义了trace_event对应tracepoint callback需要的一系列函数。
宏TRACE_EVENT()第3次的展开:
定义了一个数据类型”struct trace_event_raw_##name”。用来表示trace_event数据在ringbuffer中的原始存储格式,动态长度成员使用32bit的” _data_loc##item” 来占位,其中低16bit表示偏移地址,高16bit表示数据长度。
// (1) 重新定义TRACE_EVENT()DECLARE_EVENT_CLASS(name, \PARAMS(proto), \PARAMS(args), \PARAMS(tstruct), \PARAMS(assign), \PARAMS(print)); \DEFINE_EVENT(name, name, PARAMS(proto), PARAMS(args));↓// 动态长度成员32bit的占位符“__data_loc_##item”// 低16bit是动态长度成员在结构体中的偏移量:__get_dynamic_array() = ((void *)__entry + (__entry->__data_loc_##field & 0xffff))// 高16bit是动态长度成员的长度:__get_dynamic_array_len() = ((__entry->__data_loc_##field >> 16) & 0xffff)// (2) 重新定义DECLARE_EVENT_CLASS()// (2.1)定义类型"struct trace_event_raw_##name"struct trace_event_raw_struct trace_entry ent; \tstruct \char __data[0]; \}; \\// (2.2) 定义“struct trace_event_class ”类型的变量"event_class_##name"// 这里只是定义了一个没有被初始化的变量,后面会被其他赋值变量覆盖// 这里也有一个C语言的技巧:全局变量如果没有初始化赋值,会被后面有初始化赋值的变量定义所覆盖,而不会出现“变量重定义”的错误static struct trace_event_class event_class_// (3) 重新定义DEFINE_EVENT()// (3.1) 定义“struct trace_event_call”类型的变量"event_##name"// 这里只是定义了一个没有被初始化的变量,后面会被其他赋值变量覆盖static struct trace_event_call __used \__attribute__((__aligned__(4))) event_// (4)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开
4、宏TRACE_EVENT()第4次的展开:
定义了一个数据类型”struct trace_event_data_offsets_##call”。该结构体只包含动态长度成员,每个占位32bit,表示了动态成员之间的相对位置。
// (1)TRACE_EVENT() = DECLARE_EVENT_CLASS();DEFINE_EVENT();// 只定义动态长度成员,每个长度为32bit;不定义其他常规成员;// 最终得到的是动态成员间的offset// (2) 重新定义DECLARE_EVENT_CLASS()// (2.1) 定义类型“struct trace_event_data_offsets_##call”struct trace_event_data_offsets_##call { \tstruct; \};// (3) 重新定义DEFINE_EVENT()// (4)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开
5、宏TRACE_EVENT()第5次的展开:
定义了”trace_raw_output_##call()”函数。作用是打印ringbuffer原始数据到output buffer当中;
定义了“struct trace_event_functions”类型的变量“trace_event_type_funcs_##call ”。把”trace_raw_output_##call()”函数赋值给其中的”.trace”成员;
// (1)TRACE_EVENT() = DECLARE_EVENT_CLASS();DEFINE_EVENT();// 使用"__entry"取代“field”// “__data_loc_##field”的低16bit为动态长度成员的地址偏移((void *)__entry + (__entry->__data_loc_// “__data_loc_##field”的高16bit为动态长度成员的长度((__entry->__data_loc_// str成员为dynamic_array的一种// 获取到bitmask的值,并且打印到output buffer// p为下面函数中定义的临时buffer({ \void *__bitmask = __get_dynamic_array(field); \unsigned int __bitmask_size; \__bitmask_size = __get_dynamic_array_len(field); \trace_print_bitmask_seq(p, __bitmask, __bitmask_size); \})// 根据flag_array[]解析出flag的含义,并且打印到output buffer// p为下面函数中定义的临时buffer({ \static const struct trace_print_flags __flags[] = \{ flag_array, { -1, NULL }}; \trace_print_flags_seq(p, delim, flag, __flags); \})({ \static const struct trace_print_flags symbols[] = \{ symbol_array, { -1, NULL }}; \trace_print_symbols_seq(p, value, symbols); \})({ \static const struct trace_print_flags_u64 symbols[] = \{ symbol_array, { -1, NULL } }; \trace_print_symbols_seq_u64(p, value, symbols); \})__print_symbolic(value, symbol_array)({ \BUILD_BUG_ON(el_size != 1 && el_size != 2 && \el_size != 4 && el_size != 8); \trace_print_array_seq(p, array, count, el_size); \})// (2) 重新定义DECLARE_EVENT_CLASS()// (2.1) 定义output format函数"trace_raw_output_##call()"// 把ringbuffer中的原始数据打印解析到output buffer中static notrace enum print_line_t \trace_raw_output_struct trace_event *trace_event) \{ \// (2.1.1) iter->seq:为output bufferstruct trace_seq *s = &iter->seq; \// (2.1.2) iter->tmp_seq:为临时buffer,供__get_bitmask()、__print_flags()、__print_symbolic()等一系列函数先把值转换成字符串struct trace_seq __maybe_unused *p = &iter->tmp_seq; \// (2.1.3) iter->ent:为ringbuffer原始数据的存放buffer,// 使用上面定义的结构"struct trace_event_raw_##call"指向这片区域,来进行引用struct trace_event_raw_int ret; \\field = (typeof(field))iter->ent; \\// (2.1.4)在output buffer的最前头,打印出trace_event的nameret = trace_raw_output_prep(iter, trace_event); \if (ret != TRACE_TYPE_HANDLED) \return ret; \\// (2.1.5)把ringbuffer中的原始数据根据格式打印到output buffer当中trace_seq_printf(s, print); \\return trace_handle_return(s); \} \// (2.2) 定义“struct trace_event_functions”类型的变量“trace_event_type_funcs_##call ”// 把"trace_raw_output_##call()"函数赋值给其中的".trace"成员static struct trace_event_functions trace_event_type_funcs_.trace = trace_raw_output_};// (3) 重新定义DEFINE_EVENT_PRINT()static notrace enum print_line_t \trace_raw_output_struct trace_event *event) \{ \struct trace_event_raw_struct trace_entry *entry; \struct trace_seq *p = &iter->tmp_seq; \\entry = iter->ent; \\if (entry->type != event_WARN_ON_ONCE(1); \return TRACE_TYPE_UNHANDLED; \} \\field = (typeof(field))entry; \\trace_seq_init(p); \return trace_output_call(iter,} \static struct trace_event_functions trace_event_type_funcs_.trace = trace_raw_output_};// (4)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开
6、宏TRACE_EVENT()第6次的展开:
定义了函数”trace_event_define_fields_##call()”,把TP_STRUCT__entry()结构体中各成员,转换成ftrace_event_field链表,供后面trace_event filter功能来使用。在events对应的”format”文件中,可以看到这些field的定义。
// (1)TRACE_EVENT() = DECLARE_EVENT_CLASS();DEFINE_EVENT();// 把一个__field_ext成员转换成ftrace_event_field,并且挂载到trace_event_call->class->fields链表当中// ftrace_event_field包含以下成员:(const char *type, const char *name, int offset, int size, int is_signed, int filter_type)ret = trace_define_field(event_call,offsetof(typeof(field), item), \sizeof(field.item), \is_signed_type(type), filter_type); \if (ret) \return ret;ret = trace_define_field(event_call,offsetof(typeof(field), item), \sizeof(field.item), \0, filter_type); \if (ret) \return ret;// 把一个__field成员转换成ftrace_event_field,并且挂载到trace_event_call->class->fields链表当中// filter type = FILTER_OTHER// 把一个__array成员转换成ftrace_event_field,并且挂载到trace_event_call->class->fields链表当中// filter type = FILTER_OTHER// 如果type = char, filter type = FILTER_STATIC_STRINGdo { \char *type_str =BUILD_BUG_ON(len > MAX_FILTER_STR_VAL); \ret = trace_define_field(event_call, type_str,offsetof(typeof(field), item), \sizeof(field.item), \is_signed_type(type), FILTER_OTHER); \if (ret) \return ret; \} while (0);// 把一个__dynamic_array成员转换成ftrace_event_field,并且挂载到trace_event_call->class->fields链表当中// 其中的数据是动态长度成员的占位符"__data_loc_##item"// filter type = FILTER_OTHER// 如果type = char, filter type = FILTER_DYN_STRINGret = trace_define_field(event_call, "__data_loc "offsetof(typeof(field), __data_loc_sizeof(field.__data_loc_is_signed_type(type), FILTER_OTHER);// (2) 重新定义DECLARE_EVENT_CLASS()// (2.1)创建函数"trace_event_define_fields_##call()"// 把TP_STRUCT__entry()结构体中各成员,转换成ftrace_event_field链表// 供后面trace_event filter功能来使用.// 在events对应的"format"文件中,可以看到这些field的定义。static int notrace __init \trace_event_define_fields_{ \struct trace_event_raw_int ret; \\tstruct; \\return ret; \}// (3) 重新定义DEFINE_EVENT()// (4)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开
7、宏TRACE_EVENT()第7次的展开:
定义了函数”trace_event_get_offsets_##call()”。计算了每个动态成员的偏移和长度,存放到“struct trace_event_data_offsets_##call”变量中
// (1)TRACE_EVENT() = DECLARE_EVENT_CLASS();DEFINE_EVENT();// 计算动态长度成员,在entry结构体中的偏移和长度// entry->__data[0]开始,是给动态长度成员存储实际数据的// 把数据的偏移和长度,存储到前面定义的结构体"struct trace_event_data_offsets_##call"中__item_length = (len) * sizeof(type); \__data_offsets->item = __data_size + \offsetof(typeof(*entry), __data); \__data_offsets->item |= __item_length << 16; \__data_size += __item_length;// 用strlen()来计算确定string的长度strlen((src) ? (const char *)(src) : "(null)") + 1)/** __bitmask_size_in_bytes_raw is the number of bytes needed to hold* num_possible_cpus().*/(((nr_bits) + 7) / 8)((__bitmask_size_in_bytes_raw(nr_bits) + \((BITS_PER_LONG / 8) - 1)) / (BITS_PER_LONG / 8))/** __bitmask_size_in_bytes is the number of bytes needed to hold* num_possible_cpus() padded out to the nearest long. This is what* is saved in the buffer, just to be consistent.*/(__bitmask_size_in_longs(nr_bits) * (BITS_PER_LONG / 8))__bitmask_size_in_longs(nr_bits))// (2) 重新定义DECLARE_EVENT_CLASS()// (2.1)创建了函数"trace_event_get_offsets_##call()"// 计算了每个动态成员的偏移和长度,存放到“struct trace_event_data_offsets_##call”变量中static inline notrace int trace_event_get_offsets_struct trace_event_data_offsets_{ \int __data_size = 0; \int __maybe_unused __item_length; \struct trace_event_raw_\tstruct; \\return __data_size; \}// (3) 重新定义DEFINE_EVENT()// (4)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开
8、宏TRACE_EVENT()第8次的展开:
定义最核心的函数”trace_event_raw_event_##call()”,这个函数实现trace_event的3个重要功能:1、判断是否需要filter。2、判断是否trigger。3、记录数据到ringbuffer。
// (1)TRACE_EVENT() = DECLARE_EVENT_CLASS();DEFINE_EVENT();// 根据trace_event_get_offsets_##call()计算的动态长度成员的偏移和长度,// 给entry中的动态成员占位符"__entry->__data_loc_##item"赋值__entry->__data_loc_strcpy(__get_str(dst), (src) ? (const char *)(src) : "(null)");memcpy(__get_bitmask(dst), (src), __bitmask_size_in_bytes(nr_bits))// (2) 重新定义DECLARE_EVENT_CLASS()\// (2.1)创建函数"trace_event_raw_event_##call()",这个函数实现trace_event的3个重要功能:// 1、判断是否需要filter// 2、判断是否trigger// 3、记录数据到ringbufferstatic notrace void \trace_event_raw_event_{ \struct trace_event_file *trace_file = __data; \struct trace_event_data_offsets_struct trace_event_buffer fbuffer; \struct trace_event_raw_int __data_size; \\// (2.1.1)在本event被"soft disable"的情况下:// 如果绑定的trigger,没有设置filter,在"soft disable"的情况下,trigger任然可以被执行if (trace_trigger_soft_disabled(trace_file)) \return; \\// (2.1.2)计算每个动态长度成员的偏移和长度,并且返回总的长度// 动态成员的存储位置在最后面"enrey->__data[0]"开始的位置__data_size = trace_event_get_offsets_\// (2.1.3)根据"entery的固定长度+动态数据长度",向ringbuffer中申请空间entry = trace_event_buffer_reserve(&fbuffer, trace_file, \sizeof(*entry) + __data_size); \\if (!entry) \return; \\// (2.1.4)根据trace_event_get_offsets_##call()计算的动态长度成员的偏移和长度,// 给entry中的动态成员占位符"__entry->__data_loc_##item"赋值tstruct \\// (2.1.5)记录数据到ringbuffer{ assign; } \\// (2.1.6)确认ringbuffer提交// 1、判断是否需要filter// 2、判断是否triggertrace_event_buffer_commit(&fbuffer, \sizeof(*entry) + __data_size); \}/** The ftrace_test_probe is compiled out, it is only here as a build time check* to make sure that if the tracepoint handling changes, the ftrace probe will* fail to compile unless it too is updated.*/// (3) 重新定义DEFINE_EVENT()static inline void ftrace_test_probe_{ \check_trace_callback_type_}// (4)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开
↓trace_event_buffer_commit()↓static inline voidevent_trigger_unlock_commit(struct trace_event_file *file,struct ring_buffer *buffer,struct ring_buffer_event *event,void *entry, unsigned long irq_flags, int pc,unsigned long len){enum event_trigger_type tt = ETT_NONE;// (2.1.6.1) 判断filter和trigger filter的条件是否成立if (!__event_trigger_test_discard(file, buffer, event, entry, &tt)) {if (len)stm_log(OST_ENTITY_FTRACE_EVENTS, entry, len);// (2.1.6.2) 正式提交commit到ringbuffertrace_buffer_unlock_commit(file->tr, buffer, event, irq_flags, pc);}// (2.1.6.3) 执行post类型的triggerif (tt)event_triggers_post_call(file, tt);}↓static inline bool__event_trigger_test_discard(struct trace_event_file *file,struct ring_buffer *buffer,struct ring_buffer_event *event,void *entry,enum event_trigger_type *tt){unsigned long eflags = file->flags;// (2.1.6.1.1)根据trigger的filter条件,执行triggerif (eflags & EVENT_FILE_FL_TRIGGER_COND)*tt = event_triggers_call(file, entry);// (2.1.6.1.2)如果当前event处在"soft disable"模式,丢弃掉ringbuffer的commit数据提交if (test_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags))ring_buffer_discard_commit(buffer, event);// (2.1.6.1.3)检查event的filter条件是否成立else if (!filter_check_discard(file, entry, buffer, event))return false;return true;}
9、宏TRACE_EVENT()第9次的展开:
定义重要” struct trace_event_class”类型的变量“event_class_##call”
定义重要”struct trace_event_call”类型的变量“event_##call”
把”struct trace_event_call”类型变量的指针”_event##call”存放到”section(“_ftrace_events”)”区域。在trace_events初始化时会逐个遍历section(“_ftrace_events”)中的指针来添加静态的events。
// (1)TRACE_EVENT() = DECLARE_EVENT_CLASS();DEFINE_EVENT();// perf_event的probe函数"perf_trace_##call"的声明static notrace void \perf_trace_.perf_probe = perf_trace_// (2) 重新定义DECLARE_EVENT_CLASS()_TRACE_PERF_PROTO(call, PARAMS(proto)); \static char print_fmt_// (2.1)定义"struct trace_event_class"类型的变量"event_class_##call"static struct trace_event_class __used __refdata event_class_.system = TRACE_SYSTEM_STRING, \// 前面定义的函数,把成员转成ftrace_event_field,并且挂载到trace_event_call->class->fields链表当中.define_fields = trace_event_define_fields_.fields = LIST_HEAD_INIT(event_class_// trace_event_call初始化.raw_init = trace_event_raw_init, \// 前面定义的函数,trace_event的probe函数,记录数据到ringbuffer.probe = trace_event_raw_event_// 注册trace_event/perf_event的probe函数到tracepoint中.reg = trace_event_reg, \// perf_event的probe函数"perf_trace_##call"_TRACE_PERF_INIT(call) \};// (3) 重新定义DEFINE_EVENT()\// (3.1)定义" struct trace_event_call"类型的变量"event_##call"static struct trace_event_call __used event_// 上一步定义的“struct trace_event_class”类型的变量.class = &event_class_{ \// 前面定义tracepoint变量.tp = &__tracepoint_}, \// 前面定义的变量,其中的.trace = trace_raw_output_##call,把ringbuffer数据format output.event.funcs = &trace_event_type_funcs_.print_fmt = print_fmt_.flags = TRACE_EVENT_FL_TRACEPOINT, \}; \// (3.2) 把trace_event_call的指针存放到section("_ftrace_events")区域中// 在trace_events初始化时会逐个遍历section("_ftrace_events")中的指针来添加静态的eventsstatic struct trace_event_call __used \__attribute__((section("_ftrace_events"))) *__event_\static char print_fmt_\static struct trace_event_call __used event_.class = &event_class_{ \.tp = &__tracepoint_}, \.event.funcs = &trace_event_type_funcs_.print_fmt = print_fmt_.flags = TRACE_EVENT_FL_TRACEPOINT, \}; \static struct trace_event_call __used \__attribute__((section("_ftrace_events"))) *__event_// (4)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开
10、如果定义了CONFIG_PERF_EVENTS,TRACE_EVENT()会在”include\trace\perf.h”中再一次展开。
define_trace.h:
↓
“include\trace\perf.h”,宏TRACE_EVENT()第10次的展开:
定义了函数perf_trace_##call(),是perf_event的probe函数,和trace_event的功能类似都是把数据记录到buffer当中,不过这里不是trace_event的ringbuffer,而是perf_event自己的buffer。
// (1)TRACE_EVENT() = DECLARE_EVENT_CLASS();DEFINE_EVENT();((void *)__entry + (__entry->__data_loc_((__entry->__data_loc_// (2) 重新定义DECLARE_EVENT_CLASS()// (2.1)定义了函数perf_trace_##call(),是perf_event的probe函数,// 和trace_event的功能类似都是把数据记录到buffer当中,不过这里不是trace_event的ringbuffer,而是perf_event自己的buffer。static notrace void \perf_trace_{ \struct trace_event_call *event_call = __data; \struct trace_event_data_offsets_struct trace_event_raw_struct pt_regs *__regs; \u64 __addr = 0, __count = 1; \struct task_struct *__task = NULL; \struct hlist_head *head; \int __entry_size; \int __data_size; \int rctx; \\// (2.1.1)计算动态数据的长度__data_size = trace_event_get_offsets_\head = this_cpu_ptr(event_call->perf_events); \if (__builtin_constant_p(!__task) && !__task && \hlist_empty(head)) \return; \\// (2.1.2)计算需要存储数据的总长度__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),\sizeof(u64)); \__entry_size -= sizeof(u32); \\// (2.1.3)分配perf_event的buffer空间entry = perf_trace_buf_prepare(__entry_size, \event_call->event.type, &__regs, &rctx); \if (!entry) \return; \\perf_fetch_caller_regs(__regs); \\// (2.1.4)根据trace_event_get_offsets_##call()计算的动态长度成员的偏移和长度,// 给entry中的动态成员占位符"__entry->__data_loc_##item"赋值tstruct \\// (2.1.5)存储数据到perf_event的buffer当中{ assign; } \\// (2.1.6)确认数据的提交perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \__count, __regs, head, __task); \}/** This part is compiled out, it is only here as a build time check* to make sure that if the tracepoint handling changes, the* perf probe will fail to compile unless it too is updated.*/// (3) 重新定义DEFINE_EVENT()static inline void perf_test_probe_{ \check_trace_callback_type_}DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))// (4)重新包含用户定义的头文件(例如"include/trace/events/sched.h"),让其中的TRACE_EVENT()定义再次展开
1.4、trace初始化
参考上一节第9次展开的结论,TRACE_EVENT()多次展开定义了各种实现函数以后,最终把本event的所有的函数和数据存放在顶级数据结构trace_event_call中,而所有的event都会把自己的trace_event_call指针存放到section(“_ftrace_events”)中。
9、宏TRACE_EVENT()第9次的展开:
- 定义重要" struct trace_event_class"类型的变量“event_class_##call”- 定义重要"struct trace_event_call"类型的变量“event_##call”- 把"struct trace_event_call"类型变量的指针"__event_##call"存放到"section("_ftrace_events")"区域。在trace_events初始化时会逐个遍历section("_ftrace_events")中的指针来添加静态的events。events")"区域。在trace_events初始化时会逐个遍历section("_ftrace_events")中的指针来添加静态的events。
在trace初始化或者是Instances化时,系统会新建ringbuffer空间,并且建立ringbuffer和trace_event_call之间的关联。我们通过分析trace的初始化过程,来理清他们相互之间的联系。
最关键的是:创建trace_event_file结构,并将trace_array、trace_event_call结合在一起。
start_kernel() -> trace_init():
void __init trace_init(void){if (tracepoint_printk) {tracepoint_print_iter =kmalloc(sizeof(*tracepoint_print_iter), GFP_KERNEL);if (WARN_ON(!tracepoint_print_iter))tracepoint_printk = 0;}/* 1、初始化struct trace_array的global_trace:->trace_buffe,分配主要的trace ringbuffer空间->max_buffer,分配snapshot空间,大小和trace_buffer一样->ops,初始化为global_ops->current_trace,默认tracer为nop_trace最后将global_trace加入到ftrace_trace_arrays链表2、初始化trace_printk()用到的trace_buffer_struct:trace_percpu_buffer、trace_percpu_sirq_buffer、trace_percpu_irq_buffer、trace_percpu_nmi_buffer3、初始化temp_buffer,在主buffer disable的时候给trigger使用*/tracer_alloc_buffers();/* trace event 初始化*/trace_event_init();}|→void __init trace_event_init(void){/* 创建field_cachep、file_cachep对应的slub内存池 */event_trace_memsetup();/* syscall对应初始化 */init_ftrace_syscalls();/* 使能event */event_trace_enable();}||→static __init int event_trace_enable(void){/* (1) get top trace_arrary,这是应该获得的是global_trace */struct trace_array *tr = top_trace_array();struct trace_event_call **iter, *call;int ret;if (!tr)return -ENODEV;/* (2) 从section("_ftrace_events")逐个拿出,使用TRACE_EVENT()宏定义的trace_event_call指针,并进行初始化*/for_each_event(iter, __start_ftrace_events, __stop_ftrace_events) {call = *iter;/* (2.1) 初始化trace_event_call */ret = event_init(call);if (!ret)/* (2.2) 将trace_event_call加入ftrace_events链表 */list_add(&call->list, &ftrace_events);}/** We need the top trace array to have a working set of trace* points at early init, before the debug files and directories* are created. Create the file entries now, and attach them* to the actual file dentries later.*//* (3) 创建trace_event_file结构,并将trace_array、trace_event_call结合在一起:file->event_call = trace_event_call,event相关操作函数file->tr = trace_array,ringbufferlist_add(&file->list, &tr->events),将file加入到tr->events链表中这个时候虽然trace_event_file结构已经创建,但是因为debugfs并没有准备好,所以不会有对应的文件创建,只是给那些early trace使用*/__trace_early_add_events(tr);/* (4) 根据bootcmd enable一些early trace event */early_enable_events(tr, false);/* (5) 如果trace_printk已经使能,开始cmdline cache的记录 */trace_printk_start_comm();/* (6) 注册event的command */register_event_cmds();/* (7) 注册trigger的command */register_trigger_cmds();return 0;}|||→static int event_init(struct trace_event_call *call){int ret = 0;const char *name;name = trace_event_name(call);if (WARN_ON(!name))return -EINVAL;if (call->class->raw_init) {/* (2.1.1) 调用trace_event_class的raw_init函数 */ret = call->class->raw_init(call);if (ret < 0 && ret != -ENOSYS)pr_warn("Could not initialize trace events/%s\n", name);}return ret;}
根据“include/trace/trace_event.h”中event_class_##call的定义,raw_init()最后调用到trace_event_raw_init():
int trace_event_raw_init(struct trace_event_call *call){int id;/* (2.1.1.1) 注册trace_event_call结构中的trace_event成员 */id = register_trace_event(&call->event);if (!id)return -ENODEV;return 0;}|→int register_trace_event(struct trace_event *event){unsigned key;int ret = 0;down_write(&trace_event_sem);if (WARN_ON(!event))goto out;if (WARN_ON(!event->funcs))goto out;INIT_LIST_HEAD(&event->list);/* (2.1.1.1.1) 给event->type分配一个id,这个id非常重要:在存入数据到ringbuffer中的时候,ent->type记录这个数据是谁存储的在读出数据时,根据ent->type找到对应的trace_event和trace_event_call,由具体的trace_event来格式化输出数据到用户可读格式*/if (!event->type) {struct list_head *list = NULL;if (next_event_type > TRACE_EVENT_TYPE_MAX) {event->type = trace_search_list(&list);if (!event->type)goto out;} else {event->type = next_event_type++;list = &ftrace_event_list;}if (WARN_ON(ftrace_find_event(event->type)))goto out;list_add_tail(&event->list, list);} else if (event->type > __TRACE_LAST_TYPE) {printk(KERN_WARNING "Need to add type to trace.h\n");WARN_ON(1);goto out;} else {/* Is this event already used */if (ftrace_find_event(event->type))goto out;}/* (2.1.1.1.2) 初始化event默认的各种输出函数 */if (event->funcs->trace == NULL)event->funcs->trace = trace_nop_print;if (event->funcs->raw == NULL)event->funcs->raw = trace_nop_print;if (event->funcs->hex == NULL)event->funcs->hex = trace_nop_print;if (event->funcs->binary == NULL)event->funcs->binary = trace_nop_print;/* (2.1.1.1.3) 将event安装type的数值加入到hash数组,方便读出数据时快速根据type找到event结构*/key = event->type & (EVENT_HASHSIZE - 1);hlist_add_head(&event->node, &event_hash[key]);ret = event->type;out:up_write(&trace_event_sem);return ret;}
在event_trace_enable()中创建的trace_event_file,直到event_trace_init()中才会真正创建成各种文件:
fs_initcall(event_trace_init);↓static __init int event_trace_init(void){struct trace_array *tr;struct dentry *d_tracer;struct dentry *entry;int ret;tr = top_trace_array();if (!tr)return -ENODEV;/* (1) 创建顶级文件夹:tracing */d_tracer = tracing_init_dentry();if (IS_ERR(d_tracer))return 0;/* (2) 创建"tracing/available_events"文件 */entry = tracefs_create_file("available_events", 0444, d_tracer,tr, &ftrace_avail_fops);if (!entry)pr_warn("Could not create tracefs 'available_events' entry\n");/* (3) 创建generic字段:CPU/cpu/COMM/comm,并加入到ftrace_generic_fields链表*/if (trace_define_generic_fields())pr_warn("tracing: Failed to allocated generic fields");/* (4) 创建common字段:type/flags/preempt_count/pid,并加入到ftrace_common_fields链表*/if (trace_define_common_fields())pr_warn("tracing: Failed to allocate common fields");/* (5) 使用early创建的trace_event_file,继续创建相关文件 */ret = early_event_add_tracer(d_tracer, tr);if (ret)return ret;ret = register_module_notifier(&trace_module_nb);if (ret)pr_warn("Failed to register trace events module notifier\n");return 0;}|→static __init intearly_event_add_tracer(struct dentry *parent, struct trace_array *tr){int ret;mutex_lock(&event_mutex);/* (5.1) 创建相关文件:"set_event"、"set_event_pid"、"events/"、"events/header_page"、"events/header_event"、"events/enable"*/ret = create_event_toplevel_files(parent, tr);if (ret)goto out_unlock;down_write(&trace_event_sem);/* (5.2) 根据early创建的trace_event_file,逐个在"events/"文件夹中创建对应的"subsystem"和"event"文件夹*/__trace_early_add_event_dirs(tr);up_write(&trace_event_sem);out_unlock:mutex_unlock(&event_mutex);return ret;}||→static __init void__trace_early_add_event_dirs(struct trace_array *tr){struct trace_event_file *file;int ret;list_for_each_entry(file, &tr->events, list) {ret = event_create_dir(tr->event_dir, file);if (ret < 0)pr_warn("Could not create directory for event %s\n",trace_event_name(file->event_call));}}
1.5、event enable
在trace初始化以后,tracepoint的桩函数和trace_event_call之间并没有建立起联系。只有在event被enable的时候,call->class->probe才会被注册成tracepoint的回调函数;同理event被disable时,call->class->probe会被从tracepoint的回调函数组中注销。
我们来看看enable的具体过程:
trace_create_file("enable", 0644, file->dir, file,&ftrace_enable_fops);↓static const struct file_operations ftrace_enable_fops = {.open = tracing_open_generic,.read = event_enable_read,.write = event_enable_write,.llseek = default_llseek,};↓static ssize_tevent_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,loff_t *ppos){struct trace_event_file *file;unsigned long val;int ret;ret = kstrtoul_from_user(ubuf, cnt, 10, &val);if (ret)return ret;ret = tracing_update_buffers();if (ret < 0)return ret;switch (val) {case 0:case 1:ret = -ENODEV;mutex_lock(&event_mutex);/* (1) 获得event对应的trace_event_file结构 */file = event_file_data(filp);if (likely(file))/* (2) 配置enable/disable */ret = ftrace_event_enable_disable(file, val);mutex_unlock(&event_mutex);break;default:return -EINVAL;}*ppos += cnt;return ret ? ret : cnt;}|→static int ftrace_event_enable_disable(struct trace_event_file *file,int enable){return __ftrace_event_enable_disable(file, enable, 0);}||→static int __ftrace_event_enable_disable(struct trace_event_file *file,int enable, int soft_disable){struct trace_event_call *call = file->event_call;struct trace_array *tr = file->tr;int ret = 0;int disable;switch (enable) {case 0:/** When soft_disable is set and enable is cleared, the sm_ref* reference counter is decremented. If it reaches 0, we want* to clear the SOFT_DISABLED flag but leave the event in the* state that it was. That is, if the event was enabled and* SOFT_DISABLED isn't set, then do nothing. But if SOFT_DISABLED* is set we do not want the event to be enabled before we* clear the bit.** When soft_disable is not set but the SOFT_MODE flag is,* we do nothing. Do not disable the tracepoint, otherwise* "soft enable"s (clearing the SOFT_DISABLED bit) wont work.*/if (soft_disable) {if (atomic_dec_return(&file->sm_ref) > 0)break;disable = file->flags & EVENT_FILE_FL_SOFT_DISABLED;clear_bit(EVENT_FILE_FL_SOFT_MODE_BIT, &file->flags);} elsedisable = !(file->flags & EVENT_FILE_FL_SOFT_MODE);if (disable && (file->flags & EVENT_FILE_FL_ENABLED)) {clear_bit(EVENT_FILE_FL_ENABLED_BIT, &file->flags);if (file->flags & EVENT_FILE_FL_RECORDED_CMD) {tracing_stop_cmdline_record();clear_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);}/* (2.1) 在disable时,调用call->class->reg()的UNREGISTER动作 */call->class->reg(call, TRACE_REG_UNREGISTER, file);}/* If in SOFT_MODE, just set the SOFT_DISABLE_BIT, else clear it */if (file->flags & EVENT_FILE_FL_SOFT_MODE)set_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags);elseclear_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags);break;case 1:/** When soft_disable is set and enable is set, we want to* register the tracepoint for the event, but leave the event* as is. That means, if the event was already enabled, we do* nothing (but set SOFT_MODE). If the event is disabled, we* set SOFT_DISABLED before enabling the event tracepoint, so* it still seems to be disabled.*/if (!soft_disable)clear_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags);else {if (atomic_inc_return(&file->sm_ref) > 1)break;set_bit(EVENT_FILE_FL_SOFT_MODE_BIT, &file->flags);}if (!(file->flags & EVENT_FILE_FL_ENABLED)) {/* Keep the event disabled, when going to SOFT_MODE. */if (soft_disable)set_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags);if (tr->trace_flags & TRACE_ITER_RECORD_CMD) {tracing_start_cmdline_record();set_bit(EVENT_FILE_FL_RECORDED_CMD_BIT, &file->flags);}/* (2.2) 在enable时,调用call->class->reg()的REGISTER动作 */ret = call->class->reg(call, TRACE_REG_REGISTER, file);if (ret) {tracing_stop_cmdline_record();pr_info("event trace: Could not enable event ""%s\n", trace_event_name(call));break;}set_bit(EVENT_FILE_FL_ENABLED_BIT, &file->flags);/* WAS_ENABLED gets set but never cleared. */call->flags |= TRACE_EVENT_FL_WAS_ENABLED;}break;}return ret;}
根据“include/trace/trace_event.h”中event_class_##call的定义,call->class->reg()最后调用到trace_event_reg():
int trace_event_reg(struct trace_event_call *call,enum trace_reg type, void *data){struct trace_event_file *file = data;WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));switch (type) {/* (2.2.1) 把call->class->probe()注册成call->tp的回调函数 */case TRACE_REG_REGISTER:return tracepoint_probe_register(call->tp,call->class->probe,file)/* (2.1.1) 把call->class->probe()从call->tp的回调函数组中注销 */case TRACE_REG_UNREGISTER:tracepoint_probe_unregister(call->tp,call->class->probe,file);return 0;#ifdef CONFIG_PERF_EVENTScase TRACE_REG_PERF_REGISTER:return tracepoint_probe_register(call->tp,call->class->perf_probe,call);case TRACE_REG_PERF_UNREGISTER:tracepoint_probe_unregister(call->tp,call->class->perf_probe,call);return 0;case TRACE_REG_PERF_OPEN:case TRACE_REG_PERF_CLOSE:case TRACE_REG_PERF_ADD:case TRACE_REG_PERF_DEL:return 0;#endif}return 0;}
1.6、”soft disable”模式
在enable/disable trace event的时候,常规模式enable/disable会注册/注销tracepoint回调函数,除此之外还有一种”soft disable”模式。在这种模式下,tracepoint的回调函数被注册,回调函数在event被命中的时候会被调用;但是因为处于”soft disable”模式,会做以下处理:
只会处理trigger/condition trigger命令;
不会记录trace数据;
同时也不会判断filter;
这种模式就是专门给trigger设计的,应用场景有两处:
如果本event配置了trigger命令,把本event配置成”soft disable”模式。这样不管本event有没有enable,trigger命令都会被触发;
如果配置了”enable_event”/”disable_event” trigger命令,在trigger的时候enable/disable目标event,在trigger注册的时候就把目标event设置成”soft”模式,trigger的时候设置目标event在”soft”模式下enable/disable。这样做的目的应该还是因为这样做开销比较小,因为常规完整模式的enable/disable event路径太长。
判断是否处于”soft disable”模式只看trace_event_file->flags中的EVENT_FILE_FL_SOFT_DISABLED_BIT,EVENT_FILE_FL_SOFT_MODE_BIT起辅助作用。”soft disable”模式和普通模式可以共存。
关键函数是:
static int __ftrace_event_enable_disable(struct trace_event_file *file,int enable, int soft_disable)
| “soft disable” mode | ||
| Mode | Action | Description |
soft disable | enable |
|
| disable |
| |
| normal | enable |
|
| disable |
| |
注意:以上只是”soft disable”模式和”normal”模式单独使用时的场景。如果交错使用还有相互恢复、覆盖的逻辑,大家可以查看__ftrace_event_enable_disable()内部具体代码实现。
2、数据存入
从event enable一节可以看到,系统会把把call->class->probe()注册成tracepoint的回调函数,所以当tracepoint被命中call->class->probe()就会被调用。call->class->probe()负责了把trace数据存入ringbuffer的工作。
根据“include/trace/trace_event.h”中event_class_##call的定义,call->class->probe()最后调用到trace_event_raw_event_##call():
static notrace void \trace_event_raw_event_##call(void *__data, proto) \{ \struct trace_event_file *trace_file = __data; \struct trace_event_data_offsets_##call __maybe_unused __data_offsets;\struct trace_event_buffer fbuffer; \struct trace_event_raw_##call *entry; \int __data_size; \\/* (1) 如果没有condition trigger,执行以下事务:1、执行无条件调用的trigger2、如果处在SOFT_DISABLE模式,直接返回3、查看PID_Filter情况,如果被过滤直接返回*/if (trace_trigger_soft_disabled(trace_file)) \return; \\(2) 计算entry中动态数据的长度 */__data_size = trace_event_get_offsets_##call(&__data_offsets, args); \\/* (3) 从ringbuffer中分配整个entry的空间,并且给entry中的comm字段赋值:= event_call->event.type; // event的id= pc & 0xff; // 抢占计数= (tsk) ? tsk->pid : 0; //当前进程pid= // 中断是否disable、是否在HARDIRQ、是否在SOFTIRQ、need_resched、preempt_need_resched#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT? TRACE_FLAG_IRQS_OFF : 0) |#elseTRACE_FLAG_IRQS_NOSUPPORT |#endif& HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |& SOFTIRQ_OFFSET) ? TRACE_FLAG_SOFTIRQ : 0) |? TRACE_FLAG_NEED_RESCHED : 0) |? TRACE_FLAG_PREEMPT_RESCHED : 0);*/entry = trace_event_buffer_reserve(&fbuffer, trace_file, \sizeof(*entry) + __data_size); \\if (!entry) \return; \\tstruct \\/* (4) 写入用户的trace数据*/assign; } \\/* (5) 向ringbuffer中提交entry或者根据filter丢弃entrytrigger*/\sizeof(*entry) + __data_size); \}
2.1、数据格式

我们可以看到,我们从ringbuffer从分配一段空间给event使用以后,除了保留给用户保存自己trace信息的空间,已经包含如下公共信息:
| Field | Description |
|---|---|
time_ stamp | entry分配时候的时间戳。 |
| type | trace_event对应的id,在读取数据时, 会根据type字段反向查找 到对应的trace_event数据结构。 |
| flags | = 中断是否disable、是否在HARDIRQ、 是否在SOFTIRQ、 need_resched、preempt_need_resched |
| preempt_count | = pc & 0xff;抢占计数 |
| pid | = (tsk) ? tsk->pid : 0;当前的进程pid |
2.2、trigger command
1、初始化时,注册trigger command:
start_kernel() -> trace_init() -> trace_event_init() -> event_trace_enable() -> register_trigger_cmds()↓__init int register_trigger_cmds(void){register_trigger_traceon_traceoff_cmds();register_trigger_snapshot_cmd();register_trigger_stacktrace_cmd();register_trigger_enable_disable_cmds();return 0;}↓static __init int register_trigger_traceon_traceoff_cmds(void){int ret;ret = register_event_command(&trigger_traceon_cmd);if (WARN_ON(ret < 0))return ret;ret = register_event_command(&trigger_traceoff_cmd);if (WARN_ON(ret < 0))unregister_trigger_traceon_traceoff_cmds();return ret;}↓static __init int register_event_command(struct event_command *cmd){struct event_command *p;int ret = 0;mutex_lock(&trigger_cmd_mutex);/* 注册只是简单的将command加入到trigger_commands链表当中 */list_for_each_entry(p, &trigger_commands, list) {if (strcmp(cmd->name, p->name) == 0) {ret = -EBUSY;goto out_unlock;}}list_add(&cmd->list, &trigger_commands);out_unlock:mutex_unlock(&trigger_cmd_mutex);return ret;}
以trigger_enable_cmd、trigger_disable_cmd为例,对应重要的数据结构如下:
static struct event_command trigger_enable_cmd = {.name = ENABLE_EVENT_STR,.trigger_type = ETT_EVENT_ENABLE,.func = event_enable_trigger_func,.reg = event_enable_register_trigger,.unreg = event_enable_unregister_trigger,.get_trigger_ops = event_enable_get_trigger_ops,.set_filter = set_trigger_filter,};static struct event_command trigger_disable_cmd = {.name = DISABLE_EVENT_STR,.trigger_type = ETT_EVENT_ENABLE,.func = event_enable_trigger_func,.reg = event_enable_register_trigger,.unreg = event_enable_unregister_trigger,.get_trigger_ops = event_enable_get_trigger_ops,.set_filter = set_trigger_filter,};static struct event_trigger_ops *event_enable_get_trigger_ops(char *cmd, char *param){struct event_trigger_ops *ops;bool enable;enable = strcmp(cmd, ENABLE_EVENT_STR) == 0;if (enable)ops = param ? &event_enable_count_trigger_ops :&event_enable_trigger_ops;elseops = param ? &event_disable_count_trigger_ops :&event_disable_trigger_ops;return ops;}static struct event_trigger_ops event_enable_trigger_ops = {.func = event_enable_trigger,.print = event_enable_trigger_print,.init = event_trigger_init,.free = event_enable_trigger_free,};static struct event_trigger_ops event_enable_count_trigger_ops = {.func = event_enable_count_trigger,.print = event_enable_trigger_print,.init = event_trigger_init,.free = event_enable_trigger_free,};static struct event_trigger_ops event_disable_trigger_ops = {.func = event_enable_trigger,.print = event_enable_trigger_print,.init = event_trigger_init,.free = event_enable_trigger_free,};static struct event_trigger_ops event_disable_count_trigger_ops = {.func = event_enable_count_trigger,.print = event_enable_trigger_print,.init = event_trigger_init,.free = event_enable_trigger_free,};
2、通过“tracing/events/xxxsubsystem/xxxevent/trigger”文件来配置trigger:
trace_create_file("trigger", 0644, file->dir, file,&event_trigger_fops);↓const struct file_operations event_trigger_fops = {.open = event_trigger_open,.read = seq_read,.write = event_trigger_write,.llseek = tracing_lseek,.release = event_trigger_release,};↓static ssize_tevent_trigger_write(struct file *filp, const char __user *ubuf,size_t cnt, loff_t *ppos){return event_trigger_regex_write(filp, ubuf, cnt, ppos);}↓static ssize_t event_trigger_regex_write(struct file *file,const char __user *ubuf,size_t cnt, loff_t *ppos){struct trace_event_file *event_file;ssize_t ret;char *buf;if (!cnt)return 0;if (cnt >= PAGE_SIZE)return -EINVAL;buf = (char *)__get_free_page(GFP_TEMPORARY);if (!buf)return -ENOMEM;if (copy_from_user(buf, ubuf, cnt)) {free_page((unsigned long)buf);return -EFAULT;}buf[cnt] = '\0';strim(buf);mutex_lock(&event_mutex);/* event对应的trace_event_file */event_file = event_file_data(file);if (unlikely(!event_file)) {mutex_unlock(&event_mutex);free_page((unsigned long)buf);return -ENODEV;}/* 解析trigger文件配置的规则 */ret = trigger_process_regex(event_file, buf);mutex_unlock(&event_mutex);free_page((unsigned long)buf);if (ret < 0)goto out;*ppos += cnt;ret = cnt;out:return ret;}↓static int trigger_process_regex(struct trace_event_file *file, char *buff){char *command, *next = buff;struct event_command *p;int ret = -EINVAL;command = strsep(&next, ": \t");command = (command[0] != '!') ? command : command + 1;mutex_lock(&trigger_cmd_mutex);/* 根据command name,在trigger_commands链表中查找初始化时注册的event_command */list_for_each_entry(p, &trigger_commands, list) {if (strcmp(p->name, command) == 0) {/* 调用event_command的func()函数来处理配置字符串 */ret = p->func(p, file, buff, command, next);goto out_unlock;}}out_unlock:mutex_unlock(&trigger_cmd_mutex);return ret;}
以”enable_event”命令为例,我们继续分析p->func()的调用:
static struct event_command trigger_enable_cmd = {.name = ENABLE_EVENT_STR,.trigger_type = ETT_EVENT_ENABLE,.func = event_enable_trigger_func,.reg = event_enable_register_trigger,.unreg = event_enable_unregister_trigger,.get_trigger_ops = event_enable_get_trigger_ops,.set_filter = set_trigger_filter,};↓static intevent_enable_trigger_func(struct event_command *cmd_ops,struct trace_event_file *file,char *glob, char *cmd, char *param){struct trace_event_file *event_enable_file;struct enable_trigger_data *enable_data;struct event_trigger_data *trigger_data;struct event_trigger_ops *trigger_ops;struct trace_array *tr = file->tr;const char *system;const char *event;char *trigger;char *number;bool enable;int ret;if (!param)return -EINVAL;/* separate the trigger from the filter (s:e:n [if filter]) *//* (1) 从"<system>:<event>[:count] [if filter]"中分离出"<system>:<event>[:count]" */trigger = strsep(¶m, " \t");if (!trigger)return -EINVAL;/* (1.1) 分离出"<system>" */system = strsep(&trigger, ":");if (!trigger)return -EINVAL;/* (1.2) 分离出"<event>" */event = strsep(&trigger, ":");ret = -EINVAL;/* (1.3) 根据目标system、event的name,找到对应的trace_event_file */event_enable_file = find_event_file(tr, system, event);if (!event_enable_file)goto out;/* (1.4) "enable_event" or "disable_event" ? */enable = strcmp(cmd, ENABLE_EVENT_STR) == 0;/* (2) 根据是否包含"[:count]"字段来给trigger_ops赋值:= event_enable_count_trigger_ops //有count= event_enable_trigger_ops // 无count*/trigger_ops = cmd_ops->get_trigger_ops(cmd, trigger);ret = -ENOMEM;trigger_data = kzalloc(sizeof(*trigger_data), GFP_KERNEL);if (!trigger_data)goto out;enable_data = kzalloc(sizeof(*enable_data), GFP_KERNEL);if (!enable_data) {kfree(trigger_data);goto out;}/* (3) 准备本event的data */trigger_data->count = -1;trigger_data->ops = trigger_ops;trigger_data->cmd_ops = cmd_ops;INIT_LIST_HEAD(&trigger_data->list);RCU_INIT_POINTER(trigger_data->filter, NULL);/* (3.1) 准备目标event的data */enable_data->enable = enable;enable_data->file = event_enable_file;trigger_data->private_data = enable_data;/* (4) 如果命令字符串首字符是'!',说明是unregister trigger */if (glob[0] == '!') {/* (4.1) 注销trigger */cmd_ops->unreg(glob+1, trigger_ops, trigger_data, file);kfree(trigger_data);kfree(enable_data);ret = 0;goto out;}/* (3.2) 如果有"[:count]"字段,赋值给trigger_data->count */if (trigger) {number = strsep(&trigger, ":");ret = -EINVAL;if (!strlen(number))goto out_free;/** We use the callback data field (which is a pointer)* as our counter.*/ret = kstrtoul(number, 0, &trigger_data->count);if (ret)goto out_free;}if (!param) /* if param is non-empty, it's supposed to be a filter */goto out_reg;if (!cmd_ops->set_filter)goto out_reg;/* (5) 如果有“[if filter]”字段,配置filter */ret = cmd_ops->set_filter(param, trigger_data, file);if (ret < 0)goto out_free;out_reg:/* Don't let event modules unload while probe registered */ret = try_module_get(event_enable_file->event_call->mod);if (!ret) {ret = -EBUSY;goto out_free;}/* (6) 首先把目标event设置为“soft”+“disable”模式 */ret = trace_event_enable_disable(event_enable_file, 1, 1);if (ret < 0)goto out_put;/* (7) 注册trigger */ret = cmd_ops->reg(glob, trigger_ops, trigger_data, file);/** The above returns on success the # of functions enabled,* but if it didn't find any functions it returns zero.* Consider no functions a failure too.*/if (!ret) {ret = -ENOENT;goto out_disable;} else if (ret < 0)goto out_disable;/* Just return zero, not the number of enabled functions */ret = 0;out:return ret;out_disable:trace_event_enable_disable(event_enable_file, 0, 1);out_put:module_put(event_enable_file->event_call->mod);out_free:if (cmd_ops->set_filter)cmd_ops->set_filter(NULL, trigger_data, NULL);kfree(trigger_data);kfree(enable_data);goto out;}
其中,设置filter的具体操作如下(trigger_data把condition trigger的filter条件和trigger操作联结在了一起):
static int set_trigger_filter(char *filter_str,struct event_trigger_data *trigger_data,struct trace_event_file *file){struct event_trigger_data *data = trigger_data;struct event_filter *filter = NULL, *tmp;int ret = -EINVAL;char *s;if (!filter_str) /* clear the current filter */goto assign;s = strsep(&filter_str, " \t");if (!strlen(s) || strcmp(s, "if") != 0)goto out;if (!filter_str)goto out;/* The filter is for the 'trigger' event, not the triggered event *//* (5.1) 对本event创建filter */ret = create_event_filter(file->event_call, filter_str, false, &filter);if (ret)goto out;assign:tmp = rcu_access_pointer(data->filter);/* (5.2) 创建好的filter赋值给trigger_data->filter */rcu_assign_pointer(data->filter, filter);if (tmp) {/* Make sure the call is done with the filter */synchronize_sched();free_event_filter(tmp);}kfree(data->filter_str);data->filter_str = NULL;/* (5.3) 使用trigger_data->filter_str保存原有的filter字符串 */if (filter_str) {data->filter_str = kstrdup(filter_str, GFP_KERNEL);if (!data->filter_str) {free_event_filter(rcu_access_pointer(data->filter));data->filter = NULL;ret = -ENOMEM;}}out:return ret;}↓int create_event_filter(struct trace_event_call *call,char *filter_str, bool set_str,struct event_filter **filterp){/* 具体的filter的创建我也没看懂,反正就是根据表达式对field和条件进行一系列判断 */return create_filter(call, filter_str, set_str, filterp);}
把目标event设置为“soft disable”模式的具体操作如下:
ret = trace_event_enable_disable(event_enable_file, 1, 1);↓int trace_event_enable_disable(struct trace_event_file *file,int enable, int soft_disable){return __ftrace_event_enable_disable(file, enable, soft_disable);}
注册trigger的具体操作如下:
static int event_enable_register_trigger(char *glob,struct event_trigger_ops *ops,struct event_trigger_data *data,struct trace_event_file *file){struct enable_trigger_data *enable_data = data->private_data;struct enable_trigger_data *test_enable_data;struct event_trigger_data *test;int ret = 0;/* (7.1) 判断目标event是否已经有trigger已经注册 */list_for_each_entry_rcu(test, &file->triggers, list) {test_enable_data = test->private_data;if (test_enable_data &&(test_enable_data->file == enable_data->file)) {ret = -EEXIST;goto out;}}/* (7.2) init */if (data->ops->init) {ret = data->ops->init(data->ops, data);if (ret < 0)goto out;}/* (7.3) 将event_trigger_data加入到file->triggers链表中 */list_add_rcu(&data->list, &file->triggers);ret++;/* (7.4) 设置本event的EVENT_FILE_FL_TRIGGER_MODE_BIT标志位将本event设置成“soft disable”模式*/if (trace_event_trigger_enable_disable(file, 1) < 0) {list_del_rcu(&data->list);ret--;}/* (7.5) 如果有filter trigger,设置EVENT_FILE_FL_TRIGGER_COND_BIT标志位 */update_cond_flag(file);out:return ret;}↓static int trace_event_trigger_enable_disable(struct trace_event_file *file,int trigger_enable){int ret = 0;if (trigger_enable) {if (atomic_inc_return(&file->tm_ref) > 1)return ret;set_bit(EVENT_FILE_FL_TRIGGER_MODE_BIT, &file->flags);ret = trace_event_enable_disable(file, 1, 1);} else {if (atomic_dec_return(&file->tm_ref) > 0)return ret;clear_bit(EVENT_FILE_FL_TRIGGER_MODE_BIT, &file->flags);ret = trace_event_enable_disable(file, 0, 1);}return ret;}
3、在trace event被命中时,trigger的工作流程:
static notrace void \trace_event_raw_event_{ \struct trace_event_file *trace_file = __data; \struct trace_event_data_offsets_struct trace_event_buffer fbuffer; \struct trace_event_raw_int __data_size; \\/* (1) 如果没有condition trigger,执行以下事务:1、执行无条件调用的trigger2、如果处在SOFT_DISABLE模式,直接返回3、查看PID_Filter情况,如果被过滤直接返回*/if (trace_trigger_soft_disabled(trace_file)) \return; \\/* (2) 计算entry中动态数据的长度 */__data_size = trace_event_get_offsets_\/* (3) 从ringbuffer中分配整个entry的空间,并且给entry中的comm字段赋值:entry->type = event_call->event.type; // event的identry->preempt_count = pc & 0xff; // 抢占计数entry->pid= (tsk) ? tsk->pid : 0; //当前进程pidentry->flags = // 中断是否disable、是否在HARDIRQ、是否在SOFTIRQ、need_resched、preempt_need_resched#ifdef CONFIG_TRACE_IRQFLAGS_SUPPORT(irqs_disabled_flags(flags) ? TRACE_FLAG_IRQS_OFF : 0) |#elseTRACE_FLAG_IRQS_NOSUPPORT |#endif((pc & HARDIRQ_MASK) ? TRACE_FLAG_HARDIRQ : 0) |((pc & SOFTIRQ_OFFSET) ? TRACE_FLAG_SOFTIRQ : 0) |(tif_need_resched() ? TRACE_FLAG_NEED_RESCHED : 0) |(test_preempt_need_resched() ? TRACE_FLAG_PREEMPT_RESCHED : 0);*/entry = trace_event_buffer_reserve(&fbuffer, trace_file, \sizeof(*entry) + __data_size); \\if (!entry) \return; \\tstruct \\/* (4) 写入用户的trace数据给entry中自定义的字段赋值*/{ assign; } \\/* (5) 向ringbuffer中提交entry或者根据filter丢弃entry无论提交还是丢弃,判断执行condition trigger*/trace_event_buffer_commit(&fbuffer, \sizeof(*entry) + __data_size); \}|→static inline booltrace_trigger_soft_disabled(struct trace_event_file *file){unsigned long eflags = file->flags;/* (1.1) 只有没有condition trigger时,做进一步的判断 */if (!(eflags & EVENT_FILE_FL_TRIGGER_COND)) {/* (1.1.1) 执行无条件trigger */if (eflags & EVENT_FILE_FL_TRIGGER_MODE)event_triggers_call(file, NULL);/* (1.1.2) 如果处在“soft disable”模式,执行完无条件trigger就可以返回了*/if (eflags & EVENT_FILE_FL_SOFT_DISABLED)return true;/* (1.1.3) 判断当前pid是否被过滤 */if (eflags & EVENT_FILE_FL_PID_FILTER)return trace_event_ignore_this_pid(file);}return false;}|→void *trace_event_buffer_reserve(struct trace_event_buffer *fbuffer,struct trace_event_file *trace_file,unsigned long len){struct trace_event_call *event_call = trace_file->event_call;/* (3.1) 判断当前pid是否被过滤 */if ((trace_file->flags & EVENT_FILE_FL_PID_FILTER) &&trace_event_ignore_this_pid(trace_file))return NULL;local_save_flags(fbuffer->flags);fbuffer->pc = preempt_count();fbuffer->trace_file = trace_file;fbuffer->event =trace_event_buffer_lock_reserve(&fbuffer->buffer, trace_file,event_call->event.type, len,fbuffer->flags, fbuffer->pc);if (!fbuffer->event)return NULL;fbuffer->entry = ring_buffer_event_data(fbuffer->event);return fbuffer->entry;}|→void trace_event_buffer_commit(struct trace_event_buffer *fbuffer,unsigned long len){if (tracepoint_printk)output_printk(fbuffer);/* (5.1) */event_trigger_unlock_commit(fbuffer->trace_file, fbuffer->buffer,fbuffer->event, fbuffer->entry,fbuffer->flags, fbuffer->pc, len);}||→static inline voidevent_trigger_unlock_commit(struct trace_event_file *file,struct ring_buffer *buffer,struct ring_buffer_event *event,void *entry, unsigned long irq_flags, int pc,unsigned long len){enum event_trigger_type tt = ETT_NONE;/* (5.1.1) 判断condition trigger是否要被触发和filter是否需要丢弃数据 */if (!__event_trigger_test_discard(file, buffer, event, entry, &tt)) {if (len)stm_log(OST_ENTITY_FTRACE_EVENTS, entry, len);/* (5.1.2) 如果数据不丢弃,则正常commit */trace_buffer_unlock_commit(file->tr, buffer, event, irq_flags, pc);}if (tt)event_triggers_post_call(file, tt);}|||→static inline bool__event_trigger_test_discard(struct trace_event_file *file,struct ring_buffer *buffer,struct ring_buffer_event *event,void *entry,enum event_trigger_type *tt){unsigned long eflags = file->flags;/* (5.1.1.1) 判断condition trigger是否要被触发 */if (eflags & EVENT_FILE_FL_TRIGGER_COND)*tt = event_triggers_call(file, entry);/* (5.1.1.2) 如果处在“soft disable”模式,执行完condition trigger就可以返回了*/if (test_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags))ring_buffer_discard_commit(buffer, event);/* (5.1.1.3) 判断filter是否需要丢弃数据 */else if (!filter_check_discard(file, entry, buffer, event))return false;return true;}||||→enum event_trigger_typeevent_triggers_call(struct trace_event_file *file, void *rec){struct event_trigger_data *data;enum event_trigger_type tt = ETT_NONE;struct event_filter *filter;if (list_empty(&file->triggers))return tt;/* (5.1.1.1.1) 逐个判断event的trigger链表中的trigger是否需要执行 */list_for_each_entry_rcu(data, &file->triggers, list) {/* (5.1.1.1.2) 数据为空,执行无条件trigger */if (!rec) {data->ops->func(data);continue;}/* (5.1.1.1.3) 判断condition trigger中的condition是否成立 */filter = rcu_dereference_sched(data->filter);if (filter && !filter_match_preds(filter, rec))continue;if (data->cmd_ops->post_trigger) {tt |= data->cmd_ops->trigger_type;continue;}/* (5.1.1.1.4) condition成立,执行condition trigger */data->ops->func(data);}return tt;}
以”enable_event”命令为例,我们继续分析data->ops->func()的调用:
static voidevent_enable_trigger(struct event_trigger_data *data){struct enable_trigger_data *enable_data = data->private_data;/* 因为目标event已经被设置成“soft”模式,在此模式下:EVENT_FILE_FL_SOFT_DISABLED_BIT = 1,disable模式EVENT_FILE_FL_SOFT_DISABLED_BIT = 0,enable模式*/if (enable_data->enable)clear_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &enable_data->file->flags);elseset_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &enable_data->file->flags);}static voidevent_enable_count_trigger(struct event_trigger_data *data){struct enable_trigger_data *enable_data = data->private_data;if (!data->count)return;/* Skip if the event is in a state we want to switch to *//* 如果trigger命令中的“[:count]”被设置,计数event的触发次数到达门限后再调用trigger命令*/if (enable_data->enable == !(enable_data->file->flags & EVENT_FILE_FL_SOFT_DISABLED))return;if (data->count != -1)(data->count)--;event_enable_trigger(data);}
2.3、filter
1、通过“tracing/events/xxxsubsystem/xxxevent/filter”文件来配置filter:
trace_create_file("filter", 0644, file->dir, file,&ftrace_event_filter_fops);↓static const struct file_operations ftrace_event_filter_fops = {.open = tracing_open_generic,.read = event_filter_read,.write = event_filter_write,.llseek = default_llseek,};↓static ssize_tevent_filter_write(struct file *filp, const char __user *ubuf, size_t cnt,loff_t *ppos){struct trace_event_file *file;char *buf;int err = -ENODEV;if (cnt >= PAGE_SIZE)return -EINVAL;buf = (char *)__get_free_page(GFP_TEMPORARY);if (!buf)return -ENOMEM;if (copy_from_user(buf, ubuf, cnt)) {free_page((unsigned long) buf);return -EFAULT;}buf[cnt] = '\0';mutex_lock(&event_mutex);file = event_file_data(filp);if (file)err = apply_event_filter(file, buf);mutex_unlock(&event_mutex);free_page((unsigned long) buf);if (err < 0)return err;*ppos += cnt;return cnt;}↓int apply_event_filter(struct trace_event_file *file, char *filter_string){struct trace_event_call *call = file->event_call;struct event_filter *filter;int err;if (!strcmp(strstrip(filter_string), "0")) {filter_disable(file);filter = event_filter(file);if (!filter)return 0;event_clear_filter(file);/* Make sure the filter is not being used */synchronize_sched();__free_filter(filter);return 0;}err = create_filter(call, filter_string, true, &filter);/** Always swap the call filter with the new filter* even if there was an error. If there was an error* in the filter, we disable the filter and show the error* string*/if (filter) {struct event_filter *tmp;tmp = event_filter(file);if (!err)event_set_filtered_flag(file);elsefilter_disable(file);event_set_filter(file, filter);if (tmp) {/* Make sure the call is done with the filter */synchronize_sched();__free_filter(tmp);}}return err;}
2、在trace event被命中时,filter的工作流程:
trace_event_raw_event_↓int filter_check_discard(struct trace_event_file *file, void *rec,struct ring_buffer *buffer,struct ring_buffer_event *event){if (unlikely(file->flags & EVENT_FILE_FL_FILTERED) &&!filter_match_preds(file->filter, rec)) {ring_buffer_discard_commit(buffer, event);return 1;}return 0;}
3、数据取出
trace数据被存入ringbuffer以后,可以通过不同的形式来查看:
可以在trace结束后,通过“trace”文件来读取用户可读的数据。这种使用iterator读,数据不会丢失,可重复读取;
也可以在trace的过程中,使用“trace_pipe”接口同步读取数据,不会阻塞trace数据的写入。这种使用reader_page swap读,读完数据会丢失,只能读一次;
也可以使用“per_cpu/cpu0/trace_pipe_raw”接口,读取raw数据,在用户态工具侧再来解析;
3.1、“trace”文件的读取

上图展示了从trace文件中读取trace数据时的情况,把raw trace数据格式化成用户可读的形式。trace数据分成两部分:一部分是comm信息,一部分是用户自定义的trace信息。这两部分分开格式化
comm信息。这部分有系统根据当前是“Normal format”还是“Latency trace format”,选择使用trace_print_context()或者是trace_print_lat_context()函数来进行格式化。
用户trace信息。这部分根据type字段找到对应的trace_event结构,有其event->funcs->trace()函数来进行格式化。遵循谁存储谁解析的原则,其他人也不清楚raw数据的解析规则。
我们看看具体的代码分析:
trace_create_file("trace", 0644, d_tracer,tr, &tracing_fops);↓static const struct file_operations tracing_fops = {.open = tracing_open,.read = seq_read,.write = tracing_write_stub,.llseek = tracing_lseek,.release = tracing_release,};↓ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos){struct seq_file *m = file->private_data;size_t copied = 0;loff_t pos;size_t n;void *p;int err = 0;mutex_lock(&m->lock);/** seq_file->op->..m_start/m_stop/m_next may do special actions* or optimisations based on the file->f_version, so we want to* pass the file->f_version to those methods.** seq_file->version is just copy of f_version, and seq_file* methods can treat it simply as file version.* It is copied in first and copied out after all operations.* It is convenient to have it as part of structure to avoid the* need of passing another argument to all the seq_file methods.*/m->version = file->f_version;/* Don't assume *ppos is where we left it */if (unlikely(*ppos != m->read_pos)) {while ((err = traverse(m, *ppos)) == -EAGAIN);if (err) {/* With prejudice... */m->read_pos = 0;m->version = 0;m->index = 0;m->count = 0;goto Done;} else {m->read_pos = *ppos;}}/* grab buffer if we didn't have one */if (!m->buf) {m->buf = seq_buf_alloc(m->size = PAGE_SIZE);if (!m->buf)goto Enomem;}/* if not empty - flush it first */if (m->count) {n = min(m->count, size);err = copy_to_user(buf, m->buf + m->from, n);if (err)goto Efault;m->count -= n;m->from += n;size -= n;buf += n;copied += n;if (!m->count) {m->from = 0;m->index++;}if (!size)goto Done;}/* we need at least one record in buffer */pos = m->index;p = m->op->start(m, &pos);while (1) {err = PTR_ERR(p);if (!p || IS_ERR(p))break;err = m->op->show(m, p);if (err < 0)break;if (unlikely(err))m->count = 0;if (unlikely(!m->count)) {p = m->op->next(m, p, &pos);m->index = pos;continue;}if (m->count < m->size)goto Fill;m->op->stop(m, p);kvfree(m->buf);m->count = 0;m->buf = seq_buf_alloc(m->size <<= 1);if (!m->buf)goto Enomem;m->version = 0;pos = m->index;p = m->op->start(m, &pos);}m->op->stop(m, p);m->count = 0;goto Done;Fill:/* they want more? let's try to get some more */while (m->count < size) {size_t offs = m->count;loff_t next = pos;p = m->op->next(m, p, &next);if (!p || IS_ERR(p)) {err = PTR_ERR(p);break;}err = m->op->show(m, p);if (seq_has_overflowed(m) || err) {m->count = offs;if (likely(err <= 0))break;}pos = next;}m->op->stop(m, p);n = min(m->count, size);err = copy_to_user(buf, m->buf, n);if (err)goto Efault;copied += n;m->count -= n;if (m->count)m->from = n;elsepos++;m->index = pos;Done:if (!copied)copied = err;else {*ppos += copied;m->read_pos += copied;}file->f_version = m->version;mutex_unlock(&m->lock);return copied;Enomem:err = -ENOMEM;goto Done;Efault:err = -EFAULT;goto Done;}
读操作实际最后落在tracer_seq_ops的s_next()和s_show()上:
static const struct seq_operations tracer_seq_ops = {.start = s_start,.next = s_next,.stop = s_stop,.show = s_show,};static void *s_next(struct seq_file *m, void *v, loff_t *pos){struct trace_iterator *iter = m->private;int i = (int)*pos;void *ent;WARN_ON_ONCE(iter->leftover);(*pos)++;/* can't go backwards */if (iter->idx > i)return NULL;if (iter->idx < 0)ent = trace_find_next_entry_inc(iter);elseent = iter;/* (1) 从ringbuffer中取出下一个entry */while (ent && iter->idx < i)ent = trace_find_next_entry_inc(iter);iter->pos = *pos;return ent;}|→void *trace_find_next_entry_inc(struct trace_iterator *iter){/* (1.1) 取出下一个entry */iter->ent = __find_next_entry(iter, &iter->cpu,&iter->lost_events, &iter->ts);/* (1.2) 增加iter中的读指针 */if (iter->ent)trace_iterator_increment(iter);return iter->ent ? iter : NULL;}static int s_show(struct seq_file *m, void *v){struct trace_iterator *iter = v;int ret;if (iter->ent == NULL) {if (iter->tr) {seq_printf(m, "# tracer: %s\n", iter->trace->name);seq_puts(m, "#\n");test_ftrace_alive(m);}if (iter->snapshot && trace_empty(iter))print_snapshot_help(m, iter);else if (iter->trace && iter->trace->print_header)iter->trace->print_header(m);elsetrace_default_header(m);} else if (iter->leftover) {/** If we filled the seq_file buffer earlier, we* want to just show it now.*/ret = trace_print_seq(m, &iter->seq);/* ret should this time be zero, but you never know */iter->leftover = ret;} else {/* (2) 把entry格式化成用户可读字符串 */print_trace_line(iter);/* 将iter中的内容拷贝到文件中 */ret = trace_print_seq(m, &iter->seq);/** If we overflow the seq_file buffer, then it will* ask us for this data again at start up.* Use that instead.* ret is 0 if seq_file write succeeded.* -1 otherwise.*/iter->leftover = ret;}return 0;}|→enum print_line_t print_trace_line(struct trace_iterator *iter){struct trace_array *tr = iter->tr;unsigned long trace_flags = tr->trace_flags;enum print_line_t ret;if (iter->lost_events) {trace_seq_printf(&iter->seq, "CPU:%d [LOST %lu EVENTS]\n",iter->cpu, iter->lost_events);if (trace_seq_has_overflowed(&iter->seq))return TRACE_TYPE_PARTIAL_LINE;}if (iter->trace && iter->trace->print_line) {ret = iter->trace->print_line(iter);if (ret != TRACE_TYPE_UNHANDLED)return ret;}/* (2.1) 几种特殊type的格式化方式 */if (iter->ent->type == TRACE_BPUTS &&trace_flags & TRACE_ITER_PRINTK &&trace_flags & TRACE_ITER_PRINTK_MSGONLY)return trace_print_bputs_msg_only(iter);if (iter->ent->type == TRACE_BPRINT &&trace_flags & TRACE_ITER_PRINTK &&trace_flags & TRACE_ITER_PRINTK_MSGONLY)return trace_print_bprintk_msg_only(iter);if (iter->ent->type == TRACE_PRINT &&trace_flags & TRACE_ITER_PRINTK &&trace_flags & TRACE_ITER_PRINTK_MSGONLY)return trace_print_printk_msg_only(iter);/* (2.2) raw数据的打印方式 */if (trace_flags & TRACE_ITER_BIN)return print_bin_fmt(iter);if (trace_flags & TRACE_ITER_HEX)return print_hex_fmt(iter);if (trace_flags & TRACE_ITER_RAW)return print_raw_fmt(iter);/* (2.3) 普通trace信息的打印方式 */return print_trace_fmt(iter);}||→static enum print_line_t print_trace_fmt(struct trace_iterator *iter){struct trace_array *tr = iter->tr;struct trace_seq *s = &iter->seq;unsigned long sym_flags = (tr->trace_flags & TRACE_ITER_SYM_MASK);struct trace_entry *entry;struct trace_event *event;entry = iter->ent;test_cpu_buff_start(iter);event = ftrace_find_event(entry->type);/* (2.3.1) 打印头部comm信息 */if (tr->trace_flags & TRACE_ITER_CONTEXT_INFO) {if (iter->iter_flags & TRACE_FILE_LAT_FMT)/* Latency trace format */trace_print_lat_context(iter);else/* Normal format */trace_print_context(iter);}if (trace_seq_has_overflowed(s))return TRACE_TYPE_PARTIAL_LINE;/* (2.3.2) 打印用户trace信息 */if (event)return event->funcs->trace(iter, sym_flags, event);trace_seq_printf(s, "Unknown type %d\n", entry->type);return trace_handle_return(s);}
以trace event为例,我们继续分析event->funcs->trace()。最后调用到“include/trace/trace_event.h”中的trace_raw_output_##call()函数:
static notrace enum print_line_t \trace_raw_output_##call(struct trace_iterator *iter, int flags, \struct trace_event *trace_event) \{ \// (2.3.2.1) iter->seq:为output bufferstruct trace_seq *s = &iter->seq; \(2.3.2.2) iter->tmp_seq:为临时buffer,供__get_bitmask()、__print_flags()、__print_symbolic()等一系列函数先把值转换成字符串struct trace_seq __maybe_unused *p = &iter->tmp_seq; \(2.3.2.3) iter->ent:为ringbuffer原始数据的存放buffer,使用上面定义的结构"struct trace_event_raw_##call"指向这片区域,来进行引用struct trace_event_raw_##call *field; \int ret; \\field = (typeof(field))iter->ent; \\// (2.3.2.4)在output buffer的最前头,打印出trace_event的nameret = trace_raw_output_prep(iter, trace_event); \if (ret != TRACE_TYPE_HANDLED) \return ret; \\// (2.3.2.5)把ringbuffer中的原始数据根据格式打印到output buffer当中print); \\return trace_handle_return(s); \} \
trace_iterator只是一个中间临时变量,主要是利用iter->seq、iter->tmp_seq把iter->ent指向的ringbuffer数据按照format转换成用户可读的数据。
4、trace_printk()
trace_printk()是利用printk的形式,把调试信息存入到trace ringbuffer中去。从前面的trace event分析中可以看到,在trace框架中存储trace数据需要几大要素:
插桩点。因为trace_printk()是在代码中显式的调用,所以这个不是问题;
trace框架数据结构,操作函数合集。这个的顶级数据结构是trace_event_call,我们通过TRACE_EVENT()宏创建一个trace event的时候会自动创建一个trace_event_call。针对trace_printk(),系统也创建了一些默认trace_event_call供其使用。
使用哪块ringbuffer。trace_printk()默认使用global_trace。
4.1、trace_printk()定义
do { \char _______STR[] = __stringify((__VA_ARGS__)); \/* (1) 如果##__VA_ARGS__不为空,trace_printk()的调用为可变参数 */if (sizeof(_______STR) > 3) \do_trace_printk(fmt,/* (2) 如果##__VA_ARGS__为空,trace_printk()的调用为只有一个参数“fmt” */else \trace_puts(fmt); \} while (0)
1、do_trace_printk()的实现:
do { \static const char *trace_printk_fmt __used \__attribute__((section("__trace_printk_fmt"))) = \__builtin_constant_p(fmt) ? fmt : NULL; \\__trace_printk_check_format(fmt,\/* (1.1) 如果格式字符串“fmt”是一个常量,在编译的时候就能确定将“fmt”存入到section("__trace_printk_fmt"),可以在“printk_formats”文件中查看在保存trace的时候就不需要保存“fmt”到ringbuffer,只需保存“fmt”指针和“##args”到ringbuffer可以节约宝贵的ringbuffer空间在数据读出的时候,再根据“fmt”的format,格式化打印“##args”数据使用TRACE_BPRINT type的trace_event_call来处理数据*/if (__builtin_constant_p(fmt)) \__trace_bprintk(_THIS_IP_, trace_printk_fmt,/* (1.2) 如果格式字符串“fmt”是一个变量,比如引用的是一个字符串变量,在运行的时候才能确认因为“fmt”是一个变量,所以不能事先保存在保存trace的时候就需要根据“fmt”的format格式化打印“##args”数据,保存“fmt”内容和“##args”到ringbuffer比较浪费ringbuffer空间,但是没有办法在数据读出的时候,因为ringbuffer中存储的已经是格式化好的字符串,只需读出即可使用TRACE_PRINT type的trace_event_call来处理数据*/else \__trace_printk(_THIS_IP_, fmt,} while (0)|→__trace_bprintk() -> trace_vbprintk():int trace_vbprintk(unsigned long ip, const char *fmt, va_list args){struct trace_event_call *call = &event_bprint;struct ring_buffer_event *event;struct ring_buffer *buffer;struct trace_array *tr = &global_trace;struct bprint_entry *entry;unsigned long flags;char *tbuffer;int len = 0, size, pc;if (unlikely(tracing_selftest_running || tracing_disabled))return 0;/* Don't pollute graph traces with trace_vprintk internals */pause_graph_tracing();pc = preempt_count();preempt_disable_notrace();/* (1.1.1) 使用trace_pintk()专用的临时空间 */tbuffer = get_trace_buf();if (!tbuffer) {len = 0;goto out;}/* (1.1.2) 仅保存“args”数据到ringbuffer */len = vbin_printf((u32 *)tbuffer, TRACE_BUF_SIZE/sizeof(int), fmt, args);if (len > TRACE_BUF_SIZE/sizeof(int) || len < 0)goto out;local_save_flags(flags);size = sizeof(*entry) + sizeof(u32) * len;buffer = tr->trace_buffer.buffer;event = trace_buffer_lock_reserve(buffer, TRACE_BPRINT, size,flags, pc);if (!event)goto out;/* (1.1.3) 保存ip和section("__trace_printk_fmt")中的“fmt”指针 */entry = ring_buffer_event_data(event);entry->ip = ip;entry->fmt = fmt;memcpy(entry->buf, tbuffer, sizeof(u32) * len);if (!call_filter_check_discard(call, entry, buffer, event)) {__buffer_unlock_commit(buffer, event);ftrace_trace_stack(tr, buffer, flags, 6, pc, NULL);}out:preempt_enable_notrace();unpause_graph_tracing();return len;}|→__trace_printk() -> trace_vprintk() -> trace_array_vprintk() -> __trace_array_vprintk():static int__trace_array_vprintk(struct ring_buffer *buffer,unsigned long ip, const char *fmt, va_list args){struct trace_event_call *call = &event_print;struct ring_buffer_event *event;int len = 0, size, pc;struct print_entry *entry;unsigned long flags;char *tbuffer;if (tracing_disabled || tracing_selftest_running)return 0;/* Don't pollute graph traces with trace_vprintk internals */pause_graph_tracing();pc = preempt_count();preempt_disable_notrace();/* (1.2.1) 使用trace_pintk()专用的临时空间 */tbuffer = get_trace_buf();if (!tbuffer) {len = 0;goto out;}/* (1.2.2) 保存“fmt”+“args”格式化好的数据到ringbuffer */len = vscnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args);local_save_flags(flags);size = sizeof(*entry) + len + 1;event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,flags, pc);if (!event)goto out;/* (1.2.3) 保存ip和格式化字符串 */entry = ring_buffer_event_data(event);entry->ip = ip;memcpy(&entry->buf, tbuffer, len + 1);if (!call_filter_check_discard(call, entry, buffer, event)) {stm_log(OST_ENTITY_TRACE_PRINTK, entry->buf, len + 1);__buffer_unlock_commit(buffer, event);ftrace_trace_stack(&global_trace, buffer, flags, 6, pc, NULL);}out:preempt_enable_notrace();unpause_graph_tracing();return len;}
2、trace_puts()的实现:
static const char *trace_printk_fmt __used \__attribute__((section())) = \__builtin_constant_p(str) ? str : NULL; \\/* (2.1) 如果格式字符串“fmt”是一个常量,在编译的时候就能确定将“fmt”存入到section("__trace_printk_fmt"),可以在“printk_formats”文件中查看在保存trace的时候就不需要保存“fmt”到ringbuffer,只需保存“fmt”指针到ringbuffer可以节约宝贵的ringbuffer空间在数据读出的时候,再根据“fmt”的指针,打印出“fmt”内容使用TRACE_BPUTS type的trace_event_call来处理数据*/if (__builtin_constant_p(str)) \__trace_bputs(_THIS_IP_, trace_printk_fmt); \/* (2.2) 如果格式字符串“fmt”是一个变量,比如引用的是一个字符串变量,在运行的时候才能确认因为“fmt”是一个变量,所以不能事先保存在保存trace的时候保存“fmt”内容到ringbuffer比较浪费ringbuffer空间,但是没有办法在数据读出的时候,因为ringbuffer中存储的好的“fmt”字符串,只需读出即可使用TRACE_PRINT type的trace_event_call来处理数据*/else \__trace_puts(_THIS_IP_, str, strlen(str)); \})|→int __trace_bputs(unsigned long ip, const char *str){struct ring_buffer_event *event;struct ring_buffer *buffer;struct bputs_entry *entry;unsigned long irq_flags;int size = sizeof(struct bputs_entry);int pc;if (!(global_trace.trace_flags & TRACE_ITER_PRINTK))return 0;pc = preempt_count();if (unlikely(tracing_selftest_running || tracing_disabled))return 0;local_save_flags(irq_flags);buffer = global_trace.trace_buffer.buffer;event = trace_buffer_lock_reserve(buffer, TRACE_BPUTS, size,irq_flags, pc);if (!event)return 0;entry = ring_buffer_event_data(event);/* (2.1.1) 保存ip和section("__trace_printk_fmt")中的“fmt”指针到ringbuffer */entry->ip = ip;entry->str = str;stm_log(OST_ENTITY_TRACE_PRINTK, entry->str, strlen(entry->str)+1);__buffer_unlock_commit(buffer, event);ftrace_trace_stack(&global_trace, buffer, irq_flags, 4, pc, NULL);return 1;}|→int __trace_puts(unsigned long ip, const char *str, int size){struct ring_buffer_event *event;struct ring_buffer *buffer;struct print_entry *entry;unsigned long irq_flags;int alloc;int pc;if (!(global_trace.trace_flags & TRACE_ITER_PRINTK))return 0;pc = preempt_count();if (unlikely(tracing_selftest_running || tracing_disabled))return 0;alloc = sizeof(*entry) + size + 2; /* possible \n added */local_save_flags(irq_flags);buffer = global_trace.trace_buffer.buffer;event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, alloc,irq_flags, pc);if (!event)return 0;entry = ring_buffer_event_data(event);/* (2.2.1) 保存ip到ringbuffer */entry->ip = ip;/* (2.2.2) 保存“fmt”字符串到ringbuffer */memcpy(&entry->buf, str, size);/* Add a newline if necessary */if (entry->buf[size - 1] != '\n') {entry->buf[size] = '\n';entry->buf[size + 1] = '\0';stm_log(OST_ENTITY_TRACE_PRINTK, entry->buf, size + 2);} else {entry->buf[size] = '\0';stm_log(OST_ENTITY_TRACE_PRINTK, entry->buf, size + 1);}__buffer_unlock_commit(buffer, event);ftrace_trace_stack(&global_trace, buffer, irq_flags, 4, pc, NULL);return size;}
4.2、FTRACE_ENTRY()宏的展开
从上一节可以看到,为了优化对ringbuffer的消耗,不同情况下trace_printk()使用了3种type的trace_event_call来进行处理。分别是:TRACE_BPRINT、TRACE_PRINT、TRACE_BPUTS。
这3种系统默认trace_event_call,是通过FTRACE_ENTRY()宏来定义的。定义了trace_event_call可以直接使用现有的trace框架。具体过程如下:
1、kernel/trace/trace_entries.h:
/** trace_printk entry:*/FTRACE_ENTRY(bprint, bprint_entry,TRACE_BPRINT,F_STRUCT(__field( unsigned long, ip )__field( const char *, fmt )__dynamic_array( u32, buf )),F_printk("%ps: %s",(void *)__entry->ip, __entry->fmt),FILTER_OTHER);FTRACE_ENTRY(print, print_entry,TRACE_PRINT,F_STRUCT(__field( unsigned long, ip )__dynamic_array( char, buf )),F_printk("%ps: %s",(void *)__entry->ip, __entry->buf),FILTER_OTHER);FTRACE_ENTRY(bputs, bputs_entry,TRACE_BPUTS,F_STRUCT(__field( unsigned long, ip )__field( const char *, str )),F_printk("%ps: %s",(void *)__entry->ip, __entry->str),FILTER_OTHER);
2、kernel/trace/trace.h:
FTRACE_ENTRY()宏第1次展开,定义了struct struct_name:
struct struct_name { \struct trace_entry ent; \tstruct \}filter, regfn) \FTRACE_ENTRY(name, struct_name, id, PARAMS(tstruct), PARAMS(print), \filter)
FTRACE_ENTRY()宏第2次展开,定义了声明了外部变量extern struct trace_event_call event_##call:
extern struct trace_event_call \__aligned(4) event_FTRACE_ENTRY(call, struct_name, id, PARAMS(tstruct), PARAMS(print), \filter)
3、kernel/trace/export.c:
FTRACE_ENTRY()宏第3次展开,定义了struct ___ftrace##name:
/** The FTRACE_ENTRY_REG macro allows ftrace entry to define register* function and thus become accesible via perf.*/filter, regfn) \FTRACE_ENTRY(name, struct_name, id, PARAMS(tstruct), PARAMS(print), \filter)/* not needed for this file */struct ____ftrace_tstruct \}; \static void __always_unused ____ftrace_check_{ \struct ____ftrace_\/* force compile-time check on F_printk() */ \printk(print); \}FTRACE_ENTRY(name, struct_name, id, PARAMS(tstruct), PARAMS(print), \filter)
FTRACE_ENTRY()宏第4次展开,定义了ftrace_define_fields_##name():
ret = trace_define_field(event_call,offsetof(typeof(field), item), \sizeof(field.item), \is_signed_type(type), filter_type); \if (ret) \return ret;ret = trace_define_field(event_call,offsetof(typeof(field), \container.item), \sizeof(field.container.item), \is_signed_type(type), filter_type); \if (ret) \return ret;do { \char *type_str =BUILD_BUG_ON(len > MAX_FILTER_STR_VAL); \ret = trace_define_field(event_call, type_str,offsetof(typeof(field), item), \sizeof(field.item), \is_signed_type(type), filter_type); \if (ret) \return ret; \} while (0);BUILD_BUG_ON(len > MAX_FILTER_STR_VAL); \ret = trace_define_field(event_call,offsetof(typeof(field), \container.item), \sizeof(field.container.item), \is_signed_type(type), filter_type); \if (ret) \return ret;ret = trace_define_field(event_call,offsetof(typeof(field), item), \0, is_signed_type(type), filter_type);\if (ret) \return ret;static int __init \ftrace_define_fields_{ \struct struct_name field; \int ret; \int filter_type = filter; \\tstruct; \\return ret; \}
FTRACE_ENTRY()宏第5次展开,定义了struct trace_event_call _used event##call和struct trace_event_class _refdata event_class_ftrace##call:
regfn) \\struct trace_event_class __refdata event_class_ftrace_.system = __stringify(TRACE_SYSTEM), \.define_fields = ftrace_define_fields_.fields = LIST_HEAD_INIT(event_class_ftrace_.reg = regfn, \}; \\struct trace_event_call __used event_.class = &event_class_ftrace_{ \.name =}, \.event.type = etype, \.print_fmt = print, \.flags = TRACE_EVENT_FL_IGNORE_ENABLE, \}; \struct trace_event_call __used \__attribute__((section("_ftrace_events"))) *__event_FTRACE_ENTRY_REG(call, struct_name, etype, \PARAMS(tstruct), PARAMS(print), filter, NULL)bool ftrace_event_is_function(struct trace_event_call *call){return call == &event_function;}
4.3、数据存入
在trace_printk()定义一节中已经详细描述了3种(TRACE_BPRINT、TRACE_PRINT、TRACE_BPUTS)trace_event_call的数据存入过程。这里简单图示一下它们的数据存储结构:

4.4、数据读出
trace_printk()默认样式:
cat tracetracer: functionentries-in-buffer/entries-written: 45/45 #P:8_-----=> irqs-off/ _----=> need-resched| / _---=> hardirq/softirq|| / _--=> preempt-depth||| / delayTASK-PID CPU# |||| TIMESTAMP FUNCTION| | | |||| | |sh-18299 [000] .... 663030.727868: set_trigger_filter: [set_trigger_filter] data = 0xfffffff11a12cd00, filter_str = 0x0
在kernel/trace/trace_ouput.c文件中,注册了系统默认的几种trace_event:
static struct trace_event *events[] __initdata = {&trace_fn_event,&trace_graph_ent_event,&trace_graph_ret_event,&trace_ctx_event,&trace_wake_event,&trace_stack_event,&trace_user_stack_event,&trace_bputs_event,&trace_bprint_event,&trace_print_event,NULL};__init static int init_events(void){struct trace_event *event;int i, ret;for (i = 0; events[i]; i++) {event = events[i];ret = register_trace_event(event);if (!ret) {printk(KERN_WARNING "event %d failed to register\n",event->type);WARN_ON_ONCE(1);}}return 0;}|→/* (1) TRACE_BPUTS对应的trace_event 和 event->funcs->trace()函数 */static struct trace_event_functions trace_bputs_funcs = {.trace = trace_bputs_print,.raw = trace_bputs_raw,};static struct trace_event trace_bputs_event = {.type = TRACE_BPUTS,.funcs = &trace_bputs_funcs,};|→/* (2) TRACE_BPRINT对应的trace_event 和 event->funcs->trace()函数 */static struct trace_event_functions trace_bprint_funcs = {.trace = trace_bprint_print,.raw = trace_bprint_raw,};static struct trace_event trace_bprint_event = {.type = TRACE_BPRINT,.funcs = &trace_bprint_funcs,};|→/* (3) TRACE_PRINT对应的trace_event 和 event->funcs->trace()函数 */static struct trace_event_functions trace_print_funcs = {.trace = trace_print_print,.raw = trace_print_raw,};static struct trace_event trace_print_event = {.type = TRACE_PRINT,.funcs = &trace_print_funcs,};
在数据读出时,会调用到event对应的event->funcs->trace()函数,seq_read() -> s_show() -> print_trace_line() -> print_trace_fmt() -> event->funcs->trace():
1、TRACE_BPUTS,event->funcs->trace()对应trace_bputs_print():
static enum print_line_ttrace_bputs_print(struct trace_iterator *iter, int flags,struct trace_event *event){struct trace_entry *entry = iter->ent;struct trace_seq *s = &iter->seq;struct bputs_entry *field;trace_assign_type(field, entry);seq_print_ip_sym(s, field->ip, flags);trace_seq_puts(s, ": ");trace_seq_puts(s, field->str);return trace_handle_return(s);}
2、TRACE_BPRINT,event->funcs->trace()对应trace_bprint_print():
static enum print_line_ttrace_bprint_print(struct trace_iterator *iter, int flags,struct trace_event *event){struct trace_entry *entry = iter->ent;struct trace_seq *s = &iter->seq;struct bprint_entry *field;trace_assign_type(field, entry);seq_print_ip_sym(s, field->ip, flags);trace_seq_puts(s, ": ");trace_seq_bprintf(s, field->fmt, field->buf);return trace_handle_return(s);}
3、TRACE_PRINT,event->funcs->trace()对应trace_print_print():
static enum print_line_t trace_print_print(struct trace_iterator *iter,int flags, struct trace_event *event){struct print_entry *field;struct trace_seq *s = &iter->seq;trace_assign_type(field, iter->ent);seq_print_ip_sym(s, field->ip, flags);trace_seq_printf(s, ": %s", field->buf);return trace_handle_return(s);}
作者上一篇文章:深入理解EtherCAT
往期精华文章:【精华】Linux阅码场原创精华文章汇总
阅码场付费会员专业交流群
会员招募:各专业群会员费为88元/季度,权益包含群内提问,线下活动8折,全年不定期群技术分享(普通用户直播免费,分享后每次点播价为19元/次),有意加入请私信客服小月(小月微信号:linuxer2016)
专业群介绍:
甄建勇-性能优化与体系结构
本群定位Perf、cache和CPU架构技术交流,覆盖云/网/车/机/芯领域资深用户,由阅码场资深讲师甄建勇主持。
李春良-Xenomai与实时优化
本群定位Xenomai与实时优化技术交流,覆盖云/网/车/机/芯领域资深用户,由阅码场资深讲师李春良和彭伟林共同主持。
周贺贺-Tee和ARM架构
本群定位Tee和ARM架构技术交流,覆盖云/网/车/机/芯领域资深用户,由阅码场资深讲师周贺贺主持。
谢欢-Linux tracers
本群定位Linux tracers技术交流,覆盖云/网/车/机/芯领域资深用户,由阅码场资深讲师谢欢主持。
✦
✦