diff options
Diffstat (limited to 'Documentation/trace')
-rw-r--r-- | Documentation/trace/events-kmem.txt | 107 | ||||
-rw-r--r-- | Documentation/trace/events-power.txt | 90 | ||||
-rw-r--r-- | Documentation/trace/events.txt | 280 | ||||
-rw-r--r-- | Documentation/trace/ftrace-design.txt | 411 | ||||
-rw-r--r-- | Documentation/trace/ftrace.txt | 1848 | ||||
-rw-r--r-- | Documentation/trace/function-graph-fold.vim | 42 | ||||
-rw-r--r-- | Documentation/trace/kprobetrace.txt | 172 | ||||
-rw-r--r-- | Documentation/trace/mmiotrace.txt | 164 | ||||
-rw-r--r-- | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl | 418 | ||||
-rw-r--r-- | Documentation/trace/postprocess/trace-vmscan-postprocess.pl | 714 | ||||
-rw-r--r-- | Documentation/trace/ring-buffer-design.txt | 955 | ||||
-rw-r--r-- | Documentation/trace/tracepoint-analysis.txt | 327 | ||||
-rw-r--r-- | Documentation/trace/tracepoints.txt | 116 | ||||
-rw-r--r-- | Documentation/trace/uprobetracer.txt | 113 |
14 files changed, 0 insertions, 5757 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt deleted file mode 100644 index 19480041006..00000000000 --- a/Documentation/trace/events-kmem.txt +++ /dev/null @@ -1,107 +0,0 @@ - Subsystem Trace Points: kmem - -The kmem tracing system captures events related to object and page allocation -within the kernel. Broadly speaking there are five major subheadings. - - o Slab allocation of small objects of unknown type (kmalloc) - o Slab allocation of small objects of known type - o Page allocation - o Per-CPU Allocator Activity - o External Fragmentation - -This document describes what each of the tracepoints is and why they -might be useful. - -1. Slab allocation of small objects of unknown type -=================================================== -kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s -kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d -kfree call_site=%lx ptr=%p - -Heavy activity for these events may indicate that a specific cache is -justified, particularly if kmalloc slab pages are getting significantly -internal fragmented as a result of the allocation pattern. By correlating -kmalloc with kfree, it may be possible to identify memory leaks and where -the allocation sites were. - - -2. Slab allocation of small objects of known type -================================================= -kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s -kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d -kmem_cache_free call_site=%lx ptr=%p - -These events are similar in usage to the kmalloc-related events except that -it is likely easier to pin the event down to a specific cache. At the time -of writing, no information is available on what slab is being allocated from, -but the call_site can usually be used to extrapolate that information. - -3. Page allocation -================== -mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s -mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d -mm_page_free page=%p pfn=%lu order=%d -mm_page_free_batched page=%p pfn=%lu order=%d cold=%d - -These four events deal with page allocation and freeing. mm_page_alloc is -a simple indicator of page allocator activity. Pages may be allocated from -the per-CPU allocator (high performance) or the buddy allocator. - -If pages are allocated directly from the buddy allocator, the -mm_page_alloc_zone_locked event is triggered. This event is important as high -amounts of activity imply high activity on the zone->lock. Taking this lock -impairs performance by disabling interrupts, dirtying cache lines between -CPUs and serialising many CPUs. - -When a page is freed directly by the caller, the only mm_page_free event -is triggered. Significant amounts of activity here could indicate that the -callers should be batching their activities. - -When pages are freed in batch, the also mm_page_free_batched is triggered. -Broadly speaking, pages are taken off the LRU lock in bulk and -freed in batch with a page list. Significant amounts of activity here could -indicate that the system is under memory pressure and can also indicate -contention on the zone->lru_lock. - -4. Per-CPU Allocator Activity -============================= -mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d -mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d - -In front of the page allocator is a per-cpu page allocator. It exists only -for order-0 pages, reduces contention on the zone->lock and reduces the -amount of writing on struct page. - -When a per-CPU list is empty or pages of the wrong type are allocated, -the zone->lock will be taken once and the per-CPU list refilled. The event -triggered is mm_page_alloc_zone_locked for each page allocated with the -event indicating whether it is for a percpu_refill or not. - -When the per-CPU list is too full, a number of pages are freed, each one -which triggers a mm_page_pcpu_drain event. - -The individual nature of the events is so that pages can be tracked -between allocation and freeing. A number of drain or refill pages that occur -consecutively imply the zone->lock being taken once. Large amounts of per-CPU -refills and drains could imply an imbalance between CPUs where too much work -is being concentrated in one place. It could also indicate that the per-CPU -lists should be a larger size. Finally, large amounts of refills on one CPU -and drains on another could be a factor in causing large amounts of cache -line bounces due to writes between CPUs and worth investigating if pages -can be allocated and freed on the same CPU through some algorithm change. - -5. External Fragmentation -========================= -mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d - -External fragmentation affects whether a high-order allocation will be -successful or not. For some types of hardware, this is important although -it is avoided where possible. If the system is using huge pages and needs -to be able to resize the pool over the lifetime of the system, this value -is important. - -Large numbers of this event implies that memory is fragmenting and -high-order allocations will start failing at some time in the future. One -means of reducing the occurrence of this event is to increase the size of -min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where -pageblock_size is usually the size of the default hugepage size. diff --git a/Documentation/trace/events-power.txt b/Documentation/trace/events-power.txt deleted file mode 100644 index cf794af2285..00000000000 --- a/Documentation/trace/events-power.txt +++ /dev/null @@ -1,90 +0,0 @@ - - Subsystem Trace Points: power - -The power tracing system captures events related to power transitions -within the kernel. Broadly speaking there are three major subheadings: - - o Power state switch which reports events related to suspend (S-states), - cpuidle (C-states) and cpufreq (P-states) - o System clock related changes - o Power domains related changes and transitions - -This document describes what each of the tracepoints is and why they -might be useful. - -Cf. include/trace/events/power.h for the events definitions. - -1. Power state switch events -============================ - -1.1 New trace API ------------------ - -A 'cpu' event class gathers the CPU-related events: cpuidle and -cpufreq. - -cpu_idle "state=%lu cpu_id=%lu" -cpu_frequency "state=%lu cpu_id=%lu" - -A suspend event is used to indicate the system going in and out of the -suspend mode: - -machine_suspend "state=%lu" - - -Note: the value of '-1' or '4294967295' for state means an exit from the current state, -i.e. trace_cpu_idle(4, smp_processor_id()) means that the system -enters the idle state 4, while trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()) -means that the system exits the previous idle state. - -The event which has 'state=4294967295' in the trace is very important to the user -space tools which are using it to detect the end of the current state, and so to -correctly draw the states diagrams and to calculate accurate statistics etc. - -1.2 DEPRECATED trace API ------------------------- - -A new Kconfig option CONFIG_EVENT_POWER_TRACING_DEPRECATED with the default value of -'y' has been created. This allows the legacy trace power API to be used conjointly -with the new trace API. -The Kconfig option, the old trace API (in include/trace/events/power.h) and the -old trace points will disappear in a future release (namely 2.6.41). - -power_start "type=%lu state=%lu cpu_id=%lu" -power_frequency "type=%lu state=%lu cpu_id=%lu" -power_end "cpu_id=%lu" - -The 'type' parameter takes one of those macros: - . POWER_NONE = 0, - . POWER_CSTATE = 1, /* C-State */ - . POWER_PSTATE = 2, /* Frequency change or DVFS */ - -The 'state' parameter is set depending on the type: - . Target C-state for type=POWER_CSTATE, - . Target frequency for type=POWER_PSTATE, - -power_end is used to indicate the exit of a state, corresponding to the latest -power_start event. - -2. Clocks events -================ -The clock events are used for clock enable/disable and for -clock rate change. - -clock_enable "%s state=%lu cpu_id=%lu" -clock_disable "%s state=%lu cpu_id=%lu" -clock_set_rate "%s state=%lu cpu_id=%lu" - -The first parameter gives the clock name (e.g. "gpio1_iclk"). -The second parameter is '1' for enable, '0' for disable, the target -clock rate for set_rate. - -3. Power domains events -======================= -The power domain events are used for power domains transitions - -power_domain_target "%s state=%lu cpu_id=%lu" - -The first parameter gives the power domain name (e.g. "mpu_pwrdm"). -The second parameter is the power domain target state. - diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt deleted file mode 100644 index bb24c2a0e87..00000000000 --- a/Documentation/trace/events.txt +++ /dev/null @@ -1,280 +0,0 @@ - Event Tracing - - Documentation written by Theodore Ts'o - Updated by Li Zefan and Tom Zanussi - -1. Introduction -=============== - -Tracepoints (see Documentation/trace/tracepoints.txt) can be used -without creating custom kernel modules to register probe functions -using the event tracing infrastructure. - -Not all tracepoints can be traced using the event tracing system; -the kernel developer must provide code snippets which define how the -tracing information is saved into the tracing buffer, and how the -tracing information should be printed. - -2. Using Event Tracing -====================== - -2.1 Via the 'set_event' interface ---------------------------------- - -The events which are available for tracing can be found in the file -/sys/kernel/debug/tracing/available_events. - -To enable a particular event, such as 'sched_wakeup', simply echo it -to /sys/kernel/debug/tracing/set_event. For example: - - # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event - -[ Note: '>>' is necessary, otherwise it will firstly disable - all the events. ] - -To disable an event, echo the event name to the set_event file prefixed -with an exclamation point: - - # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event - -To disable all events, echo an empty line to the set_event file: - - # echo > /sys/kernel/debug/tracing/set_event - -To enable all events, echo '*:*' or '*:' to the set_event file: - - # echo *:* > /sys/kernel/debug/tracing/set_event - -The events are organized into subsystems, such as ext4, irq, sched, -etc., and a full event name looks like this: <subsystem>:<event>. The -subsystem name is optional, but it is displayed in the available_events -file. All of the events in a subsystem can be specified via the syntax -"<subsystem>:*"; for example, to enable all irq events, you can use the -command: - - # echo 'irq:*' > /sys/kernel/debug/tracing/set_event - -2.2 Via the 'enable' toggle ---------------------------- - -The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy -of directories. - -To enable event 'sched_wakeup': - - # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable - -To disable it: - - # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable - -To enable all events in sched subsystem: - - # echo 1 > /sys/kernel/debug/tracing/events/sched/enable - -To enable all events: - - # echo 1 > /sys/kernel/debug/tracing/events/enable - -When reading one of these enable files, there are four results: - - 0 - all events this file affects are disabled - 1 - all events this file affects are enabled - X - there is a mixture of events enabled and disabled - ? - this file does not affect any event - -2.3 Boot option ---------------- - -In order to facilitate early boot debugging, use boot option: - - trace_event=[event-list] - -event-list is a comma separated list of events. See section 2.1 for event -format. - -3. Defining an event-enabled tracepoint -======================================= - -See The example provided in samples/trace_events - -4. Event formats -================ - -Each trace event has a 'format' file associated with it that contains -a description of each field in a logged event. This information can -be used to parse the binary trace stream, and is also the place to -find the field names that can be used in event filters (see section 5). - -It also displays the format string that will be used to print the -event in text mode, along with the event name and ID used for -profiling. - -Every event has a set of 'common' fields associated with it; these are -the fields prefixed with 'common_'. The other fields vary between -events and correspond to the fields defined in the TRACE_EVENT -definition for that event. - -Each field in the format has the form: - - field:field-type field-name; offset:N; size:N; - -where offset is the offset of the field in the trace record and size -is the size of the data item, in bytes. - -For example, here's the information displayed for the 'sched_wakeup' -event: - -# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format - -name: sched_wakeup -ID: 60 -format: - field:unsigned short common_type; offset:0; size:2; - field:unsigned char common_flags; offset:2; size:1; - field:unsigned char common_preempt_count; offset:3; size:1; - field:int common_pid; offset:4; size:4; - field:int common_tgid; offset:8; size:4; - - field:char comm[TASK_COMM_LEN]; offset:12; size:16; - field:pid_t pid; offset:28; size:4; - field:int prio; offset:32; size:4; - field:int success; offset:36; size:4; - field:int cpu; offset:40; size:4; - -print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, - REC->prio, REC->success, REC->cpu - -This event contains 10 fields, the first 5 common and the remaining 5 -event-specific. All the fields for this event are numeric, except for -'comm' which is a string, a distinction important for event filtering. - -5. Event filtering -================== - -Trace events can be filtered in the kernel by associating boolean -'filter expressions' with them. As soon as an event is logged into -the trace buffer, its fields are checked against the filter expression -associated with that event type. An event with field values that -'match' the filter will appear in the trace output, and an event whose -values don't match will be discarded. An event with no filter -associated with it matches everything, and is the default when no -filter has been set for an event. - -5.1 Expression syntax ---------------------- - -A filter expression consists of one or more 'predicates' that can be -combined using the logical operators '&&' and '||'. A predicate is -simply a clause that compares the value of a field contained within a -logged event with a constant value and returns either 0 or 1 depending -on whether the field value matched (1) or didn't match (0): - - field-name relational-operator value - -Parentheses can be used to provide arbitrary logical groupings and -double-quotes can be used to prevent the shell from interpreting -operators as shell metacharacters. - -The field-names available for use in filters can be found in the -'format' files for trace events (see section 4). - -The relational-operators depend on the type of the field being tested: - -The operators available for numeric fields are: - -==, !=, <, <=, >, >= - -And for string fields they are: - -==, != - -Currently, only exact string matches are supported. - -5.2 Setting filters -------------------- - -A filter for an individual event is set by writing a filter expression -to the 'filter' file for the given event. - -For example: - -# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup -# echo "common_preempt_count > 4" > filter - -A slightly more involved example: - -# cd /sys/kernel/debug/tracing/events/signal/signal_generate -# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter - -If there is an error in the expression, you'll get an 'Invalid -argument' error when setting it, and the erroneous string along with -an error message can be seen by looking at the filter e.g.: - -# cd /sys/kernel/debug/tracing/events/signal/signal_generate -# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter --bash: echo: write error: Invalid argument -# cat filter -((sig >= 10 && sig < 15) || dsig == 17) && comm != bash -^ -parse_error: Field not found - -Currently the caret ('^') for an error always appears at the beginning of -the filter string; the error message should still be useful though -even without more accurate position info. - -5.3 Clearing filters --------------------- - -To clear the filter for an event, write a '0' to the event's filter -file. - -To clear the filters for all events in a subsystem, write a '0' to the -subsystem's filter file. - -5.3 Subsystem filters ---------------------- - -For convenience, filters for every event in a subsystem can be set or -cleared as a group by writing a filter expression into the filter file -at the root of the subsystem. Note however, that if a filter for any -event within the subsystem lacks a field specified in the subsystem -filter, or if the filter can't be applied for any other reason, the -filter for that event will retain its previous setting. This can -result in an unintended mixture of filters which could lead to -confusing (to the user who might think different filters are in -effect) trace output. Only filters that reference just the common -fields can be guaranteed to propagate successfully to all events. - -Here are a few subsystem filter examples that also illustrate the -above points: - -Clear the filters on all events in the sched subsystem: - -# cd /sys/kernel/debug/tracing/events/sched -# echo 0 > filter -# cat sched_switch/filter -none -# cat sched_wakeup/filter -none - -Set a filter using only common fields for all events in the sched -subsystem (all events end up with the same filter): - -# cd /sys/kernel/debug/tracing/events/sched -# echo common_pid == 0 > filter -# cat sched_switch/filter -common_pid == 0 -# cat sched_wakeup/filter -common_pid == 0 - -Attempt to set a filter using a non-common field for all events in the -sched subsystem (all events but those that have a prev_pid field retain -their old filters): - -# cd /sys/kernel/debug/tracing/events/sched -# echo prev_pid == 0 > filter -# cat sched_switch/filter -prev_pid == 0 -# cat sched_wakeup/filter -common_pid == 0 diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt deleted file mode 100644 index 79fcafc7fd6..00000000000 --- a/Documentation/trace/ftrace-design.txt +++ /dev/null @@ -1,411 +0,0 @@ - function tracer guts - ==================== - By Mike Frysinger - -Introduction ------------- - -Here we will cover the architecture pieces that the common function tracing -code relies on for proper functioning. Things are broken down into increasing -complexity so that you can start simple and at least get basic functionality. - -Note that this focuses on architecture implementation details only. If you -want more explanation of a feature in terms of common code, review the common -ftrace.txt file. - -Ideally, everyone who wishes to retain performance while supporting tracing in -their kernel should make it all the way to dynamic ftrace support. - - -Prerequisites -------------- - -Ftrace relies on these features being implemented: - STACKTRACE_SUPPORT - implement save_stack_trace() - TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h - - -HAVE_FUNCTION_TRACER --------------------- - -You will need to implement the mcount and the ftrace_stub functions. - -The exact mcount symbol name will depend on your toolchain. Some call it -"mcount", "_mcount", or even "__mcount". You can probably figure it out by -running something like: - $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount - call mcount -We'll make the assumption below that the symbol is "mcount" just to keep things -nice and simple in the examples. - -Keep in mind that the ABI that is in effect inside of the mcount function is -*highly* architecture/toolchain specific. We cannot help you in this regard, -sorry. Dig up some old documentation and/or find someone more familiar than -you to bang ideas off of. Typically, register usage (argument/scratch/etc...) -is a major issue at this point, especially in relation to the location of the -mcount call (before/after function prologue). You might also want to look at -how glibc has implemented the mcount function for your architecture. It might -be (semi-)relevant. - -The mcount function should check the function pointer ftrace_trace_function -to see if it is set to ftrace_stub. If it is, there is nothing for you to do, -so return immediately. If it isn't, then call that function in the same way -the mcount function normally calls __mcount_internal -- the first argument is -the "frompc" while the second argument is the "selfpc" (adjusted to remove the -size of the mcount call that is embedded in the function). - -For example, if the function foo() calls bar(), when the bar() function calls -mcount(), the arguments mcount() will pass to the tracer are: - "frompc" - the address bar() will use to return to foo() - "selfpc" - the address bar() (with mcount() size adjustment) - -Also keep in mind that this mcount function will be called *a lot*, so -optimizing for the default case of no tracer will help the smooth running of -your system when tracing is disabled. So the start of the mcount function is -typically the bare minimum with checking things before returning. That also -means the code flow should usually be kept linear (i.e. no branching in the nop -case). This is of course an optimization and not a hard requirement. - -Here is some pseudo code that should help (these functions should actually be -implemented in assembly): - -void ftrace_stub(void) -{ - return; -} - -void mcount(void) -{ - /* save any bare state needed in order to do initial checking */ - - extern void (*ftrace_trace_function)(unsigned long, unsigned long); - if (ftrace_trace_function != ftrace_stub) - goto do_trace; - - /* restore any bare state */ - - return; - -do_trace: - - /* save all state needed by the ABI (see paragraph above) */ - - unsigned long frompc = ...; - unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; - ftrace_trace_function(frompc, selfpc); - - /* restore all state needed by the ABI */ -} - -Don't forget to export mcount for modules ! -extern void mcount(void); -EXPORT_SYMBOL(mcount); - - -HAVE_FUNCTION_TRACE_MCOUNT_TEST -------------------------------- - -This is an optional optimization for the normal case when tracing is turned off -in the system. If you do not enable this Kconfig option, the common ftrace -code will take care of doing the checking for you. - -To support this feature, you only need to check the function_trace_stop -variable in the mcount function. If it is non-zero, there is no tracing to be -done at all, so you can return. - -This additional pseudo code would simply be: -void mcount(void) -{ - /* save any bare state needed in order to do initial checking */ - -+ if (function_trace_stop) -+ return; - - extern void (*ftrace_trace_function)(unsigned long, unsigned long); - if (ftrace_trace_function != ftrace_stub) -... - - -HAVE_FUNCTION_GRAPH_TRACER --------------------------- - -Deep breath ... time to do some real work. Here you will need to update the -mcount function to check ftrace graph function pointers, as well as implement -some functions to save (hijack) and restore the return address. - -The mcount function should check the function pointers ftrace_graph_return -(compare to ftrace_stub) and ftrace_graph_entry (compare to -ftrace_graph_entry_stub). If either of those is not set to the relevant stub -function, call the arch-specific function ftrace_graph_caller which in turn -calls the arch-specific function prepare_ftrace_return. Neither of these -function names is strictly required, but you should use them anyway to stay -consistent across the architecture ports -- easier to compare & contrast -things. - -The arguments to prepare_ftrace_return are slightly different than what are -passed to ftrace_trace_function. The second argument "selfpc" is the same, -but the first argument should be a pointer to the "frompc". Typically this is -located on the stack. This allows the function to hijack the return address -temporarily to have it point to the arch-specific function return_to_handler. -That function will simply call the common ftrace_return_to_handler function and -that will return the original return address with which you can return to the -original call site. - -Here is the updated mcount pseudo code: -void mcount(void) -{ -... - if (ftrace_trace_function != ftrace_stub) - goto do_trace; - -+#ifdef CONFIG_FUNCTION_GRAPH_TRACER -+ extern void (*ftrace_graph_return)(...); -+ extern void (*ftrace_graph_entry)(...); -+ if (ftrace_graph_return != ftrace_stub || -+ ftrace_graph_entry != ftrace_graph_entry_stub) -+ ftrace_graph_caller(); -+#endif - - /* restore any bare state */ -... - -Here is the pseudo code for the new ftrace_graph_caller assembly function: -#ifdef CONFIG_FUNCTION_GRAPH_TRACER -void ftrace_graph_caller(void) -{ - /* save all state needed by the ABI */ - - unsigned long *frompc = &...; - unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; - /* passing frame pointer up is optional -- see below */ - prepare_ftrace_return(frompc, selfpc, frame_pointer); - - /* restore all state needed by the ABI */ -} -#endif - -For information on how to implement prepare_ftrace_return(), simply look at the -x86 version (the frame pointer passing is optional; see the next section for -more information). The only architecture-specific piece in it is the setup of -the fault recovery table (the asm(...) code). The rest should be the same -across architectures. - -Here is the pseudo code for the new return_to_handler assembly function. Note -that the ABI that applies here is different from what applies to the mcount -code. Since you are returning from a function (after the epilogue), you might -be able to skimp on things saved/restored (usually just registers used to pass -return values). - -#ifdef CONFIG_FUNCTION_GRAPH_TRACER -void return_to_handler(void) -{ - /* save all state needed by the ABI (see paragraph above) */ - - void (*original_return_point)(void) = ftrace_return_to_handler(); - - /* restore all state needed by the ABI */ - - /* this is usually either a return or a jump */ - original_return_point(); -} -#endif - - -HAVE_FUNCTION_GRAPH_FP_TEST ---------------------------- - -An arch may pass in a unique value (frame pointer) to both the entering and -exiting of a function. On exit, the value is compared and if it does not -match, then it will panic the kernel. This is largely a sanity check for bad -code generation with gcc. If gcc for your port sanely updates the frame -pointer under different optimization levels, then ignore this option. - -However, adding support for it isn't terribly difficult. In your assembly code -that calls prepare_ftrace_return(), pass the frame pointer as the 3rd argument. -Then in the C version of that function, do what the x86 port does and pass it -along to ftrace_push_return_trace() instead of a stub value of 0. - -Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer. - - -HAVE_FTRACE_NMI_ENTER ---------------------- - -If you can't trace NMI functions, then skip this option. - -<details to be filled> - - -HAVE_SYSCALL_TRACEPOINTS ------------------------- - -You need very few things to get the syscalls tracing in an arch. - -- Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig). -- Have a NR_syscalls variable in <asm/unistd.h> that provides the number - of syscalls supported by the arch. -- Support the TIF_SYSCALL_TRACEPOINT thread flags. -- Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace - in the ptrace syscalls tracing path. -- If the system call table on this arch is more complicated than a simple array - of addresses of the system calls, implement an arch_syscall_addr to return - the address of a given system call. -- If the symbol names of the system calls do not match the function names on - this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and - implement arch_syscall_match_sym_name with the appropriate logic to return - true if the function name corresponds with the symbol name. -- Tag this arch as HAVE_SYSCALL_TRACEPOINTS. - - -HAVE_FTRACE_MCOUNT_RECORD -------------------------- - -See scripts/recordmcount.pl for more info. Just fill in the arch-specific -details for how to locate the addresses of mcount call sites via objdump. -This option doesn't make much sense without also implementing dynamic ftrace. - - -HAVE_DYNAMIC_FTRACE -------------------- - -You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so -scroll your reader back up if you got over eager. - -Once those are out of the way, you will need to implement: - - asm/ftrace.h: - - MCOUNT_ADDR - - ftrace_call_adjust() - - struct dyn_arch_ftrace{} - - asm code: - - mcount() (new stub) - - ftrace_caller() - - ftrace_call() - - ftrace_stub() - - C code: - - ftrace_dyn_arch_init() - - ftrace_make_nop() - - ftrace_make_call() - - ftrace_update_ftrace_func() - -First you will need to fill out some arch details in your asm/ftrace.h. - -Define MCOUNT_ADDR as the address of your mcount symbol similar to: - #define MCOUNT_ADDR ((unsigned long)mcount) -Since no one else will have a decl for that function, you will need to: - extern void mcount(void); - -You will also need the helper function ftrace_call_adjust(). Most people -will be able to stub it out like so: - static inline unsigned long ftrace_call_adjust(unsigned long addr) - { - return addr; - } -<details to be filled> - -Lastly you will need the custom dyn_arch_ftrace structure. If you need -some extra state when runtime patching arbitrary call sites, this is the -place. For now though, create an empty struct: - struct dyn_arch_ftrace { - /* No extra data needed */ - }; - -With the header out of the way, we can fill out the assembly code. While we -did already create a mcount() function earlier, dynamic ftrace only wants a -stub function. This is because the mcount() will only be used during boot -and then all references to it will be patched out never to return. Instead, -the guts of the old mcount() will be used to create a new ftrace_caller() -function. Because the two are hard to merge, it will most likely be a lot -easier to have two separate definitions split up by #ifdefs. Same goes for -the ftrace_stub() as that will now be inlined in ftrace_caller(). - -Before we get confused anymore, let's check out some pseudo code so you can -implement your own stuff in assembly: - -void mcount(void) -{ - return; -} - -void ftrace_caller(void) -{ - /* implement HAVE_FUNCTION_TRACE_MCOUNT_TEST if you desire */ - - /* save all state needed by the ABI (see paragraph above) */ - - unsigned long frompc = ...; - unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; - -ftrace_call: - ftrace_stub(frompc, selfpc); - - /* restore all state needed by the ABI */ - -ftrace_stub: - return; -} - -This might look a little odd at first, but keep in mind that we will be runtime -patching multiple things. First, only functions that we actually want to trace -will be patched to call ftrace_caller(). Second, since we only have one tracer -active at a time, we will patch the ftrace_caller() function itself to call the -specific tracer in question. That is the point of the ftrace_call label. - -With that in mind, let's move on to the C code that will actually be doing the -runtime patching. You'll need a little knowledge of your arch's opcodes in -order to make it through the next section. - -Every arch has an init callback function. If you need to do something early on -to initialize some state, this is the time to do that. Otherwise, this simple -function below should be sufficient for most people: - -int __init ftrace_dyn_arch_init(void *data) -{ - /* return value is done indirectly via data */ - *(unsigned long *)data = 0; - - return 0; -} - -There are two functions that are used to do runtime patching of arbitrary -functions. The first is used to turn the mcount call site into a nop (which -is what helps us retain runtime performance when not tracing). The second is -used to turn the mcount call site into a call to an arbitrary location (but -typically that is ftracer_caller()). See the general function definition in -linux/ftrace.h for the functions: - ftrace_make_nop() - ftrace_make_call() -The rec->ip value is the address of the mcount call site that was collected -by the scripts/recordmcount.pl during build time. - -The last function is used to do runtime patching of the active tracer. This -will be modifying the assembly code at the location of the ftrace_call symbol -inside of the ftrace_caller() function. So you should have sufficient padding -at that location to support the new function calls you'll be inserting. Some -people will be using a "call" type instruction while others will be using a -"branch" type instruction. Specifically, the function is: - ftrace_update_ftrace_func() - - -HAVE_DYNAMIC_FTRACE + HAVE_FUNCTION_GRAPH_TRACER ------------------------------------------------- - -The function grapher needs a few tweaks in order to work with dynamic ftrace. -Basically, you will need to: - - update: - - ftrace_caller() - - ftrace_graph_call() - - ftrace_graph_caller() - - implement: - - ftrace_enable_ftrace_graph_caller() - - ftrace_disable_ftrace_graph_caller() - -<details to be filled> -Quick notes: - - add a nop stub after the ftrace_call location named ftrace_graph_call; - stub needs to be large enough to support a call to ftrace_graph_caller() - - update ftrace_graph_caller() to work with being called by the new - ftrace_caller() since some semantics may have changed - - ftrace_enable_ftrace_graph_caller() will runtime patch the - ftrace_graph_call location with a call to ftrace_graph_caller() - - ftrace_disable_ftrace_graph_caller() will runtime patch the - ftrace_graph_call location with nops diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt deleted file mode 100644 index 6f51fed45f2..00000000000 --- a/Documentation/trace/ftrace.txt +++ /dev/null @@ -1,1848 +0,0 @@ - ftrace - Function Tracer - ======================== - -Copyright 2008 Red Hat Inc. - Author: Steven Rostedt <srostedt@redhat.com> - License: The GNU Free Documentation License, Version 1.2 - (dual licensed under the GPL v2) -Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, - John Kacur, and David Teigland. -Written for: 2.6.28-rc2 - -Introduction ------------- - -Ftrace is an internal tracer designed to help out developers and -designers of systems to find what is going on inside the kernel. -It can be used for debugging or analyzing latencies and -performance issues that take place outside of user-space. - -Although ftrace is the function tracer, it also includes an -infrastructure that allows for other types of tracing. Some of -the tracers that are currently in ftrace include a tracer to -trace context switches, the time it takes for a high priority -task to run after it was woken up, the time interrupts are -disabled, and more (ftrace allows for tracer plugins, which -means that the list of tracers can always grow). - - -Implementation Details ----------------------- - -See ftrace-design.txt for details for arch porters and such. - - -The File System ---------------- - -Ftrace uses the debugfs file system to hold the control files as -well as the files to display output. - -When debugfs is configured into the kernel (which selecting any ftrace -option will do) the directory /sys/kernel/debug will be created. To mount -this directory, you can add to your /etc/fstab file: - - debugfs /sys/kernel/debug debugfs defaults 0 0 - -Or you can mount it at run time with: - - mount -t debugfs nodev /sys/kernel/debug - -For quicker access to that directory you may want to make a soft link to -it: - - ln -s /sys/kernel/debug /debug - -Any selected ftrace option will also create a directory called tracing -within the debugfs. The rest of the document will assume that you are in -the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate -on the files within that directory and not distract from the content with -the extended "/sys/kernel/debug/tracing" path name. - -That's it! (assuming that you have ftrace configured into your kernel) - -After mounting the debugfs, you can see a directory called -"tracing". This directory contains the control and output files -of ftrace. Here is a list of some of the key files: - - - Note: all time values are in microseconds. - - current_tracer: - - This is used to set or display the current tracer - that is configured. - - available_tracers: - - This holds the different types of tracers that - have been compiled into the kernel. The - tracers listed here can be configured by - echoing their name into current_tracer. - - tracing_on: - - This sets or displays whether writing to the trace - ring buffer is enabled. Echo 0 into this file to disable - the tracer or 1 to enable it. - - trace: - - This file holds the output of the trace in a human - readable format (described below). - - trace_pipe: - - The output is the same as the "trace" file but this - file is meant to be streamed with live tracing. - Reads from this file will block until new data is - retrieved. Unlike the "trace" file, this file is a - consumer. This means reading from this file causes - sequential reads to display more current data. Once - data is read from this file, it is consumed, and - will not be read again with a sequential read. The - "trace" file is static, and if the tracer is not - adding more data,they will display the same - information every time they are read. - - trace_options: - - This file lets the user control the amount of data - that is displayed in one of the above output - files. - - tracing_max_latency: - - Some of the tracers record the max latency. - For example, the time interrupts are disabled. - This time is saved in this file. The max trace - will also be stored, and displayed by "trace". - A new max trace will only be recorded if the - latency is greater than the value in this - file. (in microseconds) - - buffer_size_kb: - - This sets or displays the number of kilobytes each CPU - buffer can hold. The tracer buffers are the same size - for each CPU. The displayed number is the size of the - CPU buffer and not total size of all buffers. The - trace buffers are allocated in pages (blocks of memory - that the kernel uses for allocation, usually 4 KB in size). - If the last page allocated has room for more bytes - than requested, the rest of the page will be used, - making the actual allocation bigger than requested. - ( Note, the size may not be a multiple of the page size - due to buffer management overhead. ) - - This can only be updated when the current_tracer - is set to "nop". - - tracing_cpumask: - - This is a mask that lets the user only trace - on specified CPUS. The format is a hex string - representing the CPUS. - - set_ftrace_filter: - - When dynamic ftrace is configured in (see the - section below "dynamic ftrace"), the code is dynamically - modified (code text rewrite) to disable calling of the - function profiler (mcount). This lets tracing be configured - in with practically no overhead in performance. This also - has a side effect of enabling or disabling specific functions - to be traced. Echoing names of functions into this file - will limit the trace to only those functions. - - This interface also allows for commands to be used. See the - "Filter commands" section for more details. - - set_ftrace_notrace: - - This has an effect opposite to that of - set_ftrace_filter. Any function that is added here will not - be traced. If a function exists in both set_ftrace_filter - and set_ftrace_notrace, the function will _not_ be traced. - - set_ftrace_pid: - - Have the function tracer only trace a single thread. - - set_graph_function: - - Set a "trigger" function where tracing should start - with the function graph tracer (See the section - "dynamic ftrace" for more details). - - available_filter_functions: - - This lists the functions that ftrace - has processed and can trace. These are the function - names that you can pass to "set_ftrace_filter" or - "set_ftrace_notrace". (See the section "dynamic ftrace" - below for more details.) - - -The Tracers ------------ - -Here is the list of current tracers that may be configured. - - "function" - - Function call tracer to trace all kernel functions. - - "function_graph" - - Similar to the function tracer except that the - function tracer probes the functions on their entry - whereas the function graph tracer traces on both entry - and exit of the functions. It then provides the ability - to draw a graph of function calls similar to C code - source. - - "irqsoff" - - Traces the areas that disable interrupts and saves - the trace with the longest max latency. - See tracing_max_latency. When a new max is recorded, - it replaces the old trace. It is best to view this - trace with the latency-format option enabled. - - "preemptoff" - - Similar to irqsoff but traces and records the amount of - time for which preemption is disabled. - - "preemptirqsoff" - - Similar to irqsoff and preemptoff, but traces and - records the largest time for which irqs and/or preemption - is disabled. - - "wakeup" - - Traces and records the max latency that it takes for - the highest priority task to get scheduled after - it has been woken up. - Traces all tasks as an average developer would expect. - - "wakeup_rt" - - Traces and records the max latency that it takes for just - RT tasks (as the current "wakeup" does). This is useful - for those interested in wake up timings of RT tasks. - - "hw-branch-tracer" - - Uses the BTS CPU feature on x86 CPUs to traces all - branches executed. - - "nop" - - This is the "trace nothing" tracer. To remove all - tracers from tracing simply echo "nop" into - current_tracer. - - -Examples of using the tracer ----------------------------- - -Here are typical examples of using the tracers when controlling -them only with the debugfs interface (without using any -user-land utilities). - -Output format: --------------- - -Here is an example of the output format of the file "trace" - - -------- -# tracer: function -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4251 [01] 10152.583854: path_put <-path_walk - bash-4251 [01] 10152.583855: dput <-path_put - bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput - -------- - -A header is printed with the tracer name that is represented by -the trace. In this case the tracer is "function". Then a header -showing the format. Task name "bash", the task PID "4251", the -CPU that it was running on "01", the timestamp in <secs>.<usecs> -format, the function name that was traced "path_put" and the -parent function that called this function "path_walk". The -timestamp is the time at which the function was entered. - -Latency trace format --------------------- - -When the latency-format option is enabled, the trace file gives -somewhat more information to see why a latency happened. -Here is a typical trace. - -# tracer: irqsoff -# -irqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: apic_timer_interrupt - => ended at: do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt) - <idle>-0 0d.s. 97us : __do_softirq (do_softirq) - <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) - - -This shows that the current tracer is "irqsoff" tracing the time -for which interrupts were disabled. It gives the trace version -and the version of the kernel upon which this was executed on -(2.6.26-rc8). Then it displays the max latency in microsecs (97 -us). The number of trace entries displayed and the total number -recorded (both are three: #3/3). The type of preemption that was -used (PREEMPT). VP, KP, SP, and HP are always zero and are -reserved for later use. #P is the number of online CPUS (#P:2). - -The task is the process that was running when the latency -occurred. (swapper pid: 0). - -The start and stop (the functions in which the interrupts were -disabled and enabled respectively) that caused the latencies: - - apic_timer_interrupt is where the interrupts were disabled. - do_softirq is where they were enabled again. - -The next lines after the header are the trace itself. The header -explains which is which. - - cmd: The name of the process in the trace. - - pid: The PID of that process. - - CPU#: The CPU which the process was running on. - - irqs-off: 'd' interrupts are disabled. '.' otherwise. - Note: If the architecture does not support a way to - read the irq flags variable, an 'X' will always - be printed here. - - need-resched: 'N' task need_resched is set, '.' otherwise. - - hardirq/softirq: - 'H' - hard irq occurred inside a softirq. - 'h' - hard irq is running - 's' - soft irq is running - '.' - normal context. - - preempt-depth: The level of preempt_disabled - -The above is mostly meaningful for kernel developers. - - time: When the latency-format option is enabled, the trace file - output includes a timestamp relative to the start of the - trace. This differs from the output when latency-format - is disabled, which includes an absolute timestamp. - - delay: This is just to help catch your eye a bit better. And - needs to be fixed to be only relative to the same CPU. - The marks are determined by the difference between this - current trace and the next trace. - '!' - greater than preempt_mark_thresh (default 100) - '+' - greater than 1 microsecond - ' ' - less than or equal to 1 microsecond. - - The rest is the same as the 'trace' file. - - -trace_options -------------- - -The trace_options file is used to control what gets printed in -the trace output. To see what is available, simply cat the file: - - cat trace_options - print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ - noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj - -To disable one of the options, echo in the option prepended with -"no". - - echo noprint-parent > trace_options - -To enable an option, leave off the "no". - - echo sym-offset > trace_options - -Here are the available options: - - print-parent - On function traces, display the calling (parent) - function as well as the function being traced. - - print-parent: - bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul - - noprint-parent: - bash-4000 [01] 1477.606694: simple_strtoul - - - sym-offset - Display not only the function name, but also the - offset in the function. For example, instead of - seeing just "ktime_get", you will see - "ktime_get+0xb/0x20". - - sym-offset: - bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 - - sym-addr - this will also display the function address as well - as the function name. - - sym-addr: - bash-4000 [01] 1477.606694: simple_strtoul <c0339346> - - verbose - This deals with the trace file when the - latency-format option is enabled. - - bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ - (+0.000ms): simple_strtoul (strict_strtoul) - - raw - This will display raw numbers. This option is best for - use with user applications that can translate the raw - numbers better than having it done in the kernel. - - hex - Similar to raw, but the numbers will be in a hexadecimal - format. - - bin - This will print out the formats in raw binary. - - block - TBD (needs update) - - stacktrace - This is one of the options that changes the trace - itself. When a trace is recorded, so is the stack - of functions. This allows for back traces of - trace sites. - - userstacktrace - This option changes the trace. It records a - stacktrace of the current userspace thread. - - sym-userobj - when user stacktrace are enabled, look up which - object the address belongs to, and print a - relative address. This is especially useful when - ASLR is on, otherwise you don't get a chance to - resolve the address to object/file/line after - the app is no longer running - - The lookup is performed when you read - trace,trace_pipe. Example: - - a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 -x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] - - sched-tree - trace all tasks that are on the runqueue, at - every scheduling event. Will add overhead if - there's a lot of tasks running at once. - - latency-format - This option changes the trace. When - it is enabled, the trace displays - additional information about the - latencies, as described in "Latency - trace format". - - overwrite - This controls what happens when the trace buffer is - full. If "1" (default), the oldest events are - discarded and overwritten. If "0", then the newest - events are discarded. - -ftrace_enabled --------------- - -The following tracers (listed below) give different output -depending on whether or not the sysctl ftrace_enabled is set. To -set ftrace_enabled, one can either use the sysctl function or -set it via the proc file system interface. - - sysctl kernel.ftrace_enabled=1 - - or - - echo 1 > /proc/sys/kernel/ftrace_enabled - -To disable ftrace_enabled simply replace the '1' with '0' in the -above commands. - -When ftrace_enabled is set the tracers will also record the -functions that are within the trace. The descriptions of the -tracers will also show an example with ftrace enabled. - - -irqsoff -------- - -When interrupts are disabled, the CPU can not react to any other -external event (besides NMIs and SMIs). This prevents the timer -interrupt from triggering or the mouse interrupt from letting -the kernel know of a new mouse event. The result is a latency -with the reaction time. - -The irqsoff tracer tracks the time for which interrupts are -disabled. When a new maximum latency is hit, the tracer saves -the trace leading up to that latency point so that every time a -new maximum is reached, the old saved trace is discarded and the -new trace is saved. - -To reset the maximum, echo 0 into tracing_max_latency. Here is -an example: - - # echo irqsoff > current_tracer - # echo latency-format > trace_options - # echo 0 > tracing_max_latency - # echo 1 > tracing_on - # ls -ltr - [...] - # echo 0 > tracing_on - # cat trace -# tracer: irqsoff -# -irqsoff latency trace v1.1.5 on 2.6.26 --------------------------------------------------------------------- - latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: sys_setpgid - => ended at: sys_setpgid - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - bash-3730 1d... 0us : _write_lock_irq (sys_setpgid) - bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid) - bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid) - - -Here we see that that we had a latency of 12 microsecs (which is -very good). The _write_lock_irq in sys_setpgid disabled -interrupts. The difference between the 12 and the displayed -timestamp 14us occurred because the clock was incremented -between the time of recording the max latency and the time of -recording the function that had that latency. - -Note the above example had ftrace_enabled not set. If we set the -ftrace_enabled, we get a much larger output: - -# tracer: irqsoff -# -irqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: __alloc_pages_internal - => ended at: __alloc_pages_internal - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal) - ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist) - ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk) - ls-4339 0d..1 4us : add_preempt_count (_spin_lock) - ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue) -[...] - ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk) - ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock) - ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal) - ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) - - - -Here we traced a 50 microsecond latency. But we also see all the -functions that were called during that time. Note that by -enabling function tracing, we incur an added overhead. This -overhead may extend the latency times. But nevertheless, this -trace has provided some very helpful debugging information. - - -preemptoff ----------- - -When preemption is disabled, we may be able to receive -interrupts but the task cannot be preempted and a higher -priority task must wait for preemption to be enabled again -before it can preempt a lower priority task. - -The preemptoff tracer traces the places that disable preemption. -Like the irqsoff tracer, it records the maximum latency for -which preemption was disabled. The control of preemptoff tracer -is much like the irqsoff tracer. - - # echo preemptoff > current_tracer - # echo latency-format > trace_options - # echo 0 > tracing_max_latency - # echo 1 > tracing_on - # ls -ltr - [...] - # echo 0 > tracing_on - # cat trace -# tracer: preemptoff -# -preemptoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: do_IRQ - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - sshd-4261 0d.h. 0us+: irq_enter (do_IRQ) - sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) - - -This has some more changes. Preemption was disabled when an -interrupt came in (notice the 'h'), and was enabled while doing -a softirq. (notice the 's'). But we also see that interrupts -have been disabled when entering the preempt off section and -leaving it (the 'd'). We do not know if interrupts were enabled -in the mean time. - -# tracer: preemptoff -# -preemptoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: remove_wait_queue - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue) - sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue) - sshd-4261 0d..1 2us : do_IRQ (common_interrupt) - sshd-4261 0d..1 2us : irq_enter (do_IRQ) - sshd-4261 0d..1 2us : idle_cpu (irq_enter) - sshd-4261 0d..1 3us : add_preempt_count (irq_enter) - sshd-4261 0d.h1 3us : idle_cpu (irq_enter) - sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ) -[...] - sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock) - sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq) - sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq) - sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock) - sshd-4261 0d.h1 14us : irq_exit (do_IRQ) - sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit) - sshd-4261 0d..2 15us : do_softirq (irq_exit) - sshd-4261 0d... 15us : __do_softirq (do_softirq) - sshd-4261 0d... 16us : __local_bh_disable (__do_softirq) - sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable) -[...] - sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable) -[...] - sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) - - -The above is an example of the preemptoff trace with -ftrace_enabled set. Here we see that interrupts were disabled -the entire time. The irq_enter code lets us know that we entered -an interrupt 'h'. Before that, the functions being traced still -show that it is not in an interrupt, but we can see from the -functions themselves that this is not the case. - -Notice that __do_softirq when called does not have a -preempt_count. It may seem that we missed a preempt enabling. -What really happened is that the preempt count is held on the -thread's stack and we switched to the softirq stack (4K stacks -in effect). The code does not copy the preempt count, but -because interrupts are disabled, we do not need to worry about -it. Having a tracer like this is good for letting people know -what really happens inside the kernel. - - -preemptirqsoff --------------- - -Knowing the locations that have interrupts disabled or -preemption disabled for the longest times is helpful. But -sometimes we would like to know when either preemption and/or -interrupts are disabled. - -Consider the following code: - - local_irq_disable(); - call_function_with_irqs_off(); - preempt_disable(); - call_function_with_irqs_and_preemption_off(); - local_irq_enable(); - call_function_with_preemption_off(); - preempt_enable(); - -The irqsoff tracer will record the total length of -call_function_with_irqs_off() and -call_function_with_irqs_and_preemption_off(). - -The preemptoff tracer will record the total length of -call_function_with_irqs_and_preemption_off() and -call_function_with_preemption_off(). - -But neither will trace the time that interrupts and/or -preemption is disabled. This total time is the time that we can -not schedule. To record this time, use the preemptirqsoff -tracer. - -Again, using this trace is much like the irqsoff and preemptoff -tracers. - - # echo preemptirqsoff > current_tracer - # echo latency-format > trace_options - # echo 0 > tracing_max_latency - # echo 1 > tracing_on - # ls -ltr - [...] - # echo 0 > tracing_on - # cat trace -# tracer: preemptirqsoff -# -preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: apic_timer_interrupt - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt) - ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq) - ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) - - - -The trace_hardirqs_off_thunk is called from assembly on x86 when -interrupts are disabled in the assembly code. Without the -function tracing, we do not know if interrupts were enabled -within the preemption points. We do see that it started with -preemption enabled. - -Here is a trace with ftrace_enabled set: - - -# tracer: preemptirqsoff -# -preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: write_chan - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4473 0.N.. 0us : preempt_schedule (write_chan) - ls-4473 0dN.1 1us : _spin_lock (schedule) - ls-4473 0dN.1 2us : add_preempt_count (_spin_lock) - ls-4473 0d..2 2us : put_prev_task_fair (schedule) -[...] - ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts) - ls-4473 0d..2 13us : __switch_to (schedule) - sshd-4261 0d..2 14us : finish_task_switch (schedule) - sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch) - sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave) - sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set) - sshd-4261 0d..2 16us : do_IRQ (common_interrupt) - sshd-4261 0d..2 17us : irq_enter (do_IRQ) - sshd-4261 0d..2 17us : idle_cpu (irq_enter) - sshd-4261 0d..2 18us : add_preempt_count (irq_enter) - sshd-4261 0d.h2 18us : idle_cpu (irq_enter) - sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ) - sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq) - sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock) - sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock) -[...] - sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock) - sshd-4261 0d.h2 29us : irq_exit (do_IRQ) - sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit) - sshd-4261 0d..3 30us : do_softirq (irq_exit) - sshd-4261 0d... 30us : __do_softirq (do_softirq) - sshd-4261 0d... 31us : __local_bh_disable (__do_softirq) - sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable) -[...] - sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt) - sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt) - sshd-4261 0d.s3 45us : idle_cpu (irq_enter) - sshd-4261 0d.s3 46us : add_preempt_count (irq_enter) - sshd-4261 0d.H3 46us : idle_cpu (irq_enter) - sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt) - sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt) -[...] - sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt) - sshd-4261 0d.H3 82us : ktime_get (tick_program_event) - sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get) - sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts) - sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts) - sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event) - sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event) - sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt) - sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit) - sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit) - sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable) -[...] - sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action) - sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq) - sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq) - sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable) - sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) - - -This is a very interesting trace. It started with the preemption -of the ls task. We see that the task had the "need_resched" bit -set via the 'N' in the trace. Interrupts were disabled before -the spin_lock at the beginning of the trace. We see that a -schedule took place to run sshd. When the interrupts were -enabled, we took an interrupt. On return from the interrupt -handler, the softirq ran. We took another interrupt while -running the softirq as we see from the capital 'H'. - - -wakeup ------- - -In a Real-Time environment it is very important to know the -wakeup time it takes for the highest priority task that is woken -up to the time that it executes. This is also known as "schedule -latency". I stress the point that this is about RT tasks. It is -also important to know the scheduling latency of non-RT tasks, -but the average schedule latency is better for non-RT tasks. -Tools like LatencyTop are more appropriate for such -measurements. - -Real-Time environments are interested in the worst case latency. -That is the longest latency it takes for something to happen, -and not the average. We can have a very fast scheduler that may -only have a large latency once in a while, but that would not -work well with Real-Time tasks. The wakeup tracer was designed -to record the worst case wakeups of RT tasks. Non-RT tasks are -not recorded because the tracer only records one worst case and -tracing non-RT tasks that are unpredictable will overwrite the -worst case latency of RT tasks. - -Since this tracer only deals with RT tasks, we will run this -slightly differently than we did with the previous tracers. -Instead of performing an 'ls', we will run 'sleep 1' under -'chrt' which changes the priority of the task. - - # echo wakeup > current_tracer - # echo latency-format > trace_options - # echo 0 > tracing_max_latency - # echo 1 > tracing_on - # chrt -f 5 sleep 1 - # echo 0 > tracing_on - # cat trace -# tracer: wakeup -# -wakeup latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5) - ----------------- - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process) - <idle>-0 1d..4 4us : schedule (cpu_idle) - - -Running this on an idle system, we see that it only took 4 -microseconds to perform the task switch. Note, since the trace -marker in the schedule is before the actual "switch", we stop -the tracing when the recorded task is about to schedule in. This -may change if we add a new marker at the end of the scheduler. - -Notice that the recorded task is 'sleep' with the PID of 4901 -and it has an rt_prio of 5. This priority is user-space priority -and not the internal kernel priority. The policy is 1 for -SCHED_FIFO and 2 for SCHED_RR. - -Doing the same with chrt -r 5 and ftrace_enabled set. - -# tracer: wakeup -# -wakeup latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5) - ----------------- - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / -ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process) -ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb) -ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up) -ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup) -ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr) -ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup) -ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up) -ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up) -[...] -ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt) -ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit) -ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit) -ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq) -[...] -ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks) -ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq) -ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable) -ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd) -ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd) -ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched) -ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched) -ksoftirq-7 1.N.2 33us : schedule (__cond_resched) -ksoftirq-7 1.N.2 33us : add_preempt_count (schedule) -ksoftirq-7 1.N.3 34us : hrtick_clear (schedule) -ksoftirq-7 1dN.3 35us : _spin_lock (schedule) -ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock) -ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule) -ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair) -[...] -ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline) -ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock) -ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) -ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) -ksoftirq-7 1d..4 50us : schedule (__cond_resched) - -The interrupt went off while running ksoftirqd. This task runs -at SCHED_OTHER. Why did not we see the 'N' set early? This may -be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K -stacks configured, the interrupt and softirq run with their own -stack. Some information is held on the top of the task's stack -(need_resched and preempt_count are both stored there). The -setting of the NEED_RESCHED bit is done directly to the task's -stack, but the reading of the NEED_RESCHED is done by looking at -the current stack, which in this case is the stack for the hard -interrupt. This hides the fact that NEED_RESCHED has been set. -We do not see the 'N' until we switch back to the task's -assigned stack. - -function --------- - -This tracer is the function tracer. Enabling the function tracer -can be done from the debug file system. Make sure the -ftrace_enabled is set; otherwise this tracer is a nop. - - # sysctl kernel.ftrace_enabled=1 - # echo function > current_tracer - # echo 1 > tracing_on - # usleep 1 - # echo 0 > tracing_on - # cat trace -# tracer: function -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4003 [00] 123.638713: finish_task_switch <-schedule - bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch - bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq - bash-4003 [00] 123.638715: hrtick_set <-schedule - bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set - bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave - bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set - bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore - bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set - bash-4003 [00] 123.638718: sub_preempt_count <-schedule - bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule - bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run - bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion - bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common - bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq -[...] - - -Note: function tracer uses ring buffers to store the above -entries. The newest data may overwrite the oldest data. -Sometimes using echo to stop the trace is not sufficient because -the tracing could have overwritten the data that you wanted to -record. For this reason, it is sometimes better to disable -tracing directly from a program. This allows you to stop the -tracing at the point that you hit the part that you are -interested in. To disable the tracing directly from a C program, -something like following code snippet can be used: - -int trace_fd; -[...] -int main(int argc, char *argv[]) { - [...] - trace_fd = open(tracing_file("tracing_on"), O_WRONLY); - [...] - if (condition_hit()) { - write(trace_fd, "0", 1); - } - [...] -} - - -Single thread tracing ---------------------- - -By writing into set_ftrace_pid you can trace a -single thread. For example: - -# cat set_ftrace_pid -no pid -# echo 3111 > set_ftrace_pid -# cat set_ftrace_pid -3111 -# echo function > current_tracer -# cat trace | head - # tracer: function - # - # TASK-PID CPU# TIMESTAMP FUNCTION - # | | | | | - yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return - yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range - yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel - yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel - yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll - yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll -# echo -1 > set_ftrace_pid -# cat trace |head - # tracer: function - # - # TASK-PID CPU# TIMESTAMP FUNCTION - # | | | | | - ##### CPU 3 buffer started #### - yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait - yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry - yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry - yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit - yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit - -If you want to trace a function when executing, you could use -something like this simple program: - -#include <stdio.h> -#include <stdlib.h> -#include <sys/types.h> -#include <sys/stat.h> -#include <fcntl.h> -#include <unistd.h> -#include <string.h> - -#define _STR(x) #x -#define STR(x) _STR(x) -#define MAX_PATH 256 - -const char *find_debugfs(void) -{ - static char debugfs[MAX_PATH+1]; - static int debugfs_found; - char type[100]; - FILE *fp; - - if (debugfs_found) - return debugfs; - - if ((fp = fopen("/proc/mounts","r")) == NULL) { - perror("/proc/mounts"); - return NULL; - } - - while (fscanf(fp, "%*s %" - STR(MAX_PATH) - "s %99s %*s %*d %*d\n", - debugfs, type) == 2) { - if (strcmp(type, "debugfs") == 0) - break; - } - fclose(fp); - - if (strcmp(type, "debugfs") != 0) { - fprintf(stderr, "debugfs not mounted"); - return NULL; - } - - strcat(debugfs, "/tracing/"); - debugfs_found = 1; - - return debugfs; -} - -const char *tracing_file(const char *file_name) -{ - static char trace_file[MAX_PATH+1]; - snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name); - return trace_file; -} - -int main (int argc, char **argv) -{ - if (argc < 1) - exit(-1); - - if (fork() > 0) { - int fd, ffd; - char line[64]; - int s; - - ffd = open(tracing_file("current_tracer"), O_WRONLY); - if (ffd < 0) - exit(-1); - write(ffd, "nop", 3); - - fd = open(tracing_file("set_ftrace_pid"), O_WRONLY); - s = sprintf(line, "%d\n", getpid()); - write(fd, line, s); - - write(ffd, "function", 8); - - close(fd); - close(ffd); - - execvp(argv[1], argv+1); - } - - return 0; -} - - -hw-branch-tracer (x86 only) ---------------------------- - -This tracer uses the x86 last branch tracing hardware feature to -collect a branch trace on all cpus with relatively low overhead. - -The tracer uses a fixed-size circular buffer per cpu and only -traces ring 0 branches. The trace file dumps that buffer in the -following format: - -# tracer: hw-branch-tracer -# -# CPU# TO <- FROM - 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 - 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a - 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf - 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf - 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a - 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf - - -The tracer may be used to dump the trace for the oops'ing cpu on -a kernel oops into the system log. To enable this, -ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one -can either use the sysctl function or set it via the proc system -interface. - - sysctl kernel.ftrace_dump_on_oops=n - -or - - echo n > /proc/sys/kernel/ftrace_dump_on_oops - -If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will -only dump the buffer of the CPU that triggered the oops. - -Here's an example of such a dump after a null pointer -dereference in a kernel module: - -[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 -[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops] -[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 -[57848.106019] Oops: 0002 [#1] SMP -[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus -[57848.106019] Dumping ftrace buffer: -[57848.106019] --------------------------------- -[...] -[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 -[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 -[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 -[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 -[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 -[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] -[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 -[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b -[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 -[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 -[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 -[...] -[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 -[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 -[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 -[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 -[...] -[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 -[57848.106019] --------------------------------- -[57848.106019] CPU 0 -[57848.106019] Modules linked in: oops -[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 -[57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops] -[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 -[...] - - -function graph tracer ---------------------------- - -This tracer is similar to the function tracer except that it -probes a function on its entry and its exit. This is done by -using a dynamically allocated stack of return addresses in each -task_struct. On function entry the tracer overwrites the return -address of each function traced to set a custom probe. Thus the -original return address is stored on the stack of return address -in the task_struct. - -Probing on both ends of a function leads to special features -such as: - -- measure of a function's time execution -- having a reliable call stack to draw function calls graph - -This tracer is useful in several situations: - -- you want to find the reason of a strange kernel behavior and - need to see what happens in detail on any areas (or specific - ones). - -- you are experiencing weird latencies but it's difficult to - find its origin. - -- you want to find quickly which path is taken by a specific - function - -- you just want to peek inside a working kernel and want to see - what happens there. - -# tracer: function_graph -# -# CPU DURATION FUNCTION CALLS -# | | | | | | | - - 0) | sys_open() { - 0) | do_sys_open() { - 0) | getname() { - 0) | kmem_cache_alloc() { - 0) 1.382 us | __might_sleep(); - 0) 2.478 us | } - 0) | strncpy_from_user() { - 0) | might_fault() { - 0) 1.389 us | __might_sleep(); - 0) 2.553 us | } - 0) 3.807 us | } - 0) 7.876 us | } - 0) | alloc_fd() { - 0) 0.668 us | _spin_lock(); - 0) 0.570 us | expand_files(); - 0) 0.586 us | _spin_unlock(); - - -There are several columns that can be dynamically -enabled/disabled. You can use every combination of options you -want, depending on your needs. - -- The cpu number on which the function executed is default - enabled. It is sometimes better to only trace one cpu (see - tracing_cpu_mask file) or you might sometimes see unordered - function calls while cpu tracing switch. - - hide: echo nofuncgraph-cpu > trace_options - show: echo funcgraph-cpu > trace_options - -- The duration (function's time of execution) is displayed on - the closing bracket line of a function or on the same line - than the current function in case of a leaf one. It is default - enabled. - - hide: echo nofuncgraph-duration > trace_options - show: echo funcgraph-duration > trace_options - -- The overhead field precedes the duration field in case of - reached duration thresholds. - - hide: echo nofuncgraph-overhead > trace_options - show: echo funcgraph-overhead > trace_options - depends on: funcgraph-duration - - ie: - - 0) | up_write() { - 0) 0.646 us | _spin_lock_irqsave(); - 0) 0.684 us | _spin_unlock_irqrestore(); - 0) 3.123 us | } - 0) 0.548 us | fput(); - 0) + 58.628 us | } - - [...] - - 0) | putname() { - 0) | kmem_cache_free() { - 0) 0.518 us | __phys_addr(); - 0) 1.757 us | } - 0) 2.861 us | } - 0) ! 115.305 us | } - 0) ! 116.402 us | } - - + means that the function exceeded 10 usecs. - ! means that the function exceeded 100 usecs. - - -- The task/pid field displays the thread cmdline and pid which - executed the function. It is default disabled. - - hide: echo nofuncgraph-proc > trace_options - show: echo funcgraph-proc > trace_options - - ie: - - # tracer: function_graph - # - # CPU TASK/PID DURATION FUNCTION CALLS - # | | | | | | | | | - 0) sh-4802 | | d_free() { - 0) sh-4802 | | call_rcu() { - 0) sh-4802 | | __call_rcu() { - 0) sh-4802 | 0.616 us | rcu_process_gp_end(); - 0) sh-4802 | 0.586 us | check_for_new_grace_period(); - 0) sh-4802 | 2.899 us | } - 0) sh-4802 | 4.040 us | } - 0) sh-4802 | 5.151 us | } - 0) sh-4802 | + 49.370 us | } - - -- The absolute time field is an absolute timestamp given by the - system clock since it started. A snapshot of this time is - given on each entry/exit of functions - - hide: echo nofuncgraph-abstime > trace_options - show: echo funcgraph-abstime > trace_options - - ie: - - # - # TIME CPU DURATION FUNCTION CALLS - # | | | | | | | | - 360.774522 | 1) 0.541 us | } - 360.774522 | 1) 4.663 us | } - 360.774523 | 1) 0.541 us | __wake_up_bit(); - 360.774524 | 1) 6.796 us | } - 360.774524 | 1) 7.952 us | } - 360.774525 | 1) 9.063 us | } - 360.774525 | 1) 0.615 us | journal_mark_dirty(); - 360.774527 | 1) 0.578 us | __brelse(); - 360.774528 | 1) | reiserfs_prepare_for_journal() { - 360.774528 | 1) | unlock_buffer() { - 360.774529 | 1) | wake_up_bit() { - 360.774529 | 1) | bit_waitqueue() { - 360.774530 | 1) 0.594 us | __phys_addr(); - - -You can put some comments on specific functions by using -trace_printk() For example, if you want to put a comment inside -the __might_sleep() function, you just have to include -<linux/ftrace.h> and call trace_printk() inside __might_sleep() - -trace_printk("I'm a comment!\n") - -will produce: - - 1) | __might_sleep() { - 1) | /* I'm a comment! */ - 1) 1.449 us | } - - -You might find other useful features for this tracer in the -following "dynamic ftrace" section such as tracing only specific -functions or tasks. - -dynamic ftrace --------------- - -If CONFIG_DYNAMIC_FTRACE is set, the system will run with -virtually no overhead when function tracing is disabled. The way -this works is the mcount function call (placed at the start of -every kernel function, produced by the -pg switch in gcc), -starts of pointing to a simple return. (Enabling FTRACE will -include the -pg switch in the compiling of the kernel.) - -At compile time every C file object is run through the -recordmcount.pl script (located in the scripts directory). This -script will process the C object using objdump to find all the -locations in the .text section that call mcount. (Note, only the -.text section is processed, since processing other sections like -.init.text may cause races due to those sections being freed). - -A new section called "__mcount_loc" is created that holds -references to all the mcount call sites in the .text section. -This section is compiled back into the original object. The -final linker will add all these references into a single table. - -On boot up, before SMP is initialized, the dynamic ftrace code -scans this table and updates all the locations into nops. It -also records the locations, which are added to the -available_filter_functions list. Modules are processed as they -are loaded and before they are executed. When a module is -unloaded, it also removes its functions from the ftrace function -list. This is automatic in the module unload code, and the -module author does not need to worry about it. - -When tracing is enabled, kstop_machine is called to prevent -races with the CPUS executing code being modified (which can -cause the CPU to do undesirable things), and the nops are -patched back to calls. But this time, they do not call mcount -(which is just a function stub). They now call into the ftrace -infrastructure. - -One special side-effect to the recording of the functions being -traced is that we can now selectively choose which functions we -wish to trace and which ones we want the mcount calls to remain -as nops. - -Two files are used, one for enabling and one for disabling the -tracing of specified functions. They are: - - set_ftrace_filter - -and - - set_ftrace_notrace - -A list of available functions that you can add to these files is -listed in: - - available_filter_functions - - # cat available_filter_functions -put_prev_task_idle -kmem_cache_create -pick_next_task_rt -get_online_cpus -pick_next_task_fair -mutex_lock -[...] - -If I am only interested in sys_nanosleep and hrtimer_interrupt: - - # echo sys_nanosleep hrtimer_interrupt \ - > set_ftrace_filter - # echo function > current_tracer - # echo 1 > tracing_on - # usleep 1 - # echo 0 > tracing_on - # cat trace -# tracer: ftrace -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt - usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call - <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt - -To see which functions are being traced, you can cat the file: - - # cat set_ftrace_filter -hrtimer_interrupt -sys_nanosleep - - -Perhaps this is not enough. The filters also allow simple wild -cards. Only the following are currently available - - <match>* - will match functions that begin with <match> - *<match> - will match functions that end with <match> - *<match>* - will match functions that have <match> in it - -These are the only wild cards which are supported. - - <match>*<match> will not work. - -Note: It is better to use quotes to enclose the wild cards, - otherwise the shell may expand the parameters into names - of files in the local directory. - - # echo 'hrtimer_*' > set_ftrace_filter - -Produces: - -# tracer: ftrace -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4003 [00] 1480.611794: hrtimer_init <-copy_process - bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set - bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear - bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel - <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt - - -Notice that we lost the sys_nanosleep. - - # cat set_ftrace_filter -hrtimer_run_queues -hrtimer_run_pending -hrtimer_init -hrtimer_cancel -hrtimer_try_to_cancel -hrtimer_forward -hrtimer_start -hrtimer_reprogram -hrtimer_force_reprogram -hrtimer_get_next_event -hrtimer_interrupt -hrtimer_nanosleep -hrtimer_wakeup -hrtimer_get_remaining -hrtimer_get_res -hrtimer_init_sleeper - - -This is because the '>' and '>>' act just like they do in bash. -To rewrite the filters, use '>' -To append to the filters, use '>>' - -To clear out a filter so that all functions will be recorded -again: - - # echo > set_ftrace_filter - # cat set_ftrace_filter - # - -Again, now we want to append. - - # echo sys_nanosleep > set_ftrace_filter - # cat set_ftrace_filter -sys_nanosleep - # echo 'hrtimer_*' >> set_ftrace_filter - # cat set_ftrace_filter -hrtimer_run_queues -hrtimer_run_pending -hrtimer_init -hrtimer_cancel -hrtimer_try_to_cancel -hrtimer_forward -hrtimer_start -hrtimer_reprogram -hrtimer_force_reprogram -hrtimer_get_next_event -hrtimer_interrupt -sys_nanosleep -hrtimer_nanosleep -hrtimer_wakeup -hrtimer_get_remaining -hrtimer_get_res -hrtimer_init_sleeper - - -The set_ftrace_notrace prevents those functions from being -traced. - - # echo '*preempt*' '*lock*' > set_ftrace_notrace - -Produces: - -# tracer: ftrace -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4043 [01] 115.281644: finish_task_switch <-schedule - bash-4043 [01] 115.281645: hrtick_set <-schedule - bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set - bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run - bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion - bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run - bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop - bash-4043 [01] 115.281648: wake_up_process <-kthread_stop - bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process - -We can see that there's no more lock or preempt tracing. - - -Dynamic ftrace with the function graph tracer ---------------------------------------------- - -Although what has been explained above concerns both the -function tracer and the function-graph-tracer, there are some -special features only available in the function-graph tracer. - -If you want to trace only one function and all of its children, -you just have to echo its name into set_graph_function: - - echo __do_fault > set_graph_function - -will produce the following "expanded" trace of the __do_fault() -function: - - 0) | __do_fault() { - 0) | filemap_fault() { - 0) | find_lock_page() { - 0) 0.804 us | find_get_page(); - 0) | __might_sleep() { - 0) 1.329 us | } - 0) 3.904 us | } - 0) 4.979 us | } - 0) 0.653 us | _spin_lock(); - 0) 0.578 us | page_add_file_rmap(); - 0) 0.525 us | native_set_pte_at(); - 0) 0.585 us | _spin_unlock(); - 0) | unlock_page() { - 0) 0.541 us | page_waitqueue(); - 0) 0.639 us | __wake_up_bit(); - 0) 2.786 us | } - 0) + 14.237 us | } - 0) | __do_fault() { - 0) | filemap_fault() { - 0) | find_lock_page() { - 0) 0.698 us | find_get_page(); - 0) | __might_sleep() { - 0) 1.412 us | } - 0) 3.950 us | } - 0) 5.098 us | } - 0) 0.631 us | _spin_lock(); - 0) 0.571 us | page_add_file_rmap(); - 0) 0.526 us | native_set_pte_at(); - 0) 0.586 us | _spin_unlock(); - 0) | unlock_page() { - 0) 0.533 us | page_waitqueue(); - 0) 0.638 us | __wake_up_bit(); - 0) 2.793 us | } - 0) + 14.012 us | } - -You can also expand several functions at once: - - echo sys_open > set_graph_function - echo sys_close >> set_graph_function - -Now if you want to go back to trace all functions you can clear -this special filter via: - - echo > set_graph_function - - -Filter commands ---------------- - -A few commands are supported by the set_ftrace_filter interface. -Trace commands have the following format: - -<function>:<command>:<parameter> - -The following commands are supported: - -- mod - This command enables function filtering per module. The - parameter defines the module. For example, if only the write* - functions in the ext3 module are desired, run: - - echo 'write*:mod:ext3' > set_ftrace_filter - - This command interacts with the filter in the same way as - filtering based on function names. Thus, adding more functions - in a different module is accomplished by appending (>>) to the - filter file. Remove specific module functions by prepending - '!': - - echo '!writeback*:mod:ext3' >> set_ftrace_filter - -- traceon/traceoff - These commands turn tracing on and off when the specified - functions are hit. The parameter determines how many times the - tracing system is turned on and off. If unspecified, there is - no limit. For example, to disable tracing when a schedule bug - is hit the first 5 times, run: - - echo '__schedule_bug:traceoff:5' > set_ftrace_filter - - These commands are cumulative whether or not they are appended - to set_ftrace_filter. To remove a command, prepend it by '!' - and drop the parameter: - - echo '!__schedule_bug:traceoff' > set_ftrace_filter - - -trace_pipe ----------- - -The trace_pipe outputs the same content as the trace file, but -the effect on the tracing is different. Every read from -trace_pipe is consumed. This means that subsequent reads will be -different. The trace is live. - - # echo function > current_tracer - # cat trace_pipe > /tmp/trace.out & -[1] 4153 - # echo 1 > tracing_on - # usleep 1 - # echo 0 > tracing_on - # cat trace -# tracer: function -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - - # - # cat /tmp/trace.out - bash-4043 [00] 41.267106: finish_task_switch <-schedule - bash-4043 [00] 41.267106: hrtick_set <-schedule - bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set - bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run - bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion - bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run - bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop - bash-4043 [00] 41.267110: wake_up_process <-kthread_stop - bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process - bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up - - -Note, reading the trace_pipe file will block until more input is -added. By changing the tracer, trace_pipe will issue an EOF. We -needed to set the function tracer _before_ we "cat" the -trace_pipe file. - - -trace entries -------------- - -Having too much or not enough data can be troublesome in -diagnosing an issue in the kernel. The file buffer_size_kb is -used to modify the size of the internal trace buffers. The -number listed is the number of entries that can be recorded per -CPU. To know the full size, multiply the number of possible CPUS -with the number of entries. - - # cat buffer_size_kb -1408 (units kilobytes) - -Note, to modify this, you must have tracing completely disabled. -To do that, echo "nop" into the current_tracer. If the -current_tracer is not set to "nop", an EINVAL error will be -returned. - - # echo nop > current_tracer - # echo 10000 > buffer_size_kb - # cat buffer_size_kb -10000 (units kilobytes) - -The number of pages which will be allocated is limited to a -percentage of available memory. Allocating too much will produce -an error. - - # echo 1000000000000 > buffer_size_kb --bash: echo: write error: Cannot allocate memory - # cat buffer_size_kb -85 - ------------ - -More details can be found in the source code, in the -kernel/trace/*.c files. diff --git a/Documentation/trace/function-graph-fold.vim b/Documentation/trace/function-graph-fold.vim deleted file mode 100644 index 0544b504c8b..00000000000 --- a/Documentation/trace/function-graph-fold.vim +++ /dev/null @@ -1,42 +0,0 @@ -" Enable folding for ftrace function_graph traces. -" -" To use, :source this file while viewing a function_graph trace, or use vim's -" -S option to load from the command-line together with a trace. You can then -" use the usual vim fold commands, such as "za", to open and close nested -" functions. While closed, a fold will show the total time taken for a call, -" as would normally appear on the line with the closing brace. Folded -" functions will not include finish_task_switch(), so folding should remain -" relatively sane even through a context switch. -" -" Note that this will almost certainly only work well with a -" single-CPU trace (e.g. trace-cmd report --cpu 1). - -function! FunctionGraphFoldExpr(lnum) - let line = getline(a:lnum) - if line[-1:] == '{' - if line =~ 'finish_task_switch() {$' - return '>1' - endif - return 'a1' - elseif line[-1:] == '}' - return 's1' - else - return '=' - endif -endfunction - -function! FunctionGraphFoldText() - let s = split(getline(v:foldstart), '|', 1) - if getline(v:foldend+1) =~ 'finish_task_switch() {$' - let s[2] = ' task switch ' - else - let e = split(getline(v:foldend), '|', 1) - let s[2] = e[2] - endif - return join(s, '|') -endfunction - -setlocal foldexpr=FunctionGraphFoldExpr(v:lnum) -setlocal foldtext=FunctionGraphFoldText() -setlocal foldcolumn=12 -setlocal foldmethod=expr diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt deleted file mode 100644 index d0d0bb9e3e2..00000000000 --- a/Documentation/trace/kprobetrace.txt +++ /dev/null @@ -1,172 +0,0 @@ - Kprobe-based Event Tracing - ========================== - - Documentation is written by Masami Hiramatsu - - -Overview --------- -These events are similar to tracepoint based events. Instead of Tracepoint, -this is based on kprobes (kprobe and kretprobe). So it can probe wherever -kprobes can probe (this means, all functions body except for __kprobes -functions). Unlike the Tracepoint based event, this can be added and removed -dynamically, on the fly. - -To enable this feature, build your kernel with CONFIG_KPROBE_TRACING=y. - -Similar to the events tracer, this doesn't need to be activated via -current_tracer. Instead of that, add probe points via -/sys/kernel/debug/tracing/kprobe_events, and enable it via -/sys/kernel/debug/tracing/events/kprobes/<EVENT>/enabled. - - -Synopsis of kprobe_events -------------------------- - p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe - r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe - -:[GRP/]EVENT : Clear a probe - - GRP : Group name. If omitted, use "kprobes" for it. - EVENT : Event name. If omitted, the event name is generated - based on SYM+offs or MEMADDR. - MOD : Module name which has given SYM. - SYM[+offs] : Symbol+offset where the probe is inserted. - MEMADDR : Address where the probe is inserted. - - FETCHARGS : Arguments. Each probe can have up to 128 args. - %REG : Fetch register REG - @ADDR : Fetch memory at ADDR (ADDR should be in kernel) - @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) - $stackN : Fetch Nth entry of stack (N >= 0) - $stack : Fetch stack address. - $retval : Fetch return value.(*) - +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) - NAME=FETCHARG : Set NAME as the argument name of FETCHARG. - FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types - (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield - are supported. - - (*) only for return probe. - (**) this is useful for fetching a field of data structures. - -Types ------ -Several types are supported for fetch-args. Kprobe tracer will access memory -by given type. Prefix 's' and 'u' means those types are signed and unsigned -respectively. Traced arguments are shown in decimal (signed) or hex (unsigned). -String type is a special type, which fetches a "null-terminated" string from -kernel space. This means it will fail and store NULL if the string container -has been paged out. -Bitfield is another special type, which takes 3 parameters, bit-width, bit- -offset, and container-size (usually 32). The syntax is; - - b<bit-width>@<bit-offset>/<container-size> - - -Per-Probe Event Filtering -------------------------- - Per-probe event filtering feature allows you to set different filter on each -probe and gives you what arguments will be shown in trace buffer. If an event -name is specified right after 'p:' or 'r:' in kprobe_events, it adds an event -under tracing/events/kprobes/<EVENT>, at the directory you can see 'id', -'enabled', 'format' and 'filter'. - -enabled: - You can enable/disable the probe by writing 1 or 0 on it. - -format: - This shows the format of this probe event. - -filter: - You can write filtering rules of this event. - -id: - This shows the id of this probe event. - - -Event Profiling ---------------- - You can check the total number of probe hits and probe miss-hits via -/sys/kernel/debug/tracing/kprobe_profile. - The first column is event name, the second is the number of probe hits, -the third is the number of probe miss-hits. - - -Usage examples --------------- -To add a probe as a new event, write a new definition to kprobe_events -as below. - - echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events - - This sets a kprobe on the top of do_sys_open() function with recording -1st to 4th arguments as "myprobe" event. Note, which register/stack entry is -assigned to each function argument depends on arch-specific ABI. If you unsure -the ABI, please try to use probe subcommand of perf-tools (you can find it -under tools/perf/). -As this example shows, users can choose more familiar names for each arguments. - - echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/debug/tracing/kprobe_events - - This sets a kretprobe on the return point of do_sys_open() function with -recording return value as "myretprobe" event. - You can see the format of these events via -/sys/kernel/debug/tracing/events/kprobes/<EVENT>/format. - - cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format -name: myprobe -ID: 780 -format: - field:unsigned short common_type; offset:0; size:2; signed:0; - field:unsigned char common_flags; offset:2; size:1; signed:0; - field:unsigned char common_preempt_count; offset:3; size:1;signed:0; - field:int common_pid; offset:4; size:4; signed:1; - - field:unsigned long __probe_ip; offset:12; size:4; signed:0; - field:int __probe_nargs; offset:16; size:4; signed:1; - field:unsigned long dfd; offset:20; size:4; signed:0; - field:unsigned long filename; offset:24; size:4; signed:0; - field:unsigned long flags; offset:28; size:4; signed:0; - field:unsigned long mode; offset:32; size:4; signed:0; - - -print fmt: "(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx", REC->__probe_ip, -REC->dfd, REC->filename, REC->flags, REC->mode - - You can see that the event has 4 arguments as in the expressions you specified. - - echo > /sys/kernel/debug/tracing/kprobe_events - - This clears all probe points. - - Or, - - echo -:myprobe >> kprobe_events - - This clears probe points selectively. - - Right after definition, each event is disabled by default. For tracing these -events, you need to enable it. - - echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable - echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable - - And you can see the traced information via /sys/kernel/debug/tracing/trace. - - cat /sys/kernel/debug/tracing/trace -# tracer: nop -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - <...>-1447 [001] 1038282.286875: myprobe: (do_sys_open+0x0/0xd6) dfd=3 filename=7fffd1ec4440 flags=8000 mode=0 - <...>-1447 [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe - <...>-1447 [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6 - <...>-1447 [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3 - <...>-1447 [001] 1038282.286969: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=4041c6 flags=98800 mode=10 - <...>-1447 [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3 - - - Each line shows when the kernel hits an event, and <- SYMBOL means kernel -returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel -returns from do_sys_open to sys_open+0x1b). - diff --git a/Documentation/trace/mmiotrace.txt b/Documentation/trace/mmiotrace.txt deleted file mode 100644 index 664e7386d89..00000000000 --- a/Documentation/trace/mmiotrace.txt +++ /dev/null @@ -1,164 +0,0 @@ - In-kernel memory-mapped I/O tracing - - -Home page and links to optional user space tools: - - http://nouveau.freedesktop.org/wiki/MmioTrace - -MMIO tracing was originally developed by Intel around 2003 for their Fault -Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel, -Jeff Muizelaar created a tool for tracing MMIO accesses with the Nouveau -project in mind. Since then many people have contributed. - -Mmiotrace was built for reverse engineering any memory-mapped IO device with -the Nouveau project as the first real user. Only x86 and x86_64 architectures -are supported. - -Out-of-tree mmiotrace was originally modified for mainline inclusion and -ftrace framework by Pekka Paalanen <pq@iki.fi>. - - -Preparation ------------ - -Mmiotrace feature is compiled in by the CONFIG_MMIOTRACE option. Tracing is -disabled by default, so it is safe to have this set to yes. SMP systems are -supported, but tracing is unreliable and may miss events if more than one CPU -is on-line, therefore mmiotrace takes all but one CPU off-line during run-time -activation. You can re-enable CPUs by hand, but you have been warned, there -is no way to automatically detect if you are losing events due to CPUs racing. - - -Usage Quick Reference ---------------------- - -$ mount -t debugfs debugfs /sys/kernel/debug -$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer -$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt & -Start X or whatever. -$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker -$ echo nop > /sys/kernel/debug/tracing/current_tracer -Check for lost events. - - -Usage ------ - -Make sure debugfs is mounted to /sys/kernel/debug. -If not (requires root privileges): -$ mount -t debugfs debugfs /sys/kernel/debug - -Check that the driver you are about to trace is not loaded. - -Activate mmiotrace (requires root privileges): -$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer - -Start storing the trace: -$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt & -The 'cat' process should stay running (sleeping) in the background. - -Load the driver you want to trace and use it. Mmiotrace will only catch MMIO -accesses to areas that are ioremapped while mmiotrace is active. - -During tracing you can place comments (markers) into the trace by -$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker -This makes it easier to see which part of the (huge) trace corresponds to -which action. It is recommended to place descriptive markers about what you -do. - -Shut down mmiotrace (requires root privileges): -$ echo nop > /sys/kernel/debug/tracing/current_tracer -The 'cat' process exits. If it does not, kill it by issuing 'fg' command and -pressing ctrl+c. - -Check that mmiotrace did not lose events due to a buffer filling up. Either -$ grep -i lost mydump.txt -which tells you exactly how many events were lost, or use -$ dmesg -to view your kernel log and look for "mmiotrace has lost events" warning. If -events were lost, the trace is incomplete. You should enlarge the buffers and -try again. Buffers are enlarged by first seeing how large the current buffers -are: -$ cat /sys/kernel/debug/tracing/buffer_size_kb -gives you a number. Approximately double this number and write it back, for -instance: -$ echo 128000 > /sys/kernel/debug/tracing/buffer_size_kb -Then start again from the top. - -If you are doing a trace for a driver project, e.g. Nouveau, you should also -do the following before sending your results: -$ lspci -vvv > lspci.txt -$ dmesg > dmesg.txt -$ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt -and then send the .tar.gz file. The trace compresses considerably. Replace -"pciid" and "nick" with the PCI ID or model name of your piece of hardware -under investigation and your nickname. - - -How Mmiotrace Works -------------------- - -Access to hardware IO-memory is gained by mapping addresses from PCI bus by -calling one of the ioremap_*() functions. Mmiotrace is hooked into the -__ioremap() function and gets called whenever a mapping is created. Mapping is -an event that is recorded into the trace log. Note that ISA range mappings -are not caught, since the mapping always exists and is returned directly. - -MMIO accesses are recorded via page faults. Just before __ioremap() returns, -the mapped pages are marked as not present. Any access to the pages causes a -fault. The page fault handler calls mmiotrace to handle the fault. Mmiotrace -marks the page present, sets TF flag to achieve single stepping and exits the -fault handler. The instruction that faulted is executed and debug trap is -entered. Here mmiotrace again marks the page as not present. The instruction -is decoded to get the type of operation (read/write), data width and the value -read or written. These are stored to the trace log. - -Setting the page present in the page fault handler has a race condition on SMP -machines. During the single stepping other CPUs may run freely on that page -and events can be missed without a notice. Re-enabling other CPUs during -tracing is discouraged. - - -Trace Log Format ----------------- - -The raw log is text and easily filtered with e.g. grep and awk. One record is -one line in the log. A record starts with a keyword, followed by keyword- -dependent arguments. Arguments are separated by a space, or continue until the -end of line. The format for version 20070824 is as follows: - -Explanation Keyword Space-separated arguments ---------------------------------------------------------------------------- - -read event R width, timestamp, map id, physical, value, PC, PID -write event W width, timestamp, map id, physical, value, PC, PID -ioremap event MAP timestamp, map id, physical, virtual, length, PC, PID -iounmap event UNMAP timestamp, map id, PC, PID -marker MARK timestamp, text -version VERSION the string "20070824" -info for reader LSPCI one line from lspci -v -PCI address map PCIDEV space-separated /proc/bus/pci/devices data -unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID - -Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual -is a kernel virtual address. Width is the data width in bytes and value is the -data value. Map id is an arbitrary id number identifying the mapping that was -used in an operation. PC is the program counter and PID is process id. PC is -zero if it is not recorded. PID is always zero as tracing MMIO accesses -originating in user space memory is not yet supported. - -For instance, the following awk filter will pass all 32-bit writes that target -physical addresses in the range [0xfb73ce40, 0xfb800000[ - -$ awk '/W 4 / { adr=strtonum($5); if (adr >= 0xfb73ce40 && -adr < 0xfb800000) print; }' - - -Tools for Developers --------------------- - -The user space tools include utilities for: -- replacing numeric addresses and values with hardware register names -- replaying MMIO logs, i.e., re-executing the recorded writes - - diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl deleted file mode 100644 index 0a120aae33c..00000000000 --- a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl +++ /dev/null @@ -1,418 +0,0 @@ -#!/usr/bin/perl -# This is a POC (proof of concept or piece of crap, take your pick) for reading the -# text representation of trace output related to page allocation. It makes an attempt -# to extract some high-level information on what is going on. The accuracy of the parser -# may vary considerably -# -# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe -# other options -# --prepend-parent Report on the parent proc and PID -# --read-procstat If the trace lacks process info, get it from /proc -# --ignore-pid Aggregate processes of the same name together -# -# Copyright (c) IBM Corporation 2009 -# Author: Mel Gorman <mel@csn.ul.ie> -use strict; -use Getopt::Long; - -# Tracepoint events -use constant MM_PAGE_ALLOC => 1; -use constant MM_PAGE_FREE => 2; -use constant MM_PAGE_FREE_BATCHED => 3; -use constant MM_PAGE_PCPU_DRAIN => 4; -use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5; -use constant MM_PAGE_ALLOC_EXTFRAG => 6; -use constant EVENT_UNKNOWN => 7; - -# Constants used to track state -use constant STATE_PCPU_PAGES_DRAINED => 8; -use constant STATE_PCPU_PAGES_REFILLED => 9; - -# High-level events extrapolated from tracepoints -use constant HIGH_PCPU_DRAINS => 10; -use constant HIGH_PCPU_REFILLS => 11; -use constant HIGH_EXT_FRAGMENT => 12; -use constant HIGH_EXT_FRAGMENT_SEVERE => 13; -use constant HIGH_EXT_FRAGMENT_MODERATE => 14; -use constant HIGH_EXT_FRAGMENT_CHANGED => 15; - -my %perprocesspid; -my %perprocess; -my $opt_ignorepid; -my $opt_read_procstat; -my $opt_prepend_parent; - -# Catch sigint and exit on request -my $sigint_report = 0; -my $sigint_exit = 0; -my $sigint_pending = 0; -my $sigint_received = 0; -sub sigint_handler { - my $current_time = time; - if ($current_time - 2 > $sigint_received) { - print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; - $sigint_report = 1; - } else { - if (!$sigint_exit) { - print "Second SIGINT received quickly, exiting\n"; - } - $sigint_exit++; - } - - if ($sigint_exit > 3) { - print "Many SIGINTs received, exiting now without report\n"; - exit; - } - - $sigint_received = $current_time; - $sigint_pending = 1; -} -$SIG{INT} = "sigint_handler"; - -# Parse command line options -GetOptions( - 'ignore-pid' => \$opt_ignorepid, - 'read-procstat' => \$opt_read_procstat, - 'prepend-parent' => \$opt_prepend_parent, -); - -# Defaults for dynamically discovered regex's -my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])'; - -# Dyanically discovered regex -my $regex_fragdetails; - -# Static regex used. Specified like this for readability and for use with /o -# (process_pid) (cpus ) ( time ) (tpoint ) (details) -my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; -my $regex_statname = '[-0-9]*\s\((.*)\).*'; -my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; - -sub generate_traceevent_regex { - my $event = shift; - my $default = shift; - my $regex; - - # Read the event format or use the default - if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { - $regex = $default; - } else { - my $line; - while (!eof(FORMAT)) { - $line = <FORMAT>; - if ($line =~ /^print fmt:\s"(.*)",.*/) { - $regex = $1; - $regex =~ s/%p/\([0-9a-f]*\)/g; - $regex =~ s/%d/\([-0-9]*\)/g; - $regex =~ s/%lu/\([0-9]*\)/g; - } - } - } - - # Verify fields are in the right order - my $tuple; - foreach $tuple (split /\s/, $regex) { - my ($key, $value) = split(/=/, $tuple); - my $expected = shift; - if ($key ne $expected) { - print("WARNING: Format not as expected '$key' != '$expected'"); - $regex =~ s/$key=\((.*)\)/$key=$1/; - } - } - - if (defined shift) { - die("Fewer fields than expected in format"); - } - - return $regex; -} -$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag", - $regex_fragdetails_default, - "page", "pfn", - "alloc_order", "fallback_order", "pageblock_order", - "alloc_migratetype", "fallback_migratetype", - "fragmenting", "change_ownership"); - -sub read_statline($) { - my $pid = $_[0]; - my $statline; - - if (open(STAT, "/proc/$pid/stat")) { - $statline = <STAT>; - close(STAT); - } - - if ($statline eq '') { - $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; - } - - return $statline; -} - -sub guess_process_pid($$) { - my $pid = $_[0]; - my $statline = $_[1]; - - if ($pid == 0) { - return "swapper-0"; - } - - if ($statline !~ /$regex_statname/o) { - die("Failed to math stat line for process name :: $statline"); - } - return "$1-$pid"; -} - -sub parent_info($$) { - my $pid = $_[0]; - my $statline = $_[1]; - my $ppid; - - if ($pid == 0) { - return "NOPARENT-0"; - } - - if ($statline !~ /$regex_statppid/o) { - die("Failed to match stat line process ppid:: $statline"); - } - - # Read the ppid stat line - $ppid = $1; - return guess_process_pid($ppid, read_statline($ppid)); -} - -sub process_events { - my $traceevent; - my $process_pid; - my $cpus; - my $timestamp; - my $tracepoint; - my $details; - my $statline; - - # Read each line of the event log -EVENT_PROCESS: - while ($traceevent = <STDIN>) { - if ($traceevent =~ /$regex_traceevent/o) { - $process_pid = $1; - $tracepoint = $4; - - if ($opt_read_procstat || $opt_prepend_parent) { - $process_pid =~ /(.*)-([0-9]*)$/; - my $process = $1; - my $pid = $2; - - $statline = read_statline($pid); - - if ($opt_read_procstat && $process eq '') { - $process_pid = guess_process_pid($pid, $statline); - } - - if ($opt_prepend_parent) { - $process_pid = parent_info($pid, $statline) . " :: $process_pid"; - } - } - - # Unnecessary in this script. Uncomment if required - # $cpus = $2; - # $timestamp = $3; - } else { - next; - } - - # Perl Switch() sucks majorly - if ($tracepoint eq "mm_page_alloc") { - $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++; - } elsif ($tracepoint eq "mm_page_free") { - $perprocesspid{$process_pid}->{MM_PAGE_FREE}++ - } elsif ($tracepoint eq "mm_page_free_batched") { - $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}++; - } elsif ($tracepoint eq "mm_page_pcpu_drain") { - $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++; - $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++; - } elsif ($tracepoint eq "mm_page_alloc_zone_locked") { - $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++; - $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++; - } elsif ($tracepoint eq "mm_page_alloc_extfrag") { - - # Extract the details of the event now - $details = $5; - - my ($page, $pfn); - my ($alloc_order, $fallback_order, $pageblock_order); - my ($alloc_migratetype, $fallback_migratetype); - my ($fragmenting, $change_ownership); - - if ($details !~ /$regex_fragdetails/o) { - print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n"; - next; - } - - $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++; - $page = $1; - $pfn = $2; - $alloc_order = $3; - $fallback_order = $4; - $pageblock_order = $5; - $alloc_migratetype = $6; - $fallback_migratetype = $7; - $fragmenting = $8; - $change_ownership = $9; - - if ($fragmenting) { - $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++; - if ($fallback_order <= 3) { - $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++; - } else { - $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++; - } - } - if ($change_ownership) { - $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++; - } - } else { - $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; - } - - # Catch a full pcpu drain event - if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} && - $tracepoint ne "mm_page_pcpu_drain") { - - $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++; - $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; - } - - # Catch a full pcpu refill event - if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} && - $tracepoint ne "mm_page_alloc_zone_locked") { - $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++; - $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; - } - - if ($sigint_pending) { - last EVENT_PROCESS; - } - } -} - -sub dump_stats { - my $hashref = shift; - my %stats = %$hashref; - - # Dump per-process stats - my $process_pid; - my $max_strlen = 0; - - # Get the maximum process name - foreach $process_pid (keys %perprocesspid) { - my $len = length($process_pid); - if ($len > $max_strlen) { - $max_strlen = $len; - } - } - $max_strlen += 2; - - printf("\n"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", - "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", - "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", ""); - - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", - "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", ""); - - foreach $process_pid (keys %stats) { - # Dump final aggregates - if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) { - $stats{$process_pid}->{HIGH_PCPU_DRAINS}++; - $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; - } - if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) { - $stats{$process_pid}->{HIGH_PCPU_REFILLS}++; - $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; - } - - printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n", - $process_pid, - $stats{$process_pid}->{MM_PAGE_ALLOC}, - $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}, - $stats{$process_pid}->{MM_PAGE_FREE}, - $stats{$process_pid}->{MM_PAGE_FREE_BATCHED}, - $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN}, - $stats{$process_pid}->{HIGH_PCPU_DRAINS}, - $stats{$process_pid}->{HIGH_PCPU_REFILLS}, - $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}, - $stats{$process_pid}->{HIGH_EXT_FRAG}, - $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}, - $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}, - $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}, - $stats{$process_pid}->{EVENT_UNKNOWN}); - } -} - -sub aggregate_perprocesspid() { - my $process_pid; - my $process; - undef %perprocess; - - foreach $process_pid (keys %perprocesspid) { - $process = $process_pid; - $process =~ s/-([0-9])*$//; - if ($process eq '') { - $process = "NO_PROCESS_NAME"; - } - - $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}; - $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}; - $perprocess{$process}->{MM_PAGE_FREE} += $perprocesspid{$process_pid}->{MM_PAGE_FREE}; - $perprocess{$process}->{MM_PAGE_FREE_BATCHED} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}; - $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}; - $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}; - $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}; - $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}; - $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}; - $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}; - $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}; - $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}; - $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN}; - } -} - -sub report() { - if (!$opt_ignorepid) { - dump_stats(\%perprocesspid); - } else { - aggregate_perprocesspid(); - dump_stats(\%perprocess); - } -} - -# Process events or signals until neither is available -sub signal_loop() { - my $sigint_processed; - do { - $sigint_processed = 0; - process_events(); - - # Handle pending signals if any - if ($sigint_pending) { - my $current_time = time; - - if ($sigint_exit) { - print "Received exit signal\n"; - $sigint_pending = 0; - } - if ($sigint_report) { - if ($current_time >= $sigint_received + 2) { - report(); - $sigint_report = 0; - $sigint_pending = 0; - $sigint_processed = 1; - } - } - } - } while ($sigint_pending || $sigint_processed); -} - -signal_loop(); -report(); diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl deleted file mode 100644 index 4a37c4759cd..00000000000 --- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl +++ /dev/null @@ -1,714 +0,0 @@ -#!/usr/bin/perl -# This is a POC for reading the text representation of trace output related to -# page reclaim. It makes an attempt to extract some high-level information on -# what is going on. The accuracy of the parser may vary -# -# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe -# other options -# --read-procstat If the trace lacks process info, get it from /proc -# --ignore-pid Aggregate processes of the same name together -# -# Copyright (c) IBM Corporation 2009 -# Author: Mel Gorman <mel@csn.ul.ie> -use strict; -use Getopt::Long; - -# Tracepoint events -use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1; -use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2; -use constant MM_VMSCAN_KSWAPD_WAKE => 3; -use constant MM_VMSCAN_KSWAPD_SLEEP => 4; -use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5; -use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6; -use constant MM_VMSCAN_LRU_ISOLATE => 7; -use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC => 8; -use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC => 9; -use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC => 10; -use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC => 11; -use constant MM_VMSCAN_WRITEPAGE_ASYNC => 12; -use constant EVENT_UNKNOWN => 13; - -# Per-order events -use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11; -use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12; -use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13; -use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14; - -# Constants used to track state -use constant STATE_DIRECT_BEGIN => 15; -use constant STATE_DIRECT_ORDER => 16; -use constant STATE_KSWAPD_BEGIN => 17; -use constant STATE_KSWAPD_ORDER => 18; - -# High-level events extrapolated from tracepoints -use constant HIGH_DIRECT_RECLAIM_LATENCY => 19; -use constant HIGH_KSWAPD_LATENCY => 20; -use constant HIGH_KSWAPD_REWAKEUP => 21; -use constant HIGH_NR_SCANNED => 22; -use constant HIGH_NR_TAKEN => 23; -use constant HIGH_NR_RECLAIMED => 24; -use constant HIGH_NR_CONTIG_DIRTY => 25; - -my %perprocesspid; -my %perprocess; -my %last_procmap; -my $opt_ignorepid; -my $opt_read_procstat; - -my $total_wakeup_kswapd; -my ($total_direct_reclaim, $total_direct_nr_scanned); -my ($total_direct_latency, $total_kswapd_latency); -my ($total_direct_nr_reclaimed); -my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async); -my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async); -my ($total_kswapd_nr_scanned, $total_kswapd_wake); -my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async); -my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async); -my ($total_kswapd_nr_reclaimed); - -# Catch sigint and exit on request -my $sigint_report = 0; -my $sigint_exit = 0; -my $sigint_pending = 0; -my $sigint_received = 0; -sub sigint_handler { - my $current_time = time; - if ($current_time - 2 > $sigint_received) { - print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; - $sigint_report = 1; - } else { - if (!$sigint_exit) { - print "Second SIGINT received quickly, exiting\n"; - } - $sigint_exit++; - } - - if ($sigint_exit > 3) { - print "Many SIGINTs received, exiting now without report\n"; - exit; - } - - $sigint_received = $current_time; - $sigint_pending = 1; -} -$SIG{INT} = "sigint_handler"; - -# Parse command line options -GetOptions( - 'ignore-pid' => \$opt_ignorepid, - 'read-procstat' => \$opt_read_procstat, -); - -# Defaults for dynamically discovered regex's -my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)'; -my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)'; -my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)'; -my $regex_kswapd_sleep_default = 'nid=([0-9]*)'; -my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)'; -my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)'; -my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)'; -my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)'; -my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)'; - -# Dyanically discovered regex -my $regex_direct_begin; -my $regex_direct_end; -my $regex_kswapd_wake; -my $regex_kswapd_sleep; -my $regex_wakeup_kswapd; -my $regex_lru_isolate; -my $regex_lru_shrink_inactive; -my $regex_lru_shrink_active; -my $regex_writepage; - -# Static regex used. Specified like this for readability and for use with /o -# (process_pid) (cpus ) ( time ) (tpoint ) (details) -my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; -my $regex_statname = '[-0-9]*\s\((.*)\).*'; -my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; - -sub generate_traceevent_regex { - my $event = shift; - my $default = shift; - my $regex; - - # Read the event format or use the default - if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { - print("WARNING: Event $event format string not found\n"); - return $default; - } else { - my $line; - while (!eof(FORMAT)) { - $line = <FORMAT>; - $line =~ s/, REC->.*//; - if ($line =~ /^print fmt:\s"(.*)".*/) { - $regex = $1; - $regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g; - $regex =~ s/%p/\([0-9a-f]*\)/g; - $regex =~ s/%d/\([-0-9]*\)/g; - $regex =~ s/%ld/\([-0-9]*\)/g; - $regex =~ s/%lu/\([0-9]*\)/g; - } - } - } - - # Can't handle the print_flags stuff but in the context of this - # script, it really doesn't matter - $regex =~ s/\(REC.*\) \? __print_flags.*//; - - # Verify fields are in the right order - my $tuple; - foreach $tuple (split /\s/, $regex) { - my ($key, $value) = split(/=/, $tuple); - my $expected = shift; - if ($key ne $expected) { - print("WARNING: Format not as expected for event $event '$key' != '$expected'\n"); - $regex =~ s/$key=\((.*)\)/$key=$1/; - } - } - - if (defined shift) { - die("Fewer fields than expected in format"); - } - - return $regex; -} - -$regex_direct_begin = generate_traceevent_regex( - "vmscan/mm_vmscan_direct_reclaim_begin", - $regex_direct_begin_default, - "order", "may_writepage", - "gfp_flags"); -$regex_direct_end = generate_traceevent_regex( - "vmscan/mm_vmscan_direct_reclaim_end", - $regex_direct_end_default, - "nr_reclaimed"); -$regex_kswapd_wake = generate_traceevent_regex( - "vmscan/mm_vmscan_kswapd_wake", - $regex_kswapd_wake_default, - "nid", "order"); -$regex_kswapd_sleep = generate_traceevent_regex( - "vmscan/mm_vmscan_kswapd_sleep", - $regex_kswapd_sleep_default, - "nid"); -$regex_wakeup_kswapd = generate_traceevent_regex( - "vmscan/mm_vmscan_wakeup_kswapd", - $regex_wakeup_kswapd_default, - "nid", "zid", "order"); -$regex_lru_isolate = generate_traceevent_regex( - "vmscan/mm_vmscan_lru_isolate", - $regex_lru_isolate_default, - "isolate_mode", "order", - "nr_requested", "nr_scanned", "nr_taken", - "contig_taken", "contig_dirty", "contig_failed"); -$regex_lru_shrink_inactive = generate_traceevent_regex( - "vmscan/mm_vmscan_lru_shrink_inactive", - $regex_lru_shrink_inactive_default, - "nid", "zid", - "nr_scanned", "nr_reclaimed", "priority", - "flags"); -$regex_lru_shrink_active = generate_traceevent_regex( - "vmscan/mm_vmscan_lru_shrink_active", - $regex_lru_shrink_active_default, - "nid", "zid", - "lru", - "nr_scanned", "nr_rotated", "priority"); -$regex_writepage = generate_traceevent_regex( - "vmscan/mm_vmscan_writepage", - $regex_writepage_default, - "page", "pfn", "flags"); - -sub read_statline($) { - my $pid = $_[0]; - my $statline; - - if (open(STAT, "/proc/$pid/stat")) { - $statline = <STAT>; - close(STAT); - } - - if ($statline eq '') { - $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; - } - - return $statline; -} - -sub guess_process_pid($$) { - my $pid = $_[0]; - my $statline = $_[1]; - - if ($pid == 0) { - return "swapper-0"; - } - - if ($statline !~ /$regex_statname/o) { - die("Failed to math stat line for process name :: $statline"); - } - return "$1-$pid"; -} - -# Convert sec.usec timestamp format -sub timestamp_to_ms($) { - my $timestamp = $_[0]; - - my ($sec, $usec) = split (/\./, $timestamp); - return ($sec * 1000) + ($usec / 1000); -} - -sub process_events { - my $traceevent; - my $process_pid; - my $cpus; - my $timestamp; - my $tracepoint; - my $details; - my $statline; - - # Read each line of the event log -EVENT_PROCESS: - while ($traceevent = <STDIN>) { - if ($traceevent =~ /$regex_traceevent/o) { - $process_pid = $1; - $timestamp = $3; - $tracepoint = $4; - - $process_pid =~ /(.*)-([0-9]*)$/; - my $process = $1; - my $pid = $2; - - if ($process eq "") { - $process = $last_procmap{$pid}; - $process_pid = "$process-$pid"; - } - $last_procmap{$pid} = $process; - - if ($opt_read_procstat) { - $statline = read_statline($pid); - if ($opt_read_procstat && $process eq '') { - $process_pid = guess_process_pid($pid, $statline); - } - } - } else { - next; - } - - # Perl Switch() sucks majorly - if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") { - $timestamp = timestamp_to_ms($timestamp); - $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++; - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp; - - $details = $5; - if ($details !~ /$regex_direct_begin/o) { - print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n"; - print " $details\n"; - print " $regex_direct_begin\n"; - next; - } - my $order = $1; - $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++; - $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order; - } elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") { - # Count the event itself - my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}; - $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++; - - # Record how long direct reclaim took this time - if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) { - $timestamp = timestamp_to_ms($timestamp); - my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER}; - my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}); - $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency"; - } - } elsif ($tracepoint eq "mm_vmscan_kswapd_wake") { - $details = $5; - if ($details !~ /$regex_kswapd_wake/o) { - print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n"; - print " $details\n"; - print " $regex_kswapd_wake\n"; - next; - } - - my $order = $2; - $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order; - if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) { - $timestamp = timestamp_to_ms($timestamp); - $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++; - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp; - $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++; - } else { - $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++; - $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++; - } - } elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") { - - # Count the event itself - my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}; - $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++; - - # Record how long kswapd was awake - $timestamp = timestamp_to_ms($timestamp); - my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER}; - my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}); - $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency"; - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0; - } elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") { - $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++; - - $details = $5; - if ($details !~ /$regex_wakeup_kswapd/o) { - print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n"; - print " $details\n"; - print " $regex_wakeup_kswapd\n"; - next; - } - my $order = $3; - $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++; - } elsif ($tracepoint eq "mm_vmscan_lru_isolate") { - $details = $5; - if ($details !~ /$regex_lru_isolate/o) { - print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n"; - print " $details\n"; - print " $regex_lru_isolate/o\n"; - next; - } - my $isolate_mode = $1; - my $nr_scanned = $4; - my $nr_contig_dirty = $7; - - # To closer match vmstat scanning statistics, only count isolate_both - # and isolate_inactive as scanning. isolate_active is rotation - # isolate_inactive == 1 - # isolate_active == 2 - # isolate_both == 3 - if ($isolate_mode != 2) { - $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned; - } - $perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty; - } elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") { - $details = $5; - if ($details !~ /$regex_lru_shrink_inactive/o) { - print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n"; - print " $details\n"; - print " $regex_lru_shrink_inactive/o\n"; - next; - } - my $nr_reclaimed = $4; - $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed; - } elsif ($tracepoint eq "mm_vmscan_writepage") { - $details = $5; - if ($details !~ /$regex_writepage/o) { - print "WARNING: Failed to parse mm_vmscan_writepage as expected\n"; - print " $details\n"; - print " $regex_writepage\n"; - next; - } - - my $flags = $3; - my $file = 0; - my $sync_io = 0; - if ($flags =~ /RECLAIM_WB_FILE/) { - $file = 1; - } - if ($flags =~ /RECLAIM_WB_SYNC/) { - $sync_io = 1; - } - if ($sync_io) { - if ($file) { - $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++; - } else { - $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++; - } - } else { - if ($file) { - $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++; - } else { - $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++; - } - } - } else { - $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; - } - - if ($sigint_pending) { - last EVENT_PROCESS; - } - } -} - -sub dump_stats { - my $hashref = shift; - my %stats = %$hashref; - - # Dump per-process stats - my $process_pid; - my $max_strlen = 0; - - # Get the maximum process name - foreach $process_pid (keys %perprocesspid) { - my $len = length($process_pid); - if ($len > $max_strlen) { - $max_strlen = $len; - } - } - $max_strlen += 2; - - # Work out latencies - printf("\n") if !$opt_ignorepid; - printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid; - foreach $process_pid (keys %stats) { - - if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] && - !$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) { - next; - } - - printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid; - my $index = 0; - while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] || - defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) { - - if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { - printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid; - my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); - $total_direct_latency += $latency; - } else { - printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid; - my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]); - $total_kswapd_latency += $latency; - } - $index++; - } - print "\n" if !$opt_ignorepid; - } - - # Print out process activity - printf("\n"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled"); - foreach $process_pid (keys %stats) { - - if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { - next; - } - - $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; - $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; - $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; - $total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; - $total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; - $total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; - $total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; - - $total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; - - my $index = 0; - my $this_reclaim_delay = 0; - while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { - my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); - $this_reclaim_delay += $latency; - $index++; - } - - printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f", - $process_pid, - $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}, - $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}, - $stats{$process_pid}->{HIGH_NR_SCANNED}, - $stats{$process_pid}->{HIGH_NR_RECLAIMED}, - $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, - $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}, - $this_reclaim_delay / 1000); - - if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { - print " "; - for (my $order = 0; $order < 20; $order++) { - my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; - if ($count != 0) { - print "direct-$order=$count "; - } - } - } - if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) { - print " "; - for (my $order = 0; $order < 20; $order++) { - my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]; - if ($count != 0) { - print "wakeup-$order=$count "; - } - } - } - if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) { - print " "; - my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}; - if ($count != 0) { - print "contig-dirty=$count "; - } - } - - print "\n"; - } - - # Print out kswapd activity - printf("\n"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages"); - printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO"); - foreach $process_pid (keys %stats) { - - if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { - next; - } - - $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; - $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; - $total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; - $total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; - $total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; - $total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; - $total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; - - printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u", - $process_pid, - $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}, - $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}, - $stats{$process_pid}->{HIGH_NR_SCANNED}, - $stats{$process_pid}->{HIGH_NR_RECLAIMED}, - $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, - $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}); - - if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { - print " "; - for (my $order = 0; $order < 20; $order++) { - my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]; - if ($count != 0) { - print "wake-$order=$count "; - } - } - } - if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) { - print " "; - for (my $order = 0; $order < 20; $order++) { - my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]; - if ($count != 0) { - print "rewake-$order=$count "; - } - } - } - printf("\n"); - } - - # Print out summaries - $total_direct_latency /= 1000; - $total_kswapd_latency /= 1000; - print "\nSummary\n"; - print "Direct reclaims: $total_direct_reclaim\n"; - print "Direct reclaim pages scanned: $total_direct_nr_scanned\n"; - print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n"; - print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n"; - print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n"; - print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n"; - print "Direct reclaim write anon async I/O: $total_direct_writepage_anon_async\n"; - print "Wake kswapd requests: $total_wakeup_kswapd\n"; - printf "Time stalled direct reclaim: %-1.2f seconds\n", $total_direct_latency; - print "\n"; - print "Kswapd wakeups: $total_kswapd_wake\n"; - print "Kswapd pages scanned: $total_kswapd_nr_scanned\n"; - print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n"; - print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n"; - print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n"; - print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n"; - print "Kswapd reclaim write anon async I/O: $total_kswapd_writepage_anon_async\n"; - printf "Time kswapd awake: %-1.2f seconds\n", $total_kswapd_latency; -} - -sub aggregate_perprocesspid() { - my $process_pid; - my $process; - undef %perprocess; - - foreach $process_pid (keys %perprocesspid) { - $process = $process_pid; - $process =~ s/-([0-9])*$//; - if ($process eq '') { - $process = "NO_PROCESS_NAME"; - } - - $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; - $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; - $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; - $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}; - $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED}; - $perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED}; - $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; - $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; - $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; - $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; - - for (my $order = 0; $order < 20; $order++) { - $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; - $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]; - $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]; - - } - - # Aggregate direct reclaim latencies - my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END}; - my $rd_index = 0; - while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) { - $perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]; - $rd_index++; - $wr_index++; - } - $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index; - - # Aggregate kswapd latencies - my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP}; - my $rd_index = 0; - while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) { - $perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]; - $rd_index++; - $wr_index++; - } - $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index; - } -} - -sub report() { - if (!$opt_ignorepid) { - dump_stats(\%perprocesspid); - } else { - aggregate_perprocesspid(); - dump_stats(\%perprocess); - } -} - -# Process events or signals until neither is available -sub signal_loop() { - my $sigint_processed; - do { - $sigint_processed = 0; - process_events(); - - # Handle pending signals if any - if ($sigint_pending) { - my $current_time = time; - - if ($sigint_exit) { - print "Received exit signal\n"; - $sigint_pending = 0; - } - if ($sigint_report) { - if ($current_time >= $sigint_received + 2) { - report(); - $sigint_report = 0; - $sigint_pending = 0; - $sigint_processed = 1; - } - } - } - } while ($sigint_pending || $sigint_processed); -} - -signal_loop(); -report(); diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt deleted file mode 100644 index 7d350b49658..00000000000 --- a/Documentation/trace/ring-buffer-design.txt +++ /dev/null @@ -1,955 +0,0 @@ - Lockless Ring Buffer Design - =========================== - -Copyright 2009 Red Hat Inc. - Author: Steven Rostedt <srostedt@redhat.com> - License: The GNU Free Documentation License, Version 1.2 - (dual licensed under the GPL v2) -Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, - and Frederic Weisbecker. - - -Written for: 2.6.31 - -Terminology used in this Document ---------------------------------- - -tail - where new writes happen in the ring buffer. - -head - where new reads happen in the ring buffer. - -producer - the task that writes into the ring buffer (same as writer) - -writer - same as producer - -consumer - the task that reads from the buffer (same as reader) - -reader - same as consumer. - -reader_page - A page outside the ring buffer used solely (for the most part) - by the reader. - -head_page - a pointer to the page that the reader will use next - -tail_page - a pointer to the page that will be written to next - -commit_page - a pointer to the page with the last finished non-nested write. - -cmpxchg - hardware-assisted atomic transaction that performs the following: - - A = B iff previous A == C - - R = cmpxchg(A, C, B) is saying that we replace A with B if and only if - current A is equal to C, and we put the old (current) A into R - - R gets the previous A regardless if A is updated with B or not. - - To see if the update was successful a compare of R == C may be used. - -The Generic Ring Buffer ------------------------ - -The ring buffer can be used in either an overwrite mode or in -producer/consumer mode. - -Producer/consumer mode is where if the producer were to fill up the -buffer before the consumer could free up anything, the producer -will stop writing to the buffer. This will lose most recent events. - -Overwrite mode is where if the producer were to fill up the buffer -before the consumer could free up anything, the producer will -overwrite the older data. This will lose the oldest events. - -No two writers can write at the same time (on the same per-cpu buffer), -but a writer may interrupt another writer, but it must finish writing -before the previous writer may continue. This is very important to the -algorithm. The writers act like a "stack". The way interrupts works -enforces this behavior. - - - writer1 start - <preempted> writer2 start - <preempted> writer3 start - writer3 finishes - writer2 finishes - writer1 finishes - -This is very much like a writer being preempted by an interrupt and -the interrupt doing a write as well. - -Readers can happen at any time. But no two readers may run at the -same time, nor can a reader preempt/interrupt another reader. A reader -cannot preempt/interrupt a writer, but it may read/consume from the -buffer at the same time as a writer is writing, but the reader must be -on another processor to do so. A reader may read on its own processor -and can be preempted by a writer. - -A writer can preempt a reader, but a reader cannot preempt a writer. -But a reader can read the buffer at the same time (on another processor) -as a writer. - -The ring buffer is made up of a list of pages held together by a linked list. - -At initialization a reader page is allocated for the reader that is not -part of the ring buffer. - -The head_page, tail_page and commit_page are all initialized to point -to the same page. - -The reader page is initialized to have its next pointer pointing to -the head page, and its previous pointer pointing to a page before -the head page. - -The reader has its own page to use. At start up time, this page is -allocated but is not attached to the list. When the reader wants -to read from the buffer, if its page is empty (like it is on start-up), -it will swap its page with the head_page. The old reader page will -become part of the ring buffer and the head_page will be removed. -The page after the inserted page (old reader_page) will become the -new head page. - -Once the new page is given to the reader, the reader could do what -it wants with it, as long as a writer has left that page. - -A sample of how the reader page is swapped: Note this does not -show the head page in the buffer, it is for demonstrating a swap -only. - - +------+ - |reader| RING BUFFER - |page | - +------+ - +---+ +---+ +---+ - | |-->| |-->| | - | |<--| |<--| | - +---+ +---+ +---+ - ^ | ^ | - | +-------------+ | - +-----------------+ - - - +------+ - |reader| RING BUFFER - |page |-------------------+ - +------+ v - | +---+ +---+ +---+ - | | |-->| |-->| | - | | |<--| |<--| |<-+ - | +---+ +---+ +---+ | - | ^ | ^ | | - | | +-------------+ | | - | +-----------------+ | - +------------------------------------+ - - +------+ - |reader| RING BUFFER - |page |-------------------+ - +------+ <---------------+ v - | ^ +---+ +---+ +---+ - | | | |-->| |-->| | - | | | | | |<--| |<-+ - | | +---+ +---+ +---+ | - | | | ^ | | - | | +-------------+ | | - | +-----------------------------+ | - +------------------------------------+ - - +------+ - |buffer| RING BUFFER - |page |-------------------+ - +------+ <---------------+ v - | ^ +---+ +---+ +---+ - | | | | | |-->| | - | | New | | | |<--| |<-+ - | | Reader +---+ +---+ +---+ | - | | page ----^ | | - | | | | - | +-----------------------------+ | - +------------------------------------+ - - - -It is possible that the page swapped is the commit page and the tail page, -if what is in the ring buffer is less than what is held in a buffer page. - - - reader page commit page tail page - | | | - v | | - +---+ | | - | |<----------+ | - | |<------------------------+ - | |------+ - +---+ | - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -This case is still valid for this algorithm. -When the writer leaves the page, it simply goes into the ring buffer -since the reader page still points to the next location in the ring -buffer. - - -The main pointers: - - reader page - The page used solely by the reader and is not part - of the ring buffer (may be swapped in) - - head page - the next page in the ring buffer that will be swapped - with the reader page. - - tail page - the page where the next write will take place. - - commit page - the page that last finished a write. - -The commit page only is updated by the outermost writer in the -writer stack. A writer that preempts another writer will not move the -commit page. - -When data is written into the ring buffer, a position is reserved -in the ring buffer and passed back to the writer. When the writer -is finished writing data into that position, it commits the write. - -Another write (or a read) may take place at anytime during this -transaction. If another write happens it must finish before continuing -with the previous write. - - - Write reserve: - - Buffer page - +---------+ - |written | - +---------+ <--- given back to writer (current commit) - |reserved | - +---------+ <--- tail pointer - | empty | - +---------+ - - Write commit: - - Buffer page - +---------+ - |written | - +---------+ - |written | - +---------+ <--- next position for write (current commit) - | empty | - +---------+ - - - If a write happens after the first reserve: - - Buffer page - +---------+ - |written | - +---------+ <-- current commit - |reserved | - +---------+ <--- given back to second writer - |reserved | - +---------+ <--- tail pointer - - After second writer commits: - - - Buffer page - +---------+ - |written | - +---------+ <--(last full commit) - |reserved | - +---------+ - |pending | - |commit | - +---------+ <--- tail pointer - - When the first writer commits: - - Buffer page - +---------+ - |written | - +---------+ - |written | - +---------+ - |written | - +---------+ <--(last full commit and tail pointer) - - -The commit pointer points to the last write location that was -committed without preempting another write. When a write that -preempted another write is committed, it only becomes a pending commit -and will not be a full commit until all writes have been committed. - -The commit page points to the page that has the last full commit. -The tail page points to the page with the last write (before -committing). - -The tail page is always equal to or after the commit page. It may -be several pages ahead. If the tail page catches up to the commit -page then no more writes may take place (regardless of the mode -of the ring buffer: overwrite and produce/consumer). - -The order of pages is: - - head page - commit page - tail page - -Possible scenario: - tail page - head page commit page | - | | | - v v v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -There is a special case that the head page is after either the commit page -and possibly the tail page. That is when the commit (and tail) page has been -swapped with the reader page. This is because the head page is always -part of the ring buffer, but the reader page is not. Whenever there -has been less than a full page that has been committed inside the ring buffer, -and a reader swaps out a page, it will be swapping out the commit page. - - - reader page commit page tail page - | | | - v | | - +---+ | | - | |<----------+ | - | |<------------------------+ - | |------+ - +---+ | - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - ^ - | - head page - - -In this case, the head page will not move when the tail and commit -move back into the ring buffer. - -The reader cannot swap a page into the ring buffer if the commit page -is still on that page. If the read meets the last commit (real commit -not pending or reserved), then there is nothing more to read. -The buffer is considered empty until another full commit finishes. - -When the tail meets the head page, if the buffer is in overwrite mode, -the head page will be pushed ahead one. If the buffer is in producer/consumer -mode, the write will fail. - -Overwrite mode: - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - ^ - | - head page - - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - ^ - | - head page - - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - ^ - | - head page - -Note, the reader page will still point to the previous head page. -But when a swap takes place, it will use the most recent head page. - - -Making the Ring Buffer Lockless: --------------------------------- - -The main idea behind the lockless algorithm is to combine the moving -of the head_page pointer with the swapping of pages with the reader. -State flags are placed inside the pointer to the page. To do this, -each page must be aligned in memory by 4 bytes. This will allow the 2 -least significant bits of the address to be used as flags, since -they will always be zero for the address. To get the address, -simply mask out the flags. - - MASK = ~3 - - address & MASK - -Two flags will be kept by these two bits: - - HEADER - the page being pointed to is a head page - - UPDATE - the page being pointed to is being updated by a writer - and was or is about to be a head page. - - - reader page - | - v - +---+ - | |------+ - +---+ | - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-H->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - - -The above pointer "-H->" would have the HEADER flag set. That is -the next page is the next page to be swapped out by the reader. -This pointer means the next page is the head page. - -When the tail page meets the head pointer, it will use cmpxchg to -change the pointer to the UPDATE state: - - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-H->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -"-U->" represents a pointer in the UPDATE state. - -Any access to the reader will need to take some sort of lock to serialize -the readers. But the writers will never take a lock to write to the -ring buffer. This means we only need to worry about a single reader, -and writes only preempt in "stack" formation. - -When the reader tries to swap the page with the ring buffer, it -will also use cmpxchg. If the flag bit in the pointer to the -head page does not have the HEADER flag set, the compare will fail -and the reader will need to look for the new head page and try again. -Note, the flags UPDATE and HEADER are never set at the same time. - -The reader swaps the reader page as follows: - - +------+ - |reader| RING BUFFER - |page | - +------+ - +---+ +---+ +---+ - | |--->| |--->| | - | |<---| |<---| | - +---+ +---+ +---+ - ^ | ^ | - | +---------------+ | - +-----H-------------+ - -The reader sets the reader page next pointer as HEADER to the page after -the head page. - - - +------+ - |reader| RING BUFFER - |page |-------H-----------+ - +------+ v - | +---+ +---+ +---+ - | | |--->| |--->| | - | | |<---| |<---| |<-+ - | +---+ +---+ +---+ | - | ^ | ^ | | - | | +---------------+ | | - | +-----H-------------+ | - +--------------------------------------+ - -It does a cmpxchg with the pointer to the previous head page to make it -point to the reader page. Note that the new pointer does not have the HEADER -flag set. This action atomically moves the head page forward. - - +------+ - |reader| RING BUFFER - |page |-------H-----------+ - +------+ v - | ^ +---+ +---+ +---+ - | | | |-->| |-->| | - | | | |<--| |<--| |<-+ - | | +---+ +---+ +---+ | - | | | ^ | | - | | +-------------+ | | - | +-----------------------------+ | - +------------------------------------+ - -After the new head page is set, the previous pointer of the head page is -updated to the reader page. - - +------+ - |reader| RING BUFFER - |page |-------H-----------+ - +------+ <---------------+ v - | ^ +---+ +---+ +---+ - | | | |-->| |-->| | - | | | | | |<--| |<-+ - | | +---+ +---+ +---+ | - | | | ^ | | - | | +-------------+ | | - | +-----------------------------+ | - +------------------------------------+ - - +------+ - |buffer| RING BUFFER - |page |-------H-----------+ <--- New head page - +------+ <---------------+ v - | ^ +---+ +---+ +---+ - | | | | | |-->| | - | | New | | | |<--| |<-+ - | | Reader +---+ +---+ +---+ | - | | page ----^ | | - | | | | - | +-----------------------------+ | - +------------------------------------+ - -Another important point: The page that the reader page points back to -by its previous pointer (the one that now points to the new head page) -never points back to the reader page. That is because the reader page is -not part of the ring buffer. Traversing the ring buffer via the next pointers -will always stay in the ring buffer. Traversing the ring buffer via the -prev pointers may not. - -Note, the way to determine a reader page is simply by examining the previous -pointer of the page. If the next pointer of the previous page does not -point back to the original page, then the original page is a reader page: - - - +--------+ - | reader | next +----+ - | page |-------->| |<====== (buffer page) - +--------+ +----+ - | | ^ - | v | next - prev | +----+ - +------------->| | - +----+ - -The way the head page moves forward: - -When the tail page meets the head page and the buffer is in overwrite mode -and more writes take place, the head page must be moved forward before the -writer may move the tail page. The way this is done is that the writer -performs a cmpxchg to convert the pointer to the head page from the HEADER -flag to have the UPDATE flag set. Once this is done, the reader will -not be able to swap the head page from the buffer, nor will it be able to -move the head page, until the writer is finished with the move. - -This eliminates any races that the reader can have on the writer. The reader -must spin, and this is why the reader cannot preempt the writer. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-H->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -The following page will be made into the new head page. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-H->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -After the new head page has been set, we can set the old head page -pointer back to NORMAL. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |-H->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -After the head page has been moved, the tail page may now move forward. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |-H->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - - -The above are the trivial updates. Now for the more complex scenarios. - - -As stated before, if enough writes preempt the first write, the -tail page may make it all the way around the buffer and meet the commit -page. At this time, we must start dropping writes (usually with some kind -of warning to the user). But what happens if the commit was still on the -reader page? The commit page is not part of the ring buffer. The tail page -must account for this. - - - reader page commit page - | | - v | - +---+ | - | |<----------+ - | | - | |------+ - +---+ | - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-H->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - ^ - | - tail page - -If the tail page were to simply push the head page forward, the commit when -leaving the reader page would not be pointing to the correct page. - -The solution to this is to test if the commit page is on the reader page -before pushing the head page. If it is, then it can be assumed that the -tail page wrapped the buffer, and we must drop new writes. - -This is not a race condition, because the commit page can only be moved -by the outermost writer (the writer that was preempted). -This means that the commit will not move while a writer is moving the -tail page. The reader cannot swap the reader page if it is also being -used as the commit page. The reader can simply check that the commit -is off the reader page. Once the commit page leaves the reader page -it will never go back on it unless a reader does another swap with the -buffer page that is also the commit page. - - -Nested writes -------------- - -In the pushing forward of the tail page we must first push forward -the head page if the head page is the next page. If the head page -is not the next page, the tail page is simply updated with a cmpxchg. - -Only writers move the tail page. This must be done atomically to protect -against nested writers. - - temp_page = tail_page - next_page = temp_page->next - cmpxchg(tail_page, temp_page, next_page) - -The above will update the tail page if it is still pointing to the expected -page. If this fails, a nested write pushed it forward, the the current write -does not need to push it. - - - temp page - | - v - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -Nested write comes in and moves the tail page forward: - - tail page (moved by nested writer) - temp page | - | | - v v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -The above would fail the cmpxchg, but since the tail page has already -been moved forward, the writer will just try again to reserve storage -on the new tail page. - -But the moving of the head page is a bit more complex. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-H->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -The write converts the head page pointer to UPDATE. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -But if a nested writer preempts here, it will see that the next -page is a head page, but it is also nested. It will detect that -it is nested and will save that information. The detection is the -fact that it sees the UPDATE flag instead of a HEADER or NORMAL -pointer. - -The nested writer will set the new head page pointer. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-H->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -But it will not reset the update back to normal. Only the writer -that converted a pointer from HEAD to UPDATE will convert it back -to NORMAL. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-H->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -After the nested writer finishes, the outermost writer will convert -the UPDATE pointer to NORMAL. - - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |-H->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - - -It can be even more complex if several nested writes came in and moved -the tail page ahead several pages: - - -(first writer) - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-H->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -The write converts the head page pointer to UPDATE. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |--->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -Next writer comes in, and sees the update and sets up the new -head page. - -(second writer) - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-H->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -The nested writer moves the tail page forward. But does not set the old -update page to NORMAL because it is not the outermost writer. - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-H->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -Another writer preempts and sees the page after the tail page is a head page. -It changes it from HEAD to UPDATE. - -(third writer) - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-U->| |---> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -The writer will move the head page forward: - - -(third writer) - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-U->| |-H-> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -But now that the third writer did change the HEAD flag to UPDATE it -will convert it to normal: - - -(third writer) - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |--->| |-H-> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - - -Then it will move the tail page, and return back to the second writer. - - -(second writer) - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |--->| |-H-> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - - -The second writer will fail to move the tail page because it was already -moved, so it will try again and add its data to the new tail page. -It will return to the first writer. - - -(first writer) - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |--->| |-H-> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -The first writer cannot know atomically if the tail page moved -while it updates the HEAD page. It will then update the head page to -what it thinks is the new head page. - - -(first writer) - - tail page - | - v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-H->| |-H-> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -Since the cmpxchg returns the old value of the pointer the first writer -will see it succeeded in updating the pointer from NORMAL to HEAD. -But as we can see, this is not good enough. It must also check to see -if the tail page is either where it use to be or on the next page: - - -(first writer) - - A B tail page - | | | - v v v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |-H->| |-H-> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -If tail page != A and tail page != B, then it must reset the pointer -back to NORMAL. The fact that it only needs to worry about nested -writers means that it only needs to check this after setting the HEAD page. - - -(first writer) - - A B tail page - | | | - v v v - +---+ +---+ +---+ +---+ -<---| |--->| |-U->| |--->| |-H-> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - -Now the writer can update the head page. This is also why the head page must -remain in UPDATE and only reset by the outermost writer. This prevents -the reader from seeing the incorrect head page. - - -(first writer) - - A B tail page - | | | - v v v - +---+ +---+ +---+ +---+ -<---| |--->| |--->| |--->| |-H-> ---->| |<---| |<---| |<---| |<--- - +---+ +---+ +---+ +---+ - diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt deleted file mode 100644 index 058cc6c9dc5..00000000000 --- a/Documentation/trace/tracepoint-analysis.txt +++ /dev/null @@ -1,327 +0,0 @@ - Notes on Analysing Behaviour Using Events and Tracepoints - - Documentation written by Mel Gorman - PCL information heavily based on email from Ingo Molnar - -1. Introduction -=============== - -Tracepoints (see Documentation/trace/tracepoints.txt) can be used without -creating custom kernel modules to register probe functions using the event -tracing infrastructure. - -Simplistically, tracepoints represent important events that can be -taken in conjunction with other tracepoints to build a "Big Picture" of -what is going on within the system. There are a large number of methods for -gathering and interpreting these events. Lacking any current Best Practises, -this document describes some of the methods that can be used. - -This document assumes that debugfs is mounted on /sys/kernel/debug and that -the appropriate tracing options have been configured into the kernel. It is -assumed that the PCL tool tools/perf has been installed and is in your path. - -2. Listing Available Events -=========================== - -2.1 Standard Utilities ----------------------- - -All possible events are visible from /sys/kernel/debug/tracing/events. Simply -calling - - $ find /sys/kernel/debug/tracing/events -type d - -will give a fair indication of the number of events available. - -2.2 PCL (Performance Counters for Linux) -------- - -Discovery and enumeration of all counters and events, including tracepoints, -are available with the perf tool. Getting a list of available events is a -simple case of: - - $ perf list 2>&1 | grep Tracepoint - ext4:ext4_free_inode [Tracepoint event] - ext4:ext4_request_inode [Tracepoint event] - ext4:ext4_allocate_inode [Tracepoint event] - ext4:ext4_write_begin [Tracepoint event] - ext4:ext4_ordered_write_end [Tracepoint event] - [ .... remaining output snipped .... ] - - -3. Enabling Events -================== - -3.1 System-Wide Event Enabling ------------------------------- - -See Documentation/trace/events.txt for a proper description on how events -can be enabled system-wide. A short example of enabling all events related -to page allocation would look something like: - - $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done - -3.2 System-Wide Event Enabling with SystemTap ---------------------------------------------- - -In SystemTap, tracepoints are accessible using the kernel.trace() function -call. The following is an example that reports every 5 seconds what processes -were allocating the pages. - - global page_allocs - - probe kernel.trace("mm_page_alloc") { - page_allocs[execname()]++ - } - - function print_count() { - printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") - foreach (proc in page_allocs-) - printf("%-25d %s\n", page_allocs[proc], proc) - printf ("\n") - delete page_allocs - } - - probe timer.s(5) { - print_count() - } - -3.3 System-Wide Event Enabling with PCL ---------------------------------------- - -By specifying the -a switch and analysing sleep, the system-wide events -for a duration of time can be examined. - - $ perf stat -a \ - -e kmem:mm_page_alloc -e kmem:mm_page_free \ - -e kmem:mm_page_free_batched \ - sleep 10 - Performance counter stats for 'sleep 10': - - 9630 kmem:mm_page_alloc - 2143 kmem:mm_page_free - 7424 kmem:mm_page_free_batched - - 10.002577764 seconds time elapsed - -Similarly, one could execute a shell and exit it as desired to get a report -at that point. - -3.4 Local Event Enabling ------------------------- - -Documentation/trace/ftrace.txt describes how to enable events on a per-thread -basis using set_ftrace_pid. - -3.5 Local Event Enablement with PCL ------------------------------------ - -Events can be activated and tracked for the duration of a process on a local -basis using PCL such as follows. - - $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ - -e kmem:mm_page_free_batched ./hackbench 10 - Time: 0.909 - - Performance counter stats for './hackbench 10': - - 17803 kmem:mm_page_alloc - 12398 kmem:mm_page_free - 4827 kmem:mm_page_free_batched - - 0.973913387 seconds time elapsed - -4. Event Filtering -================== - -Documentation/trace/ftrace.txt covers in-depth how to filter events in -ftrace. Obviously using grep and awk of trace_pipe is an option as well -as any script reading trace_pipe. - -5. Analysing Event Variances with PCL -===================================== - -Any workload can exhibit variances between runs and it can be important -to know what the standard deviation is. By and large, this is left to the -performance analyst to do it by hand. In the event that the discrete event -occurrences are useful to the performance analyst, then perf can be used. - - $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free - -e kmem:mm_page_free_batched ./hackbench 10 - Time: 0.890 - Time: 0.895 - Time: 0.915 - Time: 1.001 - Time: 0.899 - - Performance counter stats for './hackbench 10' (5 runs): - - 16630 kmem:mm_page_alloc ( +- 3.542% ) - 11486 kmem:mm_page_free ( +- 4.771% ) - 4730 kmem:mm_page_free_batched ( +- 2.325% ) - - 0.982653002 seconds time elapsed ( +- 1.448% ) - -In the event that some higher-level event is required that depends on some -aggregation of discrete events, then a script would need to be developed. - -Using --repeat, it is also possible to view how events are fluctuating over -time on a system-wide basis using -a and sleep. - - $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ - -e kmem:mm_page_free_batched \ - -a --repeat 10 \ - sleep 1 - Performance counter stats for 'sleep 1' (10 runs): - - 1066 kmem:mm_page_alloc ( +- 26.148% ) - 182 kmem:mm_page_free ( +- 5.464% ) - 890 kmem:mm_page_free_batched ( +- 30.079% ) - - 1.002251757 seconds time elapsed ( +- 0.005% ) - -6. Higher-Level Analysis with Helper Scripts -============================================ - -When events are enabled the events that are triggering can be read from -/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary -options exist as well. By post-processing the output, further information can -be gathered on-line as appropriate. Examples of post-processing might include - - o Reading information from /proc for the PID that triggered the event - o Deriving a higher-level event from a series of lower-level events. - o Calculating latencies between two events - -Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example -script that can read trace_pipe from STDIN or a copy of a trace. When used -on-line, it can be interrupted once to generate a report without exiting -and twice to exit. - -Simplistically, the script just reads STDIN and counts up events but it -also can do more such as - - o Derive high-level events from many low-level events. If a number of pages - are freed to the main allocator from the per-CPU lists, it recognises - that as one per-CPU drain even though there is no specific tracepoint - for that event - o It can aggregate based on PID or individual process number - o In the event memory is getting externally fragmented, it reports - on whether the fragmentation event was severe or moderate. - o When receiving an event about a PID, it can record who the parent was so - that if large numbers of events are coming from very short-lived - processes, the parent process responsible for creating all the helpers - can be identified - -7. Lower-Level Analysis with PCL -================================ - -There may also be a requirement to identify what functions within a program -were generating events within the kernel. To begin this sort of analysis, the -data must be recorded. At the time of writing, this required root: - - $ perf record -c 1 \ - -e kmem:mm_page_alloc -e kmem:mm_page_free \ - -e kmem:mm_page_free_batched \ - ./hackbench 10 - Time: 0.894 - [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] - -Note the use of '-c 1' to set the event period to sample. The default sample -period is quite high to minimise overhead but the information collected can be -very coarse as a result. - -This record outputted a file called perf.data which can be analysed using -perf report. - - $ perf report - # Samples: 30922 - # - # Overhead Command Shared Object - # ........ ......... ................................ - # - 87.27% hackbench [vdso] - 6.85% hackbench /lib/i686/cmov/libc-2.9.so - 2.62% hackbench /lib/ld-2.9.so - 1.52% perf [vdso] - 1.22% hackbench ./hackbench - 0.48% hackbench [kernel] - 0.02% perf /lib/i686/cmov/libc-2.9.so - 0.01% perf /usr/bin/perf - 0.01% perf /lib/ld-2.9.so - 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so - # - # (For more details, try: perf report --sort comm,dso,symbol) - # - -According to this, the vast majority of events triggered on events -within the VDSO. With simple binaries, this will often be the case so let's -take a slightly different example. In the course of writing this, it was -noticed that X was generating an insane amount of page allocations so let's look -at it: - - $ perf record -c 1 -f \ - -e kmem:mm_page_alloc -e kmem:mm_page_free \ - -e kmem:mm_page_free_batched \ - -p `pidof X` - -This was interrupted after a few seconds and - - $ perf report - # Samples: 27666 - # - # Overhead Command Shared Object - # ........ ....... ....................................... - # - 51.95% Xorg [vdso] - 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 - 0.09% Xorg /lib/i686/cmov/libc-2.9.so - 0.01% Xorg [kernel] - # - # (For more details, try: perf report --sort comm,dso,symbol) - # - -So, almost half of the events are occurring in a library. To get an idea which -symbol: - - $ perf report --sort comm,dso,symbol - # Samples: 27666 - # - # Overhead Command Shared Object Symbol - # ........ ....... ....................................... ...... - # - 51.95% Xorg [vdso] [.] 0x000000ffffe424 - 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 - 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc - 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f - 0.01% Xorg [kernel] [k] read_hpet - 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path - 0.00% Xorg [kernel] [k] ftrace_trace_userstack - -To see where within the function pixmanFillsse2 things are going wrong: - - $ perf annotate pixmanFillsse2 - [ ... ] - 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) - : } - : - : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ - : _mm_store_si128 (__m128i *__P, __m128i __B) : { - : *__P = __B; - 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) - 0.00 : 34ef5: ff - 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) - 0.00 : 34efd: ff - 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) - 0.00 : 34f05: ff - 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) - 0.00 : 34f0d: ff - 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) - 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) - 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) - 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) - -At a glance, it looks like the time is being spent copying pixmaps to -the card. Further investigation would be needed to determine why pixmaps -are being copied around so much but a starting point would be to take an -ancient build of libpixmap out of the library path where it was totally -forgotten about from months ago! diff --git a/Documentation/trace/tracepoints.txt b/Documentation/trace/tracepoints.txt deleted file mode 100644 index c0e1ceed75a..00000000000 --- a/Documentation/trace/tracepoints.txt +++ /dev/null @@ -1,116 +0,0 @@ - Using the Linux Kernel Tracepoints - - Mathieu Desnoyers - - -This document introduces Linux Kernel Tracepoints and their use. It -provides examples of how to insert tracepoints in the kernel and -connect probe functions to them and provides some examples of probe -functions. - - -* Purpose of tracepoints - -A tracepoint placed in code provides a hook to call a function (probe) -that you can provide at runtime. A tracepoint can be "on" (a probe is -connected to it) or "off" (no probe is attached). When a tracepoint is -"off" it has no effect, except for adding a tiny time penalty -(checking a condition for a branch) and space penalty (adding a few -bytes for the function call at the end of the instrumented function -and adds a data structure in a separate section). When a tracepoint -is "on", the function you provide is called each time the tracepoint -is executed, in the execution context of the caller. When the function -provided ends its execution, it returns to the caller (continuing from -the tracepoint site). - -You can put tracepoints at important locations in the code. They are -lightweight hooks that can pass an arbitrary number of parameters, -which prototypes are described in a tracepoint declaration placed in a -header file. - -They can be used for tracing and performance accounting. - - -* Usage - -Two elements are required for tracepoints : - -- A tracepoint definition, placed in a header file. -- The tracepoint statement, in C code. - -In order to use tracepoints, you should include linux/tracepoint.h. - -In include/trace/subsys.h : - -#include <linux/tracepoint.h> - -DECLARE_TRACE(subsys_eventname, - TP_PROTO(int firstarg, struct task_struct *p), - TP_ARGS(firstarg, p)); - -In subsys/file.c (where the tracing statement must be added) : - -#include <trace/subsys.h> - -DEFINE_TRACE(subsys_eventname); - -void somefct(void) -{ - ... - trace_subsys_eventname(arg, task); - ... -} - -Where : -- subsys_eventname is an identifier unique to your event - - subsys is the name of your subsystem. - - eventname is the name of the event to trace. - -- TP_PROTO(int firstarg, struct task_struct *p) is the prototype of the - function called by this tracepoint. - -- TP_ARGS(firstarg, p) are the parameters names, same as found in the - prototype. - -Connecting a function (probe) to a tracepoint is done by providing a -probe (function to call) for the specific tracepoint through -register_trace_subsys_eventname(). Removing a probe is done through -unregister_trace_subsys_eventname(); it will remove the probe. - -tracepoint_synchronize_unregister() must be called before the end of -the module exit function to make sure there is no caller left using -the probe. This, and the fact that preemption is disabled around the -probe call, make sure that probe removal and module unload are safe. -See the "Probe example" section below for a sample probe module. - -The tracepoint mechanism supports inserting multiple instances of the -same tracepoint, but a single definition must be made of a given -tracepoint name over all the kernel to make sure no type conflict will -occur. Name mangling of the tracepoints is done using the prototypes -to make sure typing is correct. Verification of probe type correctness -is done at the registration site by the compiler. Tracepoints can be -put in inline functions, inlined static functions, and unrolled loops -as well as regular functions. - -The naming scheme "subsys_event" is suggested here as a convention -intended to limit collisions. Tracepoint names are global to the -kernel: they are considered as being the same whether they are in the -core kernel image or in modules. - -If the tracepoint has to be used in kernel modules, an -EXPORT_TRACEPOINT_SYMBOL_GPL() or EXPORT_TRACEPOINT_SYMBOL() can be -used to export the defined tracepoints. - -* Probe / tracepoint example - -See the example provided in samples/tracepoints - -Compile them with your kernel. They are built during 'make' (not -'make modules') when CONFIG_SAMPLE_TRACEPOINTS=m. - -Run, as root : -modprobe tracepoint-sample (insmod order is not important) -modprobe tracepoint-probe-sample -cat /proc/tracepoint-sample (returns an expected error) -rmmod tracepoint-sample tracepoint-probe-sample -dmesg diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt deleted file mode 100644 index 24ce6823a09..00000000000 --- a/Documentation/trace/uprobetracer.txt +++ /dev/null @@ -1,113 +0,0 @@ - Uprobe-tracer: Uprobe-based Event Tracing - ========================================= - Documentation written by Srikar Dronamraju - -Overview --------- -Uprobe based trace events are similar to kprobe based trace events. -To enable this feature, build your kernel with CONFIG_UPROBE_EVENT=y. - -Similar to the kprobe-event tracer, this doesn't need to be activated via -current_tracer. Instead of that, add probe points via -/sys/kernel/debug/tracing/uprobe_events, and enable it via -/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled. - -However unlike kprobe-event tracer, the uprobe event interface expects the -user to calculate the offset of the probepoint in the object - -Synopsis of uprobe_tracer -------------------------- - p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] : Set a probe - - GRP : Group name. If omitted, use "uprobes" for it. - EVENT : Event name. If omitted, the event name is generated - based on SYMBOL+offs. - PATH : path to an executable or a library. - SYMBOL[+offs] : Symbol+offset where the probe is inserted. - - FETCHARGS : Arguments. Each probe can have up to 128 args. - %REG : Fetch register REG - -Event Profiling ---------------- - You can check the total number of probe hits and probe miss-hits via -/sys/kernel/debug/tracing/uprobe_profile. - The first column is event name, the second is the number of probe hits, -the third is the number of probe miss-hits. - -Usage examples --------------- -To add a probe as a new event, write a new definition to uprobe_events -as below. - - echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events - - This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash - - echo > /sys/kernel/debug/tracing/uprobe_events - - This clears all probe points. - -The following example shows how to dump the instruction pointer and %ax -a register at the probed text address. Here we are trying to probe -function zfree in /bin/zsh - - # cd /sys/kernel/debug/tracing/ - # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp - 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh - # objdump -T /bin/zsh | grep -w zfree - 0000000000446420 g DF .text 0000000000000012 Base zfree - -0x46420 is the offset of zfree in object /bin/zsh that is loaded at -0x00400000. Hence the command to probe would be : - - # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events - -Please note: User has to explicitly calculate the offset of the probepoint -in the object. We can see the events that are registered by looking at the -uprobe_events file. - - # cat uprobe_events - p:uprobes/p_zsh_0x46420 /bin/zsh:0x00046420 arg1=%ip arg2=%ax - -The format of events can be seen by viewing the file events/uprobes/p_zsh_0x46420/format - - # cat events/uprobes/p_zsh_0x46420/format - name: p_zsh_0x46420 - ID: 922 - format: - field:unsigned short common_type; offset:0; size:2; signed:0; - field:unsigned char common_flags; offset:2; size:1; signed:0; - field:unsigned char common_preempt_count; offset:3; size:1; signed:0; - field:int common_pid; offset:4; size:4; signed:1; - field:int common_padding; offset:8; size:4; signed:1; - - field:unsigned long __probe_ip; offset:12; size:4; signed:0; - field:u32 arg1; offset:16; size:4; signed:0; - field:u32 arg2; offset:20; size:4; signed:0; - - print fmt: "(%lx) arg1=%lx arg2=%lx", REC->__probe_ip, REC->arg1, REC->arg2 - -Right after definition, each event is disabled by default. For tracing these -events, you need to enable it by: - - # echo 1 > events/uprobes/enable - -Lets disable the event after sleeping for some time. - # sleep 20 - # echo 0 > events/uprobes/enable - -And you can see the traced information via /sys/kernel/debug/tracing/trace. - - # cat trace - # tracer: nop - # - # TASK-PID CPU# TIMESTAMP FUNCTION - # | | | | | - zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 - zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 - zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 - zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79 - -Each line shows us probes were triggered for a pid 24842 with ip being -0x446421 and contents of ax register being 79. |