summaryrefslogtreecommitdiffstats
path: root/Documentation/trace
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/events-kmem.txt107
-rw-r--r--Documentation/trace/events-power.txt90
-rw-r--r--Documentation/trace/events.txt280
-rw-r--r--Documentation/trace/ftrace-design.txt411
-rw-r--r--Documentation/trace/ftrace.txt1848
-rw-r--r--Documentation/trace/function-graph-fold.vim42
-rw-r--r--Documentation/trace/kprobetrace.txt172
-rw-r--r--Documentation/trace/mmiotrace.txt164
-rw-r--r--Documentation/trace/postprocess/trace-pagealloc-postprocess.pl418
-rw-r--r--Documentation/trace/postprocess/trace-vmscan-postprocess.pl714
-rw-r--r--Documentation/trace/ring-buffer-design.txt955
-rw-r--r--Documentation/trace/tracepoint-analysis.txt327
-rw-r--r--Documentation/trace/tracepoints.txt116
-rw-r--r--Documentation/trace/uprobetracer.txt113
14 files changed, 0 insertions, 5757 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
deleted file mode 100644
index 19480041006..00000000000
--- a/Documentation/trace/events-kmem.txt
+++ /dev/null
@@ -1,107 +0,0 @@
- Subsystem Trace Points: kmem
-
-The kmem tracing system captures events related to object and page allocation
-within the kernel. Broadly speaking there are five major subheadings.
-
- o Slab allocation of small objects of unknown type (kmalloc)
- o Slab allocation of small objects of known type
- o Page allocation
- o Per-CPU Allocator Activity
- o External Fragmentation
-
-This document describes what each of the tracepoints is and why they
-might be useful.
-
-1. Slab allocation of small objects of unknown type
-===================================================
-kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
-kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
-kfree call_site=%lx ptr=%p
-
-Heavy activity for these events may indicate that a specific cache is
-justified, particularly if kmalloc slab pages are getting significantly
-internal fragmented as a result of the allocation pattern. By correlating
-kmalloc with kfree, it may be possible to identify memory leaks and where
-the allocation sites were.
-
-
-2. Slab allocation of small objects of known type
-=================================================
-kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
-kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
-kmem_cache_free call_site=%lx ptr=%p
-
-These events are similar in usage to the kmalloc-related events except that
-it is likely easier to pin the event down to a specific cache. At the time
-of writing, no information is available on what slab is being allocated from,
-but the call_site can usually be used to extrapolate that information.
-
-3. Page allocation
-==================
-mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
-mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
-mm_page_free page=%p pfn=%lu order=%d
-mm_page_free_batched page=%p pfn=%lu order=%d cold=%d
-
-These four events deal with page allocation and freeing. mm_page_alloc is
-a simple indicator of page allocator activity. Pages may be allocated from
-the per-CPU allocator (high performance) or the buddy allocator.
-
-If pages are allocated directly from the buddy allocator, the
-mm_page_alloc_zone_locked event is triggered. This event is important as high
-amounts of activity imply high activity on the zone->lock. Taking this lock
-impairs performance by disabling interrupts, dirtying cache lines between
-CPUs and serialising many CPUs.
-
-When a page is freed directly by the caller, the only mm_page_free event
-is triggered. Significant amounts of activity here could indicate that the
-callers should be batching their activities.
-
-When pages are freed in batch, the also mm_page_free_batched is triggered.
-Broadly speaking, pages are taken off the LRU lock in bulk and
-freed in batch with a page list. Significant amounts of activity here could
-indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
-
-4. Per-CPU Allocator Activity
-=============================
-mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
-mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d
-
-In front of the page allocator is a per-cpu page allocator. It exists only
-for order-0 pages, reduces contention on the zone->lock and reduces the
-amount of writing on struct page.
-
-When a per-CPU list is empty or pages of the wrong type are allocated,
-the zone->lock will be taken once and the per-CPU list refilled. The event
-triggered is mm_page_alloc_zone_locked for each page allocated with the
-event indicating whether it is for a percpu_refill or not.
-
-When the per-CPU list is too full, a number of pages are freed, each one
-which triggers a mm_page_pcpu_drain event.
-
-The individual nature of the events is so that pages can be tracked
-between allocation and freeing. A number of drain or refill pages that occur
-consecutively imply the zone->lock being taken once. Large amounts of per-CPU
-refills and drains could imply an imbalance between CPUs where too much work
-is being concentrated in one place. It could also indicate that the per-CPU
-lists should be a larger size. Finally, large amounts of refills on one CPU
-and drains on another could be a factor in causing large amounts of cache
-line bounces due to writes between CPUs and worth investigating if pages
-can be allocated and freed on the same CPU through some algorithm change.
-
-5. External Fragmentation
-=========================
-mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
-
-External fragmentation affects whether a high-order allocation will be
-successful or not. For some types of hardware, this is important although
-it is avoided where possible. If the system is using huge pages and needs
-to be able to resize the pool over the lifetime of the system, this value
-is important.
-
-Large numbers of this event implies that memory is fragmenting and
-high-order allocations will start failing at some time in the future. One
-means of reducing the occurrence of this event is to increase the size of
-min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
-pageblock_size is usually the size of the default hugepage size.
diff --git a/Documentation/trace/events-power.txt b/Documentation/trace/events-power.txt
deleted file mode 100644
index cf794af2285..00000000000
--- a/Documentation/trace/events-power.txt
+++ /dev/null
@@ -1,90 +0,0 @@
-
- Subsystem Trace Points: power
-
-The power tracing system captures events related to power transitions
-within the kernel. Broadly speaking there are three major subheadings:
-
- o Power state switch which reports events related to suspend (S-states),
- cpuidle (C-states) and cpufreq (P-states)
- o System clock related changes
- o Power domains related changes and transitions
-
-This document describes what each of the tracepoints is and why they
-might be useful.
-
-Cf. include/trace/events/power.h for the events definitions.
-
-1. Power state switch events
-============================
-
-1.1 New trace API
------------------
-
-A 'cpu' event class gathers the CPU-related events: cpuidle and
-cpufreq.
-
-cpu_idle "state=%lu cpu_id=%lu"
-cpu_frequency "state=%lu cpu_id=%lu"
-
-A suspend event is used to indicate the system going in and out of the
-suspend mode:
-
-machine_suspend "state=%lu"
-
-
-Note: the value of '-1' or '4294967295' for state means an exit from the current state,
-i.e. trace_cpu_idle(4, smp_processor_id()) means that the system
-enters the idle state 4, while trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id())
-means that the system exits the previous idle state.
-
-The event which has 'state=4294967295' in the trace is very important to the user
-space tools which are using it to detect the end of the current state, and so to
-correctly draw the states diagrams and to calculate accurate statistics etc.
-
-1.2 DEPRECATED trace API
-------------------------
-
-A new Kconfig option CONFIG_EVENT_POWER_TRACING_DEPRECATED with the default value of
-'y' has been created. This allows the legacy trace power API to be used conjointly
-with the new trace API.
-The Kconfig option, the old trace API (in include/trace/events/power.h) and the
-old trace points will disappear in a future release (namely 2.6.41).
-
-power_start "type=%lu state=%lu cpu_id=%lu"
-power_frequency "type=%lu state=%lu cpu_id=%lu"
-power_end "cpu_id=%lu"
-
-The 'type' parameter takes one of those macros:
- . POWER_NONE = 0,
- . POWER_CSTATE = 1, /* C-State */
- . POWER_PSTATE = 2, /* Frequency change or DVFS */
-
-The 'state' parameter is set depending on the type:
- . Target C-state for type=POWER_CSTATE,
- . Target frequency for type=POWER_PSTATE,
-
-power_end is used to indicate the exit of a state, corresponding to the latest
-power_start event.
-
-2. Clocks events
-================
-The clock events are used for clock enable/disable and for
-clock rate change.
-
-clock_enable "%s state=%lu cpu_id=%lu"
-clock_disable "%s state=%lu cpu_id=%lu"
-clock_set_rate "%s state=%lu cpu_id=%lu"
-
-The first parameter gives the clock name (e.g. "gpio1_iclk").
-The second parameter is '1' for enable, '0' for disable, the target
-clock rate for set_rate.
-
-3. Power domains events
-=======================
-The power domain events are used for power domains transitions
-
-power_domain_target "%s state=%lu cpu_id=%lu"
-
-The first parameter gives the power domain name (e.g. "mpu_pwrdm").
-The second parameter is the power domain target state.
-
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
deleted file mode 100644
index bb24c2a0e87..00000000000
--- a/Documentation/trace/events.txt
+++ /dev/null
@@ -1,280 +0,0 @@
- Event Tracing
-
- Documentation written by Theodore Ts'o
- Updated by Li Zefan and Tom Zanussi
-
-1. Introduction
-===============
-
-Tracepoints (see Documentation/trace/tracepoints.txt) can be used
-without creating custom kernel modules to register probe functions
-using the event tracing infrastructure.
-
-Not all tracepoints can be traced using the event tracing system;
-the kernel developer must provide code snippets which define how the
-tracing information is saved into the tracing buffer, and how the
-tracing information should be printed.
-
-2. Using Event Tracing
-======================
-
-2.1 Via the 'set_event' interface
----------------------------------
-
-The events which are available for tracing can be found in the file
-/sys/kernel/debug/tracing/available_events.
-
-To enable a particular event, such as 'sched_wakeup', simply echo it
-to /sys/kernel/debug/tracing/set_event. For example:
-
- # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event
-
-[ Note: '>>' is necessary, otherwise it will firstly disable
- all the events. ]
-
-To disable an event, echo the event name to the set_event file prefixed
-with an exclamation point:
-
- # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event
-
-To disable all events, echo an empty line to the set_event file:
-
- # echo > /sys/kernel/debug/tracing/set_event
-
-To enable all events, echo '*:*' or '*:' to the set_event file:
-
- # echo *:* > /sys/kernel/debug/tracing/set_event
-
-The events are organized into subsystems, such as ext4, irq, sched,
-etc., and a full event name looks like this: <subsystem>:<event>. The
-subsystem name is optional, but it is displayed in the available_events
-file. All of the events in a subsystem can be specified via the syntax
-"<subsystem>:*"; for example, to enable all irq events, you can use the
-command:
-
- # echo 'irq:*' > /sys/kernel/debug/tracing/set_event
-
-2.2 Via the 'enable' toggle
----------------------------
-
-The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy
-of directories.
-
-To enable event 'sched_wakeup':
-
- # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
-
-To disable it:
-
- # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
-
-To enable all events in sched subsystem:
-
- # echo 1 > /sys/kernel/debug/tracing/events/sched/enable
-
-To enable all events:
-
- # echo 1 > /sys/kernel/debug/tracing/events/enable
-
-When reading one of these enable files, there are four results:
-
- 0 - all events this file affects are disabled
- 1 - all events this file affects are enabled
- X - there is a mixture of events enabled and disabled
- ? - this file does not affect any event
-
-2.3 Boot option
----------------
-
-In order to facilitate early boot debugging, use boot option:
-
- trace_event=[event-list]
-
-event-list is a comma separated list of events. See section 2.1 for event
-format.
-
-3. Defining an event-enabled tracepoint
-=======================================
-
-See The example provided in samples/trace_events
-
-4. Event formats
-================
-
-Each trace event has a 'format' file associated with it that contains
-a description of each field in a logged event. This information can
-be used to parse the binary trace stream, and is also the place to
-find the field names that can be used in event filters (see section 5).
-
-It also displays the format string that will be used to print the
-event in text mode, along with the event name and ID used for
-profiling.
-
-Every event has a set of 'common' fields associated with it; these are
-the fields prefixed with 'common_'. The other fields vary between
-events and correspond to the fields defined in the TRACE_EVENT
-definition for that event.
-
-Each field in the format has the form:
-
- field:field-type field-name; offset:N; size:N;
-
-where offset is the offset of the field in the trace record and size
-is the size of the data item, in bytes.
-
-For example, here's the information displayed for the 'sched_wakeup'
-event:
-
-# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format
-
-name: sched_wakeup
-ID: 60
-format:
- field:unsigned short common_type; offset:0; size:2;
- field:unsigned char common_flags; offset:2; size:1;
- field:unsigned char common_preempt_count; offset:3; size:1;
- field:int common_pid; offset:4; size:4;
- field:int common_tgid; offset:8; size:4;
-
- field:char comm[TASK_COMM_LEN]; offset:12; size:16;
- field:pid_t pid; offset:28; size:4;
- field:int prio; offset:32; size:4;
- field:int success; offset:36; size:4;
- field:int cpu; offset:40; size:4;
-
-print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid,
- REC->prio, REC->success, REC->cpu
-
-This event contains 10 fields, the first 5 common and the remaining 5
-event-specific. All the fields for this event are numeric, except for
-'comm' which is a string, a distinction important for event filtering.
-
-5. Event filtering
-==================
-
-Trace events can be filtered in the kernel by associating boolean
-'filter expressions' with them. As soon as an event is logged into
-the trace buffer, its fields are checked against the filter expression
-associated with that event type. An event with field values that
-'match' the filter will appear in the trace output, and an event whose
-values don't match will be discarded. An event with no filter
-associated with it matches everything, and is the default when no
-filter has been set for an event.
-
-5.1 Expression syntax
----------------------
-
-A filter expression consists of one or more 'predicates' that can be
-combined using the logical operators '&&' and '||'. A predicate is
-simply a clause that compares the value of a field contained within a
-logged event with a constant value and returns either 0 or 1 depending
-on whether the field value matched (1) or didn't match (0):
-
- field-name relational-operator value
-
-Parentheses can be used to provide arbitrary logical groupings and
-double-quotes can be used to prevent the shell from interpreting
-operators as shell metacharacters.
-
-The field-names available for use in filters can be found in the
-'format' files for trace events (see section 4).
-
-The relational-operators depend on the type of the field being tested:
-
-The operators available for numeric fields are:
-
-==, !=, <, <=, >, >=
-
-And for string fields they are:
-
-==, !=
-
-Currently, only exact string matches are supported.
-
-5.2 Setting filters
--------------------
-
-A filter for an individual event is set by writing a filter expression
-to the 'filter' file for the given event.
-
-For example:
-
-# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup
-# echo "common_preempt_count > 4" > filter
-
-A slightly more involved example:
-
-# cd /sys/kernel/debug/tracing/events/signal/signal_generate
-# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
-
-If there is an error in the expression, you'll get an 'Invalid
-argument' error when setting it, and the erroneous string along with
-an error message can be seen by looking at the filter e.g.:
-
-# cd /sys/kernel/debug/tracing/events/signal/signal_generate
-# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter
--bash: echo: write error: Invalid argument
-# cat filter
-((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
-^
-parse_error: Field not found
-
-Currently the caret ('^') for an error always appears at the beginning of
-the filter string; the error message should still be useful though
-even without more accurate position info.
-
-5.3 Clearing filters
---------------------
-
-To clear the filter for an event, write a '0' to the event's filter
-file.
-
-To clear the filters for all events in a subsystem, write a '0' to the
-subsystem's filter file.
-
-5.3 Subsystem filters
----------------------
-
-For convenience, filters for every event in a subsystem can be set or
-cleared as a group by writing a filter expression into the filter file
-at the root of the subsystem. Note however, that if a filter for any
-event within the subsystem lacks a field specified in the subsystem
-filter, or if the filter can't be applied for any other reason, the
-filter for that event will retain its previous setting. This can
-result in an unintended mixture of filters which could lead to
-confusing (to the user who might think different filters are in
-effect) trace output. Only filters that reference just the common
-fields can be guaranteed to propagate successfully to all events.
-
-Here are a few subsystem filter examples that also illustrate the
-above points:
-
-Clear the filters on all events in the sched subsystem:
-
-# cd /sys/kernel/debug/tracing/events/sched
-# echo 0 > filter
-# cat sched_switch/filter
-none
-# cat sched_wakeup/filter
-none
-
-Set a filter using only common fields for all events in the sched
-subsystem (all events end up with the same filter):
-
-# cd /sys/kernel/debug/tracing/events/sched
-# echo common_pid == 0 > filter
-# cat sched_switch/filter
-common_pid == 0
-# cat sched_wakeup/filter
-common_pid == 0
-
-Attempt to set a filter using a non-common field for all events in the
-sched subsystem (all events but those that have a prev_pid field retain
-their old filters):
-
-# cd /sys/kernel/debug/tracing/events/sched
-# echo prev_pid == 0 > filter
-# cat sched_switch/filter
-prev_pid == 0
-# cat sched_wakeup/filter
-common_pid == 0
diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt
deleted file mode 100644
index 79fcafc7fd6..00000000000
--- a/Documentation/trace/ftrace-design.txt
+++ /dev/null
@@ -1,411 +0,0 @@
- function tracer guts
- ====================
- By Mike Frysinger
-
-Introduction
-------------
-
-Here we will cover the architecture pieces that the common function tracing
-code relies on for proper functioning. Things are broken down into increasing
-complexity so that you can start simple and at least get basic functionality.
-
-Note that this focuses on architecture implementation details only. If you
-want more explanation of a feature in terms of common code, review the common
-ftrace.txt file.
-
-Ideally, everyone who wishes to retain performance while supporting tracing in
-their kernel should make it all the way to dynamic ftrace support.
-
-
-Prerequisites
--------------
-
-Ftrace relies on these features being implemented:
- STACKTRACE_SUPPORT - implement save_stack_trace()
- TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h
-
-
-HAVE_FUNCTION_TRACER
---------------------
-
-You will need to implement the mcount and the ftrace_stub functions.
-
-The exact mcount symbol name will depend on your toolchain. Some call it
-"mcount", "_mcount", or even "__mcount". You can probably figure it out by
-running something like:
- $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount
- call mcount
-We'll make the assumption below that the symbol is "mcount" just to keep things
-nice and simple in the examples.
-
-Keep in mind that the ABI that is in effect inside of the mcount function is
-*highly* architecture/toolchain specific. We cannot help you in this regard,
-sorry. Dig up some old documentation and/or find someone more familiar than
-you to bang ideas off of. Typically, register usage (argument/scratch/etc...)
-is a major issue at this point, especially in relation to the location of the
-mcount call (before/after function prologue). You might also want to look at
-how glibc has implemented the mcount function for your architecture. It might
-be (semi-)relevant.
-
-The mcount function should check the function pointer ftrace_trace_function
-to see if it is set to ftrace_stub. If it is, there is nothing for you to do,
-so return immediately. If it isn't, then call that function in the same way
-the mcount function normally calls __mcount_internal -- the first argument is
-the "frompc" while the second argument is the "selfpc" (adjusted to remove the
-size of the mcount call that is embedded in the function).
-
-For example, if the function foo() calls bar(), when the bar() function calls
-mcount(), the arguments mcount() will pass to the tracer are:
- "frompc" - the address bar() will use to return to foo()
- "selfpc" - the address bar() (with mcount() size adjustment)
-
-Also keep in mind that this mcount function will be called *a lot*, so
-optimizing for the default case of no tracer will help the smooth running of
-your system when tracing is disabled. So the start of the mcount function is
-typically the bare minimum with checking things before returning. That also
-means the code flow should usually be kept linear (i.e. no branching in the nop
-case). This is of course an optimization and not a hard requirement.
-
-Here is some pseudo code that should help (these functions should actually be
-implemented in assembly):
-
-void ftrace_stub(void)
-{
- return;
-}
-
-void mcount(void)
-{
- /* save any bare state needed in order to do initial checking */
-
- extern void (*ftrace_trace_function)(unsigned long, unsigned long);
- if (ftrace_trace_function != ftrace_stub)
- goto do_trace;
-
- /* restore any bare state */
-
- return;
-
-do_trace:
-
- /* save all state needed by the ABI (see paragraph above) */
-
- unsigned long frompc = ...;
- unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
- ftrace_trace_function(frompc, selfpc);
-
- /* restore all state needed by the ABI */
-}
-
-Don't forget to export mcount for modules !
-extern void mcount(void);
-EXPORT_SYMBOL(mcount);
-
-
-HAVE_FUNCTION_TRACE_MCOUNT_TEST
--------------------------------
-
-This is an optional optimization for the normal case when tracing is turned off
-in the system. If you do not enable this Kconfig option, the common ftrace
-code will take care of doing the checking for you.
-
-To support this feature, you only need to check the function_trace_stop
-variable in the mcount function. If it is non-zero, there is no tracing to be
-done at all, so you can return.
-
-This additional pseudo code would simply be:
-void mcount(void)
-{
- /* save any bare state needed in order to do initial checking */
-
-+ if (function_trace_stop)
-+ return;
-
- extern void (*ftrace_trace_function)(unsigned long, unsigned long);
- if (ftrace_trace_function != ftrace_stub)
-...
-
-
-HAVE_FUNCTION_GRAPH_TRACER
---------------------------
-
-Deep breath ... time to do some real work. Here you will need to update the
-mcount function to check ftrace graph function pointers, as well as implement
-some functions to save (hijack) and restore the return address.
-
-The mcount function should check the function pointers ftrace_graph_return
-(compare to ftrace_stub) and ftrace_graph_entry (compare to
-ftrace_graph_entry_stub). If either of those is not set to the relevant stub
-function, call the arch-specific function ftrace_graph_caller which in turn
-calls the arch-specific function prepare_ftrace_return. Neither of these
-function names is strictly required, but you should use them anyway to stay
-consistent across the architecture ports -- easier to compare & contrast
-things.
-
-The arguments to prepare_ftrace_return are slightly different than what are
-passed to ftrace_trace_function. The second argument "selfpc" is the same,
-but the first argument should be a pointer to the "frompc". Typically this is
-located on the stack. This allows the function to hijack the return address
-temporarily to have it point to the arch-specific function return_to_handler.
-That function will simply call the common ftrace_return_to_handler function and
-that will return the original return address with which you can return to the
-original call site.
-
-Here is the updated mcount pseudo code:
-void mcount(void)
-{
-...
- if (ftrace_trace_function != ftrace_stub)
- goto do_trace;
-
-+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-+ extern void (*ftrace_graph_return)(...);
-+ extern void (*ftrace_graph_entry)(...);
-+ if (ftrace_graph_return != ftrace_stub ||
-+ ftrace_graph_entry != ftrace_graph_entry_stub)
-+ ftrace_graph_caller();
-+#endif
-
- /* restore any bare state */
-...
-
-Here is the pseudo code for the new ftrace_graph_caller assembly function:
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-void ftrace_graph_caller(void)
-{
- /* save all state needed by the ABI */
-
- unsigned long *frompc = &...;
- unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
- /* passing frame pointer up is optional -- see below */
- prepare_ftrace_return(frompc, selfpc, frame_pointer);
-
- /* restore all state needed by the ABI */
-}
-#endif
-
-For information on how to implement prepare_ftrace_return(), simply look at the
-x86 version (the frame pointer passing is optional; see the next section for
-more information). The only architecture-specific piece in it is the setup of
-the fault recovery table (the asm(...) code). The rest should be the same
-across architectures.
-
-Here is the pseudo code for the new return_to_handler assembly function. Note
-that the ABI that applies here is different from what applies to the mcount
-code. Since you are returning from a function (after the epilogue), you might
-be able to skimp on things saved/restored (usually just registers used to pass
-return values).
-
-#ifdef CONFIG_FUNCTION_GRAPH_TRACER
-void return_to_handler(void)
-{
- /* save all state needed by the ABI (see paragraph above) */
-
- void (*original_return_point)(void) = ftrace_return_to_handler();
-
- /* restore all state needed by the ABI */
-
- /* this is usually either a return or a jump */
- original_return_point();
-}
-#endif
-
-
-HAVE_FUNCTION_GRAPH_FP_TEST
----------------------------
-
-An arch may pass in a unique value (frame pointer) to both the entering and
-exiting of a function. On exit, the value is compared and if it does not
-match, then it will panic the kernel. This is largely a sanity check for bad
-code generation with gcc. If gcc for your port sanely updates the frame
-pointer under different optimization levels, then ignore this option.
-
-However, adding support for it isn't terribly difficult. In your assembly code
-that calls prepare_ftrace_return(), pass the frame pointer as the 3rd argument.
-Then in the C version of that function, do what the x86 port does and pass it
-along to ftrace_push_return_trace() instead of a stub value of 0.
-
-Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer.
-
-
-HAVE_FTRACE_NMI_ENTER
----------------------
-
-If you can't trace NMI functions, then skip this option.
-
-<details to be filled>
-
-
-HAVE_SYSCALL_TRACEPOINTS
-------------------------
-
-You need very few things to get the syscalls tracing in an arch.
-
-- Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig).
-- Have a NR_syscalls variable in <asm/unistd.h> that provides the number
- of syscalls supported by the arch.
-- Support the TIF_SYSCALL_TRACEPOINT thread flags.
-- Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace
- in the ptrace syscalls tracing path.
-- If the system call table on this arch is more complicated than a simple array
- of addresses of the system calls, implement an arch_syscall_addr to return
- the address of a given system call.
-- If the symbol names of the system calls do not match the function names on
- this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and
- implement arch_syscall_match_sym_name with the appropriate logic to return
- true if the function name corresponds with the symbol name.
-- Tag this arch as HAVE_SYSCALL_TRACEPOINTS.
-
-
-HAVE_FTRACE_MCOUNT_RECORD
--------------------------
-
-See scripts/recordmcount.pl for more info. Just fill in the arch-specific
-details for how to locate the addresses of mcount call sites via objdump.
-This option doesn't make much sense without also implementing dynamic ftrace.
-
-
-HAVE_DYNAMIC_FTRACE
--------------------
-
-You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so
-scroll your reader back up if you got over eager.
-
-Once those are out of the way, you will need to implement:
- - asm/ftrace.h:
- - MCOUNT_ADDR
- - ftrace_call_adjust()
- - struct dyn_arch_ftrace{}
- - asm code:
- - mcount() (new stub)
- - ftrace_caller()
- - ftrace_call()
- - ftrace_stub()
- - C code:
- - ftrace_dyn_arch_init()
- - ftrace_make_nop()
- - ftrace_make_call()
- - ftrace_update_ftrace_func()
-
-First you will need to fill out some arch details in your asm/ftrace.h.
-
-Define MCOUNT_ADDR as the address of your mcount symbol similar to:
- #define MCOUNT_ADDR ((unsigned long)mcount)
-Since no one else will have a decl for that function, you will need to:
- extern void mcount(void);
-
-You will also need the helper function ftrace_call_adjust(). Most people
-will be able to stub it out like so:
- static inline unsigned long ftrace_call_adjust(unsigned long addr)
- {
- return addr;
- }
-<details to be filled>
-
-Lastly you will need the custom dyn_arch_ftrace structure. If you need
-some extra state when runtime patching arbitrary call sites, this is the
-place. For now though, create an empty struct:
- struct dyn_arch_ftrace {
- /* No extra data needed */
- };
-
-With the header out of the way, we can fill out the assembly code. While we
-did already create a mcount() function earlier, dynamic ftrace only wants a
-stub function. This is because the mcount() will only be used during boot
-and then all references to it will be patched out never to return. Instead,
-the guts of the old mcount() will be used to create a new ftrace_caller()
-function. Because the two are hard to merge, it will most likely be a lot
-easier to have two separate definitions split up by #ifdefs. Same goes for
-the ftrace_stub() as that will now be inlined in ftrace_caller().
-
-Before we get confused anymore, let's check out some pseudo code so you can
-implement your own stuff in assembly:
-
-void mcount(void)
-{
- return;
-}
-
-void ftrace_caller(void)
-{
- /* implement HAVE_FUNCTION_TRACE_MCOUNT_TEST if you desire */
-
- /* save all state needed by the ABI (see paragraph above) */
-
- unsigned long frompc = ...;
- unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
-
-ftrace_call:
- ftrace_stub(frompc, selfpc);
-
- /* restore all state needed by the ABI */
-
-ftrace_stub:
- return;
-}
-
-This might look a little odd at first, but keep in mind that we will be runtime
-patching multiple things. First, only functions that we actually want to trace
-will be patched to call ftrace_caller(). Second, since we only have one tracer
-active at a time, we will patch the ftrace_caller() function itself to call the
-specific tracer in question. That is the point of the ftrace_call label.
-
-With that in mind, let's move on to the C code that will actually be doing the
-runtime patching. You'll need a little knowledge of your arch's opcodes in
-order to make it through the next section.
-
-Every arch has an init callback function. If you need to do something early on
-to initialize some state, this is the time to do that. Otherwise, this simple
-function below should be sufficient for most people:
-
-int __init ftrace_dyn_arch_init(void *data)
-{
- /* return value is done indirectly via data */
- *(unsigned long *)data = 0;
-
- return 0;
-}
-
-There are two functions that are used to do runtime patching of arbitrary
-functions. The first is used to turn the mcount call site into a nop (which
-is what helps us retain runtime performance when not tracing). The second is
-used to turn the mcount call site into a call to an arbitrary location (but
-typically that is ftracer_caller()). See the general function definition in
-linux/ftrace.h for the functions:
- ftrace_make_nop()
- ftrace_make_call()
-The rec->ip value is the address of the mcount call site that was collected
-by the scripts/recordmcount.pl during build time.
-
-The last function is used to do runtime patching of the active tracer. This
-will be modifying the assembly code at the location of the ftrace_call symbol
-inside of the ftrace_caller() function. So you should have sufficient padding
-at that location to support the new function calls you'll be inserting. Some
-people will be using a "call" type instruction while others will be using a
-"branch" type instruction. Specifically, the function is:
- ftrace_update_ftrace_func()
-
-
-HAVE_DYNAMIC_FTRACE + HAVE_FUNCTION_GRAPH_TRACER
-------------------------------------------------
-
-The function grapher needs a few tweaks in order to work with dynamic ftrace.
-Basically, you will need to:
- - update:
- - ftrace_caller()
- - ftrace_graph_call()
- - ftrace_graph_caller()
- - implement:
- - ftrace_enable_ftrace_graph_caller()
- - ftrace_disable_ftrace_graph_caller()
-
-<details to be filled>
-Quick notes:
- - add a nop stub after the ftrace_call location named ftrace_graph_call;
- stub needs to be large enough to support a call to ftrace_graph_caller()
- - update ftrace_graph_caller() to work with being called by the new
- ftrace_caller() since some semantics may have changed
- - ftrace_enable_ftrace_graph_caller() will runtime patch the
- ftrace_graph_call location with a call to ftrace_graph_caller()
- - ftrace_disable_ftrace_graph_caller() will runtime patch the
- ftrace_graph_call location with nops
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
deleted file mode 100644
index 6f51fed45f2..00000000000
--- a/Documentation/trace/ftrace.txt
+++ /dev/null
@@ -1,1848 +0,0 @@
- ftrace - Function Tracer
- ========================
-
-Copyright 2008 Red Hat Inc.
- Author: Steven Rostedt <srostedt@redhat.com>
- License: The GNU Free Documentation License, Version 1.2
- (dual licensed under the GPL v2)
-Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
- John Kacur, and David Teigland.
-Written for: 2.6.28-rc2
-
-Introduction
-------------
-
-Ftrace is an internal tracer designed to help out developers and
-designers of systems to find what is going on inside the kernel.
-It can be used for debugging or analyzing latencies and
-performance issues that take place outside of user-space.
-
-Although ftrace is the function tracer, it also includes an
-infrastructure that allows for other types of tracing. Some of
-the tracers that are currently in ftrace include a tracer to
-trace context switches, the time it takes for a high priority
-task to run after it was woken up, the time interrupts are
-disabled, and more (ftrace allows for tracer plugins, which
-means that the list of tracers can always grow).
-
-
-Implementation Details
-----------------------
-
-See ftrace-design.txt for details for arch porters and such.
-
-
-The File System
----------------
-
-Ftrace uses the debugfs file system to hold the control files as
-well as the files to display output.
-
-When debugfs is configured into the kernel (which selecting any ftrace
-option will do) the directory /sys/kernel/debug will be created. To mount
-this directory, you can add to your /etc/fstab file:
-
- debugfs /sys/kernel/debug debugfs defaults 0 0
-
-Or you can mount it at run time with:
-
- mount -t debugfs nodev /sys/kernel/debug
-
-For quicker access to that directory you may want to make a soft link to
-it:
-
- ln -s /sys/kernel/debug /debug
-
-Any selected ftrace option will also create a directory called tracing
-within the debugfs. The rest of the document will assume that you are in
-the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate
-on the files within that directory and not distract from the content with
-the extended "/sys/kernel/debug/tracing" path name.
-
-That's it! (assuming that you have ftrace configured into your kernel)
-
-After mounting the debugfs, you can see a directory called
-"tracing". This directory contains the control and output files
-of ftrace. Here is a list of some of the key files:
-
-
- Note: all time values are in microseconds.
-
- current_tracer:
-
- This is used to set or display the current tracer
- that is configured.
-
- available_tracers:
-
- This holds the different types of tracers that
- have been compiled into the kernel. The
- tracers listed here can be configured by
- echoing their name into current_tracer.
-
- tracing_on:
-
- This sets or displays whether writing to the trace
- ring buffer is enabled. Echo 0 into this file to disable
- the tracer or 1 to enable it.
-
- trace:
-
- This file holds the output of the trace in a human
- readable format (described below).
-
- trace_pipe:
-
- The output is the same as the "trace" file but this
- file is meant to be streamed with live tracing.
- Reads from this file will block until new data is
- retrieved. Unlike the "trace" file, this file is a
- consumer. This means reading from this file causes
- sequential reads to display more current data. Once
- data is read from this file, it is consumed, and
- will not be read again with a sequential read. The
- "trace" file is static, and if the tracer is not
- adding more data,they will display the same
- information every time they are read.
-
- trace_options:
-
- This file lets the user control the amount of data
- that is displayed in one of the above output
- files.
-
- tracing_max_latency:
-
- Some of the tracers record the max latency.
- For example, the time interrupts are disabled.
- This time is saved in this file. The max trace
- will also be stored, and displayed by "trace".
- A new max trace will only be recorded if the
- latency is greater than the value in this
- file. (in microseconds)
-
- buffer_size_kb:
-
- This sets or displays the number of kilobytes each CPU
- buffer can hold. The tracer buffers are the same size
- for each CPU. The displayed number is the size of the
- CPU buffer and not total size of all buffers. The
- trace buffers are allocated in pages (blocks of memory
- that the kernel uses for allocation, usually 4 KB in size).
- If the last page allocated has room for more bytes
- than requested, the rest of the page will be used,
- making the actual allocation bigger than requested.
- ( Note, the size may not be a multiple of the page size
- due to buffer management overhead. )
-
- This can only be updated when the current_tracer
- is set to "nop".
-
- tracing_cpumask:
-
- This is a mask that lets the user only trace
- on specified CPUS. The format is a hex string
- representing the CPUS.
-
- set_ftrace_filter:
-
- When dynamic ftrace is configured in (see the
- section below "dynamic ftrace"), the code is dynamically
- modified (code text rewrite) to disable calling of the
- function profiler (mcount). This lets tracing be configured
- in with practically no overhead in performance. This also
- has a side effect of enabling or disabling specific functions
- to be traced. Echoing names of functions into this file
- will limit the trace to only those functions.
-
- This interface also allows for commands to be used. See the
- "Filter commands" section for more details.
-
- set_ftrace_notrace:
-
- This has an effect opposite to that of
- set_ftrace_filter. Any function that is added here will not
- be traced. If a function exists in both set_ftrace_filter
- and set_ftrace_notrace, the function will _not_ be traced.
-
- set_ftrace_pid:
-
- Have the function tracer only trace a single thread.
-
- set_graph_function:
-
- Set a "trigger" function where tracing should start
- with the function graph tracer (See the section
- "dynamic ftrace" for more details).
-
- available_filter_functions:
-
- This lists the functions that ftrace
- has processed and can trace. These are the function
- names that you can pass to "set_ftrace_filter" or
- "set_ftrace_notrace". (See the section "dynamic ftrace"
- below for more details.)
-
-
-The Tracers
------------
-
-Here is the list of current tracers that may be configured.
-
- "function"
-
- Function call tracer to trace all kernel functions.
-
- "function_graph"
-
- Similar to the function tracer except that the
- function tracer probes the functions on their entry
- whereas the function graph tracer traces on both entry
- and exit of the functions. It then provides the ability
- to draw a graph of function calls similar to C code
- source.
-
- "irqsoff"
-
- Traces the areas that disable interrupts and saves
- the trace with the longest max latency.
- See tracing_max_latency. When a new max is recorded,
- it replaces the old trace. It is best to view this
- trace with the latency-format option enabled.
-
- "preemptoff"
-
- Similar to irqsoff but traces and records the amount of
- time for which preemption is disabled.
-
- "preemptirqsoff"
-
- Similar to irqsoff and preemptoff, but traces and
- records the largest time for which irqs and/or preemption
- is disabled.
-
- "wakeup"
-
- Traces and records the max latency that it takes for
- the highest priority task to get scheduled after
- it has been woken up.
- Traces all tasks as an average developer would expect.
-
- "wakeup_rt"
-
- Traces and records the max latency that it takes for just
- RT tasks (as the current "wakeup" does). This is useful
- for those interested in wake up timings of RT tasks.
-
- "hw-branch-tracer"
-
- Uses the BTS CPU feature on x86 CPUs to traces all
- branches executed.
-
- "nop"
-
- This is the "trace nothing" tracer. To remove all
- tracers from tracing simply echo "nop" into
- current_tracer.
-
-
-Examples of using the tracer
-----------------------------
-
-Here are typical examples of using the tracers when controlling
-them only with the debugfs interface (without using any
-user-land utilities).
-
-Output format:
---------------
-
-Here is an example of the output format of the file "trace"
-
- --------
-# tracer: function
-#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- bash-4251 [01] 10152.583854: path_put <-path_walk
- bash-4251 [01] 10152.583855: dput <-path_put
- bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
- --------
-
-A header is printed with the tracer name that is represented by
-the trace. In this case the tracer is "function". Then a header
-showing the format. Task name "bash", the task PID "4251", the
-CPU that it was running on "01", the timestamp in <secs>.<usecs>
-format, the function name that was traced "path_put" and the
-parent function that called this function "path_walk". The
-timestamp is the time at which the function was entered.
-
-Latency trace format
---------------------
-
-When the latency-format option is enabled, the trace file gives
-somewhat more information to see why a latency happened.
-Here is a typical trace.
-
-# tracer: irqsoff
-#
-irqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: apic_timer_interrupt
- => ended at: do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
- <idle>-0 0d.s. 97us : __do_softirq (do_softirq)
- <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
-
-
-This shows that the current tracer is "irqsoff" tracing the time
-for which interrupts were disabled. It gives the trace version
-and the version of the kernel upon which this was executed on
-(2.6.26-rc8). Then it displays the max latency in microsecs (97
-us). The number of trace entries displayed and the total number
-recorded (both are three: #3/3). The type of preemption that was
-used (PREEMPT). VP, KP, SP, and HP are always zero and are
-reserved for later use. #P is the number of online CPUS (#P:2).
-
-The task is the process that was running when the latency
-occurred. (swapper pid: 0).
-
-The start and stop (the functions in which the interrupts were
-disabled and enabled respectively) that caused the latencies:
-
- apic_timer_interrupt is where the interrupts were disabled.
- do_softirq is where they were enabled again.
-
-The next lines after the header are the trace itself. The header
-explains which is which.
-
- cmd: The name of the process in the trace.
-
- pid: The PID of that process.
-
- CPU#: The CPU which the process was running on.
-
- irqs-off: 'd' interrupts are disabled. '.' otherwise.
- Note: If the architecture does not support a way to
- read the irq flags variable, an 'X' will always
- be printed here.
-
- need-resched: 'N' task need_resched is set, '.' otherwise.
-
- hardirq/softirq:
- 'H' - hard irq occurred inside a softirq.
- 'h' - hard irq is running
- 's' - soft irq is running
- '.' - normal context.
-
- preempt-depth: The level of preempt_disabled
-
-The above is mostly meaningful for kernel developers.
-
- time: When the latency-format option is enabled, the trace file
- output includes a timestamp relative to the start of the
- trace. This differs from the output when latency-format
- is disabled, which includes an absolute timestamp.
-
- delay: This is just to help catch your eye a bit better. And
- needs to be fixed to be only relative to the same CPU.
- The marks are determined by the difference between this
- current trace and the next trace.
- '!' - greater than preempt_mark_thresh (default 100)
- '+' - greater than 1 microsecond
- ' ' - less than or equal to 1 microsecond.
-
- The rest is the same as the 'trace' file.
-
-
-trace_options
--------------
-
-The trace_options file is used to control what gets printed in
-the trace output. To see what is available, simply cat the file:
-
- cat trace_options
- print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
- noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj
-
-To disable one of the options, echo in the option prepended with
-"no".
-
- echo noprint-parent > trace_options
-
-To enable an option, leave off the "no".
-
- echo sym-offset > trace_options
-
-Here are the available options:
-
- print-parent - On function traces, display the calling (parent)
- function as well as the function being traced.
-
- print-parent:
- bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul
-
- noprint-parent:
- bash-4000 [01] 1477.606694: simple_strtoul
-
-
- sym-offset - Display not only the function name, but also the
- offset in the function. For example, instead of
- seeing just "ktime_get", you will see
- "ktime_get+0xb/0x20".
-
- sym-offset:
- bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
-
- sym-addr - this will also display the function address as well
- as the function name.
-
- sym-addr:
- bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
-
- verbose - This deals with the trace file when the
- latency-format option is enabled.
-
- bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
- (+0.000ms): simple_strtoul (strict_strtoul)
-
- raw - This will display raw numbers. This option is best for
- use with user applications that can translate the raw
- numbers better than having it done in the kernel.
-
- hex - Similar to raw, but the numbers will be in a hexadecimal
- format.
-
- bin - This will print out the formats in raw binary.
-
- block - TBD (needs update)
-
- stacktrace - This is one of the options that changes the trace
- itself. When a trace is recorded, so is the stack
- of functions. This allows for back traces of
- trace sites.
-
- userstacktrace - This option changes the trace. It records a
- stacktrace of the current userspace thread.
-
- sym-userobj - when user stacktrace are enabled, look up which
- object the address belongs to, and print a
- relative address. This is especially useful when
- ASLR is on, otherwise you don't get a chance to
- resolve the address to object/file/line after
- the app is no longer running
-
- The lookup is performed when you read
- trace,trace_pipe. Example:
-
- a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
-x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
-
- sched-tree - trace all tasks that are on the runqueue, at
- every scheduling event. Will add overhead if
- there's a lot of tasks running at once.
-
- latency-format - This option changes the trace. When
- it is enabled, the trace displays
- additional information about the
- latencies, as described in "Latency
- trace format".
-
- overwrite - This controls what happens when the trace buffer is
- full. If "1" (default), the oldest events are
- discarded and overwritten. If "0", then the newest
- events are discarded.
-
-ftrace_enabled
---------------
-
-The following tracers (listed below) give different output
-depending on whether or not the sysctl ftrace_enabled is set. To
-set ftrace_enabled, one can either use the sysctl function or
-set it via the proc file system interface.
-
- sysctl kernel.ftrace_enabled=1
-
- or
-
- echo 1 > /proc/sys/kernel/ftrace_enabled
-
-To disable ftrace_enabled simply replace the '1' with '0' in the
-above commands.
-
-When ftrace_enabled is set the tracers will also record the
-functions that are within the trace. The descriptions of the
-tracers will also show an example with ftrace enabled.
-
-
-irqsoff
--------
-
-When interrupts are disabled, the CPU can not react to any other
-external event (besides NMIs and SMIs). This prevents the timer
-interrupt from triggering or the mouse interrupt from letting
-the kernel know of a new mouse event. The result is a latency
-with the reaction time.
-
-The irqsoff tracer tracks the time for which interrupts are
-disabled. When a new maximum latency is hit, the tracer saves
-the trace leading up to that latency point so that every time a
-new maximum is reached, the old saved trace is discarded and the
-new trace is saved.
-
-To reset the maximum, echo 0 into tracing_max_latency. Here is
-an example:
-
- # echo irqsoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
- # echo 1 > tracing_on
- # ls -ltr
- [...]
- # echo 0 > tracing_on
- # cat trace
-# tracer: irqsoff
-#
-irqsoff latency trace v1.1.5 on 2.6.26
---------------------------------------------------------------------
- latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: sys_setpgid
- => ended at: sys_setpgid
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- bash-3730 1d... 0us : _write_lock_irq (sys_setpgid)
- bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid)
- bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid)
-
-
-Here we see that that we had a latency of 12 microsecs (which is
-very good). The _write_lock_irq in sys_setpgid disabled
-interrupts. The difference between the 12 and the displayed
-timestamp 14us occurred because the clock was incremented
-between the time of recording the max latency and the time of
-recording the function that had that latency.
-
-Note the above example had ftrace_enabled not set. If we set the
-ftrace_enabled, we get a much larger output:
-
-# tracer: irqsoff
-#
-irqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: __alloc_pages_internal
- => ended at: __alloc_pages_internal
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal)
- ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist)
- ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk)
- ls-4339 0d..1 4us : add_preempt_count (_spin_lock)
- ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk)
- ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue)
- ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest)
- ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk)
- ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue)
- ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest)
- ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk)
- ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue)
-[...]
- ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue)
- ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest)
- ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk)
- ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue)
- ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest)
- ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk)
- ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock)
- ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal)
- ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal)
-
-
-
-Here we traced a 50 microsecond latency. But we also see all the
-functions that were called during that time. Note that by
-enabling function tracing, we incur an added overhead. This
-overhead may extend the latency times. But nevertheless, this
-trace has provided some very helpful debugging information.
-
-
-preemptoff
-----------
-
-When preemption is disabled, we may be able to receive
-interrupts but the task cannot be preempted and a higher
-priority task must wait for preemption to be enabled again
-before it can preempt a lower priority task.
-
-The preemptoff tracer traces the places that disable preemption.
-Like the irqsoff tracer, it records the maximum latency for
-which preemption was disabled. The control of preemptoff tracer
-is much like the irqsoff tracer.
-
- # echo preemptoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
- # echo 1 > tracing_on
- # ls -ltr
- [...]
- # echo 0 > tracing_on
- # cat trace
-# tracer: preemptoff
-#
-preemptoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: do_IRQ
- => ended at: __do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- sshd-4261 0d.h. 0us+: irq_enter (do_IRQ)
- sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq)
- sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq)
-
-
-This has some more changes. Preemption was disabled when an
-interrupt came in (notice the 'h'), and was enabled while doing
-a softirq. (notice the 's'). But we also see that interrupts
-have been disabled when entering the preempt off section and
-leaving it (the 'd'). We do not know if interrupts were enabled
-in the mean time.
-
-# tracer: preemptoff
-#
-preemptoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: remove_wait_queue
- => ended at: __do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue)
- sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue)
- sshd-4261 0d..1 2us : do_IRQ (common_interrupt)
- sshd-4261 0d..1 2us : irq_enter (do_IRQ)
- sshd-4261 0d..1 2us : idle_cpu (irq_enter)
- sshd-4261 0d..1 3us : add_preempt_count (irq_enter)
- sshd-4261 0d.h1 3us : idle_cpu (irq_enter)
- sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ)
-[...]
- sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock)
- sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
- sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq)
- sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq)
- sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock)
- sshd-4261 0d.h1 14us : irq_exit (do_IRQ)
- sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit)
- sshd-4261 0d..2 15us : do_softirq (irq_exit)
- sshd-4261 0d... 15us : __do_softirq (do_softirq)
- sshd-4261 0d... 16us : __local_bh_disable (__do_softirq)
- sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable)
- sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable)
-[...]
- sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable)
- sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable)
- sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip)
- sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip)
- sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable)
-[...]
- sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq)
- sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq)
-
-
-The above is an example of the preemptoff trace with
-ftrace_enabled set. Here we see that interrupts were disabled
-the entire time. The irq_enter code lets us know that we entered
-an interrupt 'h'. Before that, the functions being traced still
-show that it is not in an interrupt, but we can see from the
-functions themselves that this is not the case.
-
-Notice that __do_softirq when called does not have a
-preempt_count. It may seem that we missed a preempt enabling.
-What really happened is that the preempt count is held on the
-thread's stack and we switched to the softirq stack (4K stacks
-in effect). The code does not copy the preempt count, but
-because interrupts are disabled, we do not need to worry about
-it. Having a tracer like this is good for letting people know
-what really happens inside the kernel.
-
-
-preemptirqsoff
---------------
-
-Knowing the locations that have interrupts disabled or
-preemption disabled for the longest times is helpful. But
-sometimes we would like to know when either preemption and/or
-interrupts are disabled.
-
-Consider the following code:
-
- local_irq_disable();
- call_function_with_irqs_off();
- preempt_disable();
- call_function_with_irqs_and_preemption_off();
- local_irq_enable();
- call_function_with_preemption_off();
- preempt_enable();
-
-The irqsoff tracer will record the total length of
-call_function_with_irqs_off() and
-call_function_with_irqs_and_preemption_off().
-
-The preemptoff tracer will record the total length of
-call_function_with_irqs_and_preemption_off() and
-call_function_with_preemption_off().
-
-But neither will trace the time that interrupts and/or
-preemption is disabled. This total time is the time that we can
-not schedule. To record this time, use the preemptirqsoff
-tracer.
-
-Again, using this trace is much like the irqsoff and preemptoff
-tracers.
-
- # echo preemptirqsoff > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
- # echo 1 > tracing_on
- # ls -ltr
- [...]
- # echo 0 > tracing_on
- # cat trace
-# tracer: preemptirqsoff
-#
-preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: apic_timer_interrupt
- => ended at: __do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
- ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq)
- ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq)
-
-
-
-The trace_hardirqs_off_thunk is called from assembly on x86 when
-interrupts are disabled in the assembly code. Without the
-function tracing, we do not know if interrupts were enabled
-within the preemption points. We do see that it started with
-preemption enabled.
-
-Here is a trace with ftrace_enabled set:
-
-
-# tracer: preemptirqsoff
-#
-preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
- -----------------
- => started at: write_chan
- => ended at: __do_softirq
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- ls-4473 0.N.. 0us : preempt_schedule (write_chan)
- ls-4473 0dN.1 1us : _spin_lock (schedule)
- ls-4473 0dN.1 2us : add_preempt_count (_spin_lock)
- ls-4473 0d..2 2us : put_prev_task_fair (schedule)
-[...]
- ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts)
- ls-4473 0d..2 13us : __switch_to (schedule)
- sshd-4261 0d..2 14us : finish_task_switch (schedule)
- sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch)
- sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave)
- sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set)
- sshd-4261 0d..2 16us : do_IRQ (common_interrupt)
- sshd-4261 0d..2 17us : irq_enter (do_IRQ)
- sshd-4261 0d..2 17us : idle_cpu (irq_enter)
- sshd-4261 0d..2 18us : add_preempt_count (irq_enter)
- sshd-4261 0d.h2 18us : idle_cpu (irq_enter)
- sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ)
- sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq)
- sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock)
- sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq)
- sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock)
-[...]
- sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq)
- sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock)
- sshd-4261 0d.h2 29us : irq_exit (do_IRQ)
- sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit)
- sshd-4261 0d..3 30us : do_softirq (irq_exit)
- sshd-4261 0d... 30us : __do_softirq (do_softirq)
- sshd-4261 0d... 31us : __local_bh_disable (__do_softirq)
- sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable)
- sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable)
-[...]
- sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip)
- sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip)
- sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt)
- sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt)
- sshd-4261 0d.s3 45us : idle_cpu (irq_enter)
- sshd-4261 0d.s3 46us : add_preempt_count (irq_enter)
- sshd-4261 0d.H3 46us : idle_cpu (irq_enter)
- sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt)
- sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt)
-[...]
- sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt)
- sshd-4261 0d.H3 82us : ktime_get (tick_program_event)
- sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get)
- sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts)
- sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts)
- sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event)
- sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event)
- sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt)
- sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit)
- sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit)
- sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable)
-[...]
- sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action)
- sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq)
- sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq)
- sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq)
- sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable)
- sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq)
- sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq)
-
-
-This is a very interesting trace. It started with the preemption
-of the ls task. We see that the task had the "need_resched" bit
-set via the 'N' in the trace. Interrupts were disabled before
-the spin_lock at the beginning of the trace. We see that a
-schedule took place to run sshd. When the interrupts were
-enabled, we took an interrupt. On return from the interrupt
-handler, the softirq ran. We took another interrupt while
-running the softirq as we see from the capital 'H'.
-
-
-wakeup
-------
-
-In a Real-Time environment it is very important to know the
-wakeup time it takes for the highest priority task that is woken
-up to the time that it executes. This is also known as "schedule
-latency". I stress the point that this is about RT tasks. It is
-also important to know the scheduling latency of non-RT tasks,
-but the average schedule latency is better for non-RT tasks.
-Tools like LatencyTop are more appropriate for such
-measurements.
-
-Real-Time environments are interested in the worst case latency.
-That is the longest latency it takes for something to happen,
-and not the average. We can have a very fast scheduler that may
-only have a large latency once in a while, but that would not
-work well with Real-Time tasks. The wakeup tracer was designed
-to record the worst case wakeups of RT tasks. Non-RT tasks are
-not recorded because the tracer only records one worst case and
-tracing non-RT tasks that are unpredictable will overwrite the
-worst case latency of RT tasks.
-
-Since this tracer only deals with RT tasks, we will run this
-slightly differently than we did with the previous tracers.
-Instead of performing an 'ls', we will run 'sleep 1' under
-'chrt' which changes the priority of the task.
-
- # echo wakeup > current_tracer
- # echo latency-format > trace_options
- # echo 0 > tracing_max_latency
- # echo 1 > tracing_on
- # chrt -f 5 sleep 1
- # echo 0 > tracing_on
- # cat trace
-# tracer: wakeup
-#
-wakeup latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
- -----------------
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
- <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process)
- <idle>-0 1d..4 4us : schedule (cpu_idle)
-
-
-Running this on an idle system, we see that it only took 4
-microseconds to perform the task switch. Note, since the trace
-marker in the schedule is before the actual "switch", we stop
-the tracing when the recorded task is about to schedule in. This
-may change if we add a new marker at the end of the scheduler.
-
-Notice that the recorded task is 'sleep' with the PID of 4901
-and it has an rt_prio of 5. This priority is user-space priority
-and not the internal kernel priority. The policy is 1 for
-SCHED_FIFO and 2 for SCHED_RR.
-
-Doing the same with chrt -r 5 and ftrace_enabled set.
-
-# tracer: wakeup
-#
-wakeup latency trace v1.1.5 on 2.6.26-rc8
---------------------------------------------------------------------
- latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
- -----------------
- | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
- -----------------
-
-# _------=> CPU#
-# / _-----=> irqs-off
-# | / _----=> need-resched
-# || / _---=> hardirq/softirq
-# ||| / _--=> preempt-depth
-# |||| /
-# ||||| delay
-# cmd pid ||||| time | caller
-# \ / ||||| \ | /
-ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process)
-ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb)
-ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up)
-ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup)
-ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr)
-ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup)
-ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up)
-ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up)
-[...]
-ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt)
-ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit)
-ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit)
-ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq)
-[...]
-ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks)
-ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq)
-ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable)
-ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd)
-ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd)
-ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched)
-ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched)
-ksoftirq-7 1.N.2 33us : schedule (__cond_resched)
-ksoftirq-7 1.N.2 33us : add_preempt_count (schedule)
-ksoftirq-7 1.N.3 34us : hrtick_clear (schedule)
-ksoftirq-7 1dN.3 35us : _spin_lock (schedule)
-ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock)
-ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule)
-ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair)
-[...]
-ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline)
-ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock)
-ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline)
-ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock)
-ksoftirq-7 1d..4 50us : schedule (__cond_resched)
-
-The interrupt went off while running ksoftirqd. This task runs
-at SCHED_OTHER. Why did not we see the 'N' set early? This may
-be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K
-stacks configured, the interrupt and softirq run with their own
-stack. Some information is held on the top of the task's stack
-(need_resched and preempt_count are both stored there). The
-setting of the NEED_RESCHED bit is done directly to the task's
-stack, but the reading of the NEED_RESCHED is done by looking at
-the current stack, which in this case is the stack for the hard
-interrupt. This hides the fact that NEED_RESCHED has been set.
-We do not see the 'N' until we switch back to the task's
-assigned stack.
-
-function
---------
-
-This tracer is the function tracer. Enabling the function tracer
-can be done from the debug file system. Make sure the
-ftrace_enabled is set; otherwise this tracer is a nop.
-
- # sysctl kernel.ftrace_enabled=1
- # echo function > current_tracer
- # echo 1 > tracing_on
- # usleep 1
- # echo 0 > tracing_on
- # cat trace
-# tracer: function
-#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- bash-4003 [00] 123.638713: finish_task_switch <-schedule
- bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch
- bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq
- bash-4003 [00] 123.638715: hrtick_set <-schedule
- bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set
- bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave
- bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set
- bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore
- bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set
- bash-4003 [00] 123.638718: sub_preempt_count <-schedule
- bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule
- bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run
- bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion
- bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common
- bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq
-[...]
-
-
-Note: function tracer uses ring buffers to store the above
-entries. The newest data may overwrite the oldest data.
-Sometimes using echo to stop the trace is not sufficient because
-the tracing could have overwritten the data that you wanted to
-record. For this reason, it is sometimes better to disable
-tracing directly from a program. This allows you to stop the
-tracing at the point that you hit the part that you are
-interested in. To disable the tracing directly from a C program,
-something like following code snippet can be used:
-
-int trace_fd;
-[...]
-int main(int argc, char *argv[]) {
- [...]
- trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
- [...]
- if (condition_hit()) {
- write(trace_fd, "0", 1);
- }
- [...]
-}
-
-
-Single thread tracing
----------------------
-
-By writing into set_ftrace_pid you can trace a
-single thread. For example:
-
-# cat set_ftrace_pid
-no pid
-# echo 3111 > set_ftrace_pid
-# cat set_ftrace_pid
-3111
-# echo function > current_tracer
-# cat trace | head
- # tracer: function
- #
- # TASK-PID CPU# TIMESTAMP FUNCTION
- # | | | | |
- yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return
- yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
- yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
- yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
- yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll
- yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll
-# echo -1 > set_ftrace_pid
-# cat trace |head
- # tracer: function
- #
- # TASK-PID CPU# TIMESTAMP FUNCTION
- # | | | | |
- ##### CPU 3 buffer started ####
- yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait
- yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry
- yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry
- yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit
- yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit
-
-If you want to trace a function when executing, you could use
-something like this simple program:
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <fcntl.h>
-#include <unistd.h>
-#include <string.h>
-
-#define _STR(x) #x
-#define STR(x) _STR(x)
-#define MAX_PATH 256
-
-const char *find_debugfs(void)
-{
- static char debugfs[MAX_PATH+1];
- static int debugfs_found;
- char type[100];
- FILE *fp;
-
- if (debugfs_found)
- return debugfs;
-
- if ((fp = fopen("/proc/mounts","r")) == NULL) {
- perror("/proc/mounts");
- return NULL;
- }
-
- while (fscanf(fp, "%*s %"
- STR(MAX_PATH)
- "s %99s %*s %*d %*d\n",
- debugfs, type) == 2) {
- if (strcmp(type, "debugfs") == 0)
- break;
- }
- fclose(fp);
-
- if (strcmp(type, "debugfs") != 0) {
- fprintf(stderr, "debugfs not mounted");
- return NULL;
- }
-
- strcat(debugfs, "/tracing/");
- debugfs_found = 1;
-
- return debugfs;
-}
-
-const char *tracing_file(const char *file_name)
-{
- static char trace_file[MAX_PATH+1];
- snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name);
- return trace_file;
-}
-
-int main (int argc, char **argv)
-{
- if (argc < 1)
- exit(-1);
-
- if (fork() > 0) {
- int fd, ffd;
- char line[64];
- int s;
-
- ffd = open(tracing_file("current_tracer"), O_WRONLY);
- if (ffd < 0)
- exit(-1);
- write(ffd, "nop", 3);
-
- fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
- s = sprintf(line, "%d\n", getpid());
- write(fd, line, s);
-
- write(ffd, "function", 8);
-
- close(fd);
- close(ffd);
-
- execvp(argv[1], argv+1);
- }
-
- return 0;
-}
-
-
-hw-branch-tracer (x86 only)
----------------------------
-
-This tracer uses the x86 last branch tracing hardware feature to
-collect a branch trace on all cpus with relatively low overhead.
-
-The tracer uses a fixed-size circular buffer per cpu and only
-traces ring 0 branches. The trace file dumps that buffer in the
-following format:
-
-# tracer: hw-branch-tracer
-#
-# CPU# TO <- FROM
- 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6
- 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a
- 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf
- 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf
- 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a
- 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf
-
-
-The tracer may be used to dump the trace for the oops'ing cpu on
-a kernel oops into the system log. To enable this,
-ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one
-can either use the sysctl function or set it via the proc system
-interface.
-
- sysctl kernel.ftrace_dump_on_oops=n
-
-or
-
- echo n > /proc/sys/kernel/ftrace_dump_on_oops
-
-If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will
-only dump the buffer of the CPU that triggered the oops.
-
-Here's an example of such a dump after a null pointer
-dereference in a kernel module:
-
-[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
-[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops]
-[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0
-[57848.106019] Oops: 0002 [#1] SMP
-[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus
-[57848.106019] Dumping ftrace buffer:
-[57848.106019] ---------------------------------
-[...]
-[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24
-[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165
-[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165
-[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165
-[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165
-[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops]
-[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30
-[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b
-[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31
-[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1
-[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30
-[...]
-[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2
-[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881
-[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881
-[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96
-[...]
-[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3
-[57848.106019] ---------------------------------
-[57848.106019] CPU 0
-[57848.106019] Modules linked in: oops
-[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23
-[57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops]
-[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246
-[...]
-
-
-function graph tracer
----------------------------
-
-This tracer is similar to the function tracer except that it
-probes a function on its entry and its exit. This is done by
-using a dynamically allocated stack of return addresses in each
-task_struct. On function entry the tracer overwrites the return
-address of each function traced to set a custom probe. Thus the
-original return address is stored on the stack of return address
-in the task_struct.
-
-Probing on both ends of a function leads to special features
-such as:
-
-- measure of a function's time execution
-- having a reliable call stack to draw function calls graph
-
-This tracer is useful in several situations:
-
-- you want to find the reason of a strange kernel behavior and
- need to see what happens in detail on any areas (or specific
- ones).
-
-- you are experiencing weird latencies but it's difficult to
- find its origin.
-
-- you want to find quickly which path is taken by a specific
- function
-
-- you just want to peek inside a working kernel and want to see
- what happens there.
-
-# tracer: function_graph
-#
-# CPU DURATION FUNCTION CALLS
-# | | | | | | |
-
- 0) | sys_open() {
- 0) | do_sys_open() {
- 0) | getname() {
- 0) | kmem_cache_alloc() {
- 0) 1.382 us | __might_sleep();
- 0) 2.478 us | }
- 0) | strncpy_from_user() {
- 0) | might_fault() {
- 0) 1.389 us | __might_sleep();
- 0) 2.553 us | }
- 0) 3.807 us | }
- 0) 7.876 us | }
- 0) | alloc_fd() {
- 0) 0.668 us | _spin_lock();
- 0) 0.570 us | expand_files();
- 0) 0.586 us | _spin_unlock();
-
-
-There are several columns that can be dynamically
-enabled/disabled. You can use every combination of options you
-want, depending on your needs.
-
-- The cpu number on which the function executed is default
- enabled. It is sometimes better to only trace one cpu (see
- tracing_cpu_mask file) or you might sometimes see unordered
- function calls while cpu tracing switch.
-
- hide: echo nofuncgraph-cpu > trace_options
- show: echo funcgraph-cpu > trace_options
-
-- The duration (function's time of execution) is displayed on
- the closing bracket line of a function or on the same line
- than the current function in case of a leaf one. It is default
- enabled.
-
- hide: echo nofuncgraph-duration > trace_options
- show: echo funcgraph-duration > trace_options
-
-- The overhead field precedes the duration field in case of
- reached duration thresholds.
-
- hide: echo nofuncgraph-overhead > trace_options
- show: echo funcgraph-overhead > trace_options
- depends on: funcgraph-duration
-
- ie:
-
- 0) | up_write() {
- 0) 0.646 us | _spin_lock_irqsave();
- 0) 0.684 us | _spin_unlock_irqrestore();
- 0) 3.123 us | }
- 0) 0.548 us | fput();
- 0) + 58.628 us | }
-
- [...]
-
- 0) | putname() {
- 0) | kmem_cache_free() {
- 0) 0.518 us | __phys_addr();
- 0) 1.757 us | }
- 0) 2.861 us | }
- 0) ! 115.305 us | }
- 0) ! 116.402 us | }
-
- + means that the function exceeded 10 usecs.
- ! means that the function exceeded 100 usecs.
-
-
-- The task/pid field displays the thread cmdline and pid which
- executed the function. It is default disabled.
-
- hide: echo nofuncgraph-proc > trace_options
- show: echo funcgraph-proc > trace_options
-
- ie:
-
- # tracer: function_graph
- #
- # CPU TASK/PID DURATION FUNCTION CALLS
- # | | | | | | | | |
- 0) sh-4802 | | d_free() {
- 0) sh-4802 | | call_rcu() {
- 0) sh-4802 | | __call_rcu() {
- 0) sh-4802 | 0.616 us | rcu_process_gp_end();
- 0) sh-4802 | 0.586 us | check_for_new_grace_period();
- 0) sh-4802 | 2.899 us | }
- 0) sh-4802 | 4.040 us | }
- 0) sh-4802 | 5.151 us | }
- 0) sh-4802 | + 49.370 us | }
-
-
-- The absolute time field is an absolute timestamp given by the
- system clock since it started. A snapshot of this time is
- given on each entry/exit of functions
-
- hide: echo nofuncgraph-abstime > trace_options
- show: echo funcgraph-abstime > trace_options
-
- ie:
-
- #
- # TIME CPU DURATION FUNCTION CALLS
- # | | | | | | | |
- 360.774522 | 1) 0.541 us | }
- 360.774522 | 1) 4.663 us | }
- 360.774523 | 1) 0.541 us | __wake_up_bit();
- 360.774524 | 1) 6.796 us | }
- 360.774524 | 1) 7.952 us | }
- 360.774525 | 1) 9.063 us | }
- 360.774525 | 1) 0.615 us | journal_mark_dirty();
- 360.774527 | 1) 0.578 us | __brelse();
- 360.774528 | 1) | reiserfs_prepare_for_journal() {
- 360.774528 | 1) | unlock_buffer() {
- 360.774529 | 1) | wake_up_bit() {
- 360.774529 | 1) | bit_waitqueue() {
- 360.774530 | 1) 0.594 us | __phys_addr();
-
-
-You can put some comments on specific functions by using
-trace_printk() For example, if you want to put a comment inside
-the __might_sleep() function, you just have to include
-<linux/ftrace.h> and call trace_printk() inside __might_sleep()
-
-trace_printk("I'm a comment!\n")
-
-will produce:
-
- 1) | __might_sleep() {
- 1) | /* I'm a comment! */
- 1) 1.449 us | }
-
-
-You might find other useful features for this tracer in the
-following "dynamic ftrace" section such as tracing only specific
-functions or tasks.
-
-dynamic ftrace
---------------
-
-If CONFIG_DYNAMIC_FTRACE is set, the system will run with
-virtually no overhead when function tracing is disabled. The way
-this works is the mcount function call (placed at the start of
-every kernel function, produced by the -pg switch in gcc),
-starts of pointing to a simple return. (Enabling FTRACE will
-include the -pg switch in the compiling of the kernel.)
-
-At compile time every C file object is run through the
-recordmcount.pl script (located in the scripts directory). This
-script will process the C object using objdump to find all the
-locations in the .text section that call mcount. (Note, only the
-.text section is processed, since processing other sections like
-.init.text may cause races due to those sections being freed).
-
-A new section called "__mcount_loc" is created that holds
-references to all the mcount call sites in the .text section.
-This section is compiled back into the original object. The
-final linker will add all these references into a single table.
-
-On boot up, before SMP is initialized, the dynamic ftrace code
-scans this table and updates all the locations into nops. It
-also records the locations, which are added to the
-available_filter_functions list. Modules are processed as they
-are loaded and before they are executed. When a module is
-unloaded, it also removes its functions from the ftrace function
-list. This is automatic in the module unload code, and the
-module author does not need to worry about it.
-
-When tracing is enabled, kstop_machine is called to prevent
-races with the CPUS executing code being modified (which can
-cause the CPU to do undesirable things), and the nops are
-patched back to calls. But this time, they do not call mcount
-(which is just a function stub). They now call into the ftrace
-infrastructure.
-
-One special side-effect to the recording of the functions being
-traced is that we can now selectively choose which functions we
-wish to trace and which ones we want the mcount calls to remain
-as nops.
-
-Two files are used, one for enabling and one for disabling the
-tracing of specified functions. They are:
-
- set_ftrace_filter
-
-and
-
- set_ftrace_notrace
-
-A list of available functions that you can add to these files is
-listed in:
-
- available_filter_functions
-
- # cat available_filter_functions
-put_prev_task_idle
-kmem_cache_create
-pick_next_task_rt
-get_online_cpus
-pick_next_task_fair
-mutex_lock
-[...]
-
-If I am only interested in sys_nanosleep and hrtimer_interrupt:
-
- # echo sys_nanosleep hrtimer_interrupt \
- > set_ftrace_filter
- # echo function > current_tracer
- # echo 1 > tracing_on
- # usleep 1
- # echo 0 > tracing_on
- # cat trace
-# tracer: ftrace
-#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
- usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call
- <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
-
-To see which functions are being traced, you can cat the file:
-
- # cat set_ftrace_filter
-hrtimer_interrupt
-sys_nanosleep
-
-
-Perhaps this is not enough. The filters also allow simple wild
-cards. Only the following are currently available
-
- <match>* - will match functions that begin with <match>
- *<match> - will match functions that end with <match>
- *<match>* - will match functions that have <match> in it
-
-These are the only wild cards which are supported.
-
- <match>*<match> will not work.
-
-Note: It is better to use quotes to enclose the wild cards,
- otherwise the shell may expand the parameters into names
- of files in the local directory.
-
- # echo 'hrtimer_*' > set_ftrace_filter
-
-Produces:
-
-# tracer: ftrace
-#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- bash-4003 [00] 1480.611794: hrtimer_init <-copy_process
- bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set
- bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear
- bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
- <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
- <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
- <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
- <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
- <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
-
-
-Notice that we lost the sys_nanosleep.
-
- # cat set_ftrace_filter
-hrtimer_run_queues
-hrtimer_run_pending
-hrtimer_init
-hrtimer_cancel
-hrtimer_try_to_cancel
-hrtimer_forward
-hrtimer_start
-hrtimer_reprogram
-hrtimer_force_reprogram
-hrtimer_get_next_event
-hrtimer_interrupt
-hrtimer_nanosleep
-hrtimer_wakeup
-hrtimer_get_remaining
-hrtimer_get_res
-hrtimer_init_sleeper
-
-
-This is because the '>' and '>>' act just like they do in bash.
-To rewrite the filters, use '>'
-To append to the filters, use '>>'
-
-To clear out a filter so that all functions will be recorded
-again:
-
- # echo > set_ftrace_filter
- # cat set_ftrace_filter
- #
-
-Again, now we want to append.
-
- # echo sys_nanosleep > set_ftrace_filter
- # cat set_ftrace_filter
-sys_nanosleep
- # echo 'hrtimer_*' >> set_ftrace_filter
- # cat set_ftrace_filter
-hrtimer_run_queues
-hrtimer_run_pending
-hrtimer_init
-hrtimer_cancel
-hrtimer_try_to_cancel
-hrtimer_forward
-hrtimer_start
-hrtimer_reprogram
-hrtimer_force_reprogram
-hrtimer_get_next_event
-hrtimer_interrupt
-sys_nanosleep
-hrtimer_nanosleep
-hrtimer_wakeup
-hrtimer_get_remaining
-hrtimer_get_res
-hrtimer_init_sleeper
-
-
-The set_ftrace_notrace prevents those functions from being
-traced.
-
- # echo '*preempt*' '*lock*' > set_ftrace_notrace
-
-Produces:
-
-# tracer: ftrace
-#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- bash-4043 [01] 115.281644: finish_task_switch <-schedule
- bash-4043 [01] 115.281645: hrtick_set <-schedule
- bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set
- bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run
- bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion
- bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run
- bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop
- bash-4043 [01] 115.281648: wake_up_process <-kthread_stop
- bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process
-
-We can see that there's no more lock or preempt tracing.
-
-
-Dynamic ftrace with the function graph tracer
----------------------------------------------
-
-Although what has been explained above concerns both the
-function tracer and the function-graph-tracer, there are some
-special features only available in the function-graph tracer.
-
-If you want to trace only one function and all of its children,
-you just have to echo its name into set_graph_function:
-
- echo __do_fault > set_graph_function
-
-will produce the following "expanded" trace of the __do_fault()
-function:
-
- 0) | __do_fault() {
- 0) | filemap_fault() {
- 0) | find_lock_page() {
- 0) 0.804 us | find_get_page();
- 0) | __might_sleep() {
- 0) 1.329 us | }
- 0) 3.904 us | }
- 0) 4.979 us | }
- 0) 0.653 us | _spin_lock();
- 0) 0.578 us | page_add_file_rmap();
- 0) 0.525 us | native_set_pte_at();
- 0) 0.585 us | _spin_unlock();
- 0) | unlock_page() {
- 0) 0.541 us | page_waitqueue();
- 0) 0.639 us | __wake_up_bit();
- 0) 2.786 us | }
- 0) + 14.237 us | }
- 0) | __do_fault() {
- 0) | filemap_fault() {
- 0) | find_lock_page() {
- 0) 0.698 us | find_get_page();
- 0) | __might_sleep() {
- 0) 1.412 us | }
- 0) 3.950 us | }
- 0) 5.098 us | }
- 0) 0.631 us | _spin_lock();
- 0) 0.571 us | page_add_file_rmap();
- 0) 0.526 us | native_set_pte_at();
- 0) 0.586 us | _spin_unlock();
- 0) | unlock_page() {
- 0) 0.533 us | page_waitqueue();
- 0) 0.638 us | __wake_up_bit();
- 0) 2.793 us | }
- 0) + 14.012 us | }
-
-You can also expand several functions at once:
-
- echo sys_open > set_graph_function
- echo sys_close >> set_graph_function
-
-Now if you want to go back to trace all functions you can clear
-this special filter via:
-
- echo > set_graph_function
-
-
-Filter commands
----------------
-
-A few commands are supported by the set_ftrace_filter interface.
-Trace commands have the following format:
-
-<function>:<command>:<parameter>
-
-The following commands are supported:
-
-- mod
- This command enables function filtering per module. The
- parameter defines the module. For example, if only the write*
- functions in the ext3 module are desired, run:
-
- echo 'write*:mod:ext3' > set_ftrace_filter
-
- This command interacts with the filter in the same way as
- filtering based on function names. Thus, adding more functions
- in a different module is accomplished by appending (>>) to the
- filter file. Remove specific module functions by prepending
- '!':
-
- echo '!writeback*:mod:ext3' >> set_ftrace_filter
-
-- traceon/traceoff
- These commands turn tracing on and off when the specified
- functions are hit. The parameter determines how many times the
- tracing system is turned on and off. If unspecified, there is
- no limit. For example, to disable tracing when a schedule bug
- is hit the first 5 times, run:
-
- echo '__schedule_bug:traceoff:5' > set_ftrace_filter
-
- These commands are cumulative whether or not they are appended
- to set_ftrace_filter. To remove a command, prepend it by '!'
- and drop the parameter:
-
- echo '!__schedule_bug:traceoff' > set_ftrace_filter
-
-
-trace_pipe
-----------
-
-The trace_pipe outputs the same content as the trace file, but
-the effect on the tracing is different. Every read from
-trace_pipe is consumed. This means that subsequent reads will be
-different. The trace is live.
-
- # echo function > current_tracer
- # cat trace_pipe > /tmp/trace.out &
-[1] 4153
- # echo 1 > tracing_on
- # usleep 1
- # echo 0 > tracing_on
- # cat trace
-# tracer: function
-#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
-
- #
- # cat /tmp/trace.out
- bash-4043 [00] 41.267106: finish_task_switch <-schedule
- bash-4043 [00] 41.267106: hrtick_set <-schedule
- bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set
- bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run
- bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion
- bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run
- bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop
- bash-4043 [00] 41.267110: wake_up_process <-kthread_stop
- bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process
- bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
-
-
-Note, reading the trace_pipe file will block until more input is
-added. By changing the tracer, trace_pipe will issue an EOF. We
-needed to set the function tracer _before_ we "cat" the
-trace_pipe file.
-
-
-trace entries
--------------
-
-Having too much or not enough data can be troublesome in
-diagnosing an issue in the kernel. The file buffer_size_kb is
-used to modify the size of the internal trace buffers. The
-number listed is the number of entries that can be recorded per
-CPU. To know the full size, multiply the number of possible CPUS
-with the number of entries.
-
- # cat buffer_size_kb
-1408 (units kilobytes)
-
-Note, to modify this, you must have tracing completely disabled.
-To do that, echo "nop" into the current_tracer. If the
-current_tracer is not set to "nop", an EINVAL error will be
-returned.
-
- # echo nop > current_tracer
- # echo 10000 > buffer_size_kb
- # cat buffer_size_kb
-10000 (units kilobytes)
-
-The number of pages which will be allocated is limited to a
-percentage of available memory. Allocating too much will produce
-an error.
-
- # echo 1000000000000 > buffer_size_kb
--bash: echo: write error: Cannot allocate memory
- # cat buffer_size_kb
-85
-
------------
-
-More details can be found in the source code, in the
-kernel/trace/*.c files.
diff --git a/Documentation/trace/function-graph-fold.vim b/Documentation/trace/function-graph-fold.vim
deleted file mode 100644
index 0544b504c8b..00000000000
--- a/Documentation/trace/function-graph-fold.vim
+++ /dev/null
@@ -1,42 +0,0 @@
-" Enable folding for ftrace function_graph traces.
-"
-" To use, :source this file while viewing a function_graph trace, or use vim's
-" -S option to load from the command-line together with a trace. You can then
-" use the usual vim fold commands, such as "za", to open and close nested
-" functions. While closed, a fold will show the total time taken for a call,
-" as would normally appear on the line with the closing brace. Folded
-" functions will not include finish_task_switch(), so folding should remain
-" relatively sane even through a context switch.
-"
-" Note that this will almost certainly only work well with a
-" single-CPU trace (e.g. trace-cmd report --cpu 1).
-
-function! FunctionGraphFoldExpr(lnum)
- let line = getline(a:lnum)
- if line[-1:] == '{'
- if line =~ 'finish_task_switch() {$'
- return '>1'
- endif
- return 'a1'
- elseif line[-1:] == '}'
- return 's1'
- else
- return '='
- endif
-endfunction
-
-function! FunctionGraphFoldText()
- let s = split(getline(v:foldstart), '|', 1)
- if getline(v:foldend+1) =~ 'finish_task_switch() {$'
- let s[2] = ' task switch '
- else
- let e = split(getline(v:foldend), '|', 1)
- let s[2] = e[2]
- endif
- return join(s, '|')
-endfunction
-
-setlocal foldexpr=FunctionGraphFoldExpr(v:lnum)
-setlocal foldtext=FunctionGraphFoldText()
-setlocal foldcolumn=12
-setlocal foldmethod=expr
diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt
deleted file mode 100644
index d0d0bb9e3e2..00000000000
--- a/Documentation/trace/kprobetrace.txt
+++ /dev/null
@@ -1,172 +0,0 @@
- Kprobe-based Event Tracing
- ==========================
-
- Documentation is written by Masami Hiramatsu
-
-
-Overview
---------
-These events are similar to tracepoint based events. Instead of Tracepoint,
-this is based on kprobes (kprobe and kretprobe). So it can probe wherever
-kprobes can probe (this means, all functions body except for __kprobes
-functions). Unlike the Tracepoint based event, this can be added and removed
-dynamically, on the fly.
-
-To enable this feature, build your kernel with CONFIG_KPROBE_TRACING=y.
-
-Similar to the events tracer, this doesn't need to be activated via
-current_tracer. Instead of that, add probe points via
-/sys/kernel/debug/tracing/kprobe_events, and enable it via
-/sys/kernel/debug/tracing/events/kprobes/<EVENT>/enabled.
-
-
-Synopsis of kprobe_events
--------------------------
- p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
- r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
- -:[GRP/]EVENT : Clear a probe
-
- GRP : Group name. If omitted, use "kprobes" for it.
- EVENT : Event name. If omitted, the event name is generated
- based on SYM+offs or MEMADDR.
- MOD : Module name which has given SYM.
- SYM[+offs] : Symbol+offset where the probe is inserted.
- MEMADDR : Address where the probe is inserted.
-
- FETCHARGS : Arguments. Each probe can have up to 128 args.
- %REG : Fetch register REG
- @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
- @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
- $stackN : Fetch Nth entry of stack (N >= 0)
- $stack : Fetch stack address.
- $retval : Fetch return value.(*)
- +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
- NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
- FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
- (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield
- are supported.
-
- (*) only for return probe.
- (**) this is useful for fetching a field of data structures.
-
-Types
------
-Several types are supported for fetch-args. Kprobe tracer will access memory
-by given type. Prefix 's' and 'u' means those types are signed and unsigned
-respectively. Traced arguments are shown in decimal (signed) or hex (unsigned).
-String type is a special type, which fetches a "null-terminated" string from
-kernel space. This means it will fail and store NULL if the string container
-has been paged out.
-Bitfield is another special type, which takes 3 parameters, bit-width, bit-
-offset, and container-size (usually 32). The syntax is;
-
- b<bit-width>@<bit-offset>/<container-size>
-
-
-Per-Probe Event Filtering
--------------------------
- Per-probe event filtering feature allows you to set different filter on each
-probe and gives you what arguments will be shown in trace buffer. If an event
-name is specified right after 'p:' or 'r:' in kprobe_events, it adds an event
-under tracing/events/kprobes/<EVENT>, at the directory you can see 'id',
-'enabled', 'format' and 'filter'.
-
-enabled:
- You can enable/disable the probe by writing 1 or 0 on it.
-
-format:
- This shows the format of this probe event.
-
-filter:
- You can write filtering rules of this event.
-
-id:
- This shows the id of this probe event.
-
-
-Event Profiling
----------------
- You can check the total number of probe hits and probe miss-hits via
-/sys/kernel/debug/tracing/kprobe_profile.
- The first column is event name, the second is the number of probe hits,
-the third is the number of probe miss-hits.
-
-
-Usage examples
---------------
-To add a probe as a new event, write a new definition to kprobe_events
-as below.
-
- echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events
-
- This sets a kprobe on the top of do_sys_open() function with recording
-1st to 4th arguments as "myprobe" event. Note, which register/stack entry is
-assigned to each function argument depends on arch-specific ABI. If you unsure
-the ABI, please try to use probe subcommand of perf-tools (you can find it
-under tools/perf/).
-As this example shows, users can choose more familiar names for each arguments.
-
- echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/debug/tracing/kprobe_events
-
- This sets a kretprobe on the return point of do_sys_open() function with
-recording return value as "myretprobe" event.
- You can see the format of these events via
-/sys/kernel/debug/tracing/events/kprobes/<EVENT>/format.
-
- cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format
-name: myprobe
-ID: 780
-format:
- field:unsigned short common_type; offset:0; size:2; signed:0;
- field:unsigned char common_flags; offset:2; size:1; signed:0;
- field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
- field:int common_pid; offset:4; size:4; signed:1;
-
- field:unsigned long __probe_ip; offset:12; size:4; signed:0;
- field:int __probe_nargs; offset:16; size:4; signed:1;
- field:unsigned long dfd; offset:20; size:4; signed:0;
- field:unsigned long filename; offset:24; size:4; signed:0;
- field:unsigned long flags; offset:28; size:4; signed:0;
- field:unsigned long mode; offset:32; size:4; signed:0;
-
-
-print fmt: "(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx", REC->__probe_ip,
-REC->dfd, REC->filename, REC->flags, REC->mode
-
- You can see that the event has 4 arguments as in the expressions you specified.
-
- echo > /sys/kernel/debug/tracing/kprobe_events
-
- This clears all probe points.
-
- Or,
-
- echo -:myprobe >> kprobe_events
-
- This clears probe points selectively.
-
- Right after definition, each event is disabled by default. For tracing these
-events, you need to enable it.
-
- echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
- echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
-
- And you can see the traced information via /sys/kernel/debug/tracing/trace.
-
- cat /sys/kernel/debug/tracing/trace
-# tracer: nop
-#
-# TASK-PID CPU# TIMESTAMP FUNCTION
-# | | | | |
- <...>-1447 [001] 1038282.286875: myprobe: (do_sys_open+0x0/0xd6) dfd=3 filename=7fffd1ec4440 flags=8000 mode=0
- <...>-1447 [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe
- <...>-1447 [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6
- <...>-1447 [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
- <...>-1447 [001] 1038282.286969: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=4041c6 flags=98800 mode=10
- <...>-1447 [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
-
-
- Each line shows when the kernel hits an event, and <- SYMBOL means kernel
-returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel
-returns from do_sys_open to sys_open+0x1b).
-
diff --git a/Documentation/trace/mmiotrace.txt b/Documentation/trace/mmiotrace.txt
deleted file mode 100644
index 664e7386d89..00000000000
--- a/Documentation/trace/mmiotrace.txt
+++ /dev/null
@@ -1,164 +0,0 @@
- In-kernel memory-mapped I/O tracing
-
-
-Home page and links to optional user space tools:
-
- http://nouveau.freedesktop.org/wiki/MmioTrace
-
-MMIO tracing was originally developed by Intel around 2003 for their Fault
-Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel,
-Jeff Muizelaar created a tool for tracing MMIO accesses with the Nouveau
-project in mind. Since then many people have contributed.
-
-Mmiotrace was built for reverse engineering any memory-mapped IO device with
-the Nouveau project as the first real user. Only x86 and x86_64 architectures
-are supported.
-
-Out-of-tree mmiotrace was originally modified for mainline inclusion and
-ftrace framework by Pekka Paalanen <pq@iki.fi>.
-
-
-Preparation
------------
-
-Mmiotrace feature is compiled in by the CONFIG_MMIOTRACE option. Tracing is
-disabled by default, so it is safe to have this set to yes. SMP systems are
-supported, but tracing is unreliable and may miss events if more than one CPU
-is on-line, therefore mmiotrace takes all but one CPU off-line during run-time
-activation. You can re-enable CPUs by hand, but you have been warned, there
-is no way to automatically detect if you are losing events due to CPUs racing.
-
-
-Usage Quick Reference
----------------------
-
-$ mount -t debugfs debugfs /sys/kernel/debug
-$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
-$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
-Start X or whatever.
-$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker
-$ echo nop > /sys/kernel/debug/tracing/current_tracer
-Check for lost events.
-
-
-Usage
------
-
-Make sure debugfs is mounted to /sys/kernel/debug.
-If not (requires root privileges):
-$ mount -t debugfs debugfs /sys/kernel/debug
-
-Check that the driver you are about to trace is not loaded.
-
-Activate mmiotrace (requires root privileges):
-$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
-
-Start storing the trace:
-$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
-The 'cat' process should stay running (sleeping) in the background.
-
-Load the driver you want to trace and use it. Mmiotrace will only catch MMIO
-accesses to areas that are ioremapped while mmiotrace is active.
-
-During tracing you can place comments (markers) into the trace by
-$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker
-This makes it easier to see which part of the (huge) trace corresponds to
-which action. It is recommended to place descriptive markers about what you
-do.
-
-Shut down mmiotrace (requires root privileges):
-$ echo nop > /sys/kernel/debug/tracing/current_tracer
-The 'cat' process exits. If it does not, kill it by issuing 'fg' command and
-pressing ctrl+c.
-
-Check that mmiotrace did not lose events due to a buffer filling up. Either
-$ grep -i lost mydump.txt
-which tells you exactly how many events were lost, or use
-$ dmesg
-to view your kernel log and look for "mmiotrace has lost events" warning. If
-events were lost, the trace is incomplete. You should enlarge the buffers and
-try again. Buffers are enlarged by first seeing how large the current buffers
-are:
-$ cat /sys/kernel/debug/tracing/buffer_size_kb
-gives you a number. Approximately double this number and write it back, for
-instance:
-$ echo 128000 > /sys/kernel/debug/tracing/buffer_size_kb
-Then start again from the top.
-
-If you are doing a trace for a driver project, e.g. Nouveau, you should also
-do the following before sending your results:
-$ lspci -vvv > lspci.txt
-$ dmesg > dmesg.txt
-$ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt
-and then send the .tar.gz file. The trace compresses considerably. Replace
-"pciid" and "nick" with the PCI ID or model name of your piece of hardware
-under investigation and your nickname.
-
-
-How Mmiotrace Works
--------------------
-
-Access to hardware IO-memory is gained by mapping addresses from PCI bus by
-calling one of the ioremap_*() functions. Mmiotrace is hooked into the
-__ioremap() function and gets called whenever a mapping is created. Mapping is
-an event that is recorded into the trace log. Note that ISA range mappings
-are not caught, since the mapping always exists and is returned directly.
-
-MMIO accesses are recorded via page faults. Just before __ioremap() returns,
-the mapped pages are marked as not present. Any access to the pages causes a
-fault. The page fault handler calls mmiotrace to handle the fault. Mmiotrace
-marks the page present, sets TF flag to achieve single stepping and exits the
-fault handler. The instruction that faulted is executed and debug trap is
-entered. Here mmiotrace again marks the page as not present. The instruction
-is decoded to get the type of operation (read/write), data width and the value
-read or written. These are stored to the trace log.
-
-Setting the page present in the page fault handler has a race condition on SMP
-machines. During the single stepping other CPUs may run freely on that page
-and events can be missed without a notice. Re-enabling other CPUs during
-tracing is discouraged.
-
-
-Trace Log Format
-----------------
-
-The raw log is text and easily filtered with e.g. grep and awk. One record is
-one line in the log. A record starts with a keyword, followed by keyword-
-dependent arguments. Arguments are separated by a space, or continue until the
-end of line. The format for version 20070824 is as follows:
-
-Explanation Keyword Space-separated arguments
----------------------------------------------------------------------------
-
-read event R width, timestamp, map id, physical, value, PC, PID
-write event W width, timestamp, map id, physical, value, PC, PID
-ioremap event MAP timestamp, map id, physical, virtual, length, PC, PID
-iounmap event UNMAP timestamp, map id, PC, PID
-marker MARK timestamp, text
-version VERSION the string "20070824"
-info for reader LSPCI one line from lspci -v
-PCI address map PCIDEV space-separated /proc/bus/pci/devices data
-unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID
-
-Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual
-is a kernel virtual address. Width is the data width in bytes and value is the
-data value. Map id is an arbitrary id number identifying the mapping that was
-used in an operation. PC is the program counter and PID is process id. PC is
-zero if it is not recorded. PID is always zero as tracing MMIO accesses
-originating in user space memory is not yet supported.
-
-For instance, the following awk filter will pass all 32-bit writes that target
-physical addresses in the range [0xfb73ce40, 0xfb800000[
-
-$ awk '/W 4 / { adr=strtonum($5); if (adr >= 0xfb73ce40 &&
-adr < 0xfb800000) print; }'
-
-
-Tools for Developers
---------------------
-
-The user space tools include utilities for:
-- replacing numeric addresses and values with hardware register names
-- replaying MMIO logs, i.e., re-executing the recorded writes
-
-
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
deleted file mode 100644
index 0a120aae33c..00000000000
--- a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
+++ /dev/null
@@ -1,418 +0,0 @@
-#!/usr/bin/perl
-# This is a POC (proof of concept or piece of crap, take your pick) for reading the
-# text representation of trace output related to page allocation. It makes an attempt
-# to extract some high-level information on what is going on. The accuracy of the parser
-# may vary considerably
-#
-# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
-# other options
-# --prepend-parent Report on the parent proc and PID
-# --read-procstat If the trace lacks process info, get it from /proc
-# --ignore-pid Aggregate processes of the same name together
-#
-# Copyright (c) IBM Corporation 2009
-# Author: Mel Gorman <mel@csn.ul.ie>
-use strict;
-use Getopt::Long;
-
-# Tracepoint events
-use constant MM_PAGE_ALLOC => 1;
-use constant MM_PAGE_FREE => 2;
-use constant MM_PAGE_FREE_BATCHED => 3;
-use constant MM_PAGE_PCPU_DRAIN => 4;
-use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5;
-use constant MM_PAGE_ALLOC_EXTFRAG => 6;
-use constant EVENT_UNKNOWN => 7;
-
-# Constants used to track state
-use constant STATE_PCPU_PAGES_DRAINED => 8;
-use constant STATE_PCPU_PAGES_REFILLED => 9;
-
-# High-level events extrapolated from tracepoints
-use constant HIGH_PCPU_DRAINS => 10;
-use constant HIGH_PCPU_REFILLS => 11;
-use constant HIGH_EXT_FRAGMENT => 12;
-use constant HIGH_EXT_FRAGMENT_SEVERE => 13;
-use constant HIGH_EXT_FRAGMENT_MODERATE => 14;
-use constant HIGH_EXT_FRAGMENT_CHANGED => 15;
-
-my %perprocesspid;
-my %perprocess;
-my $opt_ignorepid;
-my $opt_read_procstat;
-my $opt_prepend_parent;
-
-# Catch sigint and exit on request
-my $sigint_report = 0;
-my $sigint_exit = 0;
-my $sigint_pending = 0;
-my $sigint_received = 0;
-sub sigint_handler {
- my $current_time = time;
- if ($current_time - 2 > $sigint_received) {
- print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
- $sigint_report = 1;
- } else {
- if (!$sigint_exit) {
- print "Second SIGINT received quickly, exiting\n";
- }
- $sigint_exit++;
- }
-
- if ($sigint_exit > 3) {
- print "Many SIGINTs received, exiting now without report\n";
- exit;
- }
-
- $sigint_received = $current_time;
- $sigint_pending = 1;
-}
-$SIG{INT} = "sigint_handler";
-
-# Parse command line options
-GetOptions(
- 'ignore-pid' => \$opt_ignorepid,
- 'read-procstat' => \$opt_read_procstat,
- 'prepend-parent' => \$opt_prepend_parent,
-);
-
-# Defaults for dynamically discovered regex's
-my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
-
-# Dyanically discovered regex
-my $regex_fragdetails;
-
-# Static regex used. Specified like this for readability and for use with /o
-# (process_pid) (cpus ) ( time ) (tpoint ) (details)
-my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
-my $regex_statname = '[-0-9]*\s\((.*)\).*';
-my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
-
-sub generate_traceevent_regex {
- my $event = shift;
- my $default = shift;
- my $regex;
-
- # Read the event format or use the default
- if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
- $regex = $default;
- } else {
- my $line;
- while (!eof(FORMAT)) {
- $line = <FORMAT>;
- if ($line =~ /^print fmt:\s"(.*)",.*/) {
- $regex = $1;
- $regex =~ s/%p/\([0-9a-f]*\)/g;
- $regex =~ s/%d/\([-0-9]*\)/g;
- $regex =~ s/%lu/\([0-9]*\)/g;
- }
- }
- }
-
- # Verify fields are in the right order
- my $tuple;
- foreach $tuple (split /\s/, $regex) {
- my ($key, $value) = split(/=/, $tuple);
- my $expected = shift;
- if ($key ne $expected) {
- print("WARNING: Format not as expected '$key' != '$expected'");
- $regex =~ s/$key=\((.*)\)/$key=$1/;
- }
- }
-
- if (defined shift) {
- die("Fewer fields than expected in format");
- }
-
- return $regex;
-}
-$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
- $regex_fragdetails_default,
- "page", "pfn",
- "alloc_order", "fallback_order", "pageblock_order",
- "alloc_migratetype", "fallback_migratetype",
- "fragmenting", "change_ownership");
-
-sub read_statline($) {
- my $pid = $_[0];
- my $statline;
-
- if (open(STAT, "/proc/$pid/stat")) {
- $statline = <STAT>;
- close(STAT);
- }
-
- if ($statline eq '') {
- $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
- }
-
- return $statline;
-}
-
-sub guess_process_pid($$) {
- my $pid = $_[0];
- my $statline = $_[1];
-
- if ($pid == 0) {
- return "swapper-0";
- }
-
- if ($statline !~ /$regex_statname/o) {
- die("Failed to math stat line for process name :: $statline");
- }
- return "$1-$pid";
-}
-
-sub parent_info($$) {
- my $pid = $_[0];
- my $statline = $_[1];
- my $ppid;
-
- if ($pid == 0) {
- return "NOPARENT-0";
- }
-
- if ($statline !~ /$regex_statppid/o) {
- die("Failed to match stat line process ppid:: $statline");
- }
-
- # Read the ppid stat line
- $ppid = $1;
- return guess_process_pid($ppid, read_statline($ppid));
-}
-
-sub process_events {
- my $traceevent;
- my $process_pid;
- my $cpus;
- my $timestamp;
- my $tracepoint;
- my $details;
- my $statline;
-
- # Read each line of the event log
-EVENT_PROCESS:
- while ($traceevent = <STDIN>) {
- if ($traceevent =~ /$regex_traceevent/o) {
- $process_pid = $1;
- $tracepoint = $4;
-
- if ($opt_read_procstat || $opt_prepend_parent) {
- $process_pid =~ /(.*)-([0-9]*)$/;
- my $process = $1;
- my $pid = $2;
-
- $statline = read_statline($pid);
-
- if ($opt_read_procstat && $process eq '') {
- $process_pid = guess_process_pid($pid, $statline);
- }
-
- if ($opt_prepend_parent) {
- $process_pid = parent_info($pid, $statline) . " :: $process_pid";
- }
- }
-
- # Unnecessary in this script. Uncomment if required
- # $cpus = $2;
- # $timestamp = $3;
- } else {
- next;
- }
-
- # Perl Switch() sucks majorly
- if ($tracepoint eq "mm_page_alloc") {
- $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
- } elsif ($tracepoint eq "mm_page_free") {
- $perprocesspid{$process_pid}->{MM_PAGE_FREE}++
- } elsif ($tracepoint eq "mm_page_free_batched") {
- $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}++;
- } elsif ($tracepoint eq "mm_page_pcpu_drain") {
- $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
- $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
- } elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
- $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
- $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
- } elsif ($tracepoint eq "mm_page_alloc_extfrag") {
-
- # Extract the details of the event now
- $details = $5;
-
- my ($page, $pfn);
- my ($alloc_order, $fallback_order, $pageblock_order);
- my ($alloc_migratetype, $fallback_migratetype);
- my ($fragmenting, $change_ownership);
-
- if ($details !~ /$regex_fragdetails/o) {
- print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
- next;
- }
-
- $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
- $page = $1;
- $pfn = $2;
- $alloc_order = $3;
- $fallback_order = $4;
- $pageblock_order = $5;
- $alloc_migratetype = $6;
- $fallback_migratetype = $7;
- $fragmenting = $8;
- $change_ownership = $9;
-
- if ($fragmenting) {
- $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
- if ($fallback_order <= 3) {
- $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
- } else {
- $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
- }
- }
- if ($change_ownership) {
- $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
- }
- } else {
- $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
- }
-
- # Catch a full pcpu drain event
- if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
- $tracepoint ne "mm_page_pcpu_drain") {
-
- $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
- $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
- }
-
- # Catch a full pcpu refill event
- if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
- $tracepoint ne "mm_page_alloc_zone_locked") {
- $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
- $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
- }
-
- if ($sigint_pending) {
- last EVENT_PROCESS;
- }
- }
-}
-
-sub dump_stats {
- my $hashref = shift;
- my %stats = %$hashref;
-
- # Dump per-process stats
- my $process_pid;
- my $max_strlen = 0;
-
- # Get the maximum process name
- foreach $process_pid (keys %perprocesspid) {
- my $len = length($process_pid);
- if ($len > $max_strlen) {
- $max_strlen = $len;
- }
- }
- $max_strlen += 2;
-
- printf("\n");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
- "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
- "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", "");
-
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
- "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
-
- foreach $process_pid (keys %stats) {
- # Dump final aggregates
- if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
- $stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
- $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
- }
- if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
- $stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
- $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
- }
-
- printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n",
- $process_pid,
- $stats{$process_pid}->{MM_PAGE_ALLOC},
- $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
- $stats{$process_pid}->{MM_PAGE_FREE},
- $stats{$process_pid}->{MM_PAGE_FREE_BATCHED},
- $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
- $stats{$process_pid}->{HIGH_PCPU_DRAINS},
- $stats{$process_pid}->{HIGH_PCPU_REFILLS},
- $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
- $stats{$process_pid}->{HIGH_EXT_FRAG},
- $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
- $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
- $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
- $stats{$process_pid}->{EVENT_UNKNOWN});
- }
-}
-
-sub aggregate_perprocesspid() {
- my $process_pid;
- my $process;
- undef %perprocess;
-
- foreach $process_pid (keys %perprocesspid) {
- $process = $process_pid;
- $process =~ s/-([0-9])*$//;
- if ($process eq '') {
- $process = "NO_PROCESS_NAME";
- }
-
- $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
- $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
- $perprocess{$process}->{MM_PAGE_FREE} += $perprocesspid{$process_pid}->{MM_PAGE_FREE};
- $perprocess{$process}->{MM_PAGE_FREE_BATCHED} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED};
- $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
- $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
- $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
- $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
- $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
- $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
- $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
- $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
- $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
- }
-}
-
-sub report() {
- if (!$opt_ignorepid) {
- dump_stats(\%perprocesspid);
- } else {
- aggregate_perprocesspid();
- dump_stats(\%perprocess);
- }
-}
-
-# Process events or signals until neither is available
-sub signal_loop() {
- my $sigint_processed;
- do {
- $sigint_processed = 0;
- process_events();
-
- # Handle pending signals if any
- if ($sigint_pending) {
- my $current_time = time;
-
- if ($sigint_exit) {
- print "Received exit signal\n";
- $sigint_pending = 0;
- }
- if ($sigint_report) {
- if ($current_time >= $sigint_received + 2) {
- report();
- $sigint_report = 0;
- $sigint_pending = 0;
- $sigint_processed = 1;
- }
- }
- }
- } while ($sigint_pending || $sigint_processed);
-}
-
-signal_loop();
-report();
diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
deleted file mode 100644
index 4a37c4759cd..00000000000
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ /dev/null
@@ -1,714 +0,0 @@
-#!/usr/bin/perl
-# This is a POC for reading the text representation of trace output related to
-# page reclaim. It makes an attempt to extract some high-level information on
-# what is going on. The accuracy of the parser may vary
-#
-# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
-# other options
-# --read-procstat If the trace lacks process info, get it from /proc
-# --ignore-pid Aggregate processes of the same name together
-#
-# Copyright (c) IBM Corporation 2009
-# Author: Mel Gorman <mel@csn.ul.ie>
-use strict;
-use Getopt::Long;
-
-# Tracepoint events
-use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1;
-use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2;
-use constant MM_VMSCAN_KSWAPD_WAKE => 3;
-use constant MM_VMSCAN_KSWAPD_SLEEP => 4;
-use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5;
-use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6;
-use constant MM_VMSCAN_LRU_ISOLATE => 7;
-use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC => 8;
-use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC => 9;
-use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC => 10;
-use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC => 11;
-use constant MM_VMSCAN_WRITEPAGE_ASYNC => 12;
-use constant EVENT_UNKNOWN => 13;
-
-# Per-order events
-use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
-use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12;
-use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13;
-use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14;
-
-# Constants used to track state
-use constant STATE_DIRECT_BEGIN => 15;
-use constant STATE_DIRECT_ORDER => 16;
-use constant STATE_KSWAPD_BEGIN => 17;
-use constant STATE_KSWAPD_ORDER => 18;
-
-# High-level events extrapolated from tracepoints
-use constant HIGH_DIRECT_RECLAIM_LATENCY => 19;
-use constant HIGH_KSWAPD_LATENCY => 20;
-use constant HIGH_KSWAPD_REWAKEUP => 21;
-use constant HIGH_NR_SCANNED => 22;
-use constant HIGH_NR_TAKEN => 23;
-use constant HIGH_NR_RECLAIMED => 24;
-use constant HIGH_NR_CONTIG_DIRTY => 25;
-
-my %perprocesspid;
-my %perprocess;
-my %last_procmap;
-my $opt_ignorepid;
-my $opt_read_procstat;
-
-my $total_wakeup_kswapd;
-my ($total_direct_reclaim, $total_direct_nr_scanned);
-my ($total_direct_latency, $total_kswapd_latency);
-my ($total_direct_nr_reclaimed);
-my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
-my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
-my ($total_kswapd_nr_scanned, $total_kswapd_wake);
-my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
-my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
-my ($total_kswapd_nr_reclaimed);
-
-# Catch sigint and exit on request
-my $sigint_report = 0;
-my $sigint_exit = 0;
-my $sigint_pending = 0;
-my $sigint_received = 0;
-sub sigint_handler {
- my $current_time = time;
- if ($current_time - 2 > $sigint_received) {
- print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
- $sigint_report = 1;
- } else {
- if (!$sigint_exit) {
- print "Second SIGINT received quickly, exiting\n";
- }
- $sigint_exit++;
- }
-
- if ($sigint_exit > 3) {
- print "Many SIGINTs received, exiting now without report\n";
- exit;
- }
-
- $sigint_received = $current_time;
- $sigint_pending = 1;
-}
-$SIG{INT} = "sigint_handler";
-
-# Parse command line options
-GetOptions(
- 'ignore-pid' => \$opt_ignorepid,
- 'read-procstat' => \$opt_read_procstat,
-);
-
-# Defaults for dynamically discovered regex's
-my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
-my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
-my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
-my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
-my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
-my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
-my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
-my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
-my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
-
-# Dyanically discovered regex
-my $regex_direct_begin;
-my $regex_direct_end;
-my $regex_kswapd_wake;
-my $regex_kswapd_sleep;
-my $regex_wakeup_kswapd;
-my $regex_lru_isolate;
-my $regex_lru_shrink_inactive;
-my $regex_lru_shrink_active;
-my $regex_writepage;
-
-# Static regex used. Specified like this for readability and for use with /o
-# (process_pid) (cpus ) ( time ) (tpoint ) (details)
-my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
-my $regex_statname = '[-0-9]*\s\((.*)\).*';
-my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
-
-sub generate_traceevent_regex {
- my $event = shift;
- my $default = shift;
- my $regex;
-
- # Read the event format or use the default
- if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
- print("WARNING: Event $event format string not found\n");
- return $default;
- } else {
- my $line;
- while (!eof(FORMAT)) {
- $line = <FORMAT>;
- $line =~ s/, REC->.*//;
- if ($line =~ /^print fmt:\s"(.*)".*/) {
- $regex = $1;
- $regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
- $regex =~ s/%p/\([0-9a-f]*\)/g;
- $regex =~ s/%d/\([-0-9]*\)/g;
- $regex =~ s/%ld/\([-0-9]*\)/g;
- $regex =~ s/%lu/\([0-9]*\)/g;
- }
- }
- }
-
- # Can't handle the print_flags stuff but in the context of this
- # script, it really doesn't matter
- $regex =~ s/\(REC.*\) \? __print_flags.*//;
-
- # Verify fields are in the right order
- my $tuple;
- foreach $tuple (split /\s/, $regex) {
- my ($key, $value) = split(/=/, $tuple);
- my $expected = shift;
- if ($key ne $expected) {
- print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
- $regex =~ s/$key=\((.*)\)/$key=$1/;
- }
- }
-
- if (defined shift) {
- die("Fewer fields than expected in format");
- }
-
- return $regex;
-}
-
-$regex_direct_begin = generate_traceevent_regex(
- "vmscan/mm_vmscan_direct_reclaim_begin",
- $regex_direct_begin_default,
- "order", "may_writepage",
- "gfp_flags");
-$regex_direct_end = generate_traceevent_regex(
- "vmscan/mm_vmscan_direct_reclaim_end",
- $regex_direct_end_default,
- "nr_reclaimed");
-$regex_kswapd_wake = generate_traceevent_regex(
- "vmscan/mm_vmscan_kswapd_wake",
- $regex_kswapd_wake_default,
- "nid", "order");
-$regex_kswapd_sleep = generate_traceevent_regex(
- "vmscan/mm_vmscan_kswapd_sleep",
- $regex_kswapd_sleep_default,
- "nid");
-$regex_wakeup_kswapd = generate_traceevent_regex(
- "vmscan/mm_vmscan_wakeup_kswapd",
- $regex_wakeup_kswapd_default,
- "nid", "zid", "order");
-$regex_lru_isolate = generate_traceevent_regex(
- "vmscan/mm_vmscan_lru_isolate",
- $regex_lru_isolate_default,
- "isolate_mode", "order",
- "nr_requested", "nr_scanned", "nr_taken",
- "contig_taken", "contig_dirty", "contig_failed");
-$regex_lru_shrink_inactive = generate_traceevent_regex(
- "vmscan/mm_vmscan_lru_shrink_inactive",
- $regex_lru_shrink_inactive_default,
- "nid", "zid",
- "nr_scanned", "nr_reclaimed", "priority",
- "flags");
-$regex_lru_shrink_active = generate_traceevent_regex(
- "vmscan/mm_vmscan_lru_shrink_active",
- $regex_lru_shrink_active_default,
- "nid", "zid",
- "lru",
- "nr_scanned", "nr_rotated", "priority");
-$regex_writepage = generate_traceevent_regex(
- "vmscan/mm_vmscan_writepage",
- $regex_writepage_default,
- "page", "pfn", "flags");
-
-sub read_statline($) {
- my $pid = $_[0];
- my $statline;
-
- if (open(STAT, "/proc/$pid/stat")) {
- $statline = <STAT>;
- close(STAT);
- }
-
- if ($statline eq '') {
- $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
- }
-
- return $statline;
-}
-
-sub guess_process_pid($$) {
- my $pid = $_[0];
- my $statline = $_[1];
-
- if ($pid == 0) {
- return "swapper-0";
- }
-
- if ($statline !~ /$regex_statname/o) {
- die("Failed to math stat line for process name :: $statline");
- }
- return "$1-$pid";
-}
-
-# Convert sec.usec timestamp format
-sub timestamp_to_ms($) {
- my $timestamp = $_[0];
-
- my ($sec, $usec) = split (/\./, $timestamp);
- return ($sec * 1000) + ($usec / 1000);
-}
-
-sub process_events {
- my $traceevent;
- my $process_pid;
- my $cpus;
- my $timestamp;
- my $tracepoint;
- my $details;
- my $statline;
-
- # Read each line of the event log
-EVENT_PROCESS:
- while ($traceevent = <STDIN>) {
- if ($traceevent =~ /$regex_traceevent/o) {
- $process_pid = $1;
- $timestamp = $3;
- $tracepoint = $4;
-
- $process_pid =~ /(.*)-([0-9]*)$/;
- my $process = $1;
- my $pid = $2;
-
- if ($process eq "") {
- $process = $last_procmap{$pid};
- $process_pid = "$process-$pid";
- }
- $last_procmap{$pid} = $process;
-
- if ($opt_read_procstat) {
- $statline = read_statline($pid);
- if ($opt_read_procstat && $process eq '') {
- $process_pid = guess_process_pid($pid, $statline);
- }
- }
- } else {
- next;
- }
-
- # Perl Switch() sucks majorly
- if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
- $timestamp = timestamp_to_ms($timestamp);
- $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
- $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
-
- $details = $5;
- if ($details !~ /$regex_direct_begin/o) {
- print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
- print " $details\n";
- print " $regex_direct_begin\n";
- next;
- }
- my $order = $1;
- $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
- $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
- } elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
- # Count the event itself
- my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
- $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
-
- # Record how long direct reclaim took this time
- if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
- $timestamp = timestamp_to_ms($timestamp);
- my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
- my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
- $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
- }
- } elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
- $details = $5;
- if ($details !~ /$regex_kswapd_wake/o) {
- print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
- print " $details\n";
- print " $regex_kswapd_wake\n";
- next;
- }
-
- my $order = $2;
- $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
- if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
- $timestamp = timestamp_to_ms($timestamp);
- $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
- $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
- $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
- } else {
- $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
- $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
- }
- } elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
-
- # Count the event itself
- my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
- $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
-
- # Record how long kswapd was awake
- $timestamp = timestamp_to_ms($timestamp);
- my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
- my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
- $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
- $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
- } elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
- $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
-
- $details = $5;
- if ($details !~ /$regex_wakeup_kswapd/o) {
- print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
- print " $details\n";
- print " $regex_wakeup_kswapd\n";
- next;
- }
- my $order = $3;
- $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
- } elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
- $details = $5;
- if ($details !~ /$regex_lru_isolate/o) {
- print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
- print " $details\n";
- print " $regex_lru_isolate/o\n";
- next;
- }
- my $isolate_mode = $1;
- my $nr_scanned = $4;
- my $nr_contig_dirty = $7;
-
- # To closer match vmstat scanning statistics, only count isolate_both
- # and isolate_inactive as scanning. isolate_active is rotation
- # isolate_inactive == 1
- # isolate_active == 2
- # isolate_both == 3
- if ($isolate_mode != 2) {
- $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
- }
- $perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
- } elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
- $details = $5;
- if ($details !~ /$regex_lru_shrink_inactive/o) {
- print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
- print " $details\n";
- print " $regex_lru_shrink_inactive/o\n";
- next;
- }
- my $nr_reclaimed = $4;
- $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
- } elsif ($tracepoint eq "mm_vmscan_writepage") {
- $details = $5;
- if ($details !~ /$regex_writepage/o) {
- print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
- print " $details\n";
- print " $regex_writepage\n";
- next;
- }
-
- my $flags = $3;
- my $file = 0;
- my $sync_io = 0;
- if ($flags =~ /RECLAIM_WB_FILE/) {
- $file = 1;
- }
- if ($flags =~ /RECLAIM_WB_SYNC/) {
- $sync_io = 1;
- }
- if ($sync_io) {
- if ($file) {
- $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++;
- } else {
- $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++;
- }
- } else {
- if ($file) {
- $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++;
- } else {
- $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++;
- }
- }
- } else {
- $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
- }
-
- if ($sigint_pending) {
- last EVENT_PROCESS;
- }
- }
-}
-
-sub dump_stats {
- my $hashref = shift;
- my %stats = %$hashref;
-
- # Dump per-process stats
- my $process_pid;
- my $max_strlen = 0;
-
- # Get the maximum process name
- foreach $process_pid (keys %perprocesspid) {
- my $len = length($process_pid);
- if ($len > $max_strlen) {
- $max_strlen = $len;
- }
- }
- $max_strlen += 2;
-
- # Work out latencies
- printf("\n") if !$opt_ignorepid;
- printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
- foreach $process_pid (keys %stats) {
-
- if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
- !$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
- next;
- }
-
- printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
- my $index = 0;
- while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
- defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
-
- if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
- printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
- my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
- $total_direct_latency += $latency;
- } else {
- printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
- my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
- $total_kswapd_latency += $latency;
- }
- $index++;
- }
- print "\n" if !$opt_ignorepid;
- }
-
- # Print out process activity
- printf("\n");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled");
- foreach $process_pid (keys %stats) {
-
- if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
- next;
- }
-
- $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
- $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
- $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
- $total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
- $total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
- $total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
- $total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
-
- $total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
-
- my $index = 0;
- my $this_reclaim_delay = 0;
- while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
- my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
- $this_reclaim_delay += $latency;
- $index++;
- }
-
- printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f",
- $process_pid,
- $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
- $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
- $stats{$process_pid}->{HIGH_NR_SCANNED},
- $stats{$process_pid}->{HIGH_NR_RECLAIMED},
- $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
- $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
- $this_reclaim_delay / 1000);
-
- if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
- print " ";
- for (my $order = 0; $order < 20; $order++) {
- my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
- if ($count != 0) {
- print "direct-$order=$count ";
- }
- }
- }
- if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
- print " ";
- for (my $order = 0; $order < 20; $order++) {
- my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
- if ($count != 0) {
- print "wakeup-$order=$count ";
- }
- }
- }
- if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
- print " ";
- my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
- if ($count != 0) {
- print "contig-dirty=$count ";
- }
- }
-
- print "\n";
- }
-
- # Print out kswapd activity
- printf("\n");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO");
- foreach $process_pid (keys %stats) {
-
- if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
- next;
- }
-
- $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
- $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
- $total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
- $total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
- $total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
- $total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
- $total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
-
- printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u",
- $process_pid,
- $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
- $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
- $stats{$process_pid}->{HIGH_NR_SCANNED},
- $stats{$process_pid}->{HIGH_NR_RECLAIMED},
- $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
- $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
-
- if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
- print " ";
- for (my $order = 0; $order < 20; $order++) {
- my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
- if ($count != 0) {
- print "wake-$order=$count ";
- }
- }
- }
- if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
- print " ";
- for (my $order = 0; $order < 20; $order++) {
- my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
- if ($count != 0) {
- print "rewake-$order=$count ";
- }
- }
- }
- printf("\n");
- }
-
- # Print out summaries
- $total_direct_latency /= 1000;
- $total_kswapd_latency /= 1000;
- print "\nSummary\n";
- print "Direct reclaims: $total_direct_reclaim\n";
- print "Direct reclaim pages scanned: $total_direct_nr_scanned\n";
- print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n";
- print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n";
- print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n";
- print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n";
- print "Direct reclaim write anon async I/O: $total_direct_writepage_anon_async\n";
- print "Wake kswapd requests: $total_wakeup_kswapd\n";
- printf "Time stalled direct reclaim: %-1.2f seconds\n", $total_direct_latency;
- print "\n";
- print "Kswapd wakeups: $total_kswapd_wake\n";
- print "Kswapd pages scanned: $total_kswapd_nr_scanned\n";
- print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n";
- print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n";
- print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n";
- print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n";
- print "Kswapd reclaim write anon async I/O: $total_kswapd_writepage_anon_async\n";
- printf "Time kswapd awake: %-1.2f seconds\n", $total_kswapd_latency;
-}
-
-sub aggregate_perprocesspid() {
- my $process_pid;
- my $process;
- undef %perprocess;
-
- foreach $process_pid (keys %perprocesspid) {
- $process = $process_pid;
- $process =~ s/-([0-9])*$//;
- if ($process eq '') {
- $process = "NO_PROCESS_NAME";
- }
-
- $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
- $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
- $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
- $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
- $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
- $perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
- $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
- $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
- $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
- $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
-
- for (my $order = 0; $order < 20; $order++) {
- $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
- $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
- $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
-
- }
-
- # Aggregate direct reclaim latencies
- my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
- my $rd_index = 0;
- while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
- $perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
- $rd_index++;
- $wr_index++;
- }
- $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
-
- # Aggregate kswapd latencies
- my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
- my $rd_index = 0;
- while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
- $perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
- $rd_index++;
- $wr_index++;
- }
- $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
- }
-}
-
-sub report() {
- if (!$opt_ignorepid) {
- dump_stats(\%perprocesspid);
- } else {
- aggregate_perprocesspid();
- dump_stats(\%perprocess);
- }
-}
-
-# Process events or signals until neither is available
-sub signal_loop() {
- my $sigint_processed;
- do {
- $sigint_processed = 0;
- process_events();
-
- # Handle pending signals if any
- if ($sigint_pending) {
- my $current_time = time;
-
- if ($sigint_exit) {
- print "Received exit signal\n";
- $sigint_pending = 0;
- }
- if ($sigint_report) {
- if ($current_time >= $sigint_received + 2) {
- report();
- $sigint_report = 0;
- $sigint_pending = 0;
- $sigint_processed = 1;
- }
- }
- }
- } while ($sigint_pending || $sigint_processed);
-}
-
-signal_loop();
-report();
diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt
deleted file mode 100644
index 7d350b49658..00000000000
--- a/Documentation/trace/ring-buffer-design.txt
+++ /dev/null
@@ -1,955 +0,0 @@
- Lockless Ring Buffer Design
- ===========================
-
-Copyright 2009 Red Hat Inc.
- Author: Steven Rostedt <srostedt@redhat.com>
- License: The GNU Free Documentation License, Version 1.2
- (dual licensed under the GPL v2)
-Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
- and Frederic Weisbecker.
-
-
-Written for: 2.6.31
-
-Terminology used in this Document
----------------------------------
-
-tail - where new writes happen in the ring buffer.
-
-head - where new reads happen in the ring buffer.
-
-producer - the task that writes into the ring buffer (same as writer)
-
-writer - same as producer
-
-consumer - the task that reads from the buffer (same as reader)
-
-reader - same as consumer.
-
-reader_page - A page outside the ring buffer used solely (for the most part)
- by the reader.
-
-head_page - a pointer to the page that the reader will use next
-
-tail_page - a pointer to the page that will be written to next
-
-commit_page - a pointer to the page with the last finished non-nested write.
-
-cmpxchg - hardware-assisted atomic transaction that performs the following:
-
- A = B iff previous A == C
-
- R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
- current A is equal to C, and we put the old (current) A into R
-
- R gets the previous A regardless if A is updated with B or not.
-
- To see if the update was successful a compare of R == C may be used.
-
-The Generic Ring Buffer
------------------------
-
-The ring buffer can be used in either an overwrite mode or in
-producer/consumer mode.
-
-Producer/consumer mode is where if the producer were to fill up the
-buffer before the consumer could free up anything, the producer
-will stop writing to the buffer. This will lose most recent events.
-
-Overwrite mode is where if the producer were to fill up the buffer
-before the consumer could free up anything, the producer will
-overwrite the older data. This will lose the oldest events.
-
-No two writers can write at the same time (on the same per-cpu buffer),
-but a writer may interrupt another writer, but it must finish writing
-before the previous writer may continue. This is very important to the
-algorithm. The writers act like a "stack". The way interrupts works
-enforces this behavior.
-
-
- writer1 start
- <preempted> writer2 start
- <preempted> writer3 start
- writer3 finishes
- writer2 finishes
- writer1 finishes
-
-This is very much like a writer being preempted by an interrupt and
-the interrupt doing a write as well.
-
-Readers can happen at any time. But no two readers may run at the
-same time, nor can a reader preempt/interrupt another reader. A reader
-cannot preempt/interrupt a writer, but it may read/consume from the
-buffer at the same time as a writer is writing, but the reader must be
-on another processor to do so. A reader may read on its own processor
-and can be preempted by a writer.
-
-A writer can preempt a reader, but a reader cannot preempt a writer.
-But a reader can read the buffer at the same time (on another processor)
-as a writer.
-
-The ring buffer is made up of a list of pages held together by a linked list.
-
-At initialization a reader page is allocated for the reader that is not
-part of the ring buffer.
-
-The head_page, tail_page and commit_page are all initialized to point
-to the same page.
-
-The reader page is initialized to have its next pointer pointing to
-the head page, and its previous pointer pointing to a page before
-the head page.
-
-The reader has its own page to use. At start up time, this page is
-allocated but is not attached to the list. When the reader wants
-to read from the buffer, if its page is empty (like it is on start-up),
-it will swap its page with the head_page. The old reader page will
-become part of the ring buffer and the head_page will be removed.
-The page after the inserted page (old reader_page) will become the
-new head page.
-
-Once the new page is given to the reader, the reader could do what
-it wants with it, as long as a writer has left that page.
-
-A sample of how the reader page is swapped: Note this does not
-show the head page in the buffer, it is for demonstrating a swap
-only.
-
- +------+
- |reader| RING BUFFER
- |page |
- +------+
- +---+ +---+ +---+
- | |-->| |-->| |
- | |<--| |<--| |
- +---+ +---+ +---+
- ^ | ^ |
- | +-------------+ |
- +-----------------+
-
-
- +------+
- |reader| RING BUFFER
- |page |-------------------+
- +------+ v
- | +---+ +---+ +---+
- | | |-->| |-->| |
- | | |<--| |<--| |<-+
- | +---+ +---+ +---+ |
- | ^ | ^ | |
- | | +-------------+ | |
- | +-----------------+ |
- +------------------------------------+
-
- +------+
- |reader| RING BUFFER
- |page |-------------------+
- +------+ <---------------+ v
- | ^ +---+ +---+ +---+
- | | | |-->| |-->| |
- | | | | | |<--| |<-+
- | | +---+ +---+ +---+ |
- | | | ^ | |
- | | +-------------+ | |
- | +-----------------------------+ |
- +------------------------------------+
-
- +------+
- |buffer| RING BUFFER
- |page |-------------------+
- +------+ <---------------+ v
- | ^ +---+ +---+ +---+
- | | | | | |-->| |
- | | New | | | |<--| |<-+
- | | Reader +---+ +---+ +---+ |
- | | page ----^ | |
- | | | |
- | +-----------------------------+ |
- +------------------------------------+
-
-
-
-It is possible that the page swapped is the commit page and the tail page,
-if what is in the ring buffer is less than what is held in a buffer page.
-
-
- reader page commit page tail page
- | | |
- v | |
- +---+ | |
- | |<----------+ |
- | |<------------------------+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-This case is still valid for this algorithm.
-When the writer leaves the page, it simply goes into the ring buffer
-since the reader page still points to the next location in the ring
-buffer.
-
-
-The main pointers:
-
- reader page - The page used solely by the reader and is not part
- of the ring buffer (may be swapped in)
-
- head page - the next page in the ring buffer that will be swapped
- with the reader page.
-
- tail page - the page where the next write will take place.
-
- commit page - the page that last finished a write.
-
-The commit page only is updated by the outermost writer in the
-writer stack. A writer that preempts another writer will not move the
-commit page.
-
-When data is written into the ring buffer, a position is reserved
-in the ring buffer and passed back to the writer. When the writer
-is finished writing data into that position, it commits the write.
-
-Another write (or a read) may take place at anytime during this
-transaction. If another write happens it must finish before continuing
-with the previous write.
-
-
- Write reserve:
-
- Buffer page
- +---------+
- |written |
- +---------+ <--- given back to writer (current commit)
- |reserved |
- +---------+ <--- tail pointer
- | empty |
- +---------+
-
- Write commit:
-
- Buffer page
- +---------+
- |written |
- +---------+
- |written |
- +---------+ <--- next position for write (current commit)
- | empty |
- +---------+
-
-
- If a write happens after the first reserve:
-
- Buffer page
- +---------+
- |written |
- +---------+ <-- current commit
- |reserved |
- +---------+ <--- given back to second writer
- |reserved |
- +---------+ <--- tail pointer
-
- After second writer commits:
-
-
- Buffer page
- +---------+
- |written |
- +---------+ <--(last full commit)
- |reserved |
- +---------+
- |pending |
- |commit |
- +---------+ <--- tail pointer
-
- When the first writer commits:
-
- Buffer page
- +---------+
- |written |
- +---------+
- |written |
- +---------+
- |written |
- +---------+ <--(last full commit and tail pointer)
-
-
-The commit pointer points to the last write location that was
-committed without preempting another write. When a write that
-preempted another write is committed, it only becomes a pending commit
-and will not be a full commit until all writes have been committed.
-
-The commit page points to the page that has the last full commit.
-The tail page points to the page with the last write (before
-committing).
-
-The tail page is always equal to or after the commit page. It may
-be several pages ahead. If the tail page catches up to the commit
-page then no more writes may take place (regardless of the mode
-of the ring buffer: overwrite and produce/consumer).
-
-The order of pages is:
-
- head page
- commit page
- tail page
-
-Possible scenario:
- tail page
- head page commit page |
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-There is a special case that the head page is after either the commit page
-and possibly the tail page. That is when the commit (and tail) page has been
-swapped with the reader page. This is because the head page is always
-part of the ring buffer, but the reader page is not. Whenever there
-has been less than a full page that has been committed inside the ring buffer,
-and a reader swaps out a page, it will be swapping out the commit page.
-
-
- reader page commit page tail page
- | | |
- v | |
- +---+ | |
- | |<----------+ |
- | |<------------------------+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
-In this case, the head page will not move when the tail and commit
-move back into the ring buffer.
-
-The reader cannot swap a page into the ring buffer if the commit page
-is still on that page. If the read meets the last commit (real commit
-not pending or reserved), then there is nothing more to read.
-The buffer is considered empty until another full commit finishes.
-
-When the tail meets the head page, if the buffer is in overwrite mode,
-the head page will be pushed ahead one. If the buffer is in producer/consumer
-mode, the write will fail.
-
-Overwrite mode:
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-Note, the reader page will still point to the previous head page.
-But when a swap takes place, it will use the most recent head page.
-
-
-Making the Ring Buffer Lockless:
---------------------------------
-
-The main idea behind the lockless algorithm is to combine the moving
-of the head_page pointer with the swapping of pages with the reader.
-State flags are placed inside the pointer to the page. To do this,
-each page must be aligned in memory by 4 bytes. This will allow the 2
-least significant bits of the address to be used as flags, since
-they will always be zero for the address. To get the address,
-simply mask out the flags.
-
- MASK = ~3
-
- address & MASK
-
-Two flags will be kept by these two bits:
-
- HEADER - the page being pointed to is a head page
-
- UPDATE - the page being pointed to is being updated by a writer
- and was or is about to be a head page.
-
-
- reader page
- |
- v
- +---+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-The above pointer "-H->" would have the HEADER flag set. That is
-the next page is the next page to be swapped out by the reader.
-This pointer means the next page is the head page.
-
-When the tail page meets the head pointer, it will use cmpxchg to
-change the pointer to the UPDATE state:
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-"-U->" represents a pointer in the UPDATE state.
-
-Any access to the reader will need to take some sort of lock to serialize
-the readers. But the writers will never take a lock to write to the
-ring buffer. This means we only need to worry about a single reader,
-and writes only preempt in "stack" formation.
-
-When the reader tries to swap the page with the ring buffer, it
-will also use cmpxchg. If the flag bit in the pointer to the
-head page does not have the HEADER flag set, the compare will fail
-and the reader will need to look for the new head page and try again.
-Note, the flags UPDATE and HEADER are never set at the same time.
-
-The reader swaps the reader page as follows:
-
- +------+
- |reader| RING BUFFER
- |page |
- +------+
- +---+ +---+ +---+
- | |--->| |--->| |
- | |<---| |<---| |
- +---+ +---+ +---+
- ^ | ^ |
- | +---------------+ |
- +-----H-------------+
-
-The reader sets the reader page next pointer as HEADER to the page after
-the head page.
-
-
- +------+
- |reader| RING BUFFER
- |page |-------H-----------+
- +------+ v
- | +---+ +---+ +---+
- | | |--->| |--->| |
- | | |<---| |<---| |<-+
- | +---+ +---+ +---+ |
- | ^ | ^ | |
- | | +---------------+ | |
- | +-----H-------------+ |
- +--------------------------------------+
-
-It does a cmpxchg with the pointer to the previous head page to make it
-point to the reader page. Note that the new pointer does not have the HEADER
-flag set. This action atomically moves the head page forward.
-
- +------+
- |reader| RING BUFFER
- |page |-------H-----------+
- +------+ v
- | ^ +---+ +---+ +---+
- | | | |-->| |-->| |
- | | | |<--| |<--| |<-+
- | | +---+ +---+ +---+ |
- | | | ^ | |
- | | +-------------+ | |
- | +-----------------------------+ |
- +------------------------------------+
-
-After the new head page is set, the previous pointer of the head page is
-updated to the reader page.
-
- +------+
- |reader| RING BUFFER
- |page |-------H-----------+
- +------+ <---------------+ v
- | ^ +---+ +---+ +---+
- | | | |-->| |-->| |
- | | | | | |<--| |<-+
- | | +---+ +---+ +---+ |
- | | | ^ | |
- | | +-------------+ | |
- | +-----------------------------+ |
- +------------------------------------+
-
- +------+
- |buffer| RING BUFFER
- |page |-------H-----------+ <--- New head page
- +------+ <---------------+ v
- | ^ +---+ +---+ +---+
- | | | | | |-->| |
- | | New | | | |<--| |<-+
- | | Reader +---+ +---+ +---+ |
- | | page ----^ | |
- | | | |
- | +-----------------------------+ |
- +------------------------------------+
-
-Another important point: The page that the reader page points back to
-by its previous pointer (the one that now points to the new head page)
-never points back to the reader page. That is because the reader page is
-not part of the ring buffer. Traversing the ring buffer via the next pointers
-will always stay in the ring buffer. Traversing the ring buffer via the
-prev pointers may not.
-
-Note, the way to determine a reader page is simply by examining the previous
-pointer of the page. If the next pointer of the previous page does not
-point back to the original page, then the original page is a reader page:
-
-
- +--------+
- | reader | next +----+
- | page |-------->| |<====== (buffer page)
- +--------+ +----+
- | | ^
- | v | next
- prev | +----+
- +------------->| |
- +----+
-
-The way the head page moves forward:
-
-When the tail page meets the head page and the buffer is in overwrite mode
-and more writes take place, the head page must be moved forward before the
-writer may move the tail page. The way this is done is that the writer
-performs a cmpxchg to convert the pointer to the head page from the HEADER
-flag to have the UPDATE flag set. Once this is done, the reader will
-not be able to swap the head page from the buffer, nor will it be able to
-move the head page, until the writer is finished with the move.
-
-This eliminates any races that the reader can have on the writer. The reader
-must spin, and this is why the reader cannot preempt the writer.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The following page will be made into the new head page.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-After the new head page has been set, we can set the old head page
-pointer back to NORMAL.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-After the head page has been moved, the tail page may now move forward.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-The above are the trivial updates. Now for the more complex scenarios.
-
-
-As stated before, if enough writes preempt the first write, the
-tail page may make it all the way around the buffer and meet the commit
-page. At this time, we must start dropping writes (usually with some kind
-of warning to the user). But what happens if the commit was still on the
-reader page? The commit page is not part of the ring buffer. The tail page
-must account for this.
-
-
- reader page commit page
- | |
- v |
- +---+ |
- | |<----------+
- | |
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- tail page
-
-If the tail page were to simply push the head page forward, the commit when
-leaving the reader page would not be pointing to the correct page.
-
-The solution to this is to test if the commit page is on the reader page
-before pushing the head page. If it is, then it can be assumed that the
-tail page wrapped the buffer, and we must drop new writes.
-
-This is not a race condition, because the commit page can only be moved
-by the outermost writer (the writer that was preempted).
-This means that the commit will not move while a writer is moving the
-tail page. The reader cannot swap the reader page if it is also being
-used as the commit page. The reader can simply check that the commit
-is off the reader page. Once the commit page leaves the reader page
-it will never go back on it unless a reader does another swap with the
-buffer page that is also the commit page.
-
-
-Nested writes
--------------
-
-In the pushing forward of the tail page we must first push forward
-the head page if the head page is the next page. If the head page
-is not the next page, the tail page is simply updated with a cmpxchg.
-
-Only writers move the tail page. This must be done atomically to protect
-against nested writers.
-
- temp_page = tail_page
- next_page = temp_page->next
- cmpxchg(tail_page, temp_page, next_page)
-
-The above will update the tail page if it is still pointing to the expected
-page. If this fails, a nested write pushed it forward, the the current write
-does not need to push it.
-
-
- temp page
- |
- v
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Nested write comes in and moves the tail page forward:
-
- tail page (moved by nested writer)
- temp page |
- | |
- v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The above would fail the cmpxchg, but since the tail page has already
-been moved forward, the writer will just try again to reserve storage
-on the new tail page.
-
-But the moving of the head page is a bit more complex.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The write converts the head page pointer to UPDATE.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-But if a nested writer preempts here, it will see that the next
-page is a head page, but it is also nested. It will detect that
-it is nested and will save that information. The detection is the
-fact that it sees the UPDATE flag instead of a HEADER or NORMAL
-pointer.
-
-The nested writer will set the new head page pointer.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-But it will not reset the update back to normal. Only the writer
-that converted a pointer from HEAD to UPDATE will convert it back
-to NORMAL.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-After the nested writer finishes, the outermost writer will convert
-the UPDATE pointer to NORMAL.
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-It can be even more complex if several nested writes came in and moved
-the tail page ahead several pages:
-
-
-(first writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The write converts the head page pointer to UPDATE.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Next writer comes in, and sees the update and sets up the new
-head page.
-
-(second writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The nested writer moves the tail page forward. But does not set the old
-update page to NORMAL because it is not the outermost writer.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Another writer preempts and sees the page after the tail page is a head page.
-It changes it from HEAD to UPDATE.
-
-(third writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-U->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The writer will move the head page forward:
-
-
-(third writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-U->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-But now that the third writer did change the HEAD flag to UPDATE it
-will convert it to normal:
-
-
-(third writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-Then it will move the tail page, and return back to the second writer.
-
-
-(second writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-
-The second writer will fail to move the tail page because it was already
-moved, so it will try again and add its data to the new tail page.
-It will return to the first writer.
-
-
-(first writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The first writer cannot know atomically if the tail page moved
-while it updates the HEAD page. It will then update the head page to
-what it thinks is the new head page.
-
-
-(first writer)
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Since the cmpxchg returns the old value of the pointer the first writer
-will see it succeeded in updating the pointer from NORMAL to HEAD.
-But as we can see, this is not good enough. It must also check to see
-if the tail page is either where it use to be or on the next page:
-
-
-(first writer)
-
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-If tail page != A and tail page != B, then it must reset the pointer
-back to NORMAL. The fact that it only needs to worry about nested
-writers means that it only needs to check this after setting the HEAD page.
-
-
-(first writer)
-
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Now the writer can update the head page. This is also why the head page must
-remain in UPDATE and only reset by the outermost writer. This prevents
-the reader from seeing the incorrect head page.
-
-
-(first writer)
-
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
deleted file mode 100644
index 058cc6c9dc5..00000000000
--- a/Documentation/trace/tracepoint-analysis.txt
+++ /dev/null
@@ -1,327 +0,0 @@
- Notes on Analysing Behaviour Using Events and Tracepoints
-
- Documentation written by Mel Gorman
- PCL information heavily based on email from Ingo Molnar
-
-1. Introduction
-===============
-
-Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
-creating custom kernel modules to register probe functions using the event
-tracing infrastructure.
-
-Simplistically, tracepoints represent important events that can be
-taken in conjunction with other tracepoints to build a "Big Picture" of
-what is going on within the system. There are a large number of methods for
-gathering and interpreting these events. Lacking any current Best Practises,
-this document describes some of the methods that can be used.
-
-This document assumes that debugfs is mounted on /sys/kernel/debug and that
-the appropriate tracing options have been configured into the kernel. It is
-assumed that the PCL tool tools/perf has been installed and is in your path.
-
-2. Listing Available Events
-===========================
-
-2.1 Standard Utilities
-----------------------
-
-All possible events are visible from /sys/kernel/debug/tracing/events. Simply
-calling
-
- $ find /sys/kernel/debug/tracing/events -type d
-
-will give a fair indication of the number of events available.
-
-2.2 PCL (Performance Counters for Linux)
--------
-
-Discovery and enumeration of all counters and events, including tracepoints,
-are available with the perf tool. Getting a list of available events is a
-simple case of:
-
- $ perf list 2>&1 | grep Tracepoint
- ext4:ext4_free_inode [Tracepoint event]
- ext4:ext4_request_inode [Tracepoint event]
- ext4:ext4_allocate_inode [Tracepoint event]
- ext4:ext4_write_begin [Tracepoint event]
- ext4:ext4_ordered_write_end [Tracepoint event]
- [ .... remaining output snipped .... ]
-
-
-3. Enabling Events
-==================
-
-3.1 System-Wide Event Enabling
-------------------------------
-
-See Documentation/trace/events.txt for a proper description on how events
-can be enabled system-wide. A short example of enabling all events related
-to page allocation would look something like:
-
- $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
-
-3.2 System-Wide Event Enabling with SystemTap
----------------------------------------------
-
-In SystemTap, tracepoints are accessible using the kernel.trace() function
-call. The following is an example that reports every 5 seconds what processes
-were allocating the pages.
-
- global page_allocs
-
- probe kernel.trace("mm_page_alloc") {
- page_allocs[execname()]++
- }
-
- function print_count() {
- printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
- foreach (proc in page_allocs-)
- printf("%-25d %s\n", page_allocs[proc], proc)
- printf ("\n")
- delete page_allocs
- }
-
- probe timer.s(5) {
- print_count()
- }
-
-3.3 System-Wide Event Enabling with PCL
----------------------------------------
-
-By specifying the -a switch and analysing sleep, the system-wide events
-for a duration of time can be examined.
-
- $ perf stat -a \
- -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched \
- sleep 10
- Performance counter stats for 'sleep 10':
-
- 9630 kmem:mm_page_alloc
- 2143 kmem:mm_page_free
- 7424 kmem:mm_page_free_batched
-
- 10.002577764 seconds time elapsed
-
-Similarly, one could execute a shell and exit it as desired to get a report
-at that point.
-
-3.4 Local Event Enabling
-------------------------
-
-Documentation/trace/ftrace.txt describes how to enable events on a per-thread
-basis using set_ftrace_pid.
-
-3.5 Local Event Enablement with PCL
------------------------------------
-
-Events can be activated and tracked for the duration of a process on a local
-basis using PCL such as follows.
-
- $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched ./hackbench 10
- Time: 0.909
-
- Performance counter stats for './hackbench 10':
-
- 17803 kmem:mm_page_alloc
- 12398 kmem:mm_page_free
- 4827 kmem:mm_page_free_batched
-
- 0.973913387 seconds time elapsed
-
-4. Event Filtering
-==================
-
-Documentation/trace/ftrace.txt covers in-depth how to filter events in
-ftrace. Obviously using grep and awk of trace_pipe is an option as well
-as any script reading trace_pipe.
-
-5. Analysing Event Variances with PCL
-=====================================
-
-Any workload can exhibit variances between runs and it can be important
-to know what the standard deviation is. By and large, this is left to the
-performance analyst to do it by hand. In the event that the discrete event
-occurrences are useful to the performance analyst, then perf can be used.
-
- $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free
- -e kmem:mm_page_free_batched ./hackbench 10
- Time: 0.890
- Time: 0.895
- Time: 0.915
- Time: 1.001
- Time: 0.899
-
- Performance counter stats for './hackbench 10' (5 runs):
-
- 16630 kmem:mm_page_alloc ( +- 3.542% )
- 11486 kmem:mm_page_free ( +- 4.771% )
- 4730 kmem:mm_page_free_batched ( +- 2.325% )
-
- 0.982653002 seconds time elapsed ( +- 1.448% )
-
-In the event that some higher-level event is required that depends on some
-aggregation of discrete events, then a script would need to be developed.
-
-Using --repeat, it is also possible to view how events are fluctuating over
-time on a system-wide basis using -a and sleep.
-
- $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched \
- -a --repeat 10 \
- sleep 1
- Performance counter stats for 'sleep 1' (10 runs):
-
- 1066 kmem:mm_page_alloc ( +- 26.148% )
- 182 kmem:mm_page_free ( +- 5.464% )
- 890 kmem:mm_page_free_batched ( +- 30.079% )
-
- 1.002251757 seconds time elapsed ( +- 0.005% )
-
-6. Higher-Level Analysis with Helper Scripts
-============================================
-
-When events are enabled the events that are triggering can be read from
-/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
-options exist as well. By post-processing the output, further information can
-be gathered on-line as appropriate. Examples of post-processing might include
-
- o Reading information from /proc for the PID that triggered the event
- o Deriving a higher-level event from a series of lower-level events.
- o Calculating latencies between two events
-
-Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
-script that can read trace_pipe from STDIN or a copy of a trace. When used
-on-line, it can be interrupted once to generate a report without exiting
-and twice to exit.
-
-Simplistically, the script just reads STDIN and counts up events but it
-also can do more such as
-
- o Derive high-level events from many low-level events. If a number of pages
- are freed to the main allocator from the per-CPU lists, it recognises
- that as one per-CPU drain even though there is no specific tracepoint
- for that event
- o It can aggregate based on PID or individual process number
- o In the event memory is getting externally fragmented, it reports
- on whether the fragmentation event was severe or moderate.
- o When receiving an event about a PID, it can record who the parent was so
- that if large numbers of events are coming from very short-lived
- processes, the parent process responsible for creating all the helpers
- can be identified
-
-7. Lower-Level Analysis with PCL
-================================
-
-There may also be a requirement to identify what functions within a program
-were generating events within the kernel. To begin this sort of analysis, the
-data must be recorded. At the time of writing, this required root:
-
- $ perf record -c 1 \
- -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched \
- ./hackbench 10
- Time: 0.894
- [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
-
-Note the use of '-c 1' to set the event period to sample. The default sample
-period is quite high to minimise overhead but the information collected can be
-very coarse as a result.
-
-This record outputted a file called perf.data which can be analysed using
-perf report.
-
- $ perf report
- # Samples: 30922
- #
- # Overhead Command Shared Object
- # ........ ......... ................................
- #
- 87.27% hackbench [vdso]
- 6.85% hackbench /lib/i686/cmov/libc-2.9.so
- 2.62% hackbench /lib/ld-2.9.so
- 1.52% perf [vdso]
- 1.22% hackbench ./hackbench
- 0.48% hackbench [kernel]
- 0.02% perf /lib/i686/cmov/libc-2.9.so
- 0.01% perf /usr/bin/perf
- 0.01% perf /lib/ld-2.9.so
- 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
- #
- # (For more details, try: perf report --sort comm,dso,symbol)
- #
-
-According to this, the vast majority of events triggered on events
-within the VDSO. With simple binaries, this will often be the case so let's
-take a slightly different example. In the course of writing this, it was
-noticed that X was generating an insane amount of page allocations so let's look
-at it:
-
- $ perf record -c 1 -f \
- -e kmem:mm_page_alloc -e kmem:mm_page_free \
- -e kmem:mm_page_free_batched \
- -p `pidof X`
-
-This was interrupted after a few seconds and
-
- $ perf report
- # Samples: 27666
- #
- # Overhead Command Shared Object
- # ........ ....... .......................................
- #
- 51.95% Xorg [vdso]
- 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
- 0.09% Xorg /lib/i686/cmov/libc-2.9.so
- 0.01% Xorg [kernel]
- #
- # (For more details, try: perf report --sort comm,dso,symbol)
- #
-
-So, almost half of the events are occurring in a library. To get an idea which
-symbol:
-
- $ perf report --sort comm,dso,symbol
- # Samples: 27666
- #
- # Overhead Command Shared Object Symbol
- # ........ ....... ....................................... ......
- #
- 51.95% Xorg [vdso] [.] 0x000000ffffe424
- 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
- 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
- 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
- 0.01% Xorg [kernel] [k] read_hpet
- 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
- 0.00% Xorg [kernel] [k] ftrace_trace_userstack
-
-To see where within the function pixmanFillsse2 things are going wrong:
-
- $ perf annotate pixmanFillsse2
- [ ... ]
- 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
- : }
- :
- : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
- : _mm_store_si128 (__m128i *__P, __m128i __B) : {
- : *__P = __B;
- 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
- 0.00 : 34ef5: ff
- 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
- 0.00 : 34efd: ff
- 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
- 0.00 : 34f05: ff
- 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
- 0.00 : 34f0d: ff
- 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
- 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
- 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
- 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
-
-At a glance, it looks like the time is being spent copying pixmaps to
-the card. Further investigation would be needed to determine why pixmaps
-are being copied around so much but a starting point would be to take an
-ancient build of libpixmap out of the library path where it was totally
-forgotten about from months ago!
diff --git a/Documentation/trace/tracepoints.txt b/Documentation/trace/tracepoints.txt
deleted file mode 100644
index c0e1ceed75a..00000000000
--- a/Documentation/trace/tracepoints.txt
+++ /dev/null
@@ -1,116 +0,0 @@
- Using the Linux Kernel Tracepoints
-
- Mathieu Desnoyers
-
-
-This document introduces Linux Kernel Tracepoints and their use. It
-provides examples of how to insert tracepoints in the kernel and
-connect probe functions to them and provides some examples of probe
-functions.
-
-
-* Purpose of tracepoints
-
-A tracepoint placed in code provides a hook to call a function (probe)
-that you can provide at runtime. A tracepoint can be "on" (a probe is
-connected to it) or "off" (no probe is attached). When a tracepoint is
-"off" it has no effect, except for adding a tiny time penalty
-(checking a condition for a branch) and space penalty (adding a few
-bytes for the function call at the end of the instrumented function
-and adds a data structure in a separate section). When a tracepoint
-is "on", the function you provide is called each time the tracepoint
-is executed, in the execution context of the caller. When the function
-provided ends its execution, it returns to the caller (continuing from
-the tracepoint site).
-
-You can put tracepoints at important locations in the code. They are
-lightweight hooks that can pass an arbitrary number of parameters,
-which prototypes are described in a tracepoint declaration placed in a
-header file.
-
-They can be used for tracing and performance accounting.
-
-
-* Usage
-
-Two elements are required for tracepoints :
-
-- A tracepoint definition, placed in a header file.
-- The tracepoint statement, in C code.
-
-In order to use tracepoints, you should include linux/tracepoint.h.
-
-In include/trace/subsys.h :
-
-#include <linux/tracepoint.h>
-
-DECLARE_TRACE(subsys_eventname,
- TP_PROTO(int firstarg, struct task_struct *p),
- TP_ARGS(firstarg, p));
-
-In subsys/file.c (where the tracing statement must be added) :
-
-#include <trace/subsys.h>
-
-DEFINE_TRACE(subsys_eventname);
-
-void somefct(void)
-{
- ...
- trace_subsys_eventname(arg, task);
- ...
-}
-
-Where :
-- subsys_eventname is an identifier unique to your event
- - subsys is the name of your subsystem.
- - eventname is the name of the event to trace.
-
-- TP_PROTO(int firstarg, struct task_struct *p) is the prototype of the
- function called by this tracepoint.
-
-- TP_ARGS(firstarg, p) are the parameters names, same as found in the
- prototype.
-
-Connecting a function (probe) to a tracepoint is done by providing a
-probe (function to call) for the specific tracepoint through
-register_trace_subsys_eventname(). Removing a probe is done through
-unregister_trace_subsys_eventname(); it will remove the probe.
-
-tracepoint_synchronize_unregister() must be called before the end of
-the module exit function to make sure there is no caller left using
-the probe. This, and the fact that preemption is disabled around the
-probe call, make sure that probe removal and module unload are safe.
-See the "Probe example" section below for a sample probe module.
-
-The tracepoint mechanism supports inserting multiple instances of the
-same tracepoint, but a single definition must be made of a given
-tracepoint name over all the kernel to make sure no type conflict will
-occur. Name mangling of the tracepoints is done using the prototypes
-to make sure typing is correct. Verification of probe type correctness
-is done at the registration site by the compiler. Tracepoints can be
-put in inline functions, inlined static functions, and unrolled loops
-as well as regular functions.
-
-The naming scheme "subsys_event" is suggested here as a convention
-intended to limit collisions. Tracepoint names are global to the
-kernel: they are considered as being the same whether they are in the
-core kernel image or in modules.
-
-If the tracepoint has to be used in kernel modules, an
-EXPORT_TRACEPOINT_SYMBOL_GPL() or EXPORT_TRACEPOINT_SYMBOL() can be
-used to export the defined tracepoints.
-
-* Probe / tracepoint example
-
-See the example provided in samples/tracepoints
-
-Compile them with your kernel. They are built during 'make' (not
-'make modules') when CONFIG_SAMPLE_TRACEPOINTS=m.
-
-Run, as root :
-modprobe tracepoint-sample (insmod order is not important)
-modprobe tracepoint-probe-sample
-cat /proc/tracepoint-sample (returns an expected error)
-rmmod tracepoint-sample tracepoint-probe-sample
-dmesg
diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
deleted file mode 100644
index 24ce6823a09..00000000000
--- a/Documentation/trace/uprobetracer.txt
+++ /dev/null
@@ -1,113 +0,0 @@
- Uprobe-tracer: Uprobe-based Event Tracing
- =========================================
- Documentation written by Srikar Dronamraju
-
-Overview
---------
-Uprobe based trace events are similar to kprobe based trace events.
-To enable this feature, build your kernel with CONFIG_UPROBE_EVENT=y.
-
-Similar to the kprobe-event tracer, this doesn't need to be activated via
-current_tracer. Instead of that, add probe points via
-/sys/kernel/debug/tracing/uprobe_events, and enable it via
-/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
-
-However unlike kprobe-event tracer, the uprobe event interface expects the
-user to calculate the offset of the probepoint in the object
-
-Synopsis of uprobe_tracer
--------------------------
- p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS] : Set a probe
-
- GRP : Group name. If omitted, use "uprobes" for it.
- EVENT : Event name. If omitted, the event name is generated
- based on SYMBOL+offs.
- PATH : path to an executable or a library.
- SYMBOL[+offs] : Symbol+offset where the probe is inserted.
-
- FETCHARGS : Arguments. Each probe can have up to 128 args.
- %REG : Fetch register REG
-
-Event Profiling
----------------
- You can check the total number of probe hits and probe miss-hits via
-/sys/kernel/debug/tracing/uprobe_profile.
- The first column is event name, the second is the number of probe hits,
-the third is the number of probe miss-hits.
-
-Usage examples
---------------
-To add a probe as a new event, write a new definition to uprobe_events
-as below.
-
- echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
-
- This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash
-
- echo > /sys/kernel/debug/tracing/uprobe_events
-
- This clears all probe points.
-
-The following example shows how to dump the instruction pointer and %ax
-a register at the probed text address. Here we are trying to probe
-function zfree in /bin/zsh
-
- # cd /sys/kernel/debug/tracing/
- # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
- 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
- # objdump -T /bin/zsh | grep -w zfree
- 0000000000446420 g DF .text 0000000000000012 Base zfree
-
-0x46420 is the offset of zfree in object /bin/zsh that is loaded at
-0x00400000. Hence the command to probe would be :
-
- # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
-
-Please note: User has to explicitly calculate the offset of the probepoint
-in the object. We can see the events that are registered by looking at the
-uprobe_events file.
-
- # cat uprobe_events
- p:uprobes/p_zsh_0x46420 /bin/zsh:0x00046420 arg1=%ip arg2=%ax
-
-The format of events can be seen by viewing the file events/uprobes/p_zsh_0x46420/format
-
- # cat events/uprobes/p_zsh_0x46420/format
- name: p_zsh_0x46420
- ID: 922
- format:
- field:unsigned short common_type; offset:0; size:2; signed:0;
- field:unsigned char common_flags; offset:2; size:1; signed:0;
- field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
- field:int common_pid; offset:4; size:4; signed:1;
- field:int common_padding; offset:8; size:4; signed:1;
-
- field:unsigned long __probe_ip; offset:12; size:4; signed:0;
- field:u32 arg1; offset:16; size:4; signed:0;
- field:u32 arg2; offset:20; size:4; signed:0;
-
- print fmt: "(%lx) arg1=%lx arg2=%lx", REC->__probe_ip, REC->arg1, REC->arg2
-
-Right after definition, each event is disabled by default. For tracing these
-events, you need to enable it by:
-
- # echo 1 > events/uprobes/enable
-
-Lets disable the event after sleeping for some time.
- # sleep 20
- # echo 0 > events/uprobes/enable
-
-And you can see the traced information via /sys/kernel/debug/tracing/trace.
-
- # cat trace
- # tracer: nop
- #
- # TASK-PID CPU# TIMESTAMP FUNCTION
- # | | | | |
- zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
- zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
- zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
- zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
-
-Each line shows us probes were triggered for a pid 24842 with ip being
-0x446421 and contents of ax register being 79.