summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--tapsets/contextinfo/contextinfo.txt72
-rw-r--r--tapsets/dynamic_cg/dynamic_cg.txt64
-rw-r--r--tapsets/dynamic_cg/tapset.stp7
-rw-r--r--tapsets/dynamic_cg/usage.stp30
-rw-r--r--tapsets/profile/profile_tapset.txt503
-rw-r--r--tapsets/timestamp/timestamp_tapset.txt327
6 files changed, 0 insertions, 1003 deletions
diff --git a/tapsets/contextinfo/contextinfo.txt b/tapsets/contextinfo/contextinfo.txt
deleted file mode 100644
index 5f7f725f..00000000
--- a/tapsets/contextinfo/contextinfo.txt
+++ /dev/null
@@ -1,72 +0,0 @@
-* Application name: probe context information ariables
-* Contact: fche
-* Motivation: let probes know where/how they were fired; introspective
- probe handlers
-* Background: discussions on mailing lists
-* Target software: various
-* Type of description: tapset variables
-* Interesting probe points: n/a
-* Interesting values:
-
- $pp_alias: string: the string specification of the probe point, as found
- in the original .stp file, before alias and other
- expansion
- $pp: string: representation of this probe point, after alias and wildcard
- expansion
- $pp_function: string: source function (if available)
- $pp_srcfile: string: source file name (if available)
- $pp_srcline: number: line number in source file (if available)
-
- $function[pc]: string: function name containing given address
- $module[pc]: string: kernel module name containing given address
- $address[sym]: number: base address of given function symbol
-
- $pc: number: PC snapshot at invocation
- $stack[depth]: number: PC of caller at given depth, if available
-
- $pid, $tgid, $uid, $comm : number/string : current-> fields
-
-* Dependencies:
-
- Debug-info files
-
-* Restrictions:
-
- The $pp series of variables are computed at translation time, and thus
- are only applicable to those probes that have related debug-info points.
-
- $pc should be directly available.
-
- The $function series of read-only pseudo-arrays are calculated at
- run time, from symbol table information passed in some way.
- $stack[0] might take some probing in the registers, or (eek!) on the
- target stack frame. Conservatively returning 0 instead may be okay.
-
- The current-based series of values ($pid etc.), for kernel-targeted
- probes, need to check for !in_interrupt() before dereferencing current->.
-
-* Data collection:
-
- Several of the variables are translation-time constants, so these don't
- have run-time collection needs.
-
- For a kernel/module probe, $function[] could be computed from the kallsyms
- lookup functions. Alternately, the translator could emit a copy of the
- target symbol table into the probe C code, which $function[] could
- search. The $stack[] elements would be served by the runtime on a
- best-effort basis.
-
-* Data presentation:
-
- n/a: variables are simple
-
-* Competition:
-
- unknown
-
-* Cross-references:
-
- http://sources.redhat.com/ml/systemtap/2005-q2/msg00395.html
- http://sources.redhat.com/ml/systemtap/2005-q2/msg00281.html
-
-* Associated files:
diff --git a/tapsets/dynamic_cg/dynamic_cg.txt b/tapsets/dynamic_cg/dynamic_cg.txt
deleted file mode 100644
index 35de29f1..00000000
--- a/tapsets/dynamic_cg/dynamic_cg.txt
+++ /dev/null
@@ -1,64 +0,0 @@
-* Application name: Dynamic Callgraph
-* Contact: William Cohen, wochen@redhat.com
-
-* Motivation:
-
-Dynamic Callgraph would provide information to allow developers to see
-what other functions a function is calling. This could show that some
-unexpected functions are getting called. DTrace has a instrumentation
-provider that generates a trace of the functions called and returned.
-
-* Background:
-
-There have been times that people in Red Hat support have narrowed a
-problem to a specific function and the functions it calls. Rather
-than instrumenting the function's children by hand, a tapset that
-provides a dynamic callgraph would allow quicker determination of the
-things called. There are cases in the kernel code where determining
-the function being called cannot be determined statically,
-e.g. function to call is stored in a data structure.
-
-* Target software:
-
-Ideally both kernel and user space, but kernel space only would
-be sufficient for many cases.
-
-* Type of description: tapset and scripting command
- tapset to provide to support to capture call return information
- scripting commands to turn on and off the capture
-
-* Interesting probe points:
-
-* Interesting values:
-
-* Dependencies:
-- P6/x86-64 processors have the debug hardware to trap control flow chgs.
-- Need to have the kernel maintain the debug hardware on a per process basis.
- The DebugCtlMSR is not currently stored in the context
- (only debug registers 0, 1, 2, 3, 6, and 7 are virtualized)
-
-* Restrictions:
- May be difficult to implement on ppc: returns may look like regular jumps
- and trapping on all branches could cause problems with atomic operations
- Won't work on pre p6 x86 processors
- Won't provide data for inlined funcions
-
-* Data collection:
- Track whether the instruction was a call or a return and the target address.
-
-* Data presentation:
- -processing address in userspace to convert addresses into function names
- -trace showing calls and returns
- -maybe further post process to build dynamic callgraph
- determine that a function is being called way too often
-
-* Competition:
- DTrace already implements tracing of function calls and returns.
-
-* Cross-references:
-
-* Associated files:
-
- $dynamic_call_graph = 1; // turn on tracing of calls for thread
- $dynamic_call_graph = 0; // turn off tracing of calls for thread
-
diff --git a/tapsets/dynamic_cg/tapset.stp b/tapsets/dynamic_cg/tapset.stp
deleted file mode 100644
index c731fac9..00000000
--- a/tapsets/dynamic_cg/tapset.stp
+++ /dev/null
@@ -1,7 +0,0 @@
-global $dynamic_call_graph
-probe kernel.perfctr.call(1) {
- if ($dynamic_call_graph) trace_sym ($pc);
-}
-probe kernel.perfctr.return(1) {
- if ($dynamic_call_graph) trace_sym ($pc);
-}
diff --git a/tapsets/dynamic_cg/usage.stp b/tapsets/dynamic_cg/usage.stp
deleted file mode 100644
index 1625768b..00000000
--- a/tapsets/dynamic_cg/usage.stp
+++ /dev/null
@@ -1,30 +0,0 @@
-
-probe.kernel.sys_open.entry()
-{
- $dynamic_call_graph =1;
-}
-
-# What you would see in the output would be something of this kind
-# call sys_open
-# call getname
-# call do_getname
-# return do_getname
-# return getname
-# call get_unused_fd
-# call find_next_zero_bit
-# return find_next_zero_bit
-# return get_unused_fd
-# call filep_open
- .....
-return sys_open
-
-# The above probe could be customized to a particular process as well,
-# like in the following
-
-probe.kernel.sys_open.entry()
-{
-if ($PID == 1234)
- $dynamic_call_graph =1;
-}
-
-
diff --git a/tapsets/profile/profile_tapset.txt b/tapsets/profile/profile_tapset.txt
deleted file mode 100644
index e8899dc7..00000000
--- a/tapsets/profile/profile_tapset.txt
+++ /dev/null
@@ -1,503 +0,0 @@
-* Application name: Stopwatch and Profiling for systemtap
-
-* Contact:
- Will Cohen wcohen@redhat.com
- Charles Spirakis charles.spirakis@intel.com
-
-* Motivation:
- Allow SW developers to improve the performance of their
- code. The metholodies used are stopwatch (sometimes known
- as event counting) and profiling.
-
-* Background:
- Will has experience with oprofile
- Charles has experience with vtune
-
-* Target software:
- Initially the kernel, but longer term, both kernel and user.
-
-* Type of description:
- General information regarding requirements and usage models.
-
-* Interesting probe points:
- When doing profiling you have "asynchronous-event" probe points
- (aka you get an interrupt and you'll want to capture information
- about where that interrupt happened).
-
- When doing stopwatch, interesting probe points will be
- function entry/exits, queue add/remove, queue entity lifecycle,
- and any other code where you want to measure time
- or events (cpu resource utilization) associated with a path of code
- (frame buffer drawing measurements, graphic T&L pipeline
- measurements, etc).
-
-* Interesting values:
- For profiling, the pt_regs structure from the interrupt handler. The
- most commonly used items would be the instruction pointer and the
- call stack pointer.
-
- For stopwatch, most of the accesses are likely to be pmu read
- operations.
-
- In addition, given the large variety of pmu capabilities, access
- to the pmu registers themselves (read and write) would be very
- important.
-
- Different pmu's have different events, but for script portability,
- we may want to have a subset of predefined events and have something
- map that into a pmu's particular event (similar to what papi does).
-
- Given the variety of performance events and pmu architectures, we
- may want to try and have a standardized library/api as part of the
- translator to map events (or specialzed event information) into
- register/value pairs used during the actual systemtap run.
-
- ??? Classify values as consumed from lower level vs. provided to higher
- level ???
-
-* Dependencies:
- Need some form of arbitration of the pmu to make sure the data provided
- is valid (see perfmon below).
-
- Two common usage models are aggregated data (oprofile) and
- trace history (papi/vtune). Currently these tools all do the
- aggregation in user-mode and we may want to look at what
- they do and why.
-
- The unofficial rule of thumb is that profiling should be
- as unobtrusive as possible and definitely < 1% overhead.
-
- When doing stopwatch or profiling, there is a need to be able to
- sequence the data. For timing this is important to be able to
- accurately compute start/stop deltas and watch control/data flow.
- For profiling, it is needed to support trace history.
-
- There needs to be a timesource that has reasonable granularity
- and is reasonably precise.
-
- Per-thread virtualization (of time and events)
-
- System wide mode for pmu events
-
-* Restrictions:
- Currently access to the pmu is a bit of a free for all with no
- single entity providing arbitration. The perfmon2 patch for 2.6
- (see the cross reference section below) is attempting to
- provide much of the infrastructure needed by profiling tools
- (like oprofile and papi) across architectures (pentium M, ia64
- and x86_64 initially, though I understand Stephane has contacted
- someone at IBM for a powerpc version as well).
-
- Andrew Morton wants a perfmon and perfcntr to be merged. Regardless
- of what happens, both pmu libraries are geared more for
- user->kernel access rather than kernel->kernel access and we
- will need to see what can be EXPORT()'ed to make it more
- kernel module friendly.
-
-* Data collection:
- Pmu counters tend to be different widths on different
- architectures. It would be useful to standardize the
- width (in software) to 64-bits to make math operations
- (such as comparisons, delta's, etc) easier.
-
- The goal of profiling is to go from:
- pid/ip -> path/image -> source file/line number
-
- This implies the need to have a (reasonably quick) mechanism to
- translate pid/ip to path/image. Potentially reuse the dcookie
- methodology from oprofile but may need to extend that model if there
- is a goal to support anonymous maps (dynamically generated code).
-
- Need the ability to map the current pid to a process name.
-
- Need to make a decision on how much will be handled via associative
- arrays in the kernel and how much will be handled in user space
- (potentially part of post processing). Given the volume of data that
- can be generated during profiling, it may make sense to follow the
- trend of current perfomrance tools and attempt to put merging and
- aggregation in the user space instead of kernel space.
-
- To keep the overhead of collection low, it may be useful to look
- into having some of the information needed be collected at interrupt
- time and other pieces of information be collected after the
- interrupt (top/bottom style). For example, although it may be
- convienent to have a syntax like:
-
- val = associate_image($pt_regs->eip)
-
- it may be preferable to use a marker in the output stream instead
- (oprofile used a dcookie) and then do a lookup later (either in the
- kernel and add a marker->name entry to the output stream or in user
- space similar to what oprofile did). This concept could be extended
- to cover the lookup of the pid name as well.
-
- Stack information will need to be collected at interrupt time
- (based on the interrupted pt_regs->esp) so the routine to obtain
- the stack trace should be reasonably fast. Due to asynchronous probes,
- the stack may be in user space.
-
- Depending on whether support of anonymous maps is important, it may
- be useful to have a more generic method of mapping ip->path/module
- which would allow dynamic code generates (usually found in user
- space) to be able to provide ip->image map information as part of
- the regular systemtap data stream. If we allow for a user-mode api
- to add data to a systemtap stream, we could have a very general
- purpose merge/aggregation tool for profiling from a variety of
- sources.
-
-* Data presentation:
- Generally data will be presented to the user as either an inorder
- stream (trace history) or aggregated in some form to produce a
- histogram or min/max/average/std.
-
- When aggregated, the data may be clumped by pid (each running of
- the app provides unique data), process name (the data for an app
- is merged for all runs), or it may be clumped by the loaded image
- (to get information about shared libraries regardless of the app
- that loaded it). Assuming an increase in multi processor and
- multi threaded applications, grouping the data by thread group
- id is likely to be useful as well. Ideally, if symbols/debug
- information is available, additional aggregation could be done
- at the function, basic block or source line.
-
-* Competition:
- See the cross-reference list below
-
-* Cross-references:
- Oprofile
-
- Oprofile is a profiling tool that provides time and event based
- sampling. Its collection methodology has a "file view" of the
- world and only captures the minimum information needed to get
- the image that corresponds to the interrupted instruction
- address. It aggregates the data (no time information) to keep
- the total data size to a minimum even on long runs. Oprofile
- allows for optional "escape" sequences in a data stream to add
- information. It can handle non-maskable interrupts (NMI) as well
- as maskable interrupts to obtain samples in areas where
- maskable interrupts are normally disabled. Work is being done
- to allow oprofile to handle anonymous maps (ie. dynamically
- generated code from jvm's).
-
- http://oprofile.sourceforge.net/news/
-
- Papi
-
- Papi is a profiling tool that can aggregate data or keep a trace
- history. It uses tables to map generic event concepts (for example,
- PAPI_TOT_CYC) into architecture specific events (for example,
- CPU_CLK_UNHALTED, value 0x79 on the Pentium M). Interrupts can be
- time based and it can capture event counts (i.e. every 5ms,
- capture cpu cycles and instructions retired) in addition to
- the instruction pointer. Papi is built on top of other performance
- monitoring support such as ia64 perfmon and i386 perfctr in the Linux
- kernel.
-
- http://icl.cs.utk.edu/papi/
-
- Perfmon2 infrastructure
-
- Perfmon2 is a profiling infrastructure currently in the linux 2.6
- kernel for ia64. It handles arbitration and virtualization
- of the pmu resources, extends
- the pmu's to a logical 64-bits regardless of the underlying hardware
- size, context switches of the counters when needed to allow for
- per-process or system-wide use, and has the ability to choose a subset
- of the cpu's on a system when doing system-wide profiling. Oprofile on
- Linux 2.6 for ia64 has been ported to use the perfmon2 interface. Currently,
- there are patches submitted for the Linux Kernel Mailing List for
- the 2.6 kernel to port the perfmon2
- infrastructure to the Pentium M and x86_64.
-
- http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html
-
- Shark
-
- Shark is a profiling tool from Apple that focuses on time and event
- based statistical stack sampling. On each profile interrupt, in
- addition to capturing the instruction pointer, it also captures
- a stack trace so you know both where you were and how you got there.
-
- http://developer.apple.com/tools/sharkoptimize.html
-
- Vtune
-
- Vtune is a profiling tool that provides time and event based
- sampling. It does collection based on a "process view" of the
- world. It keeps a trace history so that you can aggregate the
- data during post processing in various ways, it can capture
- architectural specific data in addition to ip (such as branch
- history buffers), and it can use architectural specific abilities
- to get exact ip addresses for certain events. Currently handles
- anonymous mappings (dynamically generated code from jvm's).
-
- http://www.intel.com/software/products/vtune/vlin/index.htm
-
-
-* Associated files:
- Should the usage models be split into a separate file?
-
-Usage Models:
- Below are some typical usage models. This isn't an attempt
- to propose syntax, it's an attempt to create something
- concrete enough to help people understand the goals:
- (description, psuedo code, desired output).
-
-Description: Statistical stack sampling (ala shark)
-
- probe kernel.time_ms(10)
- {
- i = associate_image($pt_regs->eip);
- s = stack($pt_regs->esp);
- stp($current->pid, $pid_name, $pt_regs->eip, i, s)
- }
-
- Output desired:
- For each process/prcess name, aggregate (histogram) based
- on eip (regardless how I got there), stack (what was the
- most common calling path), or both (what was the most common
- path to the most common eip).
- Could be implemented by generating a trace history and let the
- user post process (eats disk space, but one run can be viewed
- multiple ways) or could have the user define what was wanted
- in the script and do the post processing ourselves (saves disk space,
- but more work for us).
-
-Description: Time based aggregation (ala oprofile)
-
- probe kernel.time_ms(10)
- {
- i = associate_image($pt_regs->eip);
- stp($current->pid, $pid_name, $ptregs->eip, i);
- }
-
- Output desired:
- Histogram separated by process name, pid/eip, pid/image
-
-Description: Time a routine part 1
- time between the function call and return:
-
- probe kernel.function("sys_exec")
- {
- $thread->mystart = $timestamp
- }
- probe kernel.function("sys_exec").return
- {
- delta = $timestamp - $thread->mystart
-
- // do statistical operations...
- }
-
- Output desired:
- Be able to do statistics for the time it takes for an exec
- function to execute. The time needs to have a fine enough
- granularity to have meaning (i.e. using jiffies probably wouldn't work)
- and the time needs to be smp correct even if the probe entry
- and the return execute on different processors.
-
-Description: Time a routine part 2
- count the number of events between the
- function call and return:
-
- probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve")
- {
- $thread->myclocks = $pmu[0];
- $thread->myinstr_ret = $pmu[1];
- }
- probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve").return
- {
- $thread->myclocks = $pmu[0] - $thread->myclocks;
- $thread->myinstr_ret = $pmu[1] - $thread->myinstr_ret;
-
- cycles_per_instruction = $thread->myclocks / $thread->myinstr_ret
-
- // Do statistical operations...
- }
-
- Desired Output:
- Produce min/max/average for cycles, instructions retired,
- and cycles_per_instruction. The pmu must be virtualized if the
- probe entry and probe exit can happen on different processors. The
- pmu should be virtualized if there can be pre-emtption (or waits) in
- the function itself to get more useful information (the actual count
- of events in the function vs. a count of events in the whole system
- between when the function starts and when it ended)
-
-Description: Time a routine part 3
- reminder of threading issues
-
- probe kernel.function("sys_fork")
- {
- $thread->mystart = $timestamp
- }
- probe kernel.function("sys_fork").return
- {
- delta = $timestamp - $thread->mystart
-
- If (parent) {
- // do statistical operations for time it takes parent
- } else {
- // do statistical operations for time it takes child
- }
- }
-
- Desired Output:
- Produce min/max/average for the parent and the child. The
- time needs to have a fine enough granularity to have
- meaning (i.e. using jiffies probably wouldn't work)
- and the time needs to be smp correct even if the probe entry
- and the probe return execute on different processors.
-
-Description: Time a routine part 4
- reminder of threading issues
-
- probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork")
- {
- $thread->myclocks = $pmu[0];
- $thread->myinstr = $pmu[1];
- }
- probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork").return
- {
- $thread->myclocks = $pmu[0] - $thread->myclocks;
- $thread->myinstr = $pmu[1] - $thread->myinstr;
-
- cycles_per_instruction = $thread->myclocks / $thread->myinstr
-
- If (parent) {
- // Do statistical operations...
- } else {
- // Do statistical operations...
- }
- }
-
- Desired Output:
- Produce min/max/average for cycles, instructions retired,
- and cycles_per_instruction. The pmu must be virtualized if the
- probe entry and probe exit can happen on different processors. The
- pmu should be virtualized if there can be pre-emtption (or waits) in
- the function itself to get more useful information (the actual count
- of events in the function vs. a count of events in the whole system
- between when the function starts and when it ended)
-
-Description: Beginnings of "papi" style collection
-
- probe kernel.startwatch("cpu_cycles").startwatch("instructions_retired").time_ms(10)
- {
- i = associate_image($pt_regs->eip);
- stp($current->pid, $pid_name, $ptregs->eip, i, $pmu[0], $pmu[1]);
- }
-
- Desired output:
- Trace history or aggregation based on process name, image
-
-Description: Find the path leading to high latency cache miss
- that stalled for more than 128 cycles (ia64 only)
-
- probe kernel.startwatch("branch_event,pmc[12]=0x3e0f").pmu_profile("data_ear_event:1000,pmc[11]=0x5000f")
- {
- //
- // on ia64, when using the data ear event, the precise eip is
- // saved in pmd[17], so no need for pt_regs->eip (and the
- // associated skid)...
- //
- i = associate_image($pmu->pmd[17]);
- stp($current->pid, $pid_name, $pmu->pmd[17], i, // the basics
- $pmu->pmd[2], // precise data address
- $pmu->pmd[3], // latency information
- $pmu->pmd[8], // branch history buffer
- $pmu->pmd[9], // "
- $pmu->pmd[10], // "
- $pmu->pmd[11], // "
- $pmu->pmd[12], // "
- $pmu->pmd[13], // "
- $pmu->pmd[14], // "
- $pmu->pmd[15], // "
- $pmu->pmd[16]); // indication of which was most recent branch
- }
-
- Desired output:
- Aggregate data based on pid, process name, eip, latency, and
- data address. Each pmd on ia64 is 64 bits long, thus the capturing
- of just the 12 pmd's listed hear is 96 bytes of information every
- interrupt for each cpu. Profiling can have a very high amount of
- data collected...
-
-Description: Pmu event collection of data but use NMI
- instead of the regular interrupt.
-
-NMI is useful for getting visibily on locks and other code which is
-normally hidden behind interrupt disable code. However, handling an
-NMI is more difficult to do properly. Potentially the compiler can be
-more restrictive on what's allowed in the handler when NMI's are
-selected as the interrupt method.
-
-
- probe kernel.nmi.pmu_profile("instructions_retired:1000000")
- {
- i = associate_image($pt_regs->eip);
- stp($pid_name, $ptregs->eip, i);
- }
-
- Desired Output:
- Same as the earlier oprofile style example
-
-
-Description: Timing items in a queue
-
- Two possibilities - use associative arrays or post process
-
-Associative arrays:
-
- probe kernel.function("add queue function")
- {
- start[$arg->queue_entry] = $timestamp;
- }
- probe kernel.function("remove queue function")
- {
- delta = $timestamp - start[$arg->queue_entry];
-
- // Do statistics on the delta value and the queue entry
- }
-
-Post process:
-
- probe kernel.function("add queue function")
- {
- stp("add", $timestamp, $arg->queue_entry)
- }
- probe kernel.function("remove queue function")
- {
- stp("remove", $timestamp, $arg->queue_entry)
- }
-
-Desired Output:
- For each queue_entry, calculate the delta and do appropriate
- statistics.
-
-
-Description: Following an item as it moves to different queues/lists
-
- Two possibilities - use associative arrays or post process
- This exam
-
-Associative arrays:
-
- probe kernel.function("list_add")
- {
- delta = $timestamp - start[$arg->head, $arg->new];
- start[$arg->head, $arg->new] = $timestamp;
- // Do statistics on the delta value and queue
- }
-
-
-Post process:
-
- probe kernel.function("list_add")
- {
- stp("add", $timestamp, $arg->head, $arg->new)
- }
-
-Desired Output:
- For each (queue, queue_entry) pair, calculate the delta and do appropriate
- statistics.
-
diff --git a/tapsets/timestamp/timestamp_tapset.txt b/tapsets/timestamp/timestamp_tapset.txt
deleted file mode 100644
index dcbd5813..00000000
--- a/tapsets/timestamp/timestamp_tapset.txt
+++ /dev/null
@@ -1,327 +0,0 @@
-* Application name: sequence numbers and timestamps
-
-* Contact:
- Martin Hunt hunt@redhat.com
- Will Cohen wcohen@redhat.com
- Charles Spirakis charles.spirakis@intel.com
-
-* Motivation:
- On multi-processor systems, it is important to have a way
- to correlate information gathered between cpu's. There are two
- forms of correlation:
-
- a) putting information into the correct sequence order
- b) providing accurate time deltas between information
-
- If the resolution of the time deltas is high enough, it can
- also be used to order information.
-
-* Background:
- Discussion started due to relayfs and per-cpu buffers, but this
- is neede by many people.
-
-* Target software:
- Any software which wants to correlate data that was gathered
- on a multi-processor system, but the scope will be defined
- specifically for systemtap's needs.
-
-* Type of description:
- General information and discussion regarding sequencing and timing.
-
-* Interesting probe points:
- Any probe points where you are trying to get the time between two
- probe points. For example, timing how long a function takes and
- putting probe points at the function entry and function exit.
-
-* Interesting values:
- Possible ways to order data from multiple sources include:
-
-Retrieve the sequence/time from a global area
-
- High Precision Event Timer (HPET)
- Possible implementation:
- multimedia/HPET timer
- arch/i386/kernel/timer_hpet.c
- Advantages:
- granularity can vary (HPET spec says minimum freq of HPET timer
- is 10Mhz =~100ns resolution), can be treated as read-only,
- can bypass cache update and avoid being caced at all if desired,
- desigend to be used as an smp timestamp (see specification)
- Disadvantages:
- may not be available on all platforms, may not be synchronized on
- NUMA systems (ie counts for all processors within a numa node is
- comparable, but counts for processors between nodes may not be
- comparable), potential resource conflict if timers used by
- other software
-
- Real Time Clock (RTC)
- Possible implementation:
- "external" chip (clock chip) which has time information, accessed via
- ioport or memory-mapped io
- Advantages:
- can be treated as read-only, can bypass cache update and avoid being
- cached at all if desired
- Disadvantages:
- may not be available on all platforms, low granularity (for rtc,
- ~1ms), usually slow access
-
- ACPI Power Management Timer (pm timer)
- Possible implementation:
- implemented as part of the ACPI specification at 3.579545Mhz
- arch/i386/kernel/timers/timer_pm.c
- Advantages:
- not affected by throttling, halting or power saving states, moderate
- granularity (3.5Mhz ~ 300ns resolution), desigend for use by an OS
- to keep track of time during sleep/power states
- Disadvantages:
- may not be available on all platforms, slower access than hpet timer
- (but still much faster than RTC)
-
- Chipset counter
- Possible implementation:
- timer on a processor chipset, ??SGI implementation??, do we know of any
- other implementations ?
- Advantages:
- likely to be based on pci bus clock (33Mhz = ~30ns) or front-side-bus
- clock (200Mhz = ~5ns)
- Disadvantages:
- may not be available on all platforms
-
- Sequence Number
- Possible implementation:
- atomic_t global variable, cache aligned, placed in struct to keep
- variable on a cache line by itself
- Advantages:
- guaranteed correct ordering (even on NUMA systems), architecure
- independent, platform independent
- Disadvantages:
- potential for cache-line ping-pong, doesn't scale, no time
- information (ordering data only), access can be slower on NUMA systems
-
- Jiffies
- Possible implementation:
- OS counts the number of "clock interrupts" since power on.
- Advantages:
- platform independent, architecture independent, one writer, many
- readers (less cache ping-pong)
- cached at all if desired
- Disadvantages:
- low resolution (usually 10ms, sometimes 1ms).
-
- Do_gettimeofday
- Possible implementation:
- arch/i386/kernel/time.c
- Advantages:
- platform independent, architecture independent, 1 writer, many
- readers (less cache ping-pong), accuracy of micro-seconds
- Disadvantages:
- the time unit increment value used by this routine changes
- based on information from ntp (i.e if ntp needs to speed up / slow
- down the clock, then callers to this routine will be affected). This
- is a disadvantage for timing short intervals, but an advantage
- for timing long intervals.
-
-
-Retrieve the sequence/time from a cpu-unique area
-
- Timestamp counter (TSC)
- Possible implementation:
- count of the number of core cycles the processor has executed since
- power on, due to lack of synchronization between cpus, would also
- need to keep track of which cpu the tsc came from
- Advantages:
- no external bus access, high granularity (cpu core cycles),
- available on most (not all) architectures, platform independent
- Disadvantages:
- not synchronized between cpus, since it is a count of cpu cycles
- count can be affected by throttling, halting and power saving states,
- may not correlate to "actual" time (ie, just because a 1G processor
- showed a delta of 1G cycles, doesn't mean 1 second has passed)
-
- APIC timer
- Possible implementation:
- timer implemented within the processor
- Advantages:
- no external bus access, moderate to high granularity (usually
- counting based front-side bus clock or core clock)
- Disadvantages:
- not synchronized between cpus, may be affected by throttling,
- halting/power saving states, may not correlate to "actual" time.
-
- PMU event
- Possible implementation:
- program a perfmonance counter with a specific event related to time
- Advantages:
- no external bus access, moderate to high granularity (usually
- counting based front-side bus clock or core clock), can be
- virtualized to give moderate to high granularity for individual
- thread paths
- Disadvantages:
- not synchronized between cpus, may be affected by throttling,
- halting/power saving states, may not correlate to "actual" time,
- processor dependent
-
-
- For reference, as a quick baseline, on Martin's dual-processor system,
- he gets the following performance measurements:
-
- kprobe overhead: 1200-1500ns (depending on OS and processor)
- atomic read plus increment: 40ns (single processor access, no conflict)
- monotonic_clock() 550ns
- do_gettimeofday() 590ns
-
-* Dependencies:
- Not Applicable
-
-* Restrictions:
- Certain timers may already be in use by other parts of the kernel
- depending on how it is configured (for example, RTC is used by the
- watchdog code). Some kernels may not compile in the necessary code
- (for example, if using the pm timer, need ACPI). Some platforms
- or architectures may not have the timer requested (for example,
- there is no HPET timer on older systesm).
-
-* Data collection:
- For data collection, it is probably best to keep the concept
- between sequence ordering and timestamp separate within
- systemtap (for both the user as well as the implementation).
-
- For sequence ording, the initial implementation should use ?? the
- atomic_t form for the sequence ordering (since it is guaranteed
- to be platform and architecture neutral)?? and modify/change the
- implementation later if there is a problem.
-
- For timestamp, the initial implementation should use
- ?? hpet timer ?? pm timer ?? do_gettimeofday ?? cpu # + tsc ??
- some combination (do_gettimeofday + cpu # & low bits of tsc)?
-
- We could do something like what LTT does (see below) to
- generate 64-bit timestamps containing the nanoseconds
- since Jan 1, 1970.
-
- Assuming the implementation keeps these concepts separate now
- (ordering data vs. timing deltas), it is always possible to
- merge them in the future if a high granularity, numa/smp
- synchronized timesource becomes available for a large number
- of platforms and/or processors.
-
-* Data presentation:
- In general, users prefer output which is based on "actual" time (ie,
- they prefer an output that says the delta is XXX nanoseconds instead
- of YYY cpu cycles). Most of the time users want delta's (how long did
- this take), but occasionally they want absolute times (when / what time
- was this information collected)
-
-* Competition:
- DTRACE has output in nanoseconds (and it is comparable between
- processors on an mp system), but it is unclear what the actual
- resolution is. Even if the sparc machine does have hardware
- that provides nanosecond resolution, on x86-64 they are likely
- to have the same problems as discussed here since the solaris
- opteron box tends to be a pretty vanilla box.
-
- From Joshua Stone (joshua.i.stone at intel.com):
-
- == BEGIN ===
- DTrace gives you three built-in variables:
-
- uint64_t timestamp: The current value of a nanosecond timestamp
- counter. This counter increments from an arbitrary point in the
- past and should only be used for relative computations.
-
- uint64_t vtimestamp: The current value of a nanosecond timestamp
- counter that is virtualized to the amount of time that the current
- thread has been running on a CPU, minus the time spent in DTrace
- predicates and actions. This counter increments from an arbitrary
- point in the past and should only be used for relative time computations.
-
- uint64_t walltimestamp: The current number of nanoseconds since
- 00:00 Universal Coordinated Time, January 1, 1970. 
-
- As for how they are implemented, the only detail I found is that
- timestamp is "similar to the Solaris library routine gethrtime".
- The manpage for gethrtime is here:
- http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr8u?a=view
- == END ==
-
- What LTT does:
-
- "Cycle counters are fast to read but may reflect time
- inaccurately. Indeed, the exact clock frequency varies
- with time as the processor temperature changes, influenced
- by the external temperature and its workload. Moreover, in
- SMP systems, the clock of individual processors may vary
- independently.
-
- LTT corrects the clock inaccuracy by reading the real time
- clock value and the 64 bits cycle counter periodically, at
- the beginning of each block, and at each 10ms. This way, it
- is sufficient to read only the lower 32 bits of the cycle
- counter at each event. The associated real time value may
- then be obtained by linear interpolation between the nearest
- full cycle counter and real time values. Therefore, for the
- average cost of reading and storing the lower 32 bits of the
- cycle counter at each event, the real time with full resolution
- is obtained at analysis time."
-
-
-* Cross-references:
- The profile_tapset is very dependent on sequencing and time when
- ordering of data (i.e. taking a trace history) as well as high
- granularity (when calculating time deltas).
-
-* Associated files:
-
- Profile tapset requirements
- .../src/tapsets/profile/profile_tapset.txt
-
- Intel high precision event timers specification:
- http://www.intel.com/hardwaredesign/hpetspec.htm
-
- ACPI specification:
- http://www.acpi.info/DOWNLOADS/ACPIspec-2-0b.pdf
-
- From an internal email sent by Tony Luck (tony.luck at intel.com)
- regarding a clustered environment. For the summary below, hpet and
- pm timer were not an option. For systemtap, they should be considered,
- especially since pm timer and hpet were designed to be a timestamp.
-
- == BEGIN ===
- For extremely short intervals (<100ns) get some h/w help (oscilloscope
- or logic analyser). Delays reading TSC and pipeline effects could skew
- your results horribly. Having a 2GHz clock doesn't mean that you can
- measure 500ps intervals.
-
- For short intervals (100ns to 10 ms) TSC is your best choice ... but you
- need to sample it on the same cpu, and converting the difference between
- two TSC values to real time will require some system dependent math to find
- the nominal frequency of the system (you may be able to ignore temperature
- effects, unless your system is in an extremely hostile environment). But
- beware of systems that change the TSC rate when making frequency
- adjustments for power saving. It shouldn't be hard to measure the
- system clock frequency to about five significant digits of accuracy,
- /proc/cpuinfo is probably good enough.
-
- For medium intervals (10 ms to a minute) then "gettimeofday()" or
- "clock_gettime()" on a system *NOT* running NTP may be best, but you will
- need to adjust for systematic error to account for the system clock running
- fast/slow. Many Linux systems ship with a utility named "clockdiff" that
- you can use to measure the system drift against a reference system
- (a system that is nearby on the network, running NTP, prefereably a
- low "stratum" one).
-
- Just run clockdiff every five minutes for an hour or two, and plot the
- results to see what systematic drift your system has without NTP. N.B. if
- you find the drift is > 10 seconds per day, then NTP may have
- trouble keeping this system synced using only drift corrections,
- you might see "steps" when running NTP. Check /var/log/messages for
- complaints from NTP.
-
- For long intervals (above a minute). Then you need "gettimeofday()" on a
- system that uses NTP to keep it in touch with reality. Assuming reasonable
- network connectivity, NTP will maintain the time within a small number of
- milliseconds of reality ... so your results should be good for
- 4-5 significant figures for 1 minute intervals, and better for longer
- intervals.
- == END ==
-