6 files changed, 0 insertions, 1003 deletions
diff --git a/tapsets/contextinfo/contextinfo.txt b/tapsets/contextinfo/contextinfo.txt
deleted file mode 100644
index 5f7f725f..00000000
--- a/tapsets/contextinfo/contextinfo.txt
+++ /dev/null
@@ -1,72 +0,0 @@
-* Application name: probe context information ariables
-* Contact: fche
-* Motivation: let probes know where/how they were fired; introspective
-              probe handlers
-* Background: discussions on mailing lists
-* Target software: various
-* Type of description: tapset variables
-* Interesting probe points: n/a
-* Interesting values:
-
-  $pp_alias: string: the string specification of the probe point, as found
-                     in the original .stp file, before alias and other
-                     expansion
-  $pp: string: representation of this probe point, after alias and wildcard
-               expansion
-  $pp_function: string: source function (if available)
-  $pp_srcfile: string: source file name (if available)
-  $pp_srcline: number: line number in source file (if available)
-
-  $function[pc]: string: function name containing given address
-  $module[pc]: string: kernel module name containing given address
-  $address[sym]: number: base address of given function symbol
-
-  $pc: number: PC snapshot at invocation
-  $stack[depth]: number: PC of caller at given depth, if available
-
-  $pid, $tgid, $uid, $comm : number/string : current-> fields
-
-* Dependencies:
-
-  Debug-info files
-
-* Restrictions:
-
-  The $pp series of variables are computed at translation time, and thus
-  are only applicable to those probes that have related debug-info points.
-
-  $pc should be directly available.
-
-  The $function series of read-only pseudo-arrays are calculated at
-  run time, from symbol table information passed in some way.
-  $stack[0] might take some probing in the registers, or (eek!) on the
-  target stack frame.  Conservatively returning 0 instead may be okay.
-
-  The current-based series of values ($pid etc.), for kernel-targeted
-  probes, need to check for !in_interrupt() before dereferencing current->.
-
-* Data collection:
-
-  Several of the variables are translation-time constants, so these don't
-  have run-time collection needs.
-
-  For a kernel/module probe, $function[] could be computed from the kallsyms
-  lookup functions.  Alternately, the translator could emit a copy of the
-  target symbol table into the probe C code, which $function[] could
-  search.  The $stack[] elements would be served by the runtime on a
-  best-effort basis.
-
-* Data presentation:
-
-  n/a: variables are simple
-
-* Competition:
-
-  unknown
-
-* Cross-references:
-
-  http://sources.redhat.com/ml/systemtap/2005-q2/msg00395.html
-  http://sources.redhat.com/ml/systemtap/2005-q2/msg00281.html
-
-* Associated files:
diff --git a/tapsets/dynamic_cg/dynamic_cg.txt b/tapsets/dynamic_cg/dynamic_cg.txt
deleted file mode 100644
index 35de29f1..00000000
--- a/tapsets/dynamic_cg/dynamic_cg.txt
+++ /dev/null
@@ -1,64 +0,0 @@
-* Application name: Dynamic Callgraph
-* Contact: William Cohen, wochen@redhat.com
-
-* Motivation:
-
-Dynamic Callgraph would provide information to allow developers to see
-what other functions a function is calling. This could show that some
-unexpected functions are getting called. DTrace has a instrumentation
-provider that generates a trace of the functions called and returned.
-
-* Background:
-
-There have been times that people in Red Hat support have narrowed a
-problem to a specific function and the functions it calls.  Rather
-than instrumenting the function's children by hand, a tapset that
-provides a dynamic callgraph would allow quicker determination of the
-things called.  There are cases in the kernel code where determining
-the function being called cannot be determined statically,
-e.g. function to call is stored in a data structure.
-
-* Target software:
-
-Ideally both kernel and user space, but kernel space only would
-be sufficient for many cases.
-
-* Type of description:  tapset and scripting command
-  tapset to provide to support to capture call return information
-  scripting commands to turn on and off the capture
-
-* Interesting probe points:
-
-* Interesting values:
-
-* Dependencies:
-- P6/x86-64 processors have the debug hardware to trap control flow chgs.
-- Need to have the kernel maintain the debug hardware on a per process basis.
-       The DebugCtlMSR is not currently stored in the context
-	   (only debug registers 0, 1, 2, 3, 6, and 7 are virtualized)
-
-* Restrictions:
-  May be difficult to implement on ppc: returns may look like regular jumps
-      and trapping on all branches could cause problems with atomic operations
-  Won't work on pre p6 x86 processors
-  Won't provide data for inlined funcions
-
-* Data collection:
-  Track whether the instruction was a call or a return and the target address.
-
-* Data presentation:
-  -processing address in userspace to convert addresses into function names
-  -trace showing calls and returns
-  -maybe further post process to build dynamic callgraph
-	 determine that a function is being called way too often
-
-* Competition:
-  DTrace already implements tracing of function calls and returns.
-
-* Cross-references:
-
-* Associated files:
-
-  $dynamic_call_graph = 1; // turn on tracing of calls for thread
-  $dynamic_call_graph = 0; // turn off tracing of calls for thread
-
diff --git a/tapsets/dynamic_cg/tapset.stp b/tapsets/dynamic_cg/tapset.stp
deleted file mode 100644
index c731fac9..00000000
--- a/tapsets/dynamic_cg/tapset.stp
+++ /dev/null
@@ -1,7 +0,0 @@
-global $dynamic_call_graph
-probe kernel.perfctr.call(1) {
-   if ($dynamic_call_graph) trace_sym ($pc);
-}
-probe kernel.perfctr.return(1) {
-   if ($dynamic_call_graph) trace_sym ($pc);
-}
diff --git a/tapsets/dynamic_cg/usage.stp b/tapsets/dynamic_cg/usage.stp
deleted file mode 100644
index 1625768b..00000000
--- a/tapsets/dynamic_cg/usage.stp
+++ /dev/null
@@ -1,30 +0,0 @@
-
-probe.kernel.sys_open.entry()
-{
-  $dynamic_call_graph =1;
-}
-
-# What you would see in the output would be something of this kind
-# call sys_open
-#     call getname
-#          call do_getname
-#           return do_getname
-#     return getname
-#     call get_unused_fd
-#         call find_next_zero_bit
-#         return find_next_zero_bit
-#     return get_unused_fd
-#     call filep_open
-      .....
-return sys_open
-
-# The above probe could be customized to a particular process as well,
-# like in the following
-
-probe.kernel.sys_open.entry()
-{
-if ($PID == 1234)
-    $dynamic_call_graph =1;
-}
-
-
diff --git a/tapsets/profile/profile_tapset.txt b/tapsets/profile/profile_tapset.txt
deleted file mode 100644
index e8899dc7..00000000
--- a/tapsets/profile/profile_tapset.txt
+++ /dev/null
@@ -1,503 +0,0 @@
-* Application name: Stopwatch and Profiling for systemtap
-
-* Contact:
-    Will Cohen          wcohen@redhat.com
-    Charles Spirakis    charles.spirakis@intel.com
-
-* Motivation:
-    Allow SW developers to improve the performance of their
-    code. The metholodies used are stopwatch (sometimes known
-    as event counting) and profiling.
-
-* Background:
-    Will has experience with oprofile
-    Charles has experience with vtune
-
-* Target software:
-    Initially the kernel, but longer term, both kernel and user.
-
-* Type of description:
-   General information regarding requirements and usage models.
-
-* Interesting probe points:
-    When doing profiling you have "asynchronous-event" probe points
-    (aka you get an interrupt and you'll want to capture information
-    about where that interrupt happened).
-
-    When doing stopwatch, interesting probe points will be
-    function entry/exits, queue add/remove, queue entity lifecycle,
-    and any other code where you want to measure time
-    or events (cpu resource utilization) associated with a path of code
-    (frame buffer drawing measurements, graphic T&L pipeline
-    measurements, etc).
-
-* Interesting values:
-    For profiling, the pt_regs structure from the interrupt handler. The
-    most commonly used items would be the instruction pointer and the
-    call stack pointer.
-
-    For stopwatch, most of the accesses are likely to be pmu read
-    operations.
-    
-    In addition, given the large variety of pmu capabilities, access
-    to the pmu registers themselves (read and write) would be very
-    important.
-
-    Different pmu's have different events, but for script portability,
-    we may want to have a subset of predefined events and have something
-    map that into a pmu's particular event (similar to what papi does).
-
-    Given the variety of performance events and pmu architectures, we
-    may want to try and have a standardized library/api as part of the
-    translator to map events (or specialzed event information) into
-    register/value pairs used during the actual systemtap run.
-
-   ??? Classify values as consumed from lower level vs. provided to higher
-   level ???
-
-* Dependencies:
-    Need some form of arbitration of the pmu to make sure the data provided
-    is valid (see perfmon below).
-    
-    Two common usage models are aggregated data (oprofile) and
-    trace history (papi/vtune). Currently these tools all do the
-    aggregation in user-mode and we may want to look at what
-    they do and why.
-    
-    The unofficial rule of thumb is that profiling should be
-    as unobtrusive as possible and definitely < 1% overhead.
-
-    When doing stopwatch or profiling, there is a need to be able to
-    sequence the data. For timing this is important to be able to 
-    accurately compute start/stop deltas and watch control/data flow.
-    For profiling, it is needed to support trace history.
-
-    There needs to be a timesource that has reasonable granularity
-    and is reasonably precise.
-
-    Per-thread virtualization (of time and events)
-
-    System wide mode for pmu events
-
-* Restrictions:
-    Currently access to the pmu is a bit of a free for all with no
-    single entity providing arbitration. The perfmon2 patch for 2.6
-    (see the cross reference section below) is attempting to
-    provide much of the infrastructure needed by profiling tools
-    (like oprofile and papi) across architectures (pentium M, ia64
-    and x86_64 initially, though I understand Stephane has contacted
-    someone at IBM for a powerpc version as well).
-
-    Andrew Morton wants a perfmon and perfcntr to be merged. Regardless
-    of what happens, both pmu libraries are geared more for
-    user->kernel access rather than kernel->kernel access and we
-    will need to see what can be EXPORT()'ed to make it more
-    kernel module friendly.
-
-* Data collection:
-    Pmu counters tend to be different widths on different
-    architectures. It would be useful to standardize the
-    width (in software) to 64-bits to make math operations
-    (such as comparisons, delta's, etc) easier.
-
-    The goal of profiling is to go from:
-        pid/ip -> path/image -> source file/line number
-
-    This implies the need to have a (reasonably quick) mechanism to
-    translate pid/ip to path/image. Potentially reuse the dcookie
-    methodology from oprofile but may need to extend that model if there
-    is a goal to support anonymous maps (dynamically generated code).
-
-    Need the ability to map the current pid to a process name.
-
-    Need to make a decision on how much will be handled via associative
-    arrays in the kernel and how much will be handled in user space
-    (potentially part of post processing). Given the volume of data that
-    can be generated during profiling, it may make sense to follow the
-    trend of current perfomrance tools and attempt to put merging and
-    aggregation in the user space instead of kernel space.
-
-    To keep the overhead of collection low, it may be useful to look
-    into having some of the information needed be collected at interrupt
-    time and other pieces of information be collected after the
-    interrupt (top/bottom style). For example, although it may be
-    convienent to have a syntax like:
-    
-    val = associate_image($pt_regs->eip)
-
-    it may be preferable to use a marker in the output stream instead
-    (oprofile used a dcookie) and then do a lookup later (either in the
-    kernel and add a marker->name entry to the output stream or in user
-    space similar to what oprofile did). This concept could be extended
-    to cover the lookup of the pid name as well.
-
-    Stack information will need to be collected at interrupt time
-    (based on the interrupted pt_regs->esp) so the routine to obtain
-    the stack trace should be reasonably fast. Due to asynchronous probes,
-    the stack may be in user space.
-    
-    Depending on whether support of anonymous maps is important, it may
-    be useful to have a more generic method of mapping ip->path/module
-    which would allow dynamic code generates (usually found in user
-    space) to be able to provide ip->image map information as part of
-    the regular systemtap data stream. If we allow for a user-mode api
-    to add data to a systemtap stream, we could have a very general
-    purpose merge/aggregation tool for profiling from a variety of
-    sources.
-
-* Data presentation:
-    Generally data will be presented to the user as either an inorder
-    stream (trace history) or aggregated in some form to produce a
-    histogram or min/max/average/std. 
-    
-    When aggregated, the data may be clumped by pid (each running of
-    the app provides unique data), process name (the data for an app
-    is merged for all runs), or it may be clumped by the loaded image
-    (to get information about shared libraries regardless of the app
-    that loaded it). Assuming an increase in multi processor and
-    multi threaded applications, grouping the data by thread group
-    id is likely to be useful as well. Ideally, if symbols/debug
-    information is available, additional aggregation could be done
-    at the function, basic block or source line.
-
-* Competition:
-    See the cross-reference list below
-
-* Cross-references:
-	Oprofile
-
-    Oprofile is a profiling tool that provides time and event based
-    sampling. Its collection methodology has a "file view" of the
-    world and only captures the minimum information needed to get
-    the image that corresponds to the interrupted instruction
-    address. It aggregates the data (no time information) to keep
-    the total data size to a minimum even on long runs. Oprofile
-    allows for optional "escape" sequences in a data stream to add
-    information. It can handle non-maskable interrupts (NMI) as well
-    as maskable interrupts to obtain samples in areas where
-    maskable interrupts are normally disabled. Work is being done
-    to allow oprofile to handle anonymous maps (ie. dynamically
-    generated code from jvm's).
-
-    http://oprofile.sourceforge.net/news/
-    
-    Papi
-
-    Papi is a profiling tool that can aggregate data or keep a trace
-    history. It uses tables to map generic event concepts (for example,
-    PAPI_TOT_CYC) into architecture specific events (for example,
-    CPU_CLK_UNHALTED, value 0x79 on the Pentium M). Interrupts can be
-    time based and it can capture event counts (i.e. every 5ms,
-    capture cpu cycles and instructions retired) in addition to
-    the instruction pointer.  Papi is built on top of other performance
-    monitoring support such as ia64 perfmon and i386 perfctr in the Linux
-    kernel.
-
-    http://icl.cs.utk.edu/papi/
-
-	Perfmon2 infrastructure
-
-    Perfmon2 is a profiling infrastructure currently in the linux 2.6
-    kernel for ia64. It handles arbitration and virtualization
-    of the pmu resources, extends
-    the pmu's to a logical 64-bits regardless of the underlying hardware
-    size, context switches of the counters when needed to allow for
-    per-process or system-wide use, and has the ability to choose a subset
-    of the cpu's on a system when doing system-wide profiling. Oprofile on
-    Linux 2.6 for ia64 has been ported to use the perfmon2 interface. Currently,
-    there are patches submitted for the Linux Kernel Mailing List for
-    the 2.6 kernel to port the perfmon2
-    infrastructure to the Pentium M and x86_64.
-
-    http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html
-
-    Shark
-
-    Shark is a profiling tool from Apple that focuses on time and event
-    based statistical stack sampling. On each profile interrupt, in
-    addition to capturing the instruction pointer, it also captures
-    a stack trace so you know both where you were and how you got there.
-
-    http://developer.apple.com/tools/sharkoptimize.html
-
-    Vtune
-
-    Vtune is a profiling tool that provides time and event based
-    sampling. It does collection based on a "process view" of the
-    world. It keeps a trace history so that you can aggregate the
-    data during post processing in various ways, it can capture
-    architectural specific data in addition to ip (such as branch
-    history buffers), and it can use architectural specific abilities
-    to get exact ip addresses for certain events. Currently handles
-    anonymous mappings (dynamically generated code from jvm's).
-
-    http://www.intel.com/software/products/vtune/vlin/index.htm
-
-
-* Associated files:
-  Should the usage models be split into a separate file?
-
-Usage Models:
-   Below are some typical usage models. This isn't an attempt
-   to propose syntax, it's an attempt to create something
-   concrete enough to help people understand the goals:
-   (description, psuedo code, desired output).
-
-Description:    Statistical stack sampling (ala shark)
-
-    probe kernel.time_ms(10)
-    {
-        i = associate_image($pt_regs->eip);
-        s = stack($pt_regs->esp);
-        stp($current->pid, $pid_name, $pt_regs->eip, i, s)
-    }
-
-    Output desired:
-    For each process/prcess name, aggregate (histogram) based
-    on eip (regardless how I got there), stack (what was the
-    most common calling path), or both (what was the most common
-    path to the most common eip).
-    Could be implemented by generating a trace history and let the
-    user post process (eats disk space, but one run can be viewed
-    multiple ways) or could have the user define what was wanted
-    in the script and do the post processing ourselves (saves disk space,
-    but more work for us).
-
-Description:    Time based aggregation (ala oprofile)
-
-    probe kernel.time_ms(10)
-    {
-        i = associate_image($pt_regs->eip);
-        stp($current->pid, $pid_name, $ptregs->eip, i);
-    }
-
-    Output desired:
-    Histogram separated by process name, pid/eip, pid/image
-
-Description:    Time a routine part 1
-                time between the function call and return:
-
-    probe kernel.function("sys_exec")
-    {
-        $thread->mystart = $timestamp
-    }
-    probe kernel.function("sys_exec").return
-    {
-        delta = $timestamp - $thread->mystart
-
-        // do statistical operations...
-    }
-
-    Output desired:
-    Be able to do statistics for the time it takes for an exec
-    function to execute. The time needs to have a fine enough
-    granularity to have meaning (i.e. using jiffies probably wouldn't work)
-    and the time needs to be smp correct even if the probe entry
-    and the return execute on different processors.
-
-Description:    Time a routine part 2
-                count the number of events between the
-                function call and return:
-
-    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve")
-    {
-        $thread->myclocks = $pmu[0];
-        $thread->myinstr_ret = $pmu[1];
-    }
-    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve").return
-    {
-        $thread->myclocks = $pmu[0] - $thread->myclocks;
-        $thread->myinstr_ret = $pmu[1] - $thread->myinstr_ret;
-
-        cycles_per_instruction = $thread->myclocks / $thread->myinstr_ret
-
-        // Do statistical operations...
-    }
-
-    Desired Output:
-    Produce min/max/average for cycles, instructions retired,
-    and cycles_per_instruction. The pmu must be virtualized if the
-    probe entry and probe exit can happen on different processors. The
-    pmu should be virtualized if there can be pre-emtption (or waits) in
-    the function itself to get more useful information (the actual count
-    of events in the function vs. a count of events in the whole system
-    between when the function starts and when it ended)
-
-Description:    Time a routine part 3
-                reminder of threading issues
-
-    probe kernel.function("sys_fork")
-    {
-        $thread->mystart = $timestamp
-    }
-    probe kernel.function("sys_fork").return
-    {
-        delta = $timestamp - $thread->mystart
-
-        If (parent) {
-            // do statistical operations for time it takes parent
-        } else {
-            // do statistical operations for time it takes child
-        }
-    }
-
-    Desired Output:
-    Produce min/max/average for the parent and the child. The
-    time needs to have a fine enough granularity to have
-    meaning (i.e. using jiffies probably wouldn't work)
-    and the time needs to be smp correct even if the probe entry
-    and the probe return execute on different processors.
-
-Description:    Time a routine part 4
-                reminder of threading issues
-
-    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork")
-    {
-        $thread->myclocks = $pmu[0];
-        $thread->myinstr = $pmu[1];
-    }
-    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork").return
-    {
-        $thread->myclocks = $pmu[0] - $thread->myclocks;
-        $thread->myinstr = $pmu[1] - $thread->myinstr;
-
-        cycles_per_instruction = $thread->myclocks / $thread->myinstr
-
-        If (parent) {
-            // Do statistical operations...
-        } else {
-            // Do statistical operations...
-        }
-    }
-
-    Desired Output:
-    Produce min/max/average for cycles, instructions retired,
-    and cycles_per_instruction. The pmu must be virtualized if the
-    probe entry and probe exit can happen on different processors. The
-    pmu should be virtualized if there can be pre-emtption (or waits) in
-    the function itself to get more useful information (the actual count
-    of events in the function vs. a count of events in the whole system
-    between when the function starts and when it ended)
-
-Description:    Beginnings of "papi" style collection
-
-    probe kernel.startwatch("cpu_cycles").startwatch("instructions_retired").time_ms(10)
-    {
-        i = associate_image($pt_regs->eip);
-        stp($current->pid, $pid_name, $ptregs->eip, i, $pmu[0], $pmu[1]);
-    }
-
-    Desired output:
-    Trace history or aggregation based on process name, image
-
-Description:    Find the path leading to high latency cache miss
-                that stalled for more than 128 cycles (ia64 only)
-
-    probe kernel.startwatch("branch_event,pmc[12]=0x3e0f").pmu_profile("data_ear_event:1000,pmc[11]=0x5000f")
-    {
-        // 
-        // on ia64, when using the data ear event, the precise eip is
-        // saved in pmd[17], so no need for pt_regs->eip (and the
-        // associated skid)...
-        //
-        i = associate_image($pmu->pmd[17]);
-        stp($current->pid, $pid_name, $pmu->pmd[17], i,      // the basics
-        $pmu->pmd[2],       // precise data address
-        $pmu->pmd[3],       // latency information
-        $pmu->pmd[8],       // branch history buffer
-        $pmu->pmd[9],       //  "
-        $pmu->pmd[10],      //  "
-        $pmu->pmd[11],      //  "
-        $pmu->pmd[12],      //  "
-        $pmu->pmd[13],      //  "
-        $pmu->pmd[14],      //  "
-        $pmu->pmd[15],      //  "
-        $pmu->pmd[16]);     // indication of which was most recent branch
-    }
-
-    Desired output:
-    Aggregate data based on pid, process name, eip, latency, and
-    data address. Each pmd on ia64 is 64 bits long, thus the capturing
-    of just the 12 pmd's listed hear is 96 bytes of information every
-    interrupt for each cpu. Profiling can have a very high amount of
-    data collected...
-
-Description:    Pmu event collection of data but use NMI
-                instead of the regular interrupt.
-
-NMI is useful for getting visibily on locks and other code which is
-normally hidden behind interrupt disable code. However, handling an
-NMI is more difficult to do properly. Potentially the compiler can be
-more restrictive on what's allowed in the handler when NMI's are
-selected as the interrupt method.
-
-
-    probe kernel.nmi.pmu_profile("instructions_retired:1000000")
-    {
-        i = associate_image($pt_regs->eip);
-        stp($pid_name, $ptregs->eip, i);
-    }
-
-    Desired Output:
-    Same as the earlier oprofile style example
-
-
-Description:    Timing items in a queue
-
-    Two possibilities - use associative arrays or post process
-
-Associative arrays:
-
-    probe kernel.function("add queue function")
-    {
-        start[$arg->queue_entry] = $timestamp;
-    }
-    probe kernel.function("remove queue function")
-    {
-        delta = $timestamp - start[$arg->queue_entry];
-
-        // Do statistics on the delta value and the queue entry
-    }
-
-Post process:
-
-    probe kernel.function("add queue function")
-    {
-        stp("add", $timestamp, $arg->queue_entry)
-    }
-    probe kernel.function("remove queue function")
-    {
-        stp("remove", $timestamp, $arg->queue_entry)
-    }
-
-Desired Output:
-    For each queue_entry, calculate the delta and do appropriate
-    statistics.
-
-
-Description:    Following an item as it moves to different queues/lists
-
-    Two possibilities - use associative arrays or post process
-    This exam
-
-Associative arrays:
-
-    probe kernel.function("list_add")
-    {
-        delta = $timestamp - start[$arg->head, $arg->new];
-        start[$arg->head, $arg->new] = $timestamp;
-        // Do statistics on the delta value and queue
-    }
-
-
-Post process:
-
-    probe kernel.function("list_add")
-    {
-        stp("add", $timestamp, $arg->head, $arg->new)
-    }
-
-Desired Output:
-    For each (queue, queue_entry) pair, calculate the delta and do appropriate
-    statistics.
-
diff --git a/tapsets/timestamp/timestamp_tapset.txt b/tapsets/timestamp/timestamp_tapset.txt
deleted file mode 100644
index dcbd5813..00000000
--- a/tapsets/timestamp/timestamp_tapset.txt
+++ /dev/null
@@ -1,327 +0,0 @@
-* Application name: sequence numbers and timestamps
-
-* Contact:
-    Martin Hunt         hunt@redhat.com
-    Will Cohen          wcohen@redhat.com
-    Charles Spirakis    charles.spirakis@intel.com
-
-* Motivation:
-    On multi-processor systems, it is important to have a way
-    to correlate information gathered between cpu's. There are two
-    forms of correlation:
-
-        a) putting information into the correct sequence order
-        b) providing accurate time deltas between information
-
-    If the resolution of the time deltas is high enough, it can
-    also be used to order information.
-
-* Background:
-    Discussion started due to relayfs and per-cpu buffers, but this
-    is neede by many people.
-
-* Target software:
-    Any software which wants to correlate data that was gathered
-    on a multi-processor system, but the scope will be defined
-    specifically for systemtap's needs.
-
-* Type of description:
-    General information and discussion regarding sequencing and timing.
-
-* Interesting probe points:
-    Any probe points where you are trying to get the time between two
-    probe points. For example, timing how long a function takes and
-    putting probe points at the function entry and function exit.
-
-* Interesting values:
-    Possible ways to order data from multiple sources include:
-
-Retrieve the sequence/time from a global area
-
-  High Precision Event Timer (HPET)
-    Possible implementation:
-    multimedia/HPET timer
-    arch/i386/kernel/timer_hpet.c
-    Advantages:
-    granularity can vary (HPET spec says minimum freq of HPET timer
-    is 10Mhz =~100ns resolution), can be treated as read-only,
-    can bypass cache update and avoid being caced at all if desired,
-    desigend to be used as an smp timestamp (see specification)
-    Disadvantages:
-    may not be available on all platforms, may not be synchronized on
-    NUMA systems (ie counts for all processors within a numa node is
-    comparable, but counts for processors between nodes may not be
-    comparable), potential resource conflict if timers used by
-    other software
- 
-  Real Time Clock (RTC)
-    Possible implementation:
-    "external" chip (clock chip) which has time information, accessed via
-    ioport or memory-mapped io
-    Advantages:
-    can be treated as read-only, can bypass cache update and avoid being
-    cached at all if desired
-    Disadvantages:
-    may not be available on all platforms, low granularity (for rtc,
-    ~1ms), usually slow access
-
-  ACPI Power Management Timer (pm timer)
-    Possible implementation:
-    implemented as part of the ACPI specification at 3.579545Mhz
-    arch/i386/kernel/timers/timer_pm.c
-    Advantages:
-    not affected by throttling, halting or power saving states, moderate
-    granularity (3.5Mhz ~ 300ns resolution), desigend for use by an OS
-    to keep track of time during sleep/power states
-    Disadvantages:
-    may not be available on all platforms, slower access than hpet timer
-    (but still much faster than RTC)
-
-  Chipset counter
-    Possible implementation:
-    timer on a processor chipset, ??SGI implementation??, do we know of any
-    other implementations ?
-    Advantages:
-    likely to be based on pci bus clock (33Mhz = ~30ns) or front-side-bus
-    clock (200Mhz = ~5ns)
-    Disadvantages:
-    may not be available on all platforms
-
-  Sequence Number
-    Possible implementation:
-    atomic_t global variable, cache aligned, placed in struct to keep
-    variable on a cache line by itself
-    Advantages:
-    guaranteed correct ordering (even on NUMA systems), architecure
-    independent, platform independent
-    Disadvantages:
-    potential for cache-line ping-pong, doesn't scale, no time
-    information (ordering data only), access can be slower on NUMA systems
-
-  Jiffies
-    Possible implementation:
-    OS counts the number of "clock interrupts" since power on.
-    Advantages:
-    platform independent, architecture independent, one writer, many
-    readers (less cache ping-pong)
-    cached at all if desired
-    Disadvantages:
-    low resolution (usually 10ms, sometimes 1ms).
-
-  Do_gettimeofday
-    Possible implementation:
-    arch/i386/kernel/time.c
-    Advantages:
-    platform independent, architecture independent, 1 writer, many
-    readers (less cache ping-pong), accuracy of micro-seconds
-    Disadvantages:
-    the time unit increment value used by this routine changes
-    based on information from ntp (i.e if ntp needs to speed up / slow
-    down the clock, then callers to this routine will be affected). This
-    is a disadvantage for timing short intervals, but an advantage
-    for timing long intervals.
-
-
-Retrieve the sequence/time from a cpu-unique area
-
-  Timestamp counter (TSC)
-    Possible implementation:
-    count of the number of core cycles the processor has executed since
-    power on, due to lack of synchronization between cpus, would also
-    need to keep track of which cpu the tsc came from
-    Advantages:
-    no external bus access, high granularity (cpu core cycles),
-    available on most (not all) architectures, platform independent
-    Disadvantages:
-    not synchronized between cpus, since it is a count of cpu cycles
-    count can be affected by throttling, halting and power saving states,
-    may not correlate to "actual" time (ie, just because a 1G processor
-    showed a delta of 1G cycles, doesn't mean 1 second has passed)
-
-  APIC timer
-    Possible implementation:
-    timer implemented within the processor
-    Advantages:
-    no external bus access, moderate to high granularity (usually
-    counting based front-side bus clock or core clock)
-    Disadvantages:
-    not synchronized between cpus, may be affected by throttling,
-    halting/power saving states, may not correlate to "actual" time.
-
-  PMU event
-    Possible implementation:
-    program a perfmonance counter with a specific event related to time
-    Advantages:
-    no external bus access, moderate to high granularity (usually
-    counting based front-side bus clock or core clock), can be
-    virtualized to give moderate to high granularity for individual
-    thread paths
-    Disadvantages:
-    not synchronized between cpus, may be affected by throttling,
-    halting/power saving states, may not correlate to "actual" time,
-    processor dependent
-
-
-    For reference, as a quick baseline, on Martin's dual-processor system,
-    he gets the following performance measurements:
-
-    kprobe overhead:             1200-1500ns (depending on OS and processor)
-    atomic read plus increment:    40ns (single processor access, no conflict)
-    monotonic_clock()             550ns
-    do_gettimeofday()             590ns
-
-* Dependencies:
-    Not Applicable
-
-* Restrictions:
-    Certain timers may already be in use by other parts of the kernel
-    depending on how it is configured (for example, RTC is used by the
-    watchdog code). Some kernels may not compile in the necessary code
-    (for example, if using the pm timer, need ACPI). Some platforms
-    or architectures may not have the timer requested (for example,
-    there is no HPET timer on older systesm).
-
-* Data collection:
-   For data collection, it is probably best to keep the concept
-   between sequence ordering and timestamp separate within
-   systemtap (for both the user as well as the implementation).
-
-   For sequence ording, the initial implementation should use ?? the
-   atomic_t form for the sequence ordering (since it is guaranteed
-   to be platform and architecture neutral)?? and modify/change the
-   implementation later if there is a problem.
-   
-   For timestamp, the initial implementation should use 
-   ?? hpet timer ?? pm timer ?? do_gettimeofday ?? cpu # + tsc ??
-   some combination (do_gettimeofday + cpu # & low bits of tsc)?
-
-   We could do something like what LTT does (see below) to
-   generate 64-bit timestamps containing the nanoseconds
-   since Jan 1, 1970.
-
-   Assuming the implementation keeps these concepts separate now
-   (ordering data vs. timing deltas), it is always possible to
-   merge them in the future if a high granularity, numa/smp
-   synchronized timesource becomes available for a large number
-   of platforms and/or processors.
-
-* Data presentation:
-   In general, users prefer output which is based on "actual" time (ie,
-   they prefer an output that says the delta is XXX nanoseconds instead
-   of YYY cpu cycles). Most of the time users want delta's (how long did
-   this take), but occasionally they want absolute times (when / what time
-   was this information collected)
-
-* Competition:
-    DTRACE has output in nanoseconds (and it is comparable between
-    processors on an mp system), but it is unclear what the actual
-    resolution is.  Even if the sparc machine does have hardware
-    that provides nanosecond resolution, on x86-64 they are likely
-    to have the same problems as discussed here since the solaris
-    opteron box tends to be a pretty vanilla box.
-
-    From Joshua Stone (joshua.i.stone at intel.com):
-    
-    == BEGIN ===
-    DTrace gives you three built-in variables:
-    
-    uint64_t timestamp: The current value of a nanosecond timestamp
-    counter. This counter increments from an arbitrary point in the
-    past and should only be used for relative computations.
-
-    uint64_t vtimestamp: The current value of a nanosecond timestamp
-    counter that is virtualized to the amount of time that the current
-    thread has been running on a CPU, minus the time spent in DTrace
-    predicates and actions. This counter increments from an arbitrary
-    point in the past and should only be used for relative time computations.
-
-    uint64_t walltimestamp: The current number of nanoseconds since
-    00:00 Universal Coordinated Time, January 1, 1970.�
-
-    As for how they are implemented, the only detail I found is that
-    timestamp is "similar to the Solaris library routine gethrtime".
-    The manpage for gethrtime is here:
-    http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr8u?a=view
-    == END ==
-
-    What LTT does:
-
-    "Cycle counters are fast to read but may reflect time
-    inaccurately.  Indeed, the exact clock frequency varies
-    with time as the processor temperature changes, influenced
-    by the external temperature and its workload. Moreover, in
-    SMP systems, the clock of individual processors may vary
-    independently.
-
-    LTT corrects the clock inaccuracy by reading the real time
-    clock value and the 64 bits cycle counter periodically, at
-    the beginning of each block, and at each 10ms. This way, it
-    is sufficient to read only the lower 32 bits of the cycle
-    counter at each event. The associated real time value may
-    then be obtained by linear interpolation between the nearest
-    full cycle counter and real time values. Therefore, for the
-    average cost of reading and storing the lower 32 bits of the
-    cycle counter at each event, the real time with full resolution
-    is obtained at analysis time."
-
-
-* Cross-references:
-    The profile_tapset is very dependent on sequencing and time when 
-    ordering of data (i.e. taking a trace history) as well as high
-    granularity (when calculating time deltas).
-
-* Associated files:
-
-    Profile tapset requirements
-    .../src/tapsets/profile/profile_tapset.txt
-
-    Intel high precision event timers specification:
-    http://www.intel.com/hardwaredesign/hpetspec.htm
-
-    ACPI specification:
-    http://www.acpi.info/DOWNLOADS/ACPIspec-2-0b.pdf
-
-    From an internal email sent by Tony Luck (tony.luck at intel.com)
-    regarding a clustered environment. For the summary below, hpet and
-    pm timer were not an option. For systemtap, they should be considered,
-    especially since pm timer and hpet were designed to be a timestamp.
-
-    == BEGIN ===
-    For extremely short intervals (<100ns) get some h/w help (oscilloscope
-    or logic analyser).  Delays reading TSC and pipeline effects could skew
-    your results horribly.  Having a 2GHz clock doesn't mean that you can
-    measure 500ps intervals.
-
-    For short intervals (100ns to 10 ms) TSC is your best choice ... but you
-    need to sample it on the same cpu, and converting the difference between
-    two TSC values to real time will require some system dependent math to find
-    the nominal frequency of the system (you may be able to ignore temperature
-    effects, unless your system is in an extremely hostile environment).  But
-    beware of systems that change the TSC rate when making frequency
-    adjustments for power saving.  It shouldn't be hard to measure the
-    system clock frequency to about five significant digits of accuracy,
-    /proc/cpuinfo is probably good enough.
-
-    For medium intervals (10 ms to a minute) then "gettimeofday()" or
-    "clock_gettime()" on a system *NOT* running NTP may be best, but you will
-    need to adjust for systematic error to account for the system clock running
-    fast/slow.  Many Linux systems ship with a utility named "clockdiff" that
-    you can use to measure the system drift against a reference system
-    (a system that is nearby on the network, running NTP, prefereably a
-    low "stratum" one).
-
-    Just run clockdiff every five minutes for an hour or two, and plot the
-    results to see what systematic drift your system has without NTP. N.B. if
-    you find the drift is > 10 seconds per day, then NTP may have
-    trouble keeping this system synced using only drift corrections,
-    you might see "steps" when running NTP.  Check /var/log/messages for
-    complaints from NTP.
-
-    For long intervals (above a minute). Then you need "gettimeofday()" on a
-    system that uses NTP to keep it in touch with reality.  Assuming reasonable
-    network connectivity, NTP will maintain the time within a small number of
-    milliseconds of reality ... so your results should be good for
-    4-5 significant figures for 1 minute intervals, and better for longer
-    intervals.
-    == END ==
-