Initial description of the requirements for enabling

stopwatch and profiling for systemtap.
author: cspiraki <cspiraki> 2005-06-14 02:43:57 +0000
committer: cspiraki <cspiraki> 2005-06-14 02:43:57 +0000
commit: 5ece94302a589a2f5e8f3025d797b2f8a8194efc (patch)
tree: 95f098cad9cc85edc39441f72caa2623aa355900
parent: 56e12059fd1e281889b708b3d351092d9e3ed0be (diff)
download: systemtap-steved-5ece94302a589a2f5e8f3025d797b2f8a8194efc.tar.gz
systemtap-steved-5ece94302a589a2f5e8f3025d797b2f8a8194efc.tar.xz
systemtap-steved-5ece94302a589a2f5e8f3025d797b2f8a8194efc.zip
1 files changed, 503 insertions, 0 deletions
diff --git a/tapsets/profile/profile_tapset.txt b/tapsets/profile/profile_tapset.txt
new file mode 100644
index 00000000..e8899dc7
--- /dev/null
+++ b/tapsets/profile/profile_tapset.txt
@@ -0,0 +1,503 @@
+* Application name: Stopwatch and Profiling for systemtap
+
+* Contact:
+    Will Cohen          wcohen@redhat.com
+    Charles Spirakis    charles.spirakis@intel.com
+
+* Motivation:
+    Allow SW developers to improve the performance of their
+    code. The metholodies used are stopwatch (sometimes known
+    as event counting) and profiling.
+
+* Background:
+    Will has experience with oprofile
+    Charles has experience with vtune
+
+* Target software:
+    Initially the kernel, but longer term, both kernel and user.
+
+* Type of description:
+   General information regarding requirements and usage models.
+
+* Interesting probe points:
+    When doing profiling you have "asynchronous-event" probe points
+    (aka you get an interrupt and you'll want to capture information
+    about where that interrupt happened).
+
+    When doing stopwatch, interesting probe points will be
+    function entry/exits, queue add/remove, queue entity lifecycle,
+    and any other code where you want to measure time
+    or events (cpu resource utilization) associated with a path of code
+    (frame buffer drawing measurements, graphic T&L pipeline
+    measurements, etc).
+
+* Interesting values:
+    For profiling, the pt_regs structure from the interrupt handler. The
+    most commonly used items would be the instruction pointer and the
+    call stack pointer.
+
+    For stopwatch, most of the accesses are likely to be pmu read
+    operations.
+    
+    In addition, given the large variety of pmu capabilities, access
+    to the pmu registers themselves (read and write) would be very
+    important.
+
+    Different pmu's have different events, but for script portability,
+    we may want to have a subset of predefined events and have something
+    map that into a pmu's particular event (similar to what papi does).
+
+    Given the variety of performance events and pmu architectures, we
+    may want to try and have a standardized library/api as part of the
+    translator to map events (or specialzed event information) into
+    register/value pairs used during the actual systemtap run.
+
+   ??? Classify values as consumed from lower level vs. provided to higher
+   level ???
+
+* Dependencies:
+    Need some form of arbitration of the pmu to make sure the data provided
+    is valid (see perfmon below).
+    
+    Two common usage models are aggregated data (oprofile) and
+    trace history (papi/vtune). Currently these tools all do the
+    aggregation in user-mode and we may want to look at what
+    they do and why.
+    
+    The unofficial rule of thumb is that profiling should be
+    as unobtrusive as possible and definitely < 1% overhead.
+
+    When doing stopwatch or profiling, there is a need to be able to
+    sequence the data. For timing this is important to be able to 
+    accurately compute start/stop deltas and watch control/data flow.
+    For profiling, it is needed to support trace history.
+
+    There needs to be a timesource that has reasonable granularity
+    and is reasonably precise.
+
+    Per-thread virtualization (of time and events)
+
+    System wide mode for pmu events
+
+* Restrictions:
+    Currently access to the pmu is a bit of a free for all with no
+    single entity providing arbitration. The perfmon2 patch for 2.6
+    (see the cross reference section below) is attempting to
+    provide much of the infrastructure needed by profiling tools
+    (like oprofile and papi) across architectures (pentium M, ia64
+    and x86_64 initially, though I understand Stephane has contacted
+    someone at IBM for a powerpc version as well).
+
+    Andrew Morton wants a perfmon and perfcntr to be merged. Regardless
+    of what happens, both pmu libraries are geared more for
+    user->kernel access rather than kernel->kernel access and we
+    will need to see what can be EXPORT()'ed to make it more
+    kernel module friendly.
+
+* Data collection:
+    Pmu counters tend to be different widths on different
+    architectures. It would be useful to standardize the
+    width (in software) to 64-bits to make math operations
+    (such as comparisons, delta's, etc) easier.
+
+    The goal of profiling is to go from:
+        pid/ip -> path/image -> source file/line number
+
+    This implies the need to have a (reasonably quick) mechanism to
+    translate pid/ip to path/image. Potentially reuse the dcookie
+    methodology from oprofile but may need to extend that model if there
+    is a goal to support anonymous maps (dynamically generated code).
+
+    Need the ability to map the current pid to a process name.
+
+    Need to make a decision on how much will be handled via associative
+    arrays in the kernel and how much will be handled in user space
+    (potentially part of post processing). Given the volume of data that
+    can be generated during profiling, it may make sense to follow the
+    trend of current perfomrance tools and attempt to put merging and
+    aggregation in the user space instead of kernel space.
+
+    To keep the overhead of collection low, it may be useful to look
+    into having some of the information needed be collected at interrupt
+    time and other pieces of information be collected after the
+    interrupt (top/bottom style). For example, although it may be
+    convienent to have a syntax like:
+    
+    val = associate_image($pt_regs->eip)
+
+    it may be preferable to use a marker in the output stream instead
+    (oprofile used a dcookie) and then do a lookup later (either in the
+    kernel and add a marker->name entry to the output stream or in user
+    space similar to what oprofile did). This concept could be extended
+    to cover the lookup of the pid name as well.
+
+    Stack information will need to be collected at interrupt time
+    (based on the interrupted pt_regs->esp) so the routine to obtain
+    the stack trace should be reasonably fast. Due to asynchronous probes,
+    the stack may be in user space.
+    
+    Depending on whether support of anonymous maps is important, it may
+    be useful to have a more generic method of mapping ip->path/module
+    which would allow dynamic code generates (usually found in user
+    space) to be able to provide ip->image map information as part of
+    the regular systemtap data stream. If we allow for a user-mode api
+    to add data to a systemtap stream, we could have a very general
+    purpose merge/aggregation tool for profiling from a variety of
+    sources.
+
+* Data presentation:
+    Generally data will be presented to the user as either an inorder
+    stream (trace history) or aggregated in some form to produce a
+    histogram or min/max/average/std. 
+    
+    When aggregated, the data may be clumped by pid (each running of
+    the app provides unique data), process name (the data for an app
+    is merged for all runs), or it may be clumped by the loaded image
+    (to get information about shared libraries regardless of the app
+    that loaded it). Assuming an increase in multi processor and
+    multi threaded applications, grouping the data by thread group
+    id is likely to be useful as well. Ideally, if symbols/debug
+    information is available, additional aggregation could be done
+    at the function, basic block or source line.
+
+* Competition:
+    See the cross-reference list below
+
+* Cross-references:
+	Oprofile
+
+    Oprofile is a profiling tool that provides time and event based
+    sampling. Its collection methodology has a "file view" of the
+    world and only captures the minimum information needed to get
+    the image that corresponds to the interrupted instruction
+    address. It aggregates the data (no time information) to keep
+    the total data size to a minimum even on long runs. Oprofile
+    allows for optional "escape" sequences in a data stream to add
+    information. It can handle non-maskable interrupts (NMI) as well
+    as maskable interrupts to obtain samples in areas where
+    maskable interrupts are normally disabled. Work is being done
+    to allow oprofile to handle anonymous maps (ie. dynamically
+    generated code from jvm's).
+
+    http://oprofile.sourceforge.net/news/
+    
+    Papi
+
+    Papi is a profiling tool that can aggregate data or keep a trace
+    history. It uses tables to map generic event concepts (for example,
+    PAPI_TOT_CYC) into architecture specific events (for example,
+    CPU_CLK_UNHALTED, value 0x79 on the Pentium M). Interrupts can be
+    time based and it can capture event counts (i.e. every 5ms,
+    capture cpu cycles and instructions retired) in addition to
+    the instruction pointer.  Papi is built on top of other performance
+    monitoring support such as ia64 perfmon and i386 perfctr in the Linux
+    kernel.
+
+    http://icl.cs.utk.edu/papi/
+
+	Perfmon2 infrastructure
+
+    Perfmon2 is a profiling infrastructure currently in the linux 2.6
+    kernel for ia64. It handles arbitration and virtualization
+    of the pmu resources, extends
+    the pmu's to a logical 64-bits regardless of the underlying hardware
+    size, context switches of the counters when needed to allow for
+    per-process or system-wide use, and has the ability to choose a subset
+    of the cpu's on a system when doing system-wide profiling. Oprofile on
+    Linux 2.6 for ia64 has been ported to use the perfmon2 interface. Currently,
+    there are patches submitted for the Linux Kernel Mailing List for
+    the 2.6 kernel to port the perfmon2
+    infrastructure to the Pentium M and x86_64.
+
+    http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html
+
+    Shark
+
+    Shark is a profiling tool from Apple that focuses on time and event
+    based statistical stack sampling. On each profile interrupt, in
+    addition to capturing the instruction pointer, it also captures
+    a stack trace so you know both where you were and how you got there.
+
+    http://developer.apple.com/tools/sharkoptimize.html
+
+    Vtune
+
+    Vtune is a profiling tool that provides time and event based
+    sampling. It does collection based on a "process view" of the
+    world. It keeps a trace history so that you can aggregate the
+    data during post processing in various ways, it can capture
+    architectural specific data in addition to ip (such as branch
+    history buffers), and it can use architectural specific abilities
+    to get exact ip addresses for certain events. Currently handles
+    anonymous mappings (dynamically generated code from jvm's).
+
+    http://www.intel.com/software/products/vtune/vlin/index.htm
+
+
+* Associated files:
+  Should the usage models be split into a separate file?
+
+Usage Models:
+   Below are some typical usage models. This isn't an attempt
+   to propose syntax, it's an attempt to create something
+   concrete enough to help people understand the goals:
+   (description, psuedo code, desired output).
+
+Description:    Statistical stack sampling (ala shark)
+
+    probe kernel.time_ms(10)
+    {
+        i = associate_image($pt_regs->eip);
+        s = stack($pt_regs->esp);
+        stp($current->pid, $pid_name, $pt_regs->eip, i, s)
+    }
+
+    Output desired:
+    For each process/prcess name, aggregate (histogram) based
+    on eip (regardless how I got there), stack (what was the
+    most common calling path), or both (what was the most common
+    path to the most common eip).
+    Could be implemented by generating a trace history and let the
+    user post process (eats disk space, but one run can be viewed
+    multiple ways) or could have the user define what was wanted
+    in the script and do the post processing ourselves (saves disk space,
+    but more work for us).
+
+Description:    Time based aggregation (ala oprofile)
+
+    probe kernel.time_ms(10)
+    {
+        i = associate_image($pt_regs->eip);
+        stp($current->pid, $pid_name, $ptregs->eip, i);
+    }
+
+    Output desired:
+    Histogram separated by process name, pid/eip, pid/image
+
+Description:    Time a routine part 1
+                time between the function call and return:
+
+    probe kernel.function("sys_exec")
+    {
+        $thread->mystart = $timestamp
+    }
+    probe kernel.function("sys_exec").return
+    {
+        delta = $timestamp - $thread->mystart
+
+        // do statistical operations...
+    }
+
+    Output desired:
+    Be able to do statistics for the time it takes for an exec
+    function to execute. The time needs to have a fine enough
+    granularity to have meaning (i.e. using jiffies probably wouldn't work)
+    and the time needs to be smp correct even if the probe entry
+    and the return execute on different processors.
+
+Description:    Time a routine part 2
+                count the number of events between the
+                function call and return:
+
+    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve")
+    {
+        $thread->myclocks = $pmu[0];
+        $thread->myinstr_ret = $pmu[1];
+    }
+    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve").return
+    {
+        $thread->myclocks = $pmu[0] - $thread->myclocks;
+        $thread->myinstr_ret = $pmu[1] - $thread->myinstr_ret;
+
+        cycles_per_instruction = $thread->myclocks / $thread->myinstr_ret
+
+        // Do statistical operations...
+    }
+
+    Desired Output:
+    Produce min/max/average for cycles, instructions retired,
+    and cycles_per_instruction. The pmu must be virtualized if the
+    probe entry and probe exit can happen on different processors. The
+    pmu should be virtualized if there can be pre-emtption (or waits) in
+    the function itself to get more useful information (the actual count
+    of events in the function vs. a count of events in the whole system
+    between when the function starts and when it ended)
+
+Description:    Time a routine part 3
+                reminder of threading issues
+
+    probe kernel.function("sys_fork")
+    {
+        $thread->mystart = $timestamp
+    }
+    probe kernel.function("sys_fork").return
+    {
+        delta = $timestamp - $thread->mystart
+
+        If (parent) {
+            // do statistical operations for time it takes parent
+        } else {
+            // do statistical operations for time it takes child
+        }
+    }
+
+    Desired Output:
+    Produce min/max/average for the parent and the child. The
+    time needs to have a fine enough granularity to have
+    meaning (i.e. using jiffies probably wouldn't work)
+    and the time needs to be smp correct even if the probe entry
+    and the probe return execute on different processors.
+
+Description:    Time a routine part 4
+                reminder of threading issues
+
+    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork")
+    {
+        $thread->myclocks = $pmu[0];
+        $thread->myinstr = $pmu[1];
+    }
+    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork").return
+    {
+        $thread->myclocks = $pmu[0] - $thread->myclocks;
+        $thread->myinstr = $pmu[1] - $thread->myinstr;
+
+        cycles_per_instruction = $thread->myclocks / $thread->myinstr
+
+        If (parent) {
+            // Do statistical operations...
+        } else {
+            // Do statistical operations...
+        }
+    }
+
+    Desired Output:
+    Produce min/max/average for cycles, instructions retired,
+    and cycles_per_instruction. The pmu must be virtualized if the
+    probe entry and probe exit can happen on different processors. The
+    pmu should be virtualized if there can be pre-emtption (or waits) in
+    the function itself to get more useful information (the actual count
+    of events in the function vs. a count of events in the whole system
+    between when the function starts and when it ended)
+
+Description:    Beginnings of "papi" style collection
+
+    probe kernel.startwatch("cpu_cycles").startwatch("instructions_retired").time_ms(10)
+    {
+        i = associate_image($pt_regs->eip);
+        stp($current->pid, $pid_name, $ptregs->eip, i, $pmu[0], $pmu[1]);
+    }
+
+    Desired output:
+    Trace history or aggregation based on process name, image
+
+Description:    Find the path leading to high latency cache miss
+                that stalled for more than 128 cycles (ia64 only)
+
+    probe kernel.startwatch("branch_event,pmc[12]=0x3e0f").pmu_profile("data_ear_event:1000,pmc[11]=0x5000f")
+    {
+        // 
+        // on ia64, when using the data ear event, the precise eip is
+        // saved in pmd[17], so no need for pt_regs->eip (and the
+        // associated skid)...
+        //
+        i = associate_image($pmu->pmd[17]);
+        stp($current->pid, $pid_name, $pmu->pmd[17], i,      // the basics
+        $pmu->pmd[2],       // precise data address
+        $pmu->pmd[3],       // latency information
+        $pmu->pmd[8],       // branch history buffer
+        $pmu->pmd[9],       //  "
+        $pmu->pmd[10],      //  "
+        $pmu->pmd[11],      //  "
+        $pmu->pmd[12],      //  "
+        $pmu->pmd[13],      //  "
+        $pmu->pmd[14],      //  "
+        $pmu->pmd[15],      //  "
+        $pmu->pmd[16]);     // indication of which was most recent branch
+    }
+
+    Desired output:
+    Aggregate data based on pid, process name, eip, latency, and
+    data address. Each pmd on ia64 is 64 bits long, thus the capturing
+    of just the 12 pmd's listed hear is 96 bytes of information every
+    interrupt for each cpu. Profiling can have a very high amount of
+    data collected...
+
+Description:    Pmu event collection of data but use NMI
+                instead of the regular interrupt.
+
+NMI is useful for getting visibily on locks and other code which is
+normally hidden behind interrupt disable code. However, handling an
+NMI is more difficult to do properly. Potentially the compiler can be
+more restrictive on what's allowed in the handler when NMI's are
+selected as the interrupt method.
+
+
+    probe kernel.nmi.pmu_profile("instructions_retired:1000000")
+    {
+        i = associate_image($pt_regs->eip);
+        stp($pid_name, $ptregs->eip, i);
+    }
+
+    Desired Output:
+    Same as the earlier oprofile style example
+
+
+Description:    Timing items in a queue
+
+    Two possibilities - use associative arrays or post process
+
+Associative arrays:
+
+    probe kernel.function("add queue function")
+    {
+        start[$arg->queue_entry] = $timestamp;
+    }
+    probe kernel.function("remove queue function")
+    {
+        delta = $timestamp - start[$arg->queue_entry];
+
+        // Do statistics on the delta value and the queue entry
+    }
+
+Post process:
+
+    probe kernel.function("add queue function")
+    {
+        stp("add", $timestamp, $arg->queue_entry)
+    }
+    probe kernel.function("remove queue function")
+    {
+        stp("remove", $timestamp, $arg->queue_entry)
+    }
+
+Desired Output:
+    For each queue_entry, calculate the delta and do appropriate
+    statistics.
+
+
+Description:    Following an item as it moves to different queues/lists
+
+    Two possibilities - use associative arrays or post process
+    This exam
+
+Associative arrays:
+
+    probe kernel.function("list_add")
+    {
+        delta = $timestamp - start[$arg->head, $arg->new];
+        start[$arg->head, $arg->new] = $timestamp;
+        // Do statistics on the delta value and queue
+    }
+
+
+Post process:
+
+    probe kernel.function("list_add")
+    {
+        stp("add", $timestamp, $arg->head, $arg->new)
+    }
+
+Desired Output:
+    For each (queue, queue_entry) pair, calculate the delta and do appropriate
+    statistics.
+
author	cspiraki <cspiraki>	2005-06-14 02:43:57 +0000
committer	cspiraki <cspiraki>	2005-06-14 02:43:57 +0000
commit	5ece94302a589a2f5e8f3025d797b2f8a8194efc (patch)
tree	95f098cad9cc85edc39441f72caa2623aa355900
parent	56e12059fd1e281889b708b3d351092d9e3ed0be (diff)
download	systemtap-steved-5ece94302a589a2f5e8f3025d797b2f8a8194efc.tar.gz systemtap-steved-5ece94302a589a2f5e8f3025d797b2f8a8194efc.tar.xz systemtap-steved-5ece94302a589a2f5e8f3025d797b2f8a8194efc.zip