* Application name: Stopwatch and Profiling for systemtap

* Contact:
    Will Cohen          wcohen@redhat.com
    Charles Spirakis    charles.spirakis@intel.com

* Motivation:
    Allow SW developers to improve the performance of their
    code. The metholodies used are stopwatch (sometimes known
    as event counting) and profiling.

* Background:
    Will has experience with oprofile
    Charles has experience with vtune

* Target software:
    Initially the kernel, but longer term, both kernel and user.

* Type of description:
   General information regarding requirements and usage models.

* Interesting probe points:
    When doing profiling you have "asynchronous-event" probe points
    (aka you get an interrupt and you'll want to capture information
    about where that interrupt happened).

    When doing stopwatch, interesting probe points will be
    function entry/exits, queue add/remove, queue entity lifecycle,
    and any other code where you want to measure time
    or events (cpu resource utilization) associated with a path of code
    (frame buffer drawing measurements, graphic T&L pipeline
    measurements, etc).

* Interesting values:
    For profiling, the pt_regs structure from the interrupt handler. The
    most commonly used items would be the instruction pointer and the
    call stack pointer.

    For stopwatch, most of the accesses are likely to be pmu read
    operations.
    
    In addition, given the large variety of pmu capabilities, access
    to the pmu registers themselves (read and write) would be very
    important.

    Different pmu's have different events, but for script portability,
    we may want to have a subset of predefined events and have something
    map that into a pmu's particular event (similar to what papi does).

    Given the variety of performance events and pmu architectures, we
    may want to try and have a standardized library/api as part of the
    translator to map events (or specialzed event information) into
    register/value pairs used during the actual systemtap run.

   ??? Classify values as consumed from lower level vs. provided to higher
   level ???

* Dependencies:
    Need some form of arbitration of the pmu to make sure the data provided
    is valid (see perfmon below).
    
    Two common usage models are aggregated data (oprofile) and
    trace history (papi/vtune). Currently these tools all do the
    aggregation in user-mode and we may want to look at what
    they do and why.
    
    The unofficial rule of thumb is that profiling should be
    as unobtrusive as possible and definitely < 1% overhead.

    When doing stopwatch or profiling, there is a need to be able to
    sequence the data. For timing this is important to be able to 
    accurately compute start/stop deltas and watch control/data flow.
    For profiling, it is needed to support trace history.

    There needs to be a timesource that has reasonable granularity
    and is reasonably precise.

    Per-thread virtualization (of time and events)

    System wide mode for pmu events

* Restrictions:
    Currently access to the pmu is a bit of a free for all with no
    single entity providing arbitration. The perfmon2 patch for 2.6
    (see the cross reference section below) is attempting to
    provide much of the infrastructure needed by profiling tools
    (like oprofile and papi) across architectures (pentium M, ia64
    and x86_64 initially, though I understand Stephane has contacted
    someone at IBM for a powerpc version as well).

    Andrew Morton wants a perfmon and perfcntr to be merged. Regardless
    of what happens, both pmu libraries are geared more for
    user->kernel access rather than kernel->kernel access and we
    will need to see what can be EXPORT()'ed to make it more
    kernel module friendly.

* Data collection:
    Pmu counters tend to be different widths on different
    architectures. It would be useful to standardize the
    width (in software) to 64-bits to make math operations
    (such as comparisons, delta's, etc) easier.

    The goal of profiling is to go from:
        pid/ip -> path/image -> source file/line number

    This implies the need to have a (reasonably quick) mechanism to
    translate pid/ip to path/image. Potentially reuse the dcookie
    methodology from oprofile but may need to extend that model if there
    is a goal to support anonymous maps (dynamically generated code).

    Need the ability to map the current pid to a process name.

    Need to make a decision on how much will be handled via associative
    arrays in the kernel and how much will be handled in user space
    (potentially part of post processing). Given the volume of data that
    can be generated during profiling, it may make sense to follow the
    trend of current perfomrance tools and attempt to put merging and
    aggregation in the user space instead of kernel space.

    To keep the overhead of collection low, it may be useful to look
    into having some of the information needed be collected at interrupt
    time and other pieces of information be collected after the
    interrupt (top/bottom style). For example, although it may be
    convienent to have a syntax like:
    
    val = associate_image($pt_regs->eip)

    it may be preferable to use a marker in the output stream instead
    (oprofile used a dcookie) and then do a lookup later (either in the
    kernel and add a marker->name entry to the output stream or in user
    space similar to what oprofile did). This concept could be extended
    to cover the lookup of the pid name as well.

    Stack information will need to be collected at interrupt time
    (based on the interrupted pt_regs->esp) so the routine to obtain
    the stack trace should be reasonably fast. Due to asynchronous probes,
    the stack may be in user space.
    
    Depending on whether support of anonymous maps is important, it may
    be useful to have a more generic method of mapping ip->path/module
    which would allow dynamic code generates (usually found in user
    space) to be able to provide ip->image map information as part of
    the regular systemtap data stream. If we allow for a user-mode api
    to add data to a systemtap stream, we could have a very general
    purpose merge/aggregation tool for profiling from a variety of
    sources.

* Data presentation:
    Generally data will be presented to the user as either an inorder
    stream (trace history) or aggregated in some form to produce a
    histogram or min/max/average/std. 
    
    When aggregated, the data may be clumped by pid (each running of
    the app provides unique data), process name (the data for an app
    is merged for all runs), or it may be clumped by the loaded image
    (to get information about shared libraries regardless of the app
    that loaded it). Assuming an increase in multi processor and
    multi threaded applications, grouping the data by thread group
    id is likely to be useful as well. Ideally, if symbols/debug
    information is available, additional aggregation could be done
    at the function, basic block or source line.

* Competition:
    See the cross-reference list below

* Cross-references:
	Oprofile

    Oprofile is a profiling tool that provides time and event based
    sampling. Its collection methodology has a "file view" of the
    world and only captures the minimum information needed to get
    the image that corresponds to the interrupted instruction
    address. It aggregates the data (no time information) to keep
    the total data size to a minimum even on long runs. Oprofile
    allows for optional "escape" sequences in a data stream to add
    information. It can handle non-maskable interrupts (NMI) as well
    as maskable interrupts to obtain samples in areas where
    maskable interrupts are normally disabled. Work is being done
    to allow oprofile to handle anonymous maps (ie. dynamically
    generated code from jvm's).

    http://oprofile.sourceforge.net/news/
    
    Papi

    Papi is a profiling tool that can aggregate data or keep a trace
    history. It uses tables to map generic event concepts (for example,
    PAPI_TOT_CYC) into architecture specific events (for example,
    CPU_CLK_UNHALTED, value 0x79 on the Pentium M). Interrupts can be
    time based and it can capture event counts (i.e. every 5ms,
    capture cpu cycles and instructions retired) in addition to
    the instruction pointer.  Papi is built on top of other performance
    monitoring support such as ia64 perfmon and i386 perfctr in the Linux
    kernel.

    http://icl.cs.utk.edu/papi/

	Perfmon2 infrastructure

    Perfmon2 is a profiling infrastructure currently in the linux 2.6
    kernel for ia64. It handles arbitration and virtualization
    of the pmu resources, extends
    the pmu's to a logical 64-bits regardless of the underlying hardware
    size, context switches of the counters when needed to allow for
    per-process or system-wide use, and has the ability to choose a subset
    of the cpu's on a system when doing system-wide profiling. Oprofile on
    Linux 2.6 for ia64 has been ported to use the perfmon2 interface. Currently,
    there are patches submitted for the Linux Kernel Mailing List for
    the 2.6 kernel to port the perfmon2
    infrastructure to the Pentium M and x86_64.

    http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html

    Shark

    Shark is a profiling tool from Apple that focuses on time and event
    based statistical stack sampling. On each profile interrupt, in
    addition to capturing the instruction pointer, it also captures
    a stack trace so you know both where you were and how you got there.

    http://developer.apple.com/tools/sharkoptimize.html

    Vtune

    Vtune is a profiling tool that provides time and event based
    sampling. It does collection based on a "process view" of the
    world. It keeps a trace history so that you can aggregate the
    data during post processing in various ways, it can capture
    architectural specific data in addition to ip (such as branch
    history buffers), and it can use architectural specific abilities
    to get exact ip addresses for certain events. Currently handles
    anonymous mappings (dynamically generated code from jvm's).

    http://www.intel.com/software/products/vtune/vlin/index.htm


* Associated files:
  Should the usage models be split into a separate file?

Usage Models:
   Below are some typical usage models. This isn't an attempt
   to propose syntax, it's an attempt to create something
   concrete enough to help people understand the goals:
   (description, psuedo code, desired output).

Description:    Statistical stack sampling (ala shark)

    probe kernel.time_ms(10)
    {
        i = associate_image($pt_regs->eip);
        s = stack($pt_regs->esp);
        stp($current->pid, $pid_name, $pt_regs->eip, i, s)
    }

    Output desired:
    For each process/prcess name, aggregate (histogram) based
    on eip (regardless how I got there), stack (what was the
    most common calling path), or both (what was the most common
    path to the most common eip).
    Could be implemented by generating a trace history and let the
    user post process (eats disk space, but one run can be viewed
    multiple ways) or could have the user define what was wanted
    in the script and do the post processing ourselves (saves disk space,
    but more work for us).

Description:    Time based aggregation (ala oprofile)

    probe kernel.time_ms(10)
    {
        i = associate_image($pt_regs->eip);
        stp($current->pid, $pid_name, $ptregs->eip, i);
    }

    Output desired:
    Histogram separated by process name, pid/eip, pid/image

Description:    Time a routine part 1
                time between the function call and return:

    probe kernel.function("sys_exec")
    {
        $thread->mystart = $timestamp
    }
    probe kernel.function("sys_exec").return
    {
        delta = $timestamp - $thread->mystart

        // do statistical operations...
    }

    Output desired:
    Be able to do statistics for the time it takes for an exec
    function to execute. The time needs to have a fine enough
    granularity to have meaning (i.e. using jiffies probably wouldn't work)
    and the time needs to be smp correct even if the probe entry
    and the return execute on different processors.

Description:    Time a routine part 2
                count the number of events between the
                function call and return:

    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve")
    {
        $thread->myclocks = $pmu[0];
        $thread->myinstr_ret = $pmu[1];
    }
    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve").return
    {
        $thread->myclocks = $pmu[0] - $thread->myclocks;
        $thread->myinstr_ret = $pmu[1] - $thread->myinstr_ret;

        cycles_per_instruction = $thread->myclocks / $thread->myinstr_ret

        // Do statistical operations...
    }

    Desired Output:
    Produce min/max/average for cycles, instructions retired,
    and cycles_per_instruction. The pmu must be virtualized if the
    probe entry and probe exit can happen on different processors. The
    pmu should be virtualized if there can be pre-emtption (or waits) in
    the function itself to get more useful information (the actual count
    of events in the function vs. a count of events in the whole system
    between when the function starts and when it ended)

Description:    Time a routine part 3
                reminder of threading issues

    probe kernel.function("sys_fork")
    {
        $thread->mystart = $timestamp
    }
    probe kernel.function("sys_fork").return
    {
        delta = $timestamp - $thread->mystart

        If (parent) {
            // do statistical operations for time it takes parent
        } else {
            // do statistical operations for time it takes child
        }
    }

    Desired Output:
    Produce min/max/average for the parent and the child. The
    time needs to have a fine enough granularity to have
    meaning (i.e. using jiffies probably wouldn't work)
    and the time needs to be smp correct even if the probe entry
    and the probe return execute on different processors.

Description:    Time a routine part 4
                reminder of threading issues

    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork")
    {
        $thread->myclocks = $pmu[0];
        $thread->myinstr = $pmu[1];
    }
    probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork").return
    {
        $thread->myclocks = $pmu[0] - $thread->myclocks;
        $thread->myinstr = $pmu[1] - $thread->myinstr;

        cycles_per_instruction = $thread->myclocks / $thread->myinstr

        If (parent) {
            // Do statistical operations...
        } else {
            // Do statistical operations...
        }
    }

    Desired Output:
    Produce min/max/average for cycles, instructions retired,
    and cycles_per_instruction. The pmu must be virtualized if the
    probe entry and probe exit can happen on different processors. The
    pmu should be virtualized if there can be pre-emtption (or waits) in
    the function itself to get more useful information (the actual count
    of events in the function vs. a count of events in the whole system
    between when the function starts and when it ended)

Description:    Beginnings of "papi" style collection

    probe kernel.startwatch("cpu_cycles").startwatch("instructions_retired").time_ms(10)
    {
        i = associate_image($pt_regs->eip);
        stp($current->pid, $pid_name, $ptregs->eip, i, $pmu[0], $pmu[1]);
    }

    Desired output:
    Trace history or aggregation based on process name, image

Description:    Find the path leading to high latency cache miss
                that stalled for more than 128 cycles (ia64 only)

    probe kernel.startwatch("branch_event,pmc[12]=0x3e0f").pmu_profile("data_ear_event:1000,pmc[11]=0x5000f")
    {
        // 
        // on ia64, when using the data ear event, the precise eip is
        // saved in pmd[17], so no need for pt_regs->eip (and the
        // associated skid)...
        //
        i = associate_image($pmu->pmd[17]);
        stp($current->pid, $pid_name, $pmu->pmd[17], i,      // the basics
        $pmu->pmd[2],       // precise data address
        $pmu->pmd[3],       // latency information
        $pmu->pmd[8],       // branch history buffer
        $pmu->pmd[9],       //  "
        $pmu->pmd[10],      //  "
        $pmu->pmd[11],      //  "
        $pmu->pmd[12],      //  "
        $pmu->pmd[13],      //  "
        $pmu->pmd[14],      //  "
        $pmu->pmd[15],      //  "
        $pmu->pmd[16]);     // indication of which was most recent branch
    }

    Desired output:
    Aggregate data based on pid, process name, eip, latency, and
    data address. Each pmd on ia64 is 64 bits long, thus the capturing
    of just the 12 pmd's listed hear is 96 bytes of information every
    interrupt for each cpu. Profiling can have a very high amount of
    data collected...

Description:    Pmu event collection of data but use NMI
                instead of the regular interrupt.

NMI is useful for getting visibily on locks and other code which is
normally hidden behind interrupt disable code. However, handling an
NMI is more difficult to do properly. Potentially the compiler can be
more restrictive on what's allowed in the handler when NMI's are
selected as the interrupt method.


    probe kernel.nmi.pmu_profile("instructions_retired:1000000")
    {
        i = associate_image($pt_regs->eip);
        stp($pid_name, $ptregs->eip, i);
    }

    Desired Output:
    Same as the earlier oprofile style example


Description:    Timing items in a queue

    Two possibilities - use associative arrays or post process

Associative arrays:

    probe kernel.function("add queue function")
    {
        start[$arg->queue_entry] = $timestamp;
    }
    probe kernel.function("remove queue function")
    {
        delta = $timestamp - start[$arg->queue_entry];

        // Do statistics on the delta value and the queue entry
    }

Post process:

    probe kernel.function("add queue function")
    {
        stp("add", $timestamp, $arg->queue_entry)
    }
    probe kernel.function("remove queue function")
    {
        stp("remove", $timestamp, $arg->queue_entry)
    }

Desired Output:
    For each queue_entry, calculate the delta and do appropriate
    statistics.


Description:    Following an item as it moves to different queues/lists

    Two possibilities - use associative arrays or post process
    This exam

Associative arrays:

    probe kernel.function("list_add")
    {
        delta = $timestamp - start[$arg->head, $arg->new];
        start[$arg->head, $arg->new] = $timestamp;
        // Do statistics on the delta value and queue
    }


Post process:

    probe kernel.function("list_add")
    {
        stp("add", $timestamp, $arg->head, $arg->new)
    }

Desired Output:
    For each (queue, queue_entry) pair, calculate the delta and do appropriate
    statistics.