1 files changed, 327 insertions, 0 deletions
diff --git a/tapsets/timestamp/timestamp_tapset.txt b/tapsets/timestamp/timestamp_tapset.txt
new file mode 100644
index 00000000..dcbd5813
--- /dev/null
+++ b/tapsets/timestamp/timestamp_tapset.txt
@@ -0,0 +1,327 @@
+* Application name: sequence numbers and timestamps
+
+* Contact:
+    Martin Hunt         hunt@redhat.com
+    Will Cohen          wcohen@redhat.com
+    Charles Spirakis    charles.spirakis@intel.com
+
+* Motivation:
+    On multi-processor systems, it is important to have a way
+    to correlate information gathered between cpu's. There are two
+    forms of correlation:
+
+        a) putting information into the correct sequence order
+        b) providing accurate time deltas between information
+
+    If the resolution of the time deltas is high enough, it can
+    also be used to order information.
+
+* Background:
+    Discussion started due to relayfs and per-cpu buffers, but this
+    is neede by many people.
+
+* Target software:
+    Any software which wants to correlate data that was gathered
+    on a multi-processor system, but the scope will be defined
+    specifically for systemtap's needs.
+
+* Type of description:
+    General information and discussion regarding sequencing and timing.
+
+* Interesting probe points:
+    Any probe points where you are trying to get the time between two
+    probe points. For example, timing how long a function takes and
+    putting probe points at the function entry and function exit.
+
+* Interesting values:
+    Possible ways to order data from multiple sources include:
+
+Retrieve the sequence/time from a global area
+
+  High Precision Event Timer (HPET)
+    Possible implementation:
+    multimedia/HPET timer
+    arch/i386/kernel/timer_hpet.c
+    Advantages:
+    granularity can vary (HPET spec says minimum freq of HPET timer
+    is 10Mhz =~100ns resolution), can be treated as read-only,
+    can bypass cache update and avoid being caced at all if desired,
+    desigend to be used as an smp timestamp (see specification)
+    Disadvantages:
+    may not be available on all platforms, may not be synchronized on
+    NUMA systems (ie counts for all processors within a numa node is
+    comparable, but counts for processors between nodes may not be
+    comparable), potential resource conflict if timers used by
+    other software
+ 
+  Real Time Clock (RTC)
+    Possible implementation:
+    "external" chip (clock chip) which has time information, accessed via
+    ioport or memory-mapped io
+    Advantages:
+    can be treated as read-only, can bypass cache update and avoid being
+    cached at all if desired
+    Disadvantages:
+    may not be available on all platforms, low granularity (for rtc,
+    ~1ms), usually slow access
+
+  ACPI Power Management Timer (pm timer)
+    Possible implementation:
+    implemented as part of the ACPI specification at 3.579545Mhz
+    arch/i386/kernel/timers/timer_pm.c
+    Advantages:
+    not affected by throttling, halting or power saving states, moderate
+    granularity (3.5Mhz ~ 300ns resolution), desigend for use by an OS
+    to keep track of time during sleep/power states
+    Disadvantages:
+    may not be available on all platforms, slower access than hpet timer
+    (but still much faster than RTC)
+
+  Chipset counter
+    Possible implementation:
+    timer on a processor chipset, ??SGI implementation??, do we know of any
+    other implementations ?
+    Advantages:
+    likely to be based on pci bus clock (33Mhz = ~30ns) or front-side-bus
+    clock (200Mhz = ~5ns)
+    Disadvantages:
+    may not be available on all platforms
+
+  Sequence Number
+    Possible implementation:
+    atomic_t global variable, cache aligned, placed in struct to keep
+    variable on a cache line by itself
+    Advantages:
+    guaranteed correct ordering (even on NUMA systems), architecure
+    independent, platform independent
+    Disadvantages:
+    potential for cache-line ping-pong, doesn't scale, no time
+    information (ordering data only), access can be slower on NUMA systems
+
+  Jiffies
+    Possible implementation:
+    OS counts the number of "clock interrupts" since power on.
+    Advantages:
+    platform independent, architecture independent, one writer, many
+    readers (less cache ping-pong)
+    cached at all if desired
+    Disadvantages:
+    low resolution (usually 10ms, sometimes 1ms).
+
+  Do_gettimeofday
+    Possible implementation:
+    arch/i386/kernel/time.c
+    Advantages:
+    platform independent, architecture independent, 1 writer, many
+    readers (less cache ping-pong), accuracy of micro-seconds
+    Disadvantages:
+    the time unit increment value used by this routine changes
+    based on information from ntp (i.e if ntp needs to speed up / slow
+    down the clock, then callers to this routine will be affected). This
+    is a disadvantage for timing short intervals, but an advantage
+    for timing long intervals.
+
+
+Retrieve the sequence/time from a cpu-unique area
+
+  Timestamp counter (TSC)
+    Possible implementation:
+    count of the number of core cycles the processor has executed since
+    power on, due to lack of synchronization between cpus, would also
+    need to keep track of which cpu the tsc came from
+    Advantages:
+    no external bus access, high granularity (cpu core cycles),
+    available on most (not all) architectures, platform independent
+    Disadvantages:
+    not synchronized between cpus, since it is a count of cpu cycles
+    count can be affected by throttling, halting and power saving states,
+    may not correlate to "actual" time (ie, just because a 1G processor
+    showed a delta of 1G cycles, doesn't mean 1 second has passed)
+
+  APIC timer
+    Possible implementation:
+    timer implemented within the processor
+    Advantages:
+    no external bus access, moderate to high granularity (usually
+    counting based front-side bus clock or core clock)
+    Disadvantages:
+    not synchronized between cpus, may be affected by throttling,
+    halting/power saving states, may not correlate to "actual" time.
+
+  PMU event
+    Possible implementation:
+    program a perfmonance counter with a specific event related to time
+    Advantages:
+    no external bus access, moderate to high granularity (usually
+    counting based front-side bus clock or core clock), can be
+    virtualized to give moderate to high granularity for individual
+    thread paths
+    Disadvantages:
+    not synchronized between cpus, may be affected by throttling,
+    halting/power saving states, may not correlate to "actual" time,
+    processor dependent
+
+
+    For reference, as a quick baseline, on Martin's dual-processor system,
+    he gets the following performance measurements:
+
+    kprobe overhead:             1200-1500ns (depending on OS and processor)
+    atomic read plus increment:    40ns (single processor access, no conflict)
+    monotonic_clock()             550ns
+    do_gettimeofday()             590ns
+
+* Dependencies:
+    Not Applicable
+
+* Restrictions:
+    Certain timers may already be in use by other parts of the kernel
+    depending on how it is configured (for example, RTC is used by the
+    watchdog code). Some kernels may not compile in the necessary code
+    (for example, if using the pm timer, need ACPI). Some platforms
+    or architectures may not have the timer requested (for example,
+    there is no HPET timer on older systesm).
+
+* Data collection:
+   For data collection, it is probably best to keep the concept
+   between sequence ordering and timestamp separate within
+   systemtap (for both the user as well as the implementation).
+
+   For sequence ording, the initial implementation should use ?? the
+   atomic_t form for the sequence ordering (since it is guaranteed
+   to be platform and architecture neutral)?? and modify/change the
+   implementation later if there is a problem.
+   
+   For timestamp, the initial implementation should use 
+   ?? hpet timer ?? pm timer ?? do_gettimeofday ?? cpu # + tsc ??
+   some combination (do_gettimeofday + cpu # & low bits of tsc)?
+
+   We could do something like what LTT does (see below) to
+   generate 64-bit timestamps containing the nanoseconds
+   since Jan 1, 1970.
+
+   Assuming the implementation keeps these concepts separate now
+   (ordering data vs. timing deltas), it is always possible to
+   merge them in the future if a high granularity, numa/smp
+   synchronized timesource becomes available for a large number
+   of platforms and/or processors.
+
+* Data presentation:
+   In general, users prefer output which is based on "actual" time (ie,
+   they prefer an output that says the delta is XXX nanoseconds instead
+   of YYY cpu cycles). Most of the time users want delta's (how long did
+   this take), but occasionally they want absolute times (when / what time
+   was this information collected)
+
+* Competition:
+    DTRACE has output in nanoseconds (and it is comparable between
+    processors on an mp system), but it is unclear what the actual
+    resolution is.  Even if the sparc machine does have hardware
+    that provides nanosecond resolution, on x86-64 they are likely
+    to have the same problems as discussed here since the solaris
+    opteron box tends to be a pretty vanilla box.
+
+    From Joshua Stone (joshua.i.stone at intel.com):
+    
+    == BEGIN ===
+    DTrace gives you three built-in variables:
+    
+    uint64_t timestamp: The current value of a nanosecond timestamp
+    counter. This counter increments from an arbitrary point in the
+    past and should only be used for relative computations.
+
+    uint64_t vtimestamp: The current value of a nanosecond timestamp
+    counter that is virtualized to the amount of time that the current
+    thread has been running on a CPU, minus the time spent in DTrace
+    predicates and actions. This counter increments from an arbitrary
+    point in the past and should only be used for relative time computations.
+
+    uint64_t walltimestamp: The current number of nanoseconds since
+    00:00 Universal Coordinated Time, January 1, 1970.�
+
+    As for how they are implemented, the only detail I found is that
+    timestamp is "similar to the Solaris library routine gethrtime".
+    The manpage for gethrtime is here:
+    http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr8u?a=view
+    == END ==
+
+    What LTT does:
+
+    "Cycle counters are fast to read but may reflect time
+    inaccurately.  Indeed, the exact clock frequency varies
+    with time as the processor temperature changes, influenced
+    by the external temperature and its workload. Moreover, in
+    SMP systems, the clock of individual processors may vary
+    independently.
+
+    LTT corrects the clock inaccuracy by reading the real time
+    clock value and the 64 bits cycle counter periodically, at
+    the beginning of each block, and at each 10ms. This way, it
+    is sufficient to read only the lower 32 bits of the cycle
+    counter at each event. The associated real time value may
+    then be obtained by linear interpolation between the nearest
+    full cycle counter and real time values. Therefore, for the
+    average cost of reading and storing the lower 32 bits of the
+    cycle counter at each event, the real time with full resolution
+    is obtained at analysis time."
+
+
+* Cross-references:
+    The profile_tapset is very dependent on sequencing and time when 
+    ordering of data (i.e. taking a trace history) as well as high
+    granularity (when calculating time deltas).
+
+* Associated files:
+
+    Profile tapset requirements
+    .../src/tapsets/profile/profile_tapset.txt
+
+    Intel high precision event timers specification:
+    http://www.intel.com/hardwaredesign/hpetspec.htm
+
+    ACPI specification:
+    http://www.acpi.info/DOWNLOADS/ACPIspec-2-0b.pdf
+
+    From an internal email sent by Tony Luck (tony.luck at intel.com)
+    regarding a clustered environment. For the summary below, hpet and
+    pm timer were not an option. For systemtap, they should be considered,
+    especially since pm timer and hpet were designed to be a timestamp.
+
+    == BEGIN ===
+    For extremely short intervals (<100ns) get some h/w help (oscilloscope
+    or logic analyser).  Delays reading TSC and pipeline effects could skew
+    your results horribly.  Having a 2GHz clock doesn't mean that you can
+    measure 500ps intervals.
+
+    For short intervals (100ns to 10 ms) TSC is your best choice ... but you
+    need to sample it on the same cpu, and converting the difference between
+    two TSC values to real time will require some system dependent math to find
+    the nominal frequency of the system (you may be able to ignore temperature
+    effects, unless your system is in an extremely hostile environment).  But
+    beware of systems that change the TSC rate when making frequency
+    adjustments for power saving.  It shouldn't be hard to measure the
+    system clock frequency to about five significant digits of accuracy,
+    /proc/cpuinfo is probably good enough.
+
+    For medium intervals (10 ms to a minute) then "gettimeofday()" or
+    "clock_gettime()" on a system *NOT* running NTP may be best, but you will
+    need to adjust for systematic error to account for the system clock running
+    fast/slow.  Many Linux systems ship with a utility named "clockdiff" that
+    you can use to measure the system drift against a reference system
+    (a system that is nearby on the network, running NTP, prefereably a
+    low "stratum" one).
+
+    Just run clockdiff every five minutes for an hour or two, and plot the
+    results to see what systematic drift your system has without NTP. N.B. if
+    you find the drift is > 10 seconds per day, then NTP may have
+    trouble keeping this system synced using only drift corrections,
+    you might see "steps" when running NTP.  Check /var/log/messages for
+    complaints from NTP.
+
+    For long intervals (above a minute). Then you need "gettimeofday()" on a
+    system that uses NTP to keep it in touch with reality.  Assuming reasonable
+    network connectivity, NTP will maintain the time within a small number of
+    milliseconds of reality ... so your results should be good for
+    4-5 significant figures for 1 minute intervals, and better for longer
+    intervals.
+    == END ==
+