diff options
Diffstat (limited to 'tapsets/timestamp/timestamp_tapset.txt')
-rw-r--r-- | tapsets/timestamp/timestamp_tapset.txt | 327 |
1 files changed, 0 insertions, 327 deletions
diff --git a/tapsets/timestamp/timestamp_tapset.txt b/tapsets/timestamp/timestamp_tapset.txt deleted file mode 100644 index dcbd5813..00000000 --- a/tapsets/timestamp/timestamp_tapset.txt +++ /dev/null @@ -1,327 +0,0 @@ -* Application name: sequence numbers and timestamps - -* Contact: - Martin Hunt hunt@redhat.com - Will Cohen wcohen@redhat.com - Charles Spirakis charles.spirakis@intel.com - -* Motivation: - On multi-processor systems, it is important to have a way - to correlate information gathered between cpu's. There are two - forms of correlation: - - a) putting information into the correct sequence order - b) providing accurate time deltas between information - - If the resolution of the time deltas is high enough, it can - also be used to order information. - -* Background: - Discussion started due to relayfs and per-cpu buffers, but this - is neede by many people. - -* Target software: - Any software which wants to correlate data that was gathered - on a multi-processor system, but the scope will be defined - specifically for systemtap's needs. - -* Type of description: - General information and discussion regarding sequencing and timing. - -* Interesting probe points: - Any probe points where you are trying to get the time between two - probe points. For example, timing how long a function takes and - putting probe points at the function entry and function exit. - -* Interesting values: - Possible ways to order data from multiple sources include: - -Retrieve the sequence/time from a global area - - High Precision Event Timer (HPET) - Possible implementation: - multimedia/HPET timer - arch/i386/kernel/timer_hpet.c - Advantages: - granularity can vary (HPET spec says minimum freq of HPET timer - is 10Mhz =~100ns resolution), can be treated as read-only, - can bypass cache update and avoid being caced at all if desired, - desigend to be used as an smp timestamp (see specification) - Disadvantages: - may not be available on all platforms, may not be synchronized on - NUMA systems (ie counts for all processors within a numa node is - comparable, but counts for processors between nodes may not be - comparable), potential resource conflict if timers used by - other software - - Real Time Clock (RTC) - Possible implementation: - "external" chip (clock chip) which has time information, accessed via - ioport or memory-mapped io - Advantages: - can be treated as read-only, can bypass cache update and avoid being - cached at all if desired - Disadvantages: - may not be available on all platforms, low granularity (for rtc, - ~1ms), usually slow access - - ACPI Power Management Timer (pm timer) - Possible implementation: - implemented as part of the ACPI specification at 3.579545Mhz - arch/i386/kernel/timers/timer_pm.c - Advantages: - not affected by throttling, halting or power saving states, moderate - granularity (3.5Mhz ~ 300ns resolution), desigend for use by an OS - to keep track of time during sleep/power states - Disadvantages: - may not be available on all platforms, slower access than hpet timer - (but still much faster than RTC) - - Chipset counter - Possible implementation: - timer on a processor chipset, ??SGI implementation??, do we know of any - other implementations ? - Advantages: - likely to be based on pci bus clock (33Mhz = ~30ns) or front-side-bus - clock (200Mhz = ~5ns) - Disadvantages: - may not be available on all platforms - - Sequence Number - Possible implementation: - atomic_t global variable, cache aligned, placed in struct to keep - variable on a cache line by itself - Advantages: - guaranteed correct ordering (even on NUMA systems), architecure - independent, platform independent - Disadvantages: - potential for cache-line ping-pong, doesn't scale, no time - information (ordering data only), access can be slower on NUMA systems - - Jiffies - Possible implementation: - OS counts the number of "clock interrupts" since power on. - Advantages: - platform independent, architecture independent, one writer, many - readers (less cache ping-pong) - cached at all if desired - Disadvantages: - low resolution (usually 10ms, sometimes 1ms). - - Do_gettimeofday - Possible implementation: - arch/i386/kernel/time.c - Advantages: - platform independent, architecture independent, 1 writer, many - readers (less cache ping-pong), accuracy of micro-seconds - Disadvantages: - the time unit increment value used by this routine changes - based on information from ntp (i.e if ntp needs to speed up / slow - down the clock, then callers to this routine will be affected). This - is a disadvantage for timing short intervals, but an advantage - for timing long intervals. - - -Retrieve the sequence/time from a cpu-unique area - - Timestamp counter (TSC) - Possible implementation: - count of the number of core cycles the processor has executed since - power on, due to lack of synchronization between cpus, would also - need to keep track of which cpu the tsc came from - Advantages: - no external bus access, high granularity (cpu core cycles), - available on most (not all) architectures, platform independent - Disadvantages: - not synchronized between cpus, since it is a count of cpu cycles - count can be affected by throttling, halting and power saving states, - may not correlate to "actual" time (ie, just because a 1G processor - showed a delta of 1G cycles, doesn't mean 1 second has passed) - - APIC timer - Possible implementation: - timer implemented within the processor - Advantages: - no external bus access, moderate to high granularity (usually - counting based front-side bus clock or core clock) - Disadvantages: - not synchronized between cpus, may be affected by throttling, - halting/power saving states, may not correlate to "actual" time. - - PMU event - Possible implementation: - program a perfmonance counter with a specific event related to time - Advantages: - no external bus access, moderate to high granularity (usually - counting based front-side bus clock or core clock), can be - virtualized to give moderate to high granularity for individual - thread paths - Disadvantages: - not synchronized between cpus, may be affected by throttling, - halting/power saving states, may not correlate to "actual" time, - processor dependent - - - For reference, as a quick baseline, on Martin's dual-processor system, - he gets the following performance measurements: - - kprobe overhead: 1200-1500ns (depending on OS and processor) - atomic read plus increment: 40ns (single processor access, no conflict) - monotonic_clock() 550ns - do_gettimeofday() 590ns - -* Dependencies: - Not Applicable - -* Restrictions: - Certain timers may already be in use by other parts of the kernel - depending on how it is configured (for example, RTC is used by the - watchdog code). Some kernels may not compile in the necessary code - (for example, if using the pm timer, need ACPI). Some platforms - or architectures may not have the timer requested (for example, - there is no HPET timer on older systesm). - -* Data collection: - For data collection, it is probably best to keep the concept - between sequence ordering and timestamp separate within - systemtap (for both the user as well as the implementation). - - For sequence ording, the initial implementation should use ?? the - atomic_t form for the sequence ordering (since it is guaranteed - to be platform and architecture neutral)?? and modify/change the - implementation later if there is a problem. - - For timestamp, the initial implementation should use - ?? hpet timer ?? pm timer ?? do_gettimeofday ?? cpu # + tsc ?? - some combination (do_gettimeofday + cpu # & low bits of tsc)? - - We could do something like what LTT does (see below) to - generate 64-bit timestamps containing the nanoseconds - since Jan 1, 1970. - - Assuming the implementation keeps these concepts separate now - (ordering data vs. timing deltas), it is always possible to - merge them in the future if a high granularity, numa/smp - synchronized timesource becomes available for a large number - of platforms and/or processors. - -* Data presentation: - In general, users prefer output which is based on "actual" time (ie, - they prefer an output that says the delta is XXX nanoseconds instead - of YYY cpu cycles). Most of the time users want delta's (how long did - this take), but occasionally they want absolute times (when / what time - was this information collected) - -* Competition: - DTRACE has output in nanoseconds (and it is comparable between - processors on an mp system), but it is unclear what the actual - resolution is. Even if the sparc machine does have hardware - that provides nanosecond resolution, on x86-64 they are likely - to have the same problems as discussed here since the solaris - opteron box tends to be a pretty vanilla box. - - From Joshua Stone (joshua.i.stone at intel.com): - - == BEGIN === - DTrace gives you three built-in variables: - - uint64_t timestamp: The current value of a nanosecond timestamp - counter. This counter increments from an arbitrary point in the - past and should only be used for relative computations. - - uint64_t vtimestamp: The current value of a nanosecond timestamp - counter that is virtualized to the amount of time that the current - thread has been running on a CPU, minus the time spent in DTrace - predicates and actions. This counter increments from an arbitrary - point in the past and should only be used for relative time computations. - - uint64_t walltimestamp: The current number of nanoseconds since - 00:00 Universal Coordinated Time, January 1, 1970. - - As for how they are implemented, the only detail I found is that - timestamp is "similar to the Solaris library routine gethrtime". - The manpage for gethrtime is here: - http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr8u?a=view - == END == - - What LTT does: - - "Cycle counters are fast to read but may reflect time - inaccurately. Indeed, the exact clock frequency varies - with time as the processor temperature changes, influenced - by the external temperature and its workload. Moreover, in - SMP systems, the clock of individual processors may vary - independently. - - LTT corrects the clock inaccuracy by reading the real time - clock value and the 64 bits cycle counter periodically, at - the beginning of each block, and at each 10ms. This way, it - is sufficient to read only the lower 32 bits of the cycle - counter at each event. The associated real time value may - then be obtained by linear interpolation between the nearest - full cycle counter and real time values. Therefore, for the - average cost of reading and storing the lower 32 bits of the - cycle counter at each event, the real time with full resolution - is obtained at analysis time." - - -* Cross-references: - The profile_tapset is very dependent on sequencing and time when - ordering of data (i.e. taking a trace history) as well as high - granularity (when calculating time deltas). - -* Associated files: - - Profile tapset requirements - .../src/tapsets/profile/profile_tapset.txt - - Intel high precision event timers specification: - http://www.intel.com/hardwaredesign/hpetspec.htm - - ACPI specification: - http://www.acpi.info/DOWNLOADS/ACPIspec-2-0b.pdf - - From an internal email sent by Tony Luck (tony.luck at intel.com) - regarding a clustered environment. For the summary below, hpet and - pm timer were not an option. For systemtap, they should be considered, - especially since pm timer and hpet were designed to be a timestamp. - - == BEGIN === - For extremely short intervals (<100ns) get some h/w help (oscilloscope - or logic analyser). Delays reading TSC and pipeline effects could skew - your results horribly. Having a 2GHz clock doesn't mean that you can - measure 500ps intervals. - - For short intervals (100ns to 10 ms) TSC is your best choice ... but you - need to sample it on the same cpu, and converting the difference between - two TSC values to real time will require some system dependent math to find - the nominal frequency of the system (you may be able to ignore temperature - effects, unless your system is in an extremely hostile environment). But - beware of systems that change the TSC rate when making frequency - adjustments for power saving. It shouldn't be hard to measure the - system clock frequency to about five significant digits of accuracy, - /proc/cpuinfo is probably good enough. - - For medium intervals (10 ms to a minute) then "gettimeofday()" or - "clock_gettime()" on a system *NOT* running NTP may be best, but you will - need to adjust for systematic error to account for the system clock running - fast/slow. Many Linux systems ship with a utility named "clockdiff" that - you can use to measure the system drift against a reference system - (a system that is nearby on the network, running NTP, prefereably a - low "stratum" one). - - Just run clockdiff every five minutes for an hour or two, and plot the - results to see what systematic drift your system has without NTP. N.B. if - you find the drift is > 10 seconds per day, then NTP may have - trouble keeping this system synced using only drift corrections, - you might see "steps" when running NTP. Check /var/log/messages for - complaints from NTP. - - For long intervals (above a minute). Then you need "gettimeofday()" on a - system that uses NTP to keep it in touch with reality. Assuming reasonable - network connectivity, NTP will maintain the time within a small number of - milliseconds of reality ... so your results should be good for - 4-5 significant figures for 1 minute intervals, and better for longer - intervals. - == END == - |