1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
|
* Application name: sequence numbers and timestamps
* Contact:
Martin Hunt hunt@redhat.com
Will Cohen wcohen@redhat.com
Charles Spirakis charles.spirakis@intel.com
* Motivation:
On multi-processor systems, it is important to have a way
to correlate information gathered between cpu's. There are two
forms of correlation:
a) putting information into the correct sequence order
b) providing accurate time deltas between information
If the resolution of the time deltas is high enough, it can
also be used to order information.
* Background:
Discussion started due to relayfs and per-cpu buffers, but this
is neede by many people.
* Target software:
Any software which wants to correlate data that was gathered
on a multi-processor system, but the scope will be defined
specifically for systemtap's needs.
* Type of description:
General information and discussion regarding sequencing and timing.
* Interesting probe points:
Any probe points where you are trying to get the time between two
probe points. For example, timing how long a function takes and
putting probe points at the function entry and function exit.
* Interesting values:
Possible ways to order data from multiple sources include:
Retrieve the sequence/time from a global area
High Precision Event Timer (HPET)
Possible implementation:
multimedia/HPET timer
arch/i386/kernel/timer_hpet.c
Advantages:
granularity can vary (HPET spec says minimum freq of HPET timer
is 10Mhz =~100ns resolution), can be treated as read-only,
can bypass cache update and avoid being caced at all if desired,
desigend to be used as an smp timestamp (see specification)
Disadvantages:
may not be available on all platforms, may not be synchronized on
NUMA systems (ie counts for all processors within a numa node is
comparable, but counts for processors between nodes may not be
comparable), potential resource conflict if timers used by
other software
Real Time Clock (RTC)
Possible implementation:
"external" chip (clock chip) which has time information, accessed via
ioport or memory-mapped io
Advantages:
can be treated as read-only, can bypass cache update and avoid being
cached at all if desired
Disadvantages:
may not be available on all platforms, low granularity (for rtc,
~1ms), usually slow access
ACPI Power Management Timer (pm timer)
Possible implementation:
implemented as part of the ACPI specification at 3.579545Mhz
arch/i386/kernel/timers/timer_pm.c
Advantages:
not affected by throttling, halting or power saving states, moderate
granularity (3.5Mhz ~ 300ns resolution), desigend for use by an OS
to keep track of time during sleep/power states
Disadvantages:
may not be available on all platforms, slower access than hpet timer
(but still much faster than RTC)
Chipset counter
Possible implementation:
timer on a processor chipset, ??SGI implementation??, do we know of any
other implementations ?
Advantages:
likely to be based on pci bus clock (33Mhz = ~30ns) or front-side-bus
clock (200Mhz = ~5ns)
Disadvantages:
may not be available on all platforms
Sequence Number
Possible implementation:
atomic_t global variable, cache aligned, placed in struct to keep
variable on a cache line by itself
Advantages:
guaranteed correct ordering (even on NUMA systems), architecure
independent, platform independent
Disadvantages:
potential for cache-line ping-pong, doesn't scale, no time
information (ordering data only), access can be slower on NUMA systems
Jiffies
Possible implementation:
OS counts the number of "clock interrupts" since power on.
Advantages:
platform independent, architecture independent, one writer, many
readers (less cache ping-pong)
cached at all if desired
Disadvantages:
low resolution (usually 10ms, sometimes 1ms).
Do_gettimeofday
Possible implementation:
arch/i386/kernel/time.c
Advantages:
platform independent, architecture independent, 1 writer, many
readers (less cache ping-pong), accuracy of micro-seconds
Disadvantages:
the time unit increment value used by this routine changes
based on information from ntp (i.e if ntp needs to speed up / slow
down the clock, then callers to this routine will be affected). This
is a disadvantage for timing short intervals, but an advantage
for timing long intervals.
Retrieve the sequence/time from a cpu-unique area
Timestamp counter (TSC)
Possible implementation:
count of the number of core cycles the processor has executed since
power on, due to lack of synchronization between cpus, would also
need to keep track of which cpu the tsc came from
Advantages:
no external bus access, high granularity (cpu core cycles),
available on most (not all) architectures, platform independent
Disadvantages:
not synchronized between cpus, since it is a count of cpu cycles
count can be affected by throttling, halting and power saving states,
may not correlate to "actual" time (ie, just because a 1G processor
showed a delta of 1G cycles, doesn't mean 1 second has passed)
APIC timer
Possible implementation:
timer implemented within the processor
Advantages:
no external bus access, moderate to high granularity (usually
counting based front-side bus clock or core clock)
Disadvantages:
not synchronized between cpus, may be affected by throttling,
halting/power saving states, may not correlate to "actual" time.
PMU event
Possible implementation:
program a perfmonance counter with a specific event related to time
Advantages:
no external bus access, moderate to high granularity (usually
counting based front-side bus clock or core clock), can be
virtualized to give moderate to high granularity for individual
thread paths
Disadvantages:
not synchronized between cpus, may be affected by throttling,
halting/power saving states, may not correlate to "actual" time,
processor dependent
For reference, as a quick baseline, on Martin's dual-processor system,
he gets the following performance measurements:
kprobe overhead: 1200-1500ns (depending on OS and processor)
atomic read plus increment: 40ns (single processor access, no conflict)
monotonic_clock() 550ns
do_gettimeofday() 590ns
* Dependencies:
Not Applicable
* Restrictions:
Certain timers may already be in use by other parts of the kernel
depending on how it is configured (for example, RTC is used by the
watchdog code). Some kernels may not compile in the necessary code
(for example, if using the pm timer, need ACPI). Some platforms
or architectures may not have the timer requested (for example,
there is no HPET timer on older systesm).
* Data collection:
For data collection, it is probably best to keep the concept
between sequence ordering and timestamp separate within
systemtap (for both the user as well as the implementation).
For sequence ording, the initial implementation should use ?? the
atomic_t form for the sequence ordering (since it is guaranteed
to be platform and architecture neutral)?? and modify/change the
implementation later if there is a problem.
For timestamp, the initial implementation should use
?? hpet timer ?? pm timer ?? do_gettimeofday ?? cpu # + tsc ??
some combination (do_gettimeofday + cpu # & low bits of tsc)?
We could do something like what LTT does (see below) to
generate 64-bit timestamps containing the nanoseconds
since Jan 1, 1970.
Assuming the implementation keeps these concepts separate now
(ordering data vs. timing deltas), it is always possible to
merge them in the future if a high granularity, numa/smp
synchronized timesource becomes available for a large number
of platforms and/or processors.
* Data presentation:
In general, users prefer output which is based on "actual" time (ie,
they prefer an output that says the delta is XXX nanoseconds instead
of YYY cpu cycles). Most of the time users want delta's (how long did
this take), but occasionally they want absolute times (when / what time
was this information collected)
* Competition:
DTRACE has output in nanoseconds (and it is comparable between
processors on an mp system), but it is unclear what the actual
resolution is. Even if the sparc machine does have hardware
that provides nanosecond resolution, on x86-64 they are likely
to have the same problems as discussed here since the solaris
opteron box tends to be a pretty vanilla box.
From Joshua Stone (joshua.i.stone at intel.com):
== BEGIN ===
DTrace gives you three built-in variables:
uint64_t timestamp: The current value of a nanosecond timestamp
counter. This counter increments from an arbitrary point in the
past and should only be used for relative computations.
uint64_t vtimestamp: The current value of a nanosecond timestamp
counter that is virtualized to the amount of time that the current
thread has been running on a CPU, minus the time spent in DTrace
predicates and actions. This counter increments from an arbitrary
point in the past and should only be used for relative time computations.
uint64_t walltimestamp: The current number of nanoseconds since
00:00 Universal Coordinated Time, January 1, 1970.
As for how they are implemented, the only detail I found is that
timestamp is "similar to the Solaris library routine gethrtime".
The manpage for gethrtime is here:
http://docs.sun.com/app/docs/doc/816-5168/6mbb3hr8u?a=view
== END ==
What LTT does:
"Cycle counters are fast to read but may reflect time
inaccurately. Indeed, the exact clock frequency varies
with time as the processor temperature changes, influenced
by the external temperature and its workload. Moreover, in
SMP systems, the clock of individual processors may vary
independently.
LTT corrects the clock inaccuracy by reading the real time
clock value and the 64 bits cycle counter periodically, at
the beginning of each block, and at each 10ms. This way, it
is sufficient to read only the lower 32 bits of the cycle
counter at each event. The associated real time value may
then be obtained by linear interpolation between the nearest
full cycle counter and real time values. Therefore, for the
average cost of reading and storing the lower 32 bits of the
cycle counter at each event, the real time with full resolution
is obtained at analysis time."
* Cross-references:
The profile_tapset is very dependent on sequencing and time when
ordering of data (i.e. taking a trace history) as well as high
granularity (when calculating time deltas).
* Associated files:
Profile tapset requirements
.../src/tapsets/profile/profile_tapset.txt
Intel high precision event timers specification:
http://www.intel.com/hardwaredesign/hpetspec.htm
ACPI specification:
http://www.acpi.info/DOWNLOADS/ACPIspec-2-0b.pdf
From an internal email sent by Tony Luck (tony.luck at intel.com)
regarding a clustered environment. For the summary below, hpet and
pm timer were not an option. For systemtap, they should be considered,
especially since pm timer and hpet were designed to be a timestamp.
== BEGIN ===
For extremely short intervals (<100ns) get some h/w help (oscilloscope
or logic analyser). Delays reading TSC and pipeline effects could skew
your results horribly. Having a 2GHz clock doesn't mean that you can
measure 500ps intervals.
For short intervals (100ns to 10 ms) TSC is your best choice ... but you
need to sample it on the same cpu, and converting the difference between
two TSC values to real time will require some system dependent math to find
the nominal frequency of the system (you may be able to ignore temperature
effects, unless your system is in an extremely hostile environment). But
beware of systems that change the TSC rate when making frequency
adjustments for power saving. It shouldn't be hard to measure the
system clock frequency to about five significant digits of accuracy,
/proc/cpuinfo is probably good enough.
For medium intervals (10 ms to a minute) then "gettimeofday()" or
"clock_gettime()" on a system *NOT* running NTP may be best, but you will
need to adjust for systematic error to account for the system clock running
fast/slow. Many Linux systems ship with a utility named "clockdiff" that
you can use to measure the system drift against a reference system
(a system that is nearby on the network, running NTP, prefereably a
low "stratum" one).
Just run clockdiff every five minutes for an hour or two, and plot the
results to see what systematic drift your system has without NTP. N.B. if
you find the drift is > 10 seconds per day, then NTP may have
trouble keeping this system synced using only drift corrections,
you might see "steps" when running NTP. Check /var/log/messages for
complaints from NTP.
For long intervals (above a minute). Then you need "gettimeofday()" on a
system that uses NTP to keep it in touch with reality. Assuming reasonable
network connectivity, NTP will maintain the time within a small number of
milliseconds of reality ... so your results should be good for
4-5 significant figures for 1 minute intervals, and better for longer
intervals.
== END ==
|