1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
|
* Application name: Stopwatch and Profiling for systemtap
* Contact:
Will Cohen wcohen@redhat.com
Charles Spirakis charles.spirakis@intel.com
* Motivation:
Allow SW developers to improve the performance of their
code. The metholodies used are stopwatch (sometimes known
as event counting) and profiling.
* Background:
Will has experience with oprofile
Charles has experience with vtune
* Target software:
Initially the kernel, but longer term, both kernel and user.
* Type of description:
General information regarding requirements and usage models.
* Interesting probe points:
When doing profiling you have "asynchronous-event" probe points
(aka you get an interrupt and you'll want to capture information
about where that interrupt happened).
When doing stopwatch, interesting probe points will be
function entry/exits, queue add/remove, queue entity lifecycle,
and any other code where you want to measure time
or events (cpu resource utilization) associated with a path of code
(frame buffer drawing measurements, graphic T&L pipeline
measurements, etc).
* Interesting values:
For profiling, the pt_regs structure from the interrupt handler. The
most commonly used items would be the instruction pointer and the
call stack pointer.
For stopwatch, most of the accesses are likely to be pmu read
operations.
In addition, given the large variety of pmu capabilities, access
to the pmu registers themselves (read and write) would be very
important.
Different pmu's have different events, but for script portability,
we may want to have a subset of predefined events and have something
map that into a pmu's particular event (similar to what papi does).
Given the variety of performance events and pmu architectures, we
may want to try and have a standardized library/api as part of the
translator to map events (or specialzed event information) into
register/value pairs used during the actual systemtap run.
??? Classify values as consumed from lower level vs. provided to higher
level ???
* Dependencies:
Need some form of arbitration of the pmu to make sure the data provided
is valid (see perfmon below).
Two common usage models are aggregated data (oprofile) and
trace history (papi/vtune). Currently these tools all do the
aggregation in user-mode and we may want to look at what
they do and why.
The unofficial rule of thumb is that profiling should be
as unobtrusive as possible and definitely < 1% overhead.
When doing stopwatch or profiling, there is a need to be able to
sequence the data. For timing this is important to be able to
accurately compute start/stop deltas and watch control/data flow.
For profiling, it is needed to support trace history.
There needs to be a timesource that has reasonable granularity
and is reasonably precise.
Per-thread virtualization (of time and events)
System wide mode for pmu events
* Restrictions:
Currently access to the pmu is a bit of a free for all with no
single entity providing arbitration. The perfmon2 patch for 2.6
(see the cross reference section below) is attempting to
provide much of the infrastructure needed by profiling tools
(like oprofile and papi) across architectures (pentium M, ia64
and x86_64 initially, though I understand Stephane has contacted
someone at IBM for a powerpc version as well).
Andrew Morton wants a perfmon and perfcntr to be merged. Regardless
of what happens, both pmu libraries are geared more for
user->kernel access rather than kernel->kernel access and we
will need to see what can be EXPORT()'ed to make it more
kernel module friendly.
* Data collection:
Pmu counters tend to be different widths on different
architectures. It would be useful to standardize the
width (in software) to 64-bits to make math operations
(such as comparisons, delta's, etc) easier.
The goal of profiling is to go from:
pid/ip -> path/image -> source file/line number
This implies the need to have a (reasonably quick) mechanism to
translate pid/ip to path/image. Potentially reuse the dcookie
methodology from oprofile but may need to extend that model if there
is a goal to support anonymous maps (dynamically generated code).
Need the ability to map the current pid to a process name.
Need to make a decision on how much will be handled via associative
arrays in the kernel and how much will be handled in user space
(potentially part of post processing). Given the volume of data that
can be generated during profiling, it may make sense to follow the
trend of current perfomrance tools and attempt to put merging and
aggregation in the user space instead of kernel space.
To keep the overhead of collection low, it may be useful to look
into having some of the information needed be collected at interrupt
time and other pieces of information be collected after the
interrupt (top/bottom style). For example, although it may be
convienent to have a syntax like:
val = associate_image($pt_regs->eip)
it may be preferable to use a marker in the output stream instead
(oprofile used a dcookie) and then do a lookup later (either in the
kernel and add a marker->name entry to the output stream or in user
space similar to what oprofile did). This concept could be extended
to cover the lookup of the pid name as well.
Stack information will need to be collected at interrupt time
(based on the interrupted pt_regs->esp) so the routine to obtain
the stack trace should be reasonably fast. Due to asynchronous probes,
the stack may be in user space.
Depending on whether support of anonymous maps is important, it may
be useful to have a more generic method of mapping ip->path/module
which would allow dynamic code generates (usually found in user
space) to be able to provide ip->image map information as part of
the regular systemtap data stream. If we allow for a user-mode api
to add data to a systemtap stream, we could have a very general
purpose merge/aggregation tool for profiling from a variety of
sources.
* Data presentation:
Generally data will be presented to the user as either an inorder
stream (trace history) or aggregated in some form to produce a
histogram or min/max/average/std.
When aggregated, the data may be clumped by pid (each running of
the app provides unique data), process name (the data for an app
is merged for all runs), or it may be clumped by the loaded image
(to get information about shared libraries regardless of the app
that loaded it). Assuming an increase in multi processor and
multi threaded applications, grouping the data by thread group
id is likely to be useful as well. Ideally, if symbols/debug
information is available, additional aggregation could be done
at the function, basic block or source line.
* Competition:
See the cross-reference list below
* Cross-references:
Oprofile
Oprofile is a profiling tool that provides time and event based
sampling. Its collection methodology has a "file view" of the
world and only captures the minimum information needed to get
the image that corresponds to the interrupted instruction
address. It aggregates the data (no time information) to keep
the total data size to a minimum even on long runs. Oprofile
allows for optional "escape" sequences in a data stream to add
information. It can handle non-maskable interrupts (NMI) as well
as maskable interrupts to obtain samples in areas where
maskable interrupts are normally disabled. Work is being done
to allow oprofile to handle anonymous maps (ie. dynamically
generated code from jvm's).
http://oprofile.sourceforge.net/news/
Papi
Papi is a profiling tool that can aggregate data or keep a trace
history. It uses tables to map generic event concepts (for example,
PAPI_TOT_CYC) into architecture specific events (for example,
CPU_CLK_UNHALTED, value 0x79 on the Pentium M). Interrupts can be
time based and it can capture event counts (i.e. every 5ms,
capture cpu cycles and instructions retired) in addition to
the instruction pointer. Papi is built on top of other performance
monitoring support such as ia64 perfmon and i386 perfctr in the Linux
kernel.
http://icl.cs.utk.edu/papi/
Perfmon2 infrastructure
Perfmon2 is a profiling infrastructure currently in the linux 2.6
kernel for ia64. It handles arbitration and virtualization
of the pmu resources, extends
the pmu's to a logical 64-bits regardless of the underlying hardware
size, context switches of the counters when needed to allow for
per-process or system-wide use, and has the ability to choose a subset
of the cpu's on a system when doing system-wide profiling. Oprofile on
Linux 2.6 for ia64 has been ported to use the perfmon2 interface. Currently,
there are patches submitted for the Linux Kernel Mailing List for
the 2.6 kernel to port the perfmon2
infrastructure to the Pentium M and x86_64.
http://www.hpl.hp.com/techreports/2004/HPL-2004-200R1.html
Shark
Shark is a profiling tool from Apple that focuses on time and event
based statistical stack sampling. On each profile interrupt, in
addition to capturing the instruction pointer, it also captures
a stack trace so you know both where you were and how you got there.
http://developer.apple.com/tools/sharkoptimize.html
Vtune
Vtune is a profiling tool that provides time and event based
sampling. It does collection based on a "process view" of the
world. It keeps a trace history so that you can aggregate the
data during post processing in various ways, it can capture
architectural specific data in addition to ip (such as branch
history buffers), and it can use architectural specific abilities
to get exact ip addresses for certain events. Currently handles
anonymous mappings (dynamically generated code from jvm's).
http://www.intel.com/software/products/vtune/vlin/index.htm
* Associated files:
Should the usage models be split into a separate file?
Usage Models:
Below are some typical usage models. This isn't an attempt
to propose syntax, it's an attempt to create something
concrete enough to help people understand the goals:
(description, psuedo code, desired output).
Description: Statistical stack sampling (ala shark)
probe kernel.time_ms(10)
{
i = associate_image($pt_regs->eip);
s = stack($pt_regs->esp);
stp($current->pid, $pid_name, $pt_regs->eip, i, s)
}
Output desired:
For each process/prcess name, aggregate (histogram) based
on eip (regardless how I got there), stack (what was the
most common calling path), or both (what was the most common
path to the most common eip).
Could be implemented by generating a trace history and let the
user post process (eats disk space, but one run can be viewed
multiple ways) or could have the user define what was wanted
in the script and do the post processing ourselves (saves disk space,
but more work for us).
Description: Time based aggregation (ala oprofile)
probe kernel.time_ms(10)
{
i = associate_image($pt_regs->eip);
stp($current->pid, $pid_name, $ptregs->eip, i);
}
Output desired:
Histogram separated by process name, pid/eip, pid/image
Description: Time a routine part 1
time between the function call and return:
probe kernel.function("sys_exec")
{
$thread->mystart = $timestamp
}
probe kernel.function("sys_exec").return
{
delta = $timestamp - $thread->mystart
// do statistical operations...
}
Output desired:
Be able to do statistics for the time it takes for an exec
function to execute. The time needs to have a fine enough
granularity to have meaning (i.e. using jiffies probably wouldn't work)
and the time needs to be smp correct even if the probe entry
and the return execute on different processors.
Description: Time a routine part 2
count the number of events between the
function call and return:
probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve")
{
$thread->myclocks = $pmu[0];
$thread->myinstr_ret = $pmu[1];
}
probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_execve").return
{
$thread->myclocks = $pmu[0] - $thread->myclocks;
$thread->myinstr_ret = $pmu[1] - $thread->myinstr_ret;
cycles_per_instruction = $thread->myclocks / $thread->myinstr_ret
// Do statistical operations...
}
Desired Output:
Produce min/max/average for cycles, instructions retired,
and cycles_per_instruction. The pmu must be virtualized if the
probe entry and probe exit can happen on different processors. The
pmu should be virtualized if there can be pre-emtption (or waits) in
the function itself to get more useful information (the actual count
of events in the function vs. a count of events in the whole system
between when the function starts and when it ended)
Description: Time a routine part 3
reminder of threading issues
probe kernel.function("sys_fork")
{
$thread->mystart = $timestamp
}
probe kernel.function("sys_fork").return
{
delta = $timestamp - $thread->mystart
If (parent) {
// do statistical operations for time it takes parent
} else {
// do statistical operations for time it takes child
}
}
Desired Output:
Produce min/max/average for the parent and the child. The
time needs to have a fine enough granularity to have
meaning (i.e. using jiffies probably wouldn't work)
and the time needs to be smp correct even if the probe entry
and the probe return execute on different processors.
Description: Time a routine part 4
reminder of threading issues
probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork")
{
$thread->myclocks = $pmu[0];
$thread->myinstr = $pmu[1];
}
probe kernel.virtual.startwatch("cpu_cycles").virtual.startwatch("instructions_retired").function("sys_fork").return
{
$thread->myclocks = $pmu[0] - $thread->myclocks;
$thread->myinstr = $pmu[1] - $thread->myinstr;
cycles_per_instruction = $thread->myclocks / $thread->myinstr
If (parent) {
// Do statistical operations...
} else {
// Do statistical operations...
}
}
Desired Output:
Produce min/max/average for cycles, instructions retired,
and cycles_per_instruction. The pmu must be virtualized if the
probe entry and probe exit can happen on different processors. The
pmu should be virtualized if there can be pre-emtption (or waits) in
the function itself to get more useful information (the actual count
of events in the function vs. a count of events in the whole system
between when the function starts and when it ended)
Description: Beginnings of "papi" style collection
probe kernel.startwatch("cpu_cycles").startwatch("instructions_retired").time_ms(10)
{
i = associate_image($pt_regs->eip);
stp($current->pid, $pid_name, $ptregs->eip, i, $pmu[0], $pmu[1]);
}
Desired output:
Trace history or aggregation based on process name, image
Description: Find the path leading to high latency cache miss
that stalled for more than 128 cycles (ia64 only)
probe kernel.startwatch("branch_event,pmc[12]=0x3e0f").pmu_profile("data_ear_event:1000,pmc[11]=0x5000f")
{
//
// on ia64, when using the data ear event, the precise eip is
// saved in pmd[17], so no need for pt_regs->eip (and the
// associated skid)...
//
i = associate_image($pmu->pmd[17]);
stp($current->pid, $pid_name, $pmu->pmd[17], i, // the basics
$pmu->pmd[2], // precise data address
$pmu->pmd[3], // latency information
$pmu->pmd[8], // branch history buffer
$pmu->pmd[9], // "
$pmu->pmd[10], // "
$pmu->pmd[11], // "
$pmu->pmd[12], // "
$pmu->pmd[13], // "
$pmu->pmd[14], // "
$pmu->pmd[15], // "
$pmu->pmd[16]); // indication of which was most recent branch
}
Desired output:
Aggregate data based on pid, process name, eip, latency, and
data address. Each pmd on ia64 is 64 bits long, thus the capturing
of just the 12 pmd's listed hear is 96 bytes of information every
interrupt for each cpu. Profiling can have a very high amount of
data collected...
Description: Pmu event collection of data but use NMI
instead of the regular interrupt.
NMI is useful for getting visibily on locks and other code which is
normally hidden behind interrupt disable code. However, handling an
NMI is more difficult to do properly. Potentially the compiler can be
more restrictive on what's allowed in the handler when NMI's are
selected as the interrupt method.
probe kernel.nmi.pmu_profile("instructions_retired:1000000")
{
i = associate_image($pt_regs->eip);
stp($pid_name, $ptregs->eip, i);
}
Desired Output:
Same as the earlier oprofile style example
Description: Timing items in a queue
Two possibilities - use associative arrays or post process
Associative arrays:
probe kernel.function("add queue function")
{
start[$arg->queue_entry] = $timestamp;
}
probe kernel.function("remove queue function")
{
delta = $timestamp - start[$arg->queue_entry];
// Do statistics on the delta value and the queue entry
}
Post process:
probe kernel.function("add queue function")
{
stp("add", $timestamp, $arg->queue_entry)
}
probe kernel.function("remove queue function")
{
stp("remove", $timestamp, $arg->queue_entry)
}
Desired Output:
For each queue_entry, calculate the delta and do appropriate
statistics.
Description: Following an item as it moves to different queues/lists
Two possibilities - use associative arrays or post process
This exam
Associative arrays:
probe kernel.function("list_add")
{
delta = $timestamp - start[$arg->head, $arg->new];
start[$arg->head, $arg->new] = $timestamp;
// Do statistics on the delta value and queue
}
Post process:
probe kernel.function("list_add")
{
stp("add", $timestamp, $arg->head, $arg->new)
}
Desired Output:
For each (queue, queue_entry) pair, calculate the delta and do appropriate
statistics.
|