1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
|
.\" -*- nroff -*-
.TH STAPPROBES 5 @DATE@ "Red Hat"
.SH NAME
stapprobes \- systemtap probe points
.\" macros
.de SAMPLE
.br
.RS
.nf
.nh
..
.de ESAMPLE
.hy
.fi
.RE
..
.SH DESCRIPTION
The following sections enumerate the variety of probe points supported
by the systemtap translator, and additional aliases defined by
standard tapset scripts.
.PP
The general probe point syntax is a dotted-symbol sequence. This
allows a breakdown of the event namespace into parts, somewhat like
the Domain Name System does on the Internet. Each component
identifier may be parametrized by a string or number literal, with a
syntax like a function call. A component may include a "*" character,
to expand to a set of matching probe points. Probe aliases likewise
expand to other probe points. Each and every resulting probe point is
normally resolved to some low-level system instrumentation facility
(e.g., a kprobe address, marker, or a timer configuration), otherwise
the elaboration phase will fail.
.PP
However, a probe point may be followed by a "?" character, to indicate
that it is optional, and that no error should result if it fails to
resolve. Optionalness passes down through all levels of
alias/wildcard expansion. Alternately, a probe point may be followed
by a "!" character, to indicate that it is both optional and
sufficient. (Think vaguely of the prolog cut operator.) If it does
resolve, then no further probe points in the same comma-separated list
will be resolved. Therefore, the "!" sufficiency mark only makes
sense in a list of probe point alternatives.
.PP
Additionally, a probe point may be followed by a "if (expr)" statement, in
order to enable/disable the probe point on-the-fly. With the "if" statement,
if the "expr" is false when the probe point is hit, the whole probe body
including alias's body is skipped. The condition is stacked up through
all levels of alias/wildcard expansion. So the final condition becomes
the logical-and of conditions of all expanded alias/wildcard.
These are all syntactically valid probe points:
.SAMPLE
kernel.function("foo").return
syscall(22)
user.inode("/bin/vi").statement(0x2222)
end
syscall.*
kernel.function("no_such_function") ?
module("awol").function("no_such_function") !
signal.*? if (switch)
.ESAMPLE
Probes may be broadly classified into "synchronous" and
"asynchronous". A "synchronous" event is deemed to occur when any
processor executes an instruction matched by the specification. This
gives these probes a reference point (instruction address) from which
more contextual data may be available. Other families of probe points
refer to "asynchronous" events such as timers/counters rolling over,
where there is no fixed reference point that is related. Each probe
point specification may match multiple locations (for example, using
wildcards or aliases), and all them are then probed. A probe
declaration may also contain several comma-separated specifications,
all of which are probed.
.SS BEGIN/END/ERROR
The probe points
.IR begin " and " end
are defined by the translator to refer to the time of session startup
and shutdown. All "begin" probe handlers are run, in some sequence,
during the startup of the session. All global variables will have
been initialized prior to this point. All "end" probes are run, in
some sequence, during the
.I normal
shutdown of a session, such as in the aftermath of an
.I exit ()
function call, or an interruption from the user. In the case of an
error-triggered shutdown, "end" probes are not run. There are no
target variables available in either context.
.PP
If the order of execution among "begin" or "end" probes is significant,
then an optional sequence number may be provided:
.SAMPLE
begin(N)
end(N)
.ESAMPLE
The number N may be positive or negative. The probe handlers are run in
increasing order, and the order between handlers with the same sequence
number is unspecified. When "begin" or "end" are given without a
sequence, they are effectively sequence zero.
The
.IR error
probe point is similar to the
.IR end
probe, except that each such probe handler run when the session ends
after errors have occurred. In such cases, "end" probes are skipped,
but each "error" prober is still attempted. This kind of probe can be
used to clean up or emit a "final gasp". It may also be numerically
parametrized to set a sequence.
.SS NEVER
The probe point
.IR never
is specially defined by the translator to mean "never". Its probe
handler is never run, though its statements are analyzed for symbol /
type correctness as usual. This probe point may be useful in
conjunction with optional probes.
.SS TIMERS
Intervals defined by the standard kernel "jiffies" timer may be used
to trigger probe handlers asynchronously. Two probe point variants
are supported by the translator:
.SAMPLE
timer.jiffies(N)
timer.jiffies(N).randomize(M)
.ESAMPLE
The probe handler is run every N jiffies (a kernel-defined unit of
time, typically between 1 and 60 ms). If the "randomize" component is
given, a linearly distributed random value in the range [\-M..+M] is
added to N every time the handler is run. N is restricted to a
reasonable range (1 to around a million), and M is restricted to be
smaller than N. There are no target variables provided in either
context. It is possible for such probes to be run concurrently on
a multi-processor computer.
.PP
Alternatively, intervals may be specified in units of time.
There are two probe point variants similar to the jiffies timer:
.SAMPLE
timer.ms(N)
timer.ms(N).randomize(M)
.ESAMPLE
Here, N and M are specified in milliseconds, but the full options for units
are seconds (s/sec), milliseconds (ms/msec), microseconds (us/usec),
nanoseconds (ns/nsec), and hertz (hz). Randomization is not supported for
hertz timers.
The actual resolution of the timers depends on the target kernel. For
kernels prior to 2.6.17, timers are limited to jiffies resolution, so
intervals are rounded up to the nearest jiffies interval. After 2.6.17,
the implementation uses hrtimers for tighter precision, though the actual
resolution will be arch-dependent. In either case, if the "randomize"
component is given, then the random value will be added to the interval
before any rounding occurs.
.PP
Profiling timers are also available to provide probes that execute on all
CPUs at the rate of the system tick. This probe takes no parameters.
.SAMPLE
timer.profile
.ESAMPLE
Full context information of the interrupted process is available, making
this probe suitable for a time-based sampling profiler.
.SS DWARF
This family of probe points uses symbolic debugging information for
the target kernel/module/program, as may be found in unstripped
executables, or the separate
.I debuginfo
packages. They allow placement of probes logically into the execution
path of the target program, by specifying a set of points in the
source or object code. When a matching statement executes on any
processor, the probe handler is run in that context.
.PP
Points in a kernel, which are identified by
module, source file, line number, function name, or some
combination of these.
.PP
Here is a list of probe point families currently supported. The
.B .function
variant places a probe near the beginning of the named function, so that
parameters are available as context variables. The
.B .return
variant places a probe at the moment
.B after
the return from the named function, so the return value is available
as the "$return" context variable. The
.B .inline
modifier for
.B .function
filters the results to include only instances of inlined functions.
The
.B .call
modifier selects the opposite subset. Inline functions do not have an
identifiable return point, so
.B .return
is not supported on
.B .inline
probes. The
.B .statement
variant places a probe at the exact spot, exposing those local variables
that are visible there.
.SAMPLE
kernel.function(PATTERN)
.br
kernel.function(PATTERN).call
.br
kernel.function(PATTERN).return
.br
kernel.function(PATTERN).inline
.br
module(MPATTERN).function(PATTERN)
.br
module(MPATTERN).function(PATTERN).call
.br
module(MPATTERN).function(PATTERN).return
.br
module(MPATTERN).function(PATTERN).inline
.br
.br
kernel.statement(PATTERN)
.br
kernel.statement(ADDRESS).absolute
.br
module(MPATTERN).statement(PATTERN)
.ESAMPLE
In the above list, MPATTERN stands for a string literal that aims to
identify the loaded kernel module of interest. It may include "*", "[]",
and "?" wildcards. PATTERN stands for a string literal that
aims to identify a point in the program. It is made up of three
parts:
.IP \(bu 4
The first part is the name of a function, as would appear in the
.I nm
program's output. This part may use the "*" and "?" wildcarding
operators to match multiple names.
.IP \(bu 4
The second part is optional and begins with the "@" character.
It is followed by the path to the source file containing the function,
which may include a wildcard pattern, such as mm/slab*.
In most cases, the path should be relative to the top of the
linux source directory, although an absolute path may be necessary for some kernels.
If a relative pathname doesn't work, try absolute.
.IP \(bu 4
Finally, the third part is optional if the file name part was given,
and identifies the line number in the source file preceded by a ":"
or a "+". The line number is assumed to be an
absolute line number if preceded by a ":", or relative to the entry of
function if preceded by a "+".
.PP
As an alternative, PATTERN may be a numeric constant, indicating an
(module-relative or kernel-_stext-relative) address. In guru mode
only, absolute kernel addresses may be specified with the ".absolute"
suffix.
.PP
Some of the source-level context variables, such as function parameters,
locals, globals visible in the compilation unit, may be visible to
probe handlers. They may refer to these variables by prefixing their
name with "$" within the scripts. In addition, a special syntax
allows limited traversal of structures, pointers, and arrays.
.TP
$var
refers to an in-scope variable "var". If it's an integer-like type,
it will be cast to a 64-bit int for systemtap script use. String-like
pointers (char *) may be copied to systemtap string values using the
.IR kernel_string " or " user_string
functions.
.TP
$var\->field
traversal to a structure's field. The indirection operator
may be repeated to follow more levels of pointers.
.TP
$var[N]
indexes into an array. The index is given with a
literal number.
.PP
For ".return" probes, context variables other than the "$return"
value itself are only available for the function call parameters.
The expressions evaluate to the
.IR entry-time
values of those variables, since that is when a snapshot is taken.
Other local variables are not generally accessible, since by the time
a ".return" probe hits, the probed function will have already returned.
.SS USER-SPACE
Early prototype support for user-space probing is available in the
form of a non-symbolic probe point:
.SAMPLE
process(PID).statement(ADDRESS).absolute
.ESAMPLE
is analogous to
.IR
kernel.statement(ADDRESS).absolute
in that both use raw (unverified) virtual addresses and provide
no $variables. The target PID parameter must identify a running
process, and ADDRESS should identify a valid instruction address.
All threads of that process will be probed.
.PP
Additional user-space probing is available in the following forms:
.SAMPLE
process(PID).begin
process("PATH").begin
process(PID).thread.begin
process("PATH").thread.begin
process(PID).end
process("PATH").end
process(PID).thread.end
process("PATH").thread.end
process(PID).syscall
process("PATH").syscall
process(PID).syscall.return
process("PATH").syscall.return
.ESAMPLE
.PP
A
.B .begin
probe gets called when new process described by PID or PATH gets created.
A
.B .thread.begin
probe gets called when a new thread described by PID or PATH gets created.
A
.B .end
probe gets called when process described by PID or PATH dies.
A
.B .thread.end
probe gets called when a thread described by PID or PATH dies.
A
.B .syscall
probe gets called when a thread described by PID or PATH makes a
system call. The system call number is available in the "$syscall"
context variable.
A
.B .syscall.return
probe gets called when a thread described by PID or PATH returns from a
system call. The system call number is available in the "$syscall"
context variable.
.PP
Note that
.I PATH
pathnames must be absolute.
.SS PROCFS
These probe points allow procfs "files" in
/proc/systemtap/MODNAME to be created, read and written
.RI ( MODNAME
is the name of the systemtap module). The
.I proc
filesystem is a pseudo-filesystem which is used an an interface to
kernel data structures. There are four probe point variants supported
by the translator:
.SAMPLE
procfs("PATH").read
procfs("PATH").write
procfs.read
procfs.write
.ESAMPLE
.I PATH
is the file name (relative to /proc/systemtap/MODNAME) to be created.
If no
.I PATH
is specified (as in the last two variants above),
.I PATH
defaults to "command".
.PP
When a user reads /proc/systemtap/MODNAME/PATH, the corresponding
procfs
.I read
probe is triggered. The string data to be read should be assigned to
a variable named
.IR $value ,
like this:
.SAMPLE
procfs("PATH").read { $value = "100\\n" }
.ESAMPLE
.PP
When a user writes into /proc/systemtap/MODNAME/PATH, the
corresponding procfs
.I write
probe is triggered. The data the user wrote is available in the
string variable named
.IR $value ,
like this:
.SAMPLE
procfs("PATH").write { printf("user wrote: %s", $value) }
.ESAMPLE
.SS MARKERS
This family of probe points hooks up to static probing markers
inserted into the kernel or modules. These markers are special macro
calls inserted by kernel developers to make probing faster and more
reliable than with DWARF-based probes. Further, DWARF debugging
information is
.I not
required to probe markers.
Marker probe points begin with
.BR kernel .
The next part names the marker itself:
.BR mark("name") .
The marker name string, which may contain the usual wildcard characters,
is matched against the names given to the marker macros when the kernel
and/or module was compiled. Optionally, you can specify
.BR format("format") .
Specifying the marker format string allows differentation between two
markers with the same name but different marker format strings.
The handler associated with a marker-based probe may read the
optional parameters specified at the macro call site. These are
named
.BR $arg1 " through " $argNN ,
where NN is the number of parameters supplied by the macro. Number
and string parameters are passed in a type-safe manner.
The marker format string associated with a marker is available in
.BR $format .
.SS PERFORMANCE MONITORING HARDWARE
The perfmon family of probe points is used to access the performance
monitoring hardware available in modern processors. This family of
probes points needs the perfmon2 support in the kernel to access the
performance monitoring hardware.
.PP
Performance monitor hardware points begin with a
.BR perfmon ". "
The next part of the names the event being counted
.BR counter("event") .
The event names are processor implementation specific with the
execption of the generic
.BR cycles " and " instructions
events, which are available on all processors. This sets up a counter
on the processor to count the number of events occuring on the
processor. For more details on the performance monitoring events
available on a specific processor use the command perfmon2 command:
.SAMPLE
pfmon \-l
.ESAMPLE
.TP
$counter
is a handle used in the body of the probe for operations
involving the counter associated with the probe.
.TP
read_counter
is a function that is passed the handle for the perfmon probe and returns
the current count for the event.
.SH EXAMPLES
.PP
Here are some example probe points, defining the associated events.
.TP
begin, end, end
refers to the startup and normal shutdown of the session. In this
case, the handler would run once during startup and twice during
shutdown.
.TP
timer.jiffies(1000).randomize(200)
refers to a periodic interrupt, every 1000 +/\- 200 jiffies.
.TP
kernel.function("*init*"), kernel.function("*exit*")
refers to all kernel functions with "init" or "exit" in the name.
.TP
kernel.function("*@kernel/sched.c:240")
refers to any functions within the "kernel/sched.c" file that span
line 240.
.TP
kernel.mark("getuid")
refers to an STAP_MARK(getuid, ...) macro call in the kernel.
.TP
module("usb*").function("*sync*").return
refers to the moment of return from all functions with "sync" in the
name in any of the USB drivers.
.TP
kernel.statement(0xc0044852)
refers to the first byte of the statement whose compiled instructions
include the given address in the kernel.
.TP
kernel.statement("*@kernel/sched.c:2917")
refers to the statement of line 2917 within "kernel/sched.c".
.TP
kernel.statement("bio_init@fs/bio.c+3")
refers to the statement at line bio_init+3 within "fs/bio.c".
.TP
syscall.*.return
refers to the group of probe aliases with any name in the third position
.SH SEE ALSO
.IR stap (1),
.IR stapprobes.iosched (5),
.IR stapprobes.netdev (5),
.IR stapprobes.nfs (5),
.IR stapprobes.nfsd (5),
.IR stapprobes.pagefault (5),
.IR stapprobes.process (5),
.IR stapprobes.rpc (5),
.IR stapprobes.scsi (5),
.IR stapprobes.signal (5),
.IR stapprobes.socket (5),
.IR stapprobes.tcp (5),
.IR stapprobes.udp (5),
.IR proc (5)
|