.TH LKET 5 @DATE@ "IBM" .SH NAME LKET \- Linux Kernel Event Trace tool based on SystemTap .\" macros .de SAMPLE .br .RS .nf .nh .. .de ESAMPLE .hy .fi .RE .. .SH DESCRIPTION The Linux Kernel Event Trace (LKET) tool is an extension to the tapsets library available on SystemTap. Its goal is to utilize the dynamic probing capabilities provided through SystemTap to create a set of standard hooks that probe pre-defined places in the kernel. It can be used to collect important information that can be used as a starting point to analyze a performance problem in the system. The LKET tapsets are designed to only trace the events selected by the user. Once the data has been collected, it is then post-processed according to the need of the user. Trace data can be processed in various different ways to generate simple to complex reports. .SH EVENT HOOKS The following sections enumerate the variety of event hooks implemented in LKET and their trace data format. The trace data generated by different event hooks contain common data as well as some data specific to that event hook. the INT8, INT16, INT32, INT64 and STRING appeared in trace data format represents 8-bit, 16-bit, 32-bit, 64-bit binary data and NULL-terminated string respectively. The data common(i.e. .I common_data in the following subsecions) to all event hooks is: .RS .B usec(INT64),(tid<<32 | groupID<<24 | hookID<<16 | cpu_id<<8)(INT64) .RE Each event hook group is a collection of those hooks that have similarities of what they could trace. And the ID of each event hook (HookID) is defined in the context of its corresponding group. .SS EVENT REGISTER (GROUPID=1) Event register is not actually an event. It is used to log the metadata of the trace data, including the extra trace data appended by user. See .B EVENT REGISTER and .B CUSTOMIZED TRACE DATA for more details. .P .TP .B register_sys_event(HOOKID=1) This is a function used to register event hooks available in LKET. It should be called from register_event.stp:register_sys_events(). .TP .B register_user_event(HOOKID=2) This is a function used to log the metadata of the extra trace data appended by user for a specific event. It should be called in the probe .I register_event .SS SYSTEM CALLS (GROUPID=2) You could use .I addevent.syscall to trace the entry and return of all system calls. It contains two sub event hooks: .P .TP .B addevent.syscall.entry (HOOKID=1) Trace entry of all system calls. Data format is: .I common_data, syscall_name(STRING) .TP .B addevent.syscall.return (HOOKID=2) Trace return of all system calls. Data format is: .I common_data, syscall_name(STRING) .SS PROCESS CREATION (GROUPID=3) This group contains three sub event hooks. You could use .I addevent.process to trace fork and execve of processes(note that process_snapshot() won't be included). .P .TP .B process_snapshot()(HOOKID=1) This event hook isn't a probe definition but a function. It is called by LKET silently to take a snapshot of all running processes. Data format is: .I common_data, tid(INT32), pid(INT32), ppid(INT32), process_name(STRING) .P .TP .B addevent.process.fork (HOOKID=2) Trace fork of processes Data format is: .I common_data, new_tid(INT32), new_pid(INT32), ppid(INT32) .TP .B addevent.process.execve (HOOKID=3) Trace execve of new processes Data format is: .I common_data, pid(INT32), new_process_name(STRING) .SS IO SCHEDULER ACTIVITIES (GROUPID=4) You could use .I addevent.ioscheduler to trace the IO scheduler activities. It contains three sub event hooks: .P .TP .B addevent.ioscheduler.elv_add_request (HOOKID=1) Trace when a request is added to the request queue Data format is: .I common_data, elevator_name(STRING), disk_major(INT8), disk_minor(INT8), .I request(INT64), request_flags(INT64) .TP .B addevent.ioscheduler.elv_next_request (HOOKID=2) Trace when a request is retrieved from request queue Data format is: .I common_data, elevator_name(STRING), disk_major(INT8), disk_minor(INT8), .I request(INT64), request_flags(INT64) .TP .B addevent.ioscheduler.elv_completed_request (HOOKID=3) Trace when a request is completed Data format is: .I common_data, elevator_name(STRING), disk_major(INT8), disk_minor(INT8), .I request(INT64), request_flags(INT64) .SS TASK SCHEDULE ACTIVITIES (GROUPID=5) You could use .I addevent.tskdispatch to trace the task scheduler activities. It contains two sub event hooks: .P .TP .B addevent.tskdispatch.ctxswitch (HOOKID=1) Trace the process context switch Data format is: .I common_data, prev_pid(INT32), next_pid(INT32), prev_state(INT8) .TP .B addevent.tskdispatch.cpuidle (HOOKID=2) Trace when cpu goes idle Data format is: .I common_data, current_pid(INT32) .SS SCSI ACTIVITIES (GROUPID=6) You could use .I addevent.scsi to trace the scsi layer activities. It contains four sub event hooks: .P .TP .B addevent.scsi.ioentry (HOOKID=1) mid-layer prepares a IO request Data format is: .I common_data, disk_major(INT8), disk_minor(INT8), device_state(INT8), request(INT64) .TP .B addevent.scsi.iodispatching (HOOKID=2) Dispatch a command to the low-level driver Data format is: .I common_data, device_state(INT8), scsi_info(INT32), data_direction(INT8), .I reqbuf_addr(INT64), reqbuf_len(INT32), cmd_identifier(INT64), request(INT64) Where .I scsi_info is the combination of: .SAMPLE ((cmd->device->host->host_no & 0xFF) << 24) | ((cmd->device->channel & 0xFF) << 16) | ((cmd->device->lun & 0xFF) << 8) | (cmd->device->id & 0xFF) .ESAMPLE .TP .B addevent.scsi.iodone (HOOKID=3) I/O is done by low-level driver Data format is: .I common_data, scsi_info(INT32), data_direction(INT8), cmd_identifier(INT64), .I request(INT64) .TP .B addevent.scsi.iocompleted (HOOKID=4) mid-layer processed the completed IO Data format is: .I common_data, scsi_info(INT32), data_direction(INT8), cmd_identifier(INT64), .I bytes_done(INT32), request(INT64) .SS PAGE FAULT (GROUPID=7) You could use .I addevent.pagefault to trace page fault events. It contains only one sub event hooks: .P .TP .B addevent.pagefault (HOOKID=1) Data format is: .I common_data, memory_address(INT64), write_access(INT8) .SS NETWORK DEVICE ACTIVITIES (GROUPID=8) You could use .I addevent.netdev to trace the network device activities. It contains two sub event hooks: .P .TP .B addevent.netdev.receive (HOOKID=1) network device receives a packet Data format is: .I common_data, netdev_name(STRING), data_length(INT32), protocol(INT16), .I buffer_length(INT32) .TP .BR addevent.netdev.transmit (HOOKID=2) A packet will be sent out by network device Data format is: .I common_data, netdev_name(STRING), data_length(INT32), protocol(INT16), .I buffer_length(INT32) .SS IO SYSCALLS (GROUPID=9) You could use .I addevent.iosyscall to trace the detail activities of io related system calls. It contains 16 entry hooks and 16 corresponding return hooks. All the return hooks will only log the common_data and the return value. So in the following subsections, only the entry hooks will be listed: .P .TP .B addevent.iosyscall.open.entry (HOOKID=1) the entry of sys_open Data format is: .I common_data, filename(STRING), flags(INT32), mode(INT32) .TP .B addevent.iosyscall.close.entry (HOOKID=3) the entry of sys_close Data format is: .I common_data, fd(INT64) .TP .B addevent.iosyscall.read.entry (HOOKID=5) the entry of sys_read Data format is: .I common_data, fd(INT64), buf_addr(INT64), count(INT64) .TP .B addevent.iosyscall.write.entry (HOOKID=7) the entry of sys_write Data format is: .I common_data, fd(INT64), buf_addr(INT64), count(INT64) .TP .B addevent.iosyscall.readv.entry (HOOKID=9) the entry of sys_readv Data format is: .I common_data, fd(INT64), vector_addr(INT64), count(INT64) .TP .B addevent.iosyscall.writev.entry (HOOKID=11) the entry of sys_writev Data format is: .I common_data, fd(INT64), vector_addr(INT64), count(INT64) .TP .B addevent.iosyscall.pread64.entry (HOOKID=13) the entry of sys_pread64 Data format is: .I common_data, fd(INT64), buff_addr(INT64), count(INT64), offset(INT64) .TP .B addevent.iosyscall.pwrite64.entry (HOOKID=15) the entry of sys_pwrite64 Data format is: .I common_data, fd(INT64), buff_addr(INT64), count(INT64), offset(INT64) .TP .B addevent.iosyscall.readahead.entry (HOOKID=17) the entry of sys_readahead Data format is: .I common_data, fd(INT64), offset(INT64), count(INT64) .TP .B addevent.iosyscall.senfile.entry (HOOKID=19) the entry of sys_sendfile and sys_sendfile64 Data format is: .I common_data, out_fd(INT64), in_fd(INT64), offset_uaddr(INT64), count(INT64) .TP .B addevent.iosyscall.lseek.entry (HOOKID=21) the entry of sys_lseek Data format is: .I common_data, fd(INT64), offset(INT64), whence(INT8) .TP .B addevent.iosyscall.llseek.entry (HOOKID=23) the entry of sys_llseek Data format is: .I common_data, fd(INT64), offset_high(INT64), offset_low(INT64), .I result_addr(INT64), whence(INT8) .TP .B addevent.iosyscall.sync.entry (HOOKID=25) the entry of sys_sync Data format is: .I common_data .TP .B addevent.iosyscall.fsync.entry (HOOKID=27) the entry of sys_fsync Data format is: .I common_data, fd(INT64) .TP .B addevent.iosyscall.fdatasync.entry (HOOKID=29) the entry of sys_fdatasync Data format is: .I common_data, fd(INT64) .TP .B addevent.iosyscall.flock.entry (HOOKID=31) the entry of sys_flock Data format is: .I common_data, fd(INT64), operation(INT32) .SS Asynchronous IO (GROUPID=10) You could use .I addevent.aio to trace the detail activities of AIO related calls(most of them are AIO system calls). It contains 6 entry hooks and 6 corresponding return hooks. All the return hooks will only log the common_data and the return value. So in the following subsections, only the entry hooks will be listed: .P .TP .B addevent.aio.io_setup.entry (HOOKID=1) Fired by calling io_setup from user space. The corresponding system call is sys_io_setup, which will create an aio_context capable of receiving at least maxevents. Data format is: .I common_data, nr_events(INT32), ctxp_uaddr(INT64) .TP .B addevent.aio.io_submit.entry (HOOKID=3) Fired by calling io_submit from user space. The corresponding system call is sys_io_submit which will queue the nr iocbs pointed to by iocbpp_uaddr for processing. Data format is: .I common_data, ctx_id(INT64), nr(INT32), iocbpp_uaddr(INT64) .TP .B addevent.aio.io_submit_one.entry (HOOKID=5) Called by sys_io_submit. It will iterate iocbpp and process them one by one Data format is: .I common_data, ctx(INT64), user_iocb_uaddr(INT64), aio_lio_opcode(INT16), .I aio_reqprio(INT16), aio_fildes(INT32), aio_buf(INT64), aio_nbytes(INT64), .I aio_offset(INT64) .TP .B addevent.aio.io_getevents.entry (HOOKID=7) Fired by calling io_getevents from user space. The corresponding system call is sys_io_getevents, which will attempt to read at least min_nr events and up to nr events from the completion queue for the aio_context specified by ctx_id. Data format is: .I common_data, ctx_id(INT64), min_nr(INT32), nr(INT32), events_uaddr(INT64), .I tv_sec(INT32), tv_nsec(INT32) .TP .B addevent.aio.io_destroy.entry (HOOKID=9) Fired by calling io_destroy from user space. The corresponding system call is sys_io_destroy, which will destroy the aio_context specified. Data format is: .I common_data, ctx(INT64) .TP .B addevent.aio.io_cancel.entry (HOOKID=11) Fired by calling io_cancel from user space. The corresponding system call is sys_io_cancel, which will attempt to cancel an iocb previously passed to io_submit. Data format is: .I common_data, ctx_id(INT64), iocb_uaddr(INT64), result_uaddr(INT64) .SH TRACE DATA FORMAT By default, LKET will log the trace data in binary format. To get a better performance for binary tracing, the "-b" option should be turned on for stap and thus -M option has to be added to stop stpd merging per-cpu files. You could use the command .I lket-b2a to convert the binary trace data generated by LKET into readable data in ascii format. .I lket-b2a uses the pre-cpu binary trace data files as inputs, and generates an output file named .I lket.out You should use "stap -b -M" with LKET to get those pre-cpu files (stpd_cpu*) before using it. If you want LKET to log trace data in ASCII format directly, you should: .SAMPLE stap -D ASCII_TRACE ... .ESAMPLE .SH EVENT REGISTER LKET provides a way to log the metadata of the trace data by events registering. Two function is provided: .P .IP .SB register_sys_event(grpid:long, hookid:long, fmt:string, names:string) .IP .SB register_user_event(grpid:long, hookid:long, fmt:string, names:string) .P The .I grpid and .I hookid is the groupid and hookid of the event you want to register. .I fmt contains a set of fomat tokens seperated by ":". The valid format tokens are: .B UINT8, .B UINT16, .B UINT32, .B UINT64 and .B STRING which represents 8-bit, 16-bit, 32-bit, 64-bit binary data and NULL-terminated respectively. .I names contains a set of names seperated by ":". The names contains in .I names should match the format tokens contains in .I fmt .B register_sys_event is used to register the newly added event hooks. For example, supposing you want to add a new event hook to trace the entry of sys_open, and you want this event hook to log the fd, flag and mode paremeters for you. You should add: .SAMPLE register_sys_event(GROUP_IOSYSCALL, HOOKID_IOSYSCALL_OPEN_ENTRY, "STRING:INT32:INT32", "filename:flags:mode") .ESAMPLE into the function .B register_sys_events in LKET/register_event.stp .B register_user_event is used for user to add extra trace data for a event hook. See the section .B CUSTOMIZED TRACE DATA for more detail .SH CUSTOMIZED TRACE DATA LKET defines a set of event hooks and will log the predefined trace data for you, but what if you want to trace extra data for that event? LKET provides a way to do this without modifying the codes in the tapset of that event hook. You can simply use printf to trace extra data. For example, supposing you want to trace sk_buff->mac_len and sk_buff->priority besides the sk_buff->len, sk_buff->protocol and sk_buff->truesize for the .B netdev event hooks: .SAMPLE probe register_event { register_user_event(GROUP_NETDEV, HOOKID_NETDEV_TRANSMIT, "INT32:INT32", "mac_len:priority") } probe addevent.netdev { printf("%4b%4b", skb->mac_len, skb->priority) } .ESAMPLE .SH EXAMPLES Here are some examples of using LKET: .TP To turn on all event hooks: stap -e "probe addevent.* {}" -I /usr/share/systemtap/tapsets/LKET -bM .TP To probe syscall: stap -e "probe addevent.syscall {}" -I /usr/share/systemtap/tapsets/LKET -bM .TP To only probe syscall.entry: stap -e "probe addevent.syscall.entry {}" -I /usr/share/systemtap/tapsets/LKET -bM .TP To probe netdev transmition and log extra data of mac_len and priority: stap -e "probe addevent.netdev.transmit { printf(\\"%4b%4b\\", skb->mac_len, skb->priority) }" -I /usr/share/systemtap/tapsets/LKET -bM .SH SEE ALSO .IR stap (1)