Evaluating Realtime Linux system performance with rteval Clark Williams -------------------------------------------------------- Abstract -------- One of the problems of developing and fielding any software product that runs on a wide range of hardware platforms, is determining how well the product performs on those platforms. This is especially true about developing something as closely tied to hardware as a Realtime Linux system. So, how to measure the performance of a realtime Linux kernel on a particular hardware platform? What defines "good performance" for a realtime system? A real time system is one which must meet deadlines of some sort. These deadlines may be periodic (e.g. occuring every 200 milliseconds) or they may be a some time limit following the occurance of an event (e.g. no more than 100 milliseconds following the arrival of a network packet). To give a realtime application the best chance of meeting its deadline(s), a realtime OS must minimize the time between event occurance and the servicing of that event (latency). This paper describes the 'rteval' program, a Python 2.x program developed at Red Hat to help quantify realtime performance on the MRG Realtime kernel. Rteval is an attempt to put together a synthetic benchmark which mimics a well behaved realtime application, running on a heavily loaded realtime Linux system. It uses the 'cyclictest' program in the role of the realtime app and uses two loads, a parallel build of a Linux kernel and the scheduler benchmark 'hackbench' to boost the system load. Rteval runs for a specified length of time (typically 12 hours). When an rteval run is completed, a statisical analysis of the results is done, an XML file is generated, containing system state, raw result data and statistical analysis results and optionally the XML is sent by XML-RPC to a database for reporting. The Load Applications --------------------- Rteval uses two system loads. The first is looping a scheduler benchmark called 'hackbench'. Hackbench creates pairs of threads which send data from a sender to a receiver via a pipe or socket. It creates many small processes which do lots of I/O, exercising the kernel scheduler by causing many scheduling decisions. The rteval wrapper class for hackbench continually runs the hackbench program until signaled to stop by the main logic. The number of hackbench threads is determined by the number of cpu cores available on the test system. The other load used by rteval is a parallel compile of the Linux kernel. Rteval has a module named 'kcompile' which controls the kernel build process by invoking the make process with 2 times the number of online cpus simultaneous build jobs. The clean, bzImage and modules targets are built with an 'allmodconfig' configuration file to maximize the amount of compilation done. This results in a large amount of process creation (preprocessors, compiler, assemblers and linkers) as well as a moderately heavy file I/O load. The kernel build load is repeated until the rteval runtime is reached. The intent behind having the load programs is to generate enough threads doing a balanced load of operations (disk I/O, computation, IPC, etc.) so that there is no time in which a processor core in the system under test does not have a process ready to run. The success of the loads can be measured by watching the system 'load average' (either by examining /proc/loadavg or running the 'uptime' or 'top' programs). The Measurement Application --------------------------- The cyclictest program is used as the realtime application. Cyclictest measures the delay between timer expiration and the time when the program waiting for the timer actually runs. The way it does this is by taking a timestamp (t1) just before calling the timer wait function, then sleeping for a specified interval. Upon waking up, a second timestamp (t2) is taken. Then the difference is calculated between the timer expiration time (t1 + interval) and the actual wakeup time (t2). This is the event latency. For example, if the initial time stamp t1 is 1000 and the interval is 100, then the calculated wakup time is 1100. If the wakeup time stamp (t2) is 1110, then cyclictest would report a latency of 10. The cyclictest program is run in one of two modes, with either the --smp option or the --numa option, based on the number of memory nodes detected on the system. Both of these cases create a measurement thread for each online cpu in the system and these threads are run with a SCHED_FIFO scheduling policy at priority 95. All memory allocations done by cyclictest are locked into memory using the mlockall(2) system call (to eliminate major page faults). The measurement threads are run with the same interval (100 microseconds) using the clock_gettime(2) call to get time stamps and the clock_nanosleep(2) call to actually invoke a timer. Cyclictest keeps a histogram of observed latency values for each thread, which is dumped to standard output and read by rteval when the run is complete. The Results ----------- The main idea behind the 'rteval' program was to get two pieces of information about the performance of the RT Linux kernel on a particular hardware platform: 1. What is the maximum latency seen? 2. How variable are the service times? The first is easy, just iterate through the histograms returned for each cpu and find the highest index with a non-zero value. The second is a little more complicated. Early in rteval development, rather than use a histogram the cyclictest run would just dump all the samples to a file and rteval would parse the file after the run. Unfortunately, when you're sampling at a rate of once every 100 microseconds for each cpu in the system, you're going to generate a *lot* of data. Especially since we want to run rteval for many hours, possibly days. The output from cyclictest in that mode is a 26-character string for each sample recorded, so when sampling at 100us you generate 10,000 samples per second, so for a 1 hour run on a four core box, you'd get: 10,000 * 60 * 60 * 4 == 144,000,000 samples/hr Multiply that times the 26 character string written by cyclictest and you write 374,400,000 bytes per hour to disk. A 12 hour run on a four core system would generate about 44 gigabytes of data. This was deemed excessive... So the decision was made to record the latency values in histogram format, one histogram for each measurement thread. This has the advantage of using only a fixed amount of memory to record samples, but has the disadvantage of losing temporal ordering information, which would allow you to detect periodic latency's by looking at the time stamp for a spike. It also complicates statistics calculations which presume you have the entire data set for analysis. This was worked around by treating each non-zero histogram bucket as a series of samples for that index value. The variability calculation is basically a stock standard deviation calculation, where a mean is calculated for the data set and then variance and standard deviation are calculated. Other measures of variability of such as Mean Absolute Deviation are calulated, but to date Standard Deviation has been a reliable indicator of the variability of service times. This variability is sometimes called 'jitter' in realtime paralance, due to the plot the data would make on an oscilloscope. In addition to calculating mean, variance and standard deviation, rteval's statistics code calculates and stores the usual suspects for statistics, such as min, max, mode, median, and range, both for each cpu core and aggregated for all cores. Another challenge was in identifying the underlying hardware platform so that runs on the same system could be grouped properly. The Desktop Management Interface (DMI) tables maintaned by the BIOS were a good starting point, since they record information about the cpu, memory, and peripheral devices in the system. Added to that information is some state about the system while the test was running: kernel version, active clocksource, number of NUMA nodes available, kernel threads and their priorities, kernel modules loaded, and the state of network interfaces. Problems -------- Using rteval has helped Red Hat locate areas that cause performance problems with realtime Linux kernels. Some of these problem areas are: 1. BIOS/SMI issues Many systems use System Management Interrupts (SMIs) to perform system critical operations without help from the running operating system by trapping back to BIOS code. Unfortunately, this causes 'gaps in time' for the kernel, since nothing can run while an SMI is being handled in the BIOS. Most times SMI impact is negligable since it's mostly thermal management, reading a thermocouple and turning fans on/off. Sometimes though the operation takes a long time (i.e when an EDAC needs to correct a memory error) and that can cause deadlines to be missed by many hundreds of microseconds. Red Hat has been working with hardware vendors to identify these hotspots and reduce their impact. 2. Kernel scalability issues In the past few years, the number of cores per socket on a motherboard has gone up from 2 to 8, resulting in some scalability problems in the kernel. One area that has received a lot of attention is the load balancer for SCHED_OTHER tasks. This is logic in the kernel that attempts to make sure that each core in the system has tasks to run and that no one core is overloaded with tasks. During a load balancer pass, a core with a long run queue (indicating there are many tasks ready on that core) will have some of those tasks migrated to other cores, which requires that both the current and destination cores run queue locks being held (meaning nothing can run on those cores). In a stock Linux kernel long SCHED_OTHER load balancer passes result in more utilization of cpus and an overall througput gain. Unfortunately long load balancer passes can result in missed deadlines because a task on the run queue for a core cannot run while the loadbalancer is running. To compensate for this on realtime Linux the load balancer has a lower number of target migrations and looks for contention on the run queue locks (meaning that a task is trying to be scheduled on one of the cores on which the balancer is operating). Research in this area is ongoing. There is also a realtime thread (SCHED_FIFO and SCHED_RR) thread load balancer and similar research is being done towards reducing the overhead of this load balancer as well. 3. NUMA In conjunction with the increase in the number of cpu cores per die has been the desire to reduce the amount of interconnect traces between cpus and memory nodes (as you pack the cores tighter, you have less room to run connections to memory and other cores). One way to do this is to route a core's address/data/signal lines to some sort of switch module, such as AMD's HyperTransport (HT) mechanism or Intel's Quick Path Interconnect (QPI). With a series of switches in place many cores can access memory and other cpu resources through the switch network without programs running on them knowing they're going through a switch. This results in a Non-Uniform Memory Access (NUMA) architecuture, which means that some memory accesses will take longer than others due to traversing the switch network. NUMA is great for scaling up throughput oriented servers, but tends to hurt determinism, if the programs are not aware of the memory topology. A --numa option was added to the cyclictest program to use the libnuma library to bind threads to local memory nodes and allocate memory on the closest memory node, so to minimize the time required to access memory. Further Development ------------------- Once we started getting rteval run information it was natural that we would want to store it in a database for further analysis (especially watching for performance regressions). David Sommerseth created a set of tables for a PostgreSQL database and then added an option to rteval to ship the results back to a database server using XML-RPC. This option is currently used internally at Red Hat do ship rteval run data back to our internal DB server. There are no plans to open this data up to the public, but the XML-RPC code is there if someone else wants to use the facility. (No, there are no backdoors in the code that ship run data back to Red Hat; it's Python code, look and see!).