Evaluating Realtime Linux system performance with rteval
Clark Williams <williams@redhat.com>
--------------------------------------------------------

Abstract
--------

One of the problems of developing and fielding any software product
that runs on a wide range of hardware platforms, is determining how
well the product performs on those platforms. This is especially true
about developing something as closely tied to hardware as a Realtime
Linux system. So, how to measure the performance of a realtime Linux
kernel on a particular hardware platform? What defines "good
performance" for a realtime system? 

A real time system is one which must meet deadlines of some
sort. These deadlines may be periodic (e.g. occuring every 200
milliseconds) or they may be a some time limit following the occurance
of an event (e.g. no more than 100 milliseconds following the arrival
of a network packet). To give a realtime application the best chance
of meeting its deadline(s), a realtime OS must minimize the time
between event occurance and the servicing of that event (latency). 

This paper describes the 'rteval' program, a Python 2.x program
developed at Red Hat to help quantify realtime performance on the MRG
Realtime kernel. Rteval is an attempt to put together a synthetic
benchmark which mimics a well behaved realtime application, running on
a heavily loaded realtime Linux system. It uses the 'cyclictest'
program in the role of the realtime app and uses two loads, a parallel
build of a Linux kernel and the scheduler benchmark 'hackbench' to
boost the system load.  

Rteval runs for a specified length of time (typically 12 hours). When
an rteval run is completed, a statisical analysis of the results is
done, an XML file is generated, containing system state, raw result
data and statistical analysis results and optionally the XML is sent
by XML-RPC to a database for reporting.

The Load Applications
---------------------

Rteval uses two system loads. The first is looping a scheduler
benchmark called 'hackbench'. Hackbench creates pairs of threads which
send data from a sender to a receiver via a pipe or socket. It creates
many small processes which do lots of I/O, exercising the kernel
scheduler by causing many scheduling decisions. The rteval wrapper
class for hackbench continually runs the hackbench program until
signaled to stop by the main logic. The number of hackbench threads is
determined by the number of cpu cores available on the test system.  

The other load used by rteval is a parallel compile of the Linux
kernel. Rteval has a module named 'kcompile' which controls the kernel
build process by invoking the make process with 2 times the number of
online cpus simultaneous build jobs.  The clean, bzImage and modules
targets are built with an 'allmodconfig' configuration file to
maximize the amount of compilation done. This results in a large
amount of process creation (preprocessors, compiler, assemblers and
linkers) as well as a moderately heavy file I/O load. The kernel build
load is repeated until the rteval runtime is reached. 

The intent behind having the load programs is to generate enough
threads doing a balanced load of operations (disk I/O, computation,
IPC, etc.) so that there is no time in which a processor core in the
system under test does not have a process ready to run. The success of
the loads can be measured by watching the system 'load average'
(either by examining /proc/loadavg or running the 'uptime' or 'top'
programs). 

The Measurement Application
---------------------------

The cyclictest program is used as the realtime application. Cyclictest
measures the delay between timer expiration and the time when the program 
waiting for the timer actually runs. The way it does this is by taking
a timestamp (t1) just before calling the timer wait function, then
sleeping for a specified interval. Upon waking up, a second timestamp
(t2) is taken. Then the difference is calculated between the timer
expiration time (t1 + interval) and the actual wakeup time (t2). This
is the event latency.  For example, if the initial time stamp t1 is
1000 and the interval is 100, then the calculated wakup time is
1100. If the wakeup time stamp (t2) is 1110, then cyclictest would
report a latency of 10. 

The cyclictest program is run in one of two modes, with either the
--smp option or the --numa option, based on the number of memory nodes
detected on the system. Both of these cases create a measurement
thread for each online cpu in the system and these threads are run
with a SCHED_FIFO scheduling policy at priority 95. All memory
allocations done by cyclictest are locked into memory using the
mlockall(2) system call (to eliminate major page faults). The
measurement threads are run with the same interval (100 microseconds)
using the clock_gettime(2) call to get time stamps and the
clock_nanosleep(2) call to actually invoke a timer. Cyclictest keeps a
histogram of observed latency values for each thread, which is dumped
to standard output and read by rteval when the run is complete. 

The Results
-----------

The main idea behind the 'rteval' program was to get two pieces of
information about the performance of the RT Linux kernel on a
particular hardware platform:

	   1. What is the maximum latency seen?
	   2. How variable are the service times?

The first is easy, just iterate through the histograms returned for
each cpu and find the highest index with a non-zero value.  The second
is a little more complicated. 

Early in rteval development, rather than use a histogram the
cyclictest run would just dump all the samples to a file and rteval
would parse the file after the run. Unfortunately, when you're
sampling at a rate of once every 100 microseconds for each cpu in the
system, you're going to generate a *lot* of data. Especially since we
want to run rteval for many hours, possibly days.  The output from
cyclictest in that mode is a 26-character string for each sample
recorded, so when sampling at 100us you generate 10,000 samples per
second, so for a 1 hour run on a four core box, you'd get:

	10,000 * 60 * 60 * 4 == 144,000,000 samples/hr

Multiply that times the 26 character string written by cyclictest and
you write 374,400,000 bytes per hour to disk. A 12 hour run on a  four
core system would generate about 44 gigabytes of data. This was deemed
excessive...

So the decision was made to record the latency values in histogram
format, one histogram for each measurement thread. This has the
advantage of using only a fixed amount of memory to record samples,
but has the disadvantage of losing temporal ordering information,
which would allow you to detect periodic latency's by looking at the
time stamp for a spike. It also complicates statistics calculations
which presume you have the entire data set for analysis. This was
worked around by treating each non-zero histogram bucket as a series
of samples for that index value. 

The variability calculation is basically a stock standard deviation
calculation, where a mean is calculated for the data set and then
variance and standard deviation are calculated. Other measures of
variability of such as Mean Absolute Deviation are calulated, but to
date Standard Deviation has been a reliable indicator of the
variability of service times. This variability is sometimes called
'jitter' in realtime paralance, due to the plot the data would make on
an oscilloscope. 

In addition to calculating mean, variance and standard deviation,
rteval's statistics code calculates and stores the usual suspects for
statistics, such as min, max, mode, median, and range, both for each
cpu core and aggregated for all cores. 

Another challenge was in identifying the underlying hardware platform
so that runs on the same system could be grouped properly. The Desktop
Management Interface (DMI) tables maintaned by the BIOS were a good
starting point, since they record information about the cpu, memory,
and peripheral devices in the system. Added to that information is
some state about the system while the test was running: kernel
version, active clocksource, number of NUMA nodes available, kernel
threads and their priorities, kernel modules loaded, and the state of
network interfaces. 

Problems
--------

Using rteval has helped Red Hat locate areas that cause performance
problems with realtime Linux kernels. Some of these problem areas are:

1. BIOS/SMI issues
Many systems use System Management Interrupts (SMIs) to perform system
critical operations without help from the running operating system by
trapping back to BIOS code. Unfortunately, this causes 'gaps in time'
for the kernel, since nothing can run while an SMI is being handled in
the BIOS. Most times SMI impact is negligable since it's mostly
thermal management, reading a thermocouple and turning fans on/off. 
Sometimes though the operation takes a long time (i.e when an EDAC
needs to correct a memory error) and that can cause deadlines to 
be missed by many hundreds of microseconds. Red Hat has been working
with hardware vendors to identify these hotspots and reduce their
impact. 

2. Kernel scalability issues

In the past few years, the number of cores per socket on a motherboard
has gone up from 2 to 8, resulting in some scalability problems in the
kernel. One area that has received a lot of attention is the load
balancer for SCHED_OTHER tasks. This is logic in the kernel that
attempts to make sure that each core in the system has tasks to run
and that no one core is overloaded with tasks. During a load balancer
pass, a core with a long run queue (indicating there are many tasks
ready on that core) will have some of those tasks migrated to other
cores, which requires that both the current and destination cores run
queue locks being held (meaning nothing can run on those cores). 

In a stock Linux kernel long SCHED_OTHER load balancer passes result
in more utilization of cpus and an overall througput gain. 
Unfortunately long load balancer passes can result in missed 
deadlines because a task on the run queue for a core cannot run while
the loadbalancer is running. To compensate for this on realtime Linux
the load balancer has a lower number of target migrations and looks
for contention on the run queue locks (meaning that a task is trying
to be scheduled on one of the cores on which the balancer is
operating). Research in this area is ongoing. 

There is also a realtime thread (SCHED_FIFO and SCHED_RR) thread load
balancer and similar research is being done towards reducing the
overhead of this load balancer as well. 

<what other areas?>

3. NUMA

In conjunction with the increase in the number of cpu cores per die
has been the desire to reduce the amount of interconnect traces
between cpus and memory nodes (as you pack the cores tighter, you have
less room to run connections to memory and other cores). One way to do
this is to route a core's address/data/signal lines to some sort of
switch module, such as AMD's HyperTransport (HT) mechanism or Intel's
Quick Path Interconnect (QPI). With a series of switches in place many
cores can access memory and other cpu resources through the switch
network without programs running on them knowing they're going through
a switch. This results in a Non-Uniform Memory Access (NUMA)
architecuture, which means that some memory accesses will take longer
than others due to traversing the switch network. NUMA is great for
scaling up throughput oriented servers, but tends to hurt determinism,
if the programs are not aware of the memory topology. 

A --numa option was added to the cyclictest program to use the libnuma
library to bind threads to local memory nodes and allocate memory on
the closest memory node, so to minimize the time required to access
memory. 

Further Development
-------------------

Once we started getting rteval run information it was natural that we
would want to store it in a database for further analysis (especially
watching for performance regressions). David Sommerseth created a set
of tables for a PostgreSQL database and then added an option to rteval
to ship the results back to a database server using XML-RPC. This
option is currently used internally at Red Hat do ship rteval run data
back to our internal DB server. There are no plans to open this data
up to the public, but the XML-RPC code is there if someone else wants
to use the facility. (No, there are no backdoors in the code that ship
run data back to Red Hat; it's Python code, look and see!).