summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorClark Williams <williams@redhat.com>2010-06-14 15:04:36 -0500
committerClark Williams <williams@redhat.com>2010-06-14 15:04:36 -0500
commitaef0ad6ec0f80c97e03891b97dd8f7221ec64c18 (patch)
treeac06a8378d3f02c8f1e6d907c21d38b8f682d06b /doc
parentaafe28b72299ce0631174f116825ec93bd50a20d (diff)
downloadrteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.tar.gz
rteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.tar.xz
rteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.zip
updated doc/rteval.txt with better mlockall and load balancer explainations
Make clear that mlockall does not prevent pagefaults but just ensures that regions are kept in memory. Also explain how OS load balancer can cause latencies Signed-off-by: Clark Williams <williams@redhat.com>
Diffstat (limited to 'doc')
-rw-r--r--doc/rteval.txt70
1 files changed, 37 insertions, 33 deletions
diff --git a/doc/rteval.txt b/doc/rteval.txt
index 40b4da9..68fd375 100644
--- a/doc/rteval.txt
+++ b/doc/rteval.txt
@@ -86,13 +86,13 @@ The cyclictest program is run in one of two modes, with either the
detected on the system. Both of these cases create a measurement
thread for each online cpu in the system and these threads are run
with a SCHED_FIFO scheduling policy at priority 95. All memory
-allocations done by cyclictest are locked into page tables using the
-mlockall(2) system call (to prevent page faults). The measurement
-threads are run with the same interval (100 microseconds) using the
-clock_gettime(2) call to get time stamps and the clock_nanosleep(2)
-call to actually invoke a timer. Cyclictest keeps a histogram of
-observed latency values for each thread, which is dumped to standard
-output and read by rteval when the run is complete.
+allocations done by cyclictest are locked into memory using the
+mlockall(2) system call (to eliminate major page faults). The
+measurement threads are run with the same interval (100 microseconds)
+using the clock_gettime(2) call to get time stamps and the
+clock_nanosleep(2) call to actually invoke a timer. Cyclictest keeps a
+histogram of observed latency values for each thread, which is dumped
+to standard output and read by rteval when the run is complete.
The Results
-----------
@@ -183,23 +183,27 @@ impact.
In the past few years, the number of cores per socket on a motherboard
has gone up from 2 to 8, resulting in some scalability problems in the
kernel. One area that has received a lot of attention is the load
-balancer. This is logic in the kernel that attempts to make sure that
-each core in the system has tasks to run and that no one core is
-overloaded with tasks. During a load balancer pass, a core with a long
-run queue (indicating there are many tasks ready on that core) will
-have some of those tasks migrated to other cores, which requires that
-both the current and destination cores run queue locks being held
-(meaning nothing can run on those cores).
-
-In a stock Linux kernel long load balancer passes result in more
-utilization of cpus and an overall througput gain. Unfortunately long
-load balancer passes can result in missed deadlines because a task on
-the run queue for a core cannot run while the loadbalancer is
-running. To compensate for this on realtime Linux the load balancer
-has a lower number of target migrations and looks for contention on
-the run queue locks (meaning that a task is trying to be scheduled on
-one of the cores on which the balancer is operating). Research in this
-area is ongoing.
+balancer for SCHED_OTHER tasks. This is logic in the kernel that
+attempts to make sure that each core in the system has tasks to run
+and that no one core is overloaded with tasks. During a load balancer
+pass, a core with a long run queue (indicating there are many tasks
+ready on that core) will have some of those tasks migrated to other
+cores, which requires that both the current and destination cores run
+queue locks being held (meaning nothing can run on those cores).
+
+In a stock Linux kernel long SCHED_OTHER load balancer passes result
+in more utilization of cpus and an overall througput gain.
+Unfortunately long load balancer passes can result in missed
+deadlines because a task on the run queue for a core cannot run while
+the loadbalancer is running. To compensate for this on realtime Linux
+the load balancer has a lower number of target migrations and looks
+for contention on the run queue locks (meaning that a task is trying
+to be scheduled on one of the cores on which the balancer is
+operating). Research in this area is ongoing.
+
+There is also a realtime thread (SCHED_FIFO and SCHED_RR) thread load
+balancer and similar research is being done towards reducing the
+overhead of this load balancer as well.
<what other areas?>
@@ -210,15 +214,15 @@ has been the desire to reduce the amount of interconnect traces
between cpus and memory nodes (as you pack the cores tighter, you have
less room to run connections to memory and other cores). One way to do
this is to route a core's address/data/signal lines to some sort of
-switch module, such as AMD's HyperTransport mechanism. With a series
-of switches in place many cores can access memory and other cpu
-resources through the switch network without programs running on them
-knowing they're going through a switch. This results in a Non-Uniform
-Memory Access (NUMA) architecuture, which means that some memory
-accesses will take longer than others due to traversing the switch
-network. NUMA is great for scaling up throughput oriented servers, but
-tends to hurt determinism, if the programs are not aware of the memory
-topology.
+switch module, such as AMD's HyperTransport (HT) mechanism or Intel's
+Quick Path Interconnect (QPI). With a series of switches in place many
+cores can access memory and other cpu resources through the switch
+network without programs running on them knowing they're going through
+a switch. This results in a Non-Uniform Memory Access (NUMA)
+architecuture, which means that some memory accesses will take longer
+than others due to traversing the switch network. NUMA is great for
+scaling up throughput oriented servers, but tends to hurt determinism,
+if the programs are not aware of the memory topology.
A --numa option was added to the cyclictest program to use the libnuma
library to bind threads to local memory nodes and allocate memory on