updated doc/rteval.txt with better mlockall and load balancer explainations

Make clear that mlockall does not prevent pagefaults but just ensures that regions are kept in memory. Also explain how OS load balancer can cause latencies Signed-off-by: Clark Williams <williams@redhat.com>
author: Clark Williams <williams@redhat.com> 2010-06-14 15:04:36 -0500
committer: Clark Williams <williams@redhat.com> 2010-06-14 15:04:36 -0500
commit: aef0ad6ec0f80c97e03891b97dd8f7221ec64c18 (patch)
tree: ac06a8378d3f02c8f1e6d907c21d38b8f682d06b /doc
parent: aafe28b72299ce0631174f116825ec93bd50a20d (diff)
download: rteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.tar.gz
rteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.tar.xz
rteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.zip
1 files changed, 37 insertions, 33 deletions
diff --git a/doc/rteval.txt b/doc/rteval.txt
index 40b4da9..68fd375 100644
--- a/doc/rteval.txt
+++ b/doc/rteval.txt
@@ -86,13 +86,13 @@ The cyclictest program is run in one of two modes, with either the
 detected on the system. Both of these cases create a measurement
 thread for each online cpu in the system and these threads are run
 with a SCHED_FIFO scheduling policy at priority 95. All memory
-allocations done by cyclictest are locked into page tables using the
-mlockall(2) system call (to prevent page faults). The measurement
-threads are run with the same interval (100 microseconds) using the
-clock_gettime(2) call to get time stamps and the clock_nanosleep(2)
-call to actually invoke a timer. Cyclictest keeps a histogram of
-observed latency values for each thread, which is dumped to standard
-output and read by rteval when the run is complete. 
+allocations done by cyclictest are locked into memory using the
+mlockall(2) system call (to eliminate major page faults). The
+measurement threads are run with the same interval (100 microseconds)
+using the clock_gettime(2) call to get time stamps and the
+clock_nanosleep(2) call to actually invoke a timer. Cyclictest keeps a
+histogram of observed latency values for each thread, which is dumped
+to standard output and read by rteval when the run is complete. 
 
 The Results
 -----------
@@ -183,23 +183,27 @@ impact.
 In the past few years, the number of cores per socket on a motherboard
 has gone up from 2 to 8, resulting in some scalability problems in the
 kernel. One area that has received a lot of attention is the load
-balancer. This is logic in the kernel that attempts to make sure that
-each core in the system has tasks to run and that no one core is
-overloaded with tasks. During a load balancer pass, a core with a long
-run queue (indicating there are many tasks ready on that core) will
-have some of those tasks migrated to other cores, which requires that
-both the current and destination cores run queue locks being held
-(meaning nothing can run on those cores). 
-
-In a stock Linux kernel long load balancer passes result in more
-utilization of cpus and an overall througput gain. Unfortunately long
-load balancer passes can result in missed deadlines because a task on
-the run queue for a core cannot run while the loadbalancer is
-running. To compensate for this on realtime Linux the load balancer
-has a lower number of target migrations and looks for contention on
-the run queue locks (meaning that a task is trying to be scheduled on
-one of the cores on which the balancer is operating). Research in this
-area is ongoing.  
+balancer for SCHED_OTHER tasks. This is logic in the kernel that
+attempts to make sure that each core in the system has tasks to run
+and that no one core is overloaded with tasks. During a load balancer
+pass, a core with a long run queue (indicating there are many tasks
+ready on that core) will have some of those tasks migrated to other
+cores, which requires that both the current and destination cores run
+queue locks being held (meaning nothing can run on those cores). 
+
+In a stock Linux kernel long SCHED_OTHER load balancer passes result
+in more utilization of cpus and an overall througput gain. 
+Unfortunately long load balancer passes can result in missed 
+deadlines because a task on the run queue for a core cannot run while
+the loadbalancer is running. To compensate for this on realtime Linux
+the load balancer has a lower number of target migrations and looks
+for contention on the run queue locks (meaning that a task is trying
+to be scheduled on one of the cores on which the balancer is
+operating). Research in this area is ongoing. 
+
+There is also a realtime thread (SCHED_FIFO and SCHED_RR) thread load
+balancer and similar research is being done towards reducing the
+overhead of this load balancer as well. 
 
 <what other areas?>
 
@@ -210,15 +214,15 @@ has been the desire to reduce the amount of interconnect traces
 between cpus and memory nodes (as you pack the cores tighter, you have
 less room to run connections to memory and other cores). One way to do
 this is to route a core's address/data/signal lines to some sort of
-switch module, such as AMD's HyperTransport mechanism. With a series
-of switches in place many cores can access memory and other cpu
-resources through the switch network without programs running on them
-knowing they're going through a switch. This results in a Non-Uniform
-Memory Access (NUMA) architecuture, which means that some memory
-accesses will take longer than others due to traversing the switch
-network. NUMA is great for scaling up throughput oriented servers, but
-tends to hurt determinism, if the programs are not aware of the memory
-topology. 
+switch module, such as AMD's HyperTransport (HT) mechanism or Intel's
+Quick Path Interconnect (QPI). With a series of switches in place many
+cores can access memory and other cpu resources through the switch
+network without programs running on them knowing they're going through
+a switch. This results in a Non-Uniform Memory Access (NUMA)
+architecuture, which means that some memory accesses will take longer
+than others due to traversing the switch network. NUMA is great for
+scaling up throughput oriented servers, but tends to hurt determinism,
+if the programs are not aware of the memory topology. 
 
 A --numa option was added to the cyclictest program to use the libnuma
 library to bind threads to local memory nodes and allocate memory on
author	Clark Williams <williams@redhat.com>	2010-06-14 15:04:36 -0500
committer	Clark Williams <williams@redhat.com>	2010-06-14 15:04:36 -0500
commit	aef0ad6ec0f80c97e03891b97dd8f7221ec64c18 (patch)
tree	ac06a8378d3f02c8f1e6d907c21d38b8f682d06b /doc
parent	aafe28b72299ce0631174f116825ec93bd50a20d (diff)
download	rteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.tar.gz rteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.tar.xz rteval-aef0ad6ec0f80c97e03891b97dd8f7221ec64c18.zip