summaryrefslogtreecommitdiffstats
path: root/Documentation/scheduler
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/scheduler')
-rw-r--r--Documentation/scheduler/00-INDEX14
-rw-r--r--Documentation/scheduler/sched-arch.txt89
-rw-r--r--Documentation/scheduler/sched-bwc.txt122
-rw-r--r--Documentation/scheduler/sched-design-CFS.txt244
-rw-r--r--Documentation/scheduler/sched-domains.txt77
-rw-r--r--Documentation/scheduler/sched-nice-design.txt108
-rw-r--r--Documentation/scheduler/sched-rt-group.txt183
-rw-r--r--Documentation/scheduler/sched-stats.txt154
8 files changed, 0 insertions, 991 deletions
diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
deleted file mode 100644
index d2651c47ae2..00000000000
--- a/Documentation/scheduler/00-INDEX
+++ /dev/null
@@ -1,14 +0,0 @@
-00-INDEX
- - this file.
-sched-arch.txt
- - CPU Scheduler implementation hints for architecture specific code.
-sched-design-CFS.txt
- - goals, design and implementation of the Completely Fair Scheduler.
-sched-domains.txt
- - information on scheduling domains.
-sched-nice-design.txt
- - How and why the scheduler's nice levels are implemented.
-sched-rt-group.txt
- - real-time group scheduling.
-sched-stats.txt
- - information on schedstats (Linux Scheduler Statistics).
diff --git a/Documentation/scheduler/sched-arch.txt b/Documentation/scheduler/sched-arch.txt
deleted file mode 100644
index 28aa1075e29..00000000000
--- a/Documentation/scheduler/sched-arch.txt
+++ /dev/null
@@ -1,89 +0,0 @@
- CPU Scheduler implementation hints for architecture specific code
-
- Nick Piggin, 2005
-
-Context switch
-==============
-1. Runqueue locking
-By default, the switch_to arch function is called with the runqueue
-locked. This is usually not a problem unless switch_to may need to
-take the runqueue lock. This is usually due to a wake up operation in
-the context switch. See arch/ia64/include/asm/system.h for an example.
-
-To request the scheduler call switch_to with the runqueue unlocked,
-you must `#define __ARCH_WANT_UNLOCKED_CTXSW` in a header file
-(typically the one where switch_to is defined).
-
-Unlocked context switches introduce only a very minor performance
-penalty to the core scheduler implementation in the CONFIG_SMP case.
-
-2. Interrupt status
-By default, the switch_to arch function is called with interrupts
-disabled. Interrupts may be enabled over the call if it is likely to
-introduce a significant interrupt latency by adding the line
-`#define __ARCH_WANT_INTERRUPTS_ON_CTXSW` in the same place as for
-unlocked context switches. This define also implies
-`__ARCH_WANT_UNLOCKED_CTXSW`. See arch/arm/include/asm/system.h for an
-example.
-
-
-CPU idle
-========
-Your cpu_idle routines need to obey the following rules:
-
-1. Preempt should now disabled over idle routines. Should only
- be enabled to call schedule() then disabled again.
-
-2. need_resched/TIF_NEED_RESCHED is only ever set, and will never
- be cleared until the running task has called schedule(). Idle
- threads need only ever query need_resched, and may never set or
- clear it.
-
-3. When cpu_idle finds (need_resched() == 'true'), it should call
- schedule(). It should not call schedule() otherwise.
-
-4. The only time interrupts need to be disabled when checking
- need_resched is if we are about to sleep the processor until
- the next interrupt (this doesn't provide any protection of
- need_resched, it prevents losing an interrupt).
-
- 4a. Common problem with this type of sleep appears to be:
- local_irq_disable();
- if (!need_resched()) {
- local_irq_enable();
- *** resched interrupt arrives here ***
- __asm__("sleep until next interrupt");
- }
-
-5. TIF_POLLING_NRFLAG can be set by idle routines that do not
- need an interrupt to wake them up when need_resched goes high.
- In other words, they must be periodically polling need_resched,
- although it may be reasonable to do some background work or enter
- a low CPU priority.
-
- 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
- an interrupt sleep, it needs to be cleared then a memory
- barrier issued (followed by a test of need_resched with
- interrupts disabled, as explained in 3).
-
-arch/x86/kernel/process.c has examples of both polling and
-sleeping idle functions.
-
-
-Possible arch/ problems
-=======================
-
-Possible arch problems I found (and either tried to fix or didn't):
-
-h8300 - Is such sleeping racy vs interrupts? (See #4a).
- The H8/300 manual I found indicates yes, however disabling IRQs
- over the sleep mean only NMIs can wake it up, so can't fix easily
- without doing spin waiting.
-
-ia64 - is safe_halt call racy vs interrupts? (does it sleep?) (See #4a)
-
-sh64 - Is sleeping racy vs interrupts? (See #4a)
-
-sparc - IRQs on at this point(?), change local_irq_save to _disable.
- - TODO: needs secondary CPUs to disable preempt (See #1)
-
diff --git a/Documentation/scheduler/sched-bwc.txt b/Documentation/scheduler/sched-bwc.txt
deleted file mode 100644
index f6b1873f68a..00000000000
--- a/Documentation/scheduler/sched-bwc.txt
+++ /dev/null
@@ -1,122 +0,0 @@
-CFS Bandwidth Control
-=====================
-
-[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
- The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
-
-CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
-specification of the maximum CPU bandwidth available to a group or hierarchy.
-
-The bandwidth allowed for a group is specified using a quota and period. Within
-each given "period" (microseconds), a group is allowed to consume only up to
-"quota" microseconds of CPU time. When the CPU bandwidth consumption of a
-group exceeds this limit (for that period), the tasks belonging to its
-hierarchy will be throttled and are not allowed to run again until the next
-period.
-
-A group's unused runtime is globally tracked, being refreshed with quota units
-above at each period boundary. As threads consume this bandwidth it is
-transferred to cpu-local "silos" on a demand basis. The amount transferred
-within each of these updates is tunable and described as the "slice".
-
-Management
-----------
-Quota and period are managed within the cpu subsystem via cgroupfs.
-
-cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
-cpu.cfs_period_us: the length of a period (in microseconds)
-cpu.stat: exports throttling statistics [explained further below]
-
-The default values are:
- cpu.cfs_period_us=100ms
- cpu.cfs_quota=-1
-
-A value of -1 for cpu.cfs_quota_us indicates that the group does not have any
-bandwidth restriction in place, such a group is described as an unconstrained
-bandwidth group. This represents the traditional work-conserving behavior for
-CFS.
-
-Writing any (valid) positive value(s) will enact the specified bandwidth limit.
-The minimum quota allowed for the quota or period is 1ms. There is also an
-upper bound on the period length of 1s. Additional restrictions exist when
-bandwidth limits are used in a hierarchical fashion, these are explained in
-more detail below.
-
-Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit
-and return the group to an unconstrained state once more.
-
-Any updates to a group's bandwidth specification will result in it becoming
-unthrottled if it is in a constrained state.
-
-System wide settings
---------------------
-For efficiency run-time is transferred between the global pool and CPU local
-"silos" in a batch fashion. This greatly reduces global accounting pressure
-on large systems. The amount transferred each time such an update is required
-is described as the "slice".
-
-This is tunable via procfs:
- /proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
-
-Larger slice values will reduce transfer overheads, while smaller values allow
-for more fine-grained consumption.
-
-Statistics
-----------
-A group's bandwidth statistics are exported via 3 fields in cpu.stat.
-
-cpu.stat:
-- nr_periods: Number of enforcement intervals that have elapsed.
-- nr_throttled: Number of times the group has been throttled/limited.
-- throttled_time: The total time duration (in nanoseconds) for which entities
- of the group have been throttled.
-
-This interface is read-only.
-
-Hierarchical considerations
----------------------------
-The interface enforces that an individual entity's bandwidth is always
-attainable, that is: max(c_i) <= C. However, over-subscription in the
-aggregate case is explicitly allowed to enable work-conserving semantics
-within a hierarchy.
- e.g. \Sum (c_i) may exceed C
-[ Where C is the parent's bandwidth, and c_i its children ]
-
-
-There are two ways in which a group may become throttled:
- a. it fully consumes its own quota within a period
- b. a parent's quota is fully consumed within its period
-
-In case b) above, even though the child may have runtime remaining it will not
-be allowed to until the parent's runtime is refreshed.
-
-Examples
---------
-1. Limit a group to 1 CPU worth of runtime.
-
- If period is 250ms and quota is also 250ms, the group will get
- 1 CPU worth of runtime every 250ms.
-
- # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
- # echo 250000 > cpu.cfs_period_us /* period = 250ms */
-
-2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
-
- With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
- runtime every 500ms.
-
- # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
- # echo 500000 > cpu.cfs_period_us /* period = 500ms */
-
- The larger period here allows for increased burst capacity.
-
-3. Limit a group to 20% of 1 CPU.
-
- With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
-
- # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
- # echo 50000 > cpu.cfs_period_us /* period = 50ms */
-
- By using a small period here we are ensuring a consistent latency
- response at the expense of burst capacity.
-
diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt
deleted file mode 100644
index d529e02d928..00000000000
--- a/Documentation/scheduler/sched-design-CFS.txt
+++ /dev/null
@@ -1,244 +0,0 @@
- =============
- CFS Scheduler
- =============
-
-
-1. OVERVIEW
-
-CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
-scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
-replacement for the previous vanilla scheduler's SCHED_OTHER interactivity
-code.
-
-80% of CFS's design can be summed up in a single sentence: CFS basically models
-an "ideal, precise multi-tasking CPU" on real hardware.
-
-"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical
-power and which can run each task at precise equal speed, in parallel, each at
-1/nr_running speed. For example: if there are 2 tasks running, then it runs
-each at 50% physical power --- i.e., actually in parallel.
-
-On real hardware, we can run only a single task at once, so we have to
-introduce the concept of "virtual runtime." The virtual runtime of a task
-specifies when its next timeslice would start execution on the ideal
-multi-tasking CPU described above. In practice, the virtual runtime of a task
-is its actual runtime normalized to the total number of running tasks.
-
-
-
-2. FEW IMPLEMENTATION DETAILS
-
-In CFS the virtual runtime is expressed and tracked via the per-task
-p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
-timestamp and measure the "expected CPU time" a task should have gotten.
-
-[ small detail: on "ideal" hardware, at any time all tasks would have the same
- p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
- would ever get "out of balance" from the "ideal" share of CPU time. ]
-
-CFS's task picking logic is based on this p->se.vruntime value and it is thus
-very simple: it always tries to run the task with the smallest p->se.vruntime
-value (i.e., the task which executed least so far). CFS always tries to split
-up CPU time between runnable tasks as close to "ideal multitasking hardware" as
-possible.
-
-Most of the rest of CFS's design just falls out of this really simple concept,
-with a few add-on embellishments like nice levels, multiprocessing and various
-algorithm variants to recognize sleepers.
-
-
-
-3. THE RBTREE
-
-CFS's design is quite radical: it does not use the old data structures for the
-runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
-task execution, and thus has no "array switch" artifacts (by which both the
-previous vanilla scheduler and RSDL/SD are affected).
-
-CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
-increasing value tracking the smallest vruntime among all tasks in the
-runqueue. The total amount of work done by the system is tracked using
-min_vruntime; that value is used to place newly activated entities on the left
-side of the tree as much as possible.
-
-The total number of running tasks in the runqueue is accounted through the
-rq->cfs.load value, which is the sum of the weights of the tasks queued on the
-runqueue.
-
-CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
-p->se.vruntime key (there is a subtraction using rq->cfs.min_vruntime to
-account for possible wraparounds). CFS picks the "leftmost" task from this
-tree and sticks to it.
-As the system progresses forwards, the executed tasks are put into the tree
-more and more to the right --- slowly but surely giving a chance for every task
-to become the "leftmost task" and thus get on the CPU within a deterministic
-amount of time.
-
-Summing up, CFS works like this: it runs a task a bit, and when the task
-schedules (or a scheduler tick happens) the task's CPU usage is "accounted
-for": the (small) time it just spent using the physical CPU is added to
-p->se.vruntime. Once p->se.vruntime gets high enough so that another task
-becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
-small amount of "granularity" distance relative to the leftmost task so that we
-do not over-schedule tasks and trash the cache), then the new leftmost task is
-picked and the current task is preempted.
-
-
-
-4. SOME FEATURES OF CFS
-
-CFS uses nanosecond granularity accounting and does not rely on any jiffies or
-other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
-way the previous scheduler had, and has no heuristics whatsoever. There is
-only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
-
- /proc/sys/kernel/sched_min_granularity_ns
-
-which can be used to tune the scheduler from "desktop" (i.e., low latencies) to
-"server" (i.e., good batching) workloads. It defaults to a setting suitable
-for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too.
-
-Due to its design, the CFS scheduler is not prone to any of the "attacks" that
-exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
-chew.c, ring-test.c, massive_intr.c all work fine and do not impact
-interactivity and produce the expected behavior.
-
-The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
-than the previous vanilla scheduler: both types of workloads are isolated much
-more aggressively.
-
-SMP load-balancing has been reworked/sanitized: the runqueue-walking
-assumptions are gone from the load-balancing code now, and iterators of the
-scheduling modules are used. The balancing code got quite a bit simpler as a
-result.
-
-
-
-5. Scheduling policies
-
-CFS implements three scheduling policies:
-
- - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
- policy that is used for regular tasks.
-
- - SCHED_BATCH: Does not preempt nearly as often as regular tasks
- would, thereby allowing tasks to run longer and make better use of
- caches but at the cost of interactivity. This is well suited for
- batch jobs.
-
- - SCHED_IDLE: This is even weaker than nice 19, but its not a true
- idle timer scheduler in order to avoid to get into priority
- inversion problems which would deadlock the machine.
-
-SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
-POSIX.
-
-The command chrt from util-linux-ng 2.13.1.1 can set all of these except
-SCHED_IDLE.
-
-
-
-6. SCHEDULING CLASSES
-
-The new CFS scheduler has been designed in such a way to introduce "Scheduling
-Classes," an extensible hierarchy of scheduler modules. These modules
-encapsulate scheduling policy details and are handled by the scheduler core
-without the core code assuming too much about them.
-
-sched/fair.c implements the CFS scheduler described above.
-
-sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
-the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT
-priority levels, instead of 140 in the previous scheduler) and it needs no
-expired array.
-
-Scheduling classes are implemented through the sched_class structure, which
-contains hooks to functions that must be called whenever an interesting event
-occurs.
-
-This is the (partial) list of the hooks:
-
- - enqueue_task(...)
-
- Called when a task enters a runnable state.
- It puts the scheduling entity (task) into the red-black tree and
- increments the nr_running variable.
-
- - dequeue_task(...)
-
- When a task is no longer runnable, this function is called to keep the
- corresponding scheduling entity out of the red-black tree. It decrements
- the nr_running variable.
-
- - yield_task(...)
-
- This function is basically just a dequeue followed by an enqueue, unless the
- compat_yield sysctl is turned on; in that case, it places the scheduling
- entity at the right-most end of the red-black tree.
-
- - check_preempt_curr(...)
-
- This function checks if a task that entered the runnable state should
- preempt the currently running task.
-
- - pick_next_task(...)
-
- This function chooses the most appropriate task eligible to run next.
-
- - set_curr_task(...)
-
- This function is called when a task changes its scheduling class or changes
- its task group.
-
- - task_tick(...)
-
- This function is mostly called from time tick functions; it might lead to
- process switch. This drives the running preemption.
-
-
-
-
-7. GROUP SCHEDULER EXTENSIONS TO CFS
-
-Normally, the scheduler operates on individual tasks and strives to provide
-fair CPU time to each task. Sometimes, it may be desirable to group tasks and
-provide fair CPU time to each such task group. For example, it may be
-desirable to first provide fair CPU time to each user on the system and then to
-each task belonging to a user.
-
-CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be
-grouped and divides CPU time fairly among such groups.
-
-CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
-SCHED_RR) tasks.
-
-CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
-SCHED_BATCH) tasks.
-
- These options need CONFIG_CGROUPS to be defined, and let the administrator
- create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
- Documentation/cgroups/cgroups.txt for more information about this filesystem.
-
-When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
-group created using the pseudo filesystem. See example steps below to create
-task groups and modify their CPU share using the "cgroups" pseudo filesystem.
-
- # mount -t tmpfs cgroup_root /sys/fs/cgroup
- # mkdir /sys/fs/cgroup/cpu
- # mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
- # cd /sys/fs/cgroup/cpu
-
- # mkdir multimedia # create "multimedia" group of tasks
- # mkdir browser # create "browser" group of tasks
-
- # #Configure the multimedia group to receive twice the CPU bandwidth
- # #that of browser group
-
- # echo 2048 > multimedia/cpu.shares
- # echo 1024 > browser/cpu.shares
-
- # firefox & # Launch firefox and move it to "browser" group
- # echo <firefox_pid> > browser/tasks
-
- # #Launch gmplayer (or your favourite movie player)
- # echo <movie_player_pid> > multimedia/tasks
diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.txt
deleted file mode 100644
index 443f0c76bab..00000000000
--- a/Documentation/scheduler/sched-domains.txt
+++ /dev/null
@@ -1,77 +0,0 @@
-Each CPU has a "base" scheduling domain (struct sched_domain). The domain
-hierarchy is built from these base domains via the ->parent pointer. ->parent
-MUST be NULL terminated, and domain structures should be per-CPU as they are
-locklessly updated.
-
-Each scheduling domain spans a number of CPUs (stored in the ->span field).
-A domain's span MUST be a superset of it child's span (this restriction could
-be relaxed if the need arises), and a base domain for CPU i MUST span at least
-i. The top domain for each CPU will generally span all CPUs in the system
-although strictly it doesn't have to, but this could lead to a case where some
-CPUs will never be given tasks to run unless the CPUs allowed mask is
-explicitly set. A sched domain's span means "balance process load among these
-CPUs".
-
-Each scheduling domain must have one or more CPU groups (struct sched_group)
-which are organised as a circular one way linked list from the ->groups
-pointer. The union of cpumasks of these groups MUST be the same as the
-domain's span. The intersection of cpumasks from any two of these groups
-MUST be the empty set. The group pointed to by the ->groups pointer MUST
-contain the CPU to which the domain belongs. Groups may be shared among
-CPUs as they contain read only data after they have been set up.
-
-Balancing within a sched domain occurs between groups. That is, each group
-is treated as one entity. The load of a group is defined as the sum of the
-load of each of its member CPUs, and only when the load of a group becomes
-out of balance are tasks moved between groups.
-
-In kernel/sched.c, trigger_load_balance() is run periodically on each CPU
-through scheduler_tick(). It raises a softirq after the next regularly scheduled
-rebalancing event for the current runqueue has arrived. The actual load
-balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
-in softirq context (SCHED_SOFTIRQ).
-
-The latter function takes two arguments: the current CPU and whether it was idle
-at the time the scheduler_tick() happened and iterates over all sched domains
-our CPU is on, starting from its base domain and going up the ->parent chain.
-While doing that, it checks to see if the current domain has exhausted its
-rebalance interval. If so, it runs load_balance() on that domain. It then checks
-the parent sched_domain (if it exists), and the parent of the parent and so
-forth.
-
-Initially, load_balance() finds the busiest group in the current sched domain.
-If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
-that group. If it manages to find such a runqueue, it locks both our initial
-CPU's runqueue and the newly found busiest one and starts moving tasks from it
-to our runqueue. The exact number of tasks amounts to an imbalance previously
-computed while iterating over this sched domain's groups.
-
-*** Implementing sched domains ***
-The "base" domain will "span" the first level of the hierarchy. In the case
-of SMT, you'll span all siblings of the physical CPU, with each group being
-a single virtual CPU.
-
-In SMP, the parent of the base domain will span all physical CPUs in the
-node. Each group being a single physical CPU. Then with NUMA, the parent
-of the SMP domain will span the entire machine, with each group having the
-cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
-might have just one domain covering its one NUMA level.
-
-The implementor should read comments in include/linux/sched.h:
-struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
-the specifics and what to tune.
-
-Architectures may retain the regular override the default SD_*_INIT flags
-while using the generic domain builder in kernel/sched.c if they wish to
-retain the traditional SMT->SMP->NUMA topology (or some subset of that). This
-can be done by #define'ing ARCH_HASH_SCHED_TUNE.
-
-Alternatively, the architecture may completely override the generic domain
-builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your
-arch_init_sched_domains function. This function will attach domains to all
-CPUs using cpu_attach_domain.
-
-The sched-domains debugging infrastructure can be enabled by enabling
-CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains
-which should catch most possible errors (described above). It also prints out
-the domain structure in a visual format.
diff --git a/Documentation/scheduler/sched-nice-design.txt b/Documentation/scheduler/sched-nice-design.txt
deleted file mode 100644
index 3ac1e46d536..00000000000
--- a/Documentation/scheduler/sched-nice-design.txt
+++ /dev/null
@@ -1,108 +0,0 @@
-This document explains the thinking about the revamped and streamlined
-nice-levels implementation in the new Linux scheduler.
-
-Nice levels were always pretty weak under Linux and people continuously
-pestered us to make nice +19 tasks use up much less CPU time.
-
-Unfortunately that was not that easy to implement under the old
-scheduler, (otherwise we'd have done it long ago) because nice level
-support was historically coupled to timeslice length, and timeslice
-units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
-
-In the O(1) scheduler (in 2003) we changed negative nice levels to be
-much stronger than they were before in 2.4 (and people were happy about
-that change), and we also intentionally calibrated the linear timeslice
-rule so that nice +19 level would be _exactly_ 1 jiffy. To better
-understand it, the timeslice graph went like this (cheesy ASCII art
-alert!):
-
-
- A
- \ | [timeslice length]
- \ |
- \ |
- \ |
- \ |
- \|___100msecs
- |^ . _
- | ^ . _
- | ^ . _
- -*----------------------------------*-----> [nice level]
- -20 | +19
- |
- |
-
-So that if someone wanted to really renice tasks, +19 would give a much
-bigger hit than the normal linear rule would do. (The solution of
-changing the ABI to extend priorities was discarded early on.)
-
-This approach worked to some degree for some time, but later on with
-HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
-we felt to be a bit excessive. Excessive _not_ because it's too small of
-a CPU utilization, but because it causes too frequent (once per
-millisec) rescheduling. (and would thus trash the cache, etc. Remember,
-this was long ago when hardware was weaker and caches were smaller, and
-people were running number crunching apps at nice +19.)
-
-So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
-right minimal granularity - and this translates to 5% CPU utilization.
-But the fundamental HZ-sensitive property for nice+19 still remained,
-and we never got a single complaint about nice +19 being too _weak_ in
-terms of CPU utilization, we only got complaints about it (still) being
-too _strong_ :-)
-
-To sum it up: we always wanted to make nice levels more consistent, but
-within the constraints of HZ and jiffies and their nasty design level
-coupling to timeslices and granularity it was not really viable.
-
-The second (less frequent but still periodically occurring) complaint
-about Linux's nice level support was its assymetry around the origo
-(which you can see demonstrated in the picture above), or more
-accurately: the fact that nice level behavior depended on the _absolute_
-nice level as well, while the nice API itself is fundamentally
-"relative":
-
- int nice(int inc);
-
- asmlinkage long sys_nice(int increment)
-
-(the first one is the glibc API, the second one is the syscall API.)
-Note that the 'inc' is relative to the current nice level. Tools like
-bash's "nice" command mirror this relative API.
-
-With the old scheduler, if you for example started a niced task with +1
-and another task with +2, the CPU split between the two tasks would
-depend on the nice level of the parent shell - if it was at nice -10 the
-CPU split was different than if it was at +5 or +10.
-
-A third complaint against Linux's nice level support was that negative
-nice levels were not 'punchy enough', so lots of people had to resort to
-run audio (and other multimedia) apps under RT priorities such as
-SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
-proof, and a buggy SCHED_FIFO app can also lock up the system for good.
-
-The new scheduler in v2.6.23 addresses all three types of complaints:
-
-To address the first complaint (of nice levels being not "punchy"
-enough), the scheduler was decoupled from 'time slice' and HZ concepts
-(and granularity was made a separate concept from nice levels) and thus
-it was possible to implement better and more consistent nice +19
-support: with the new scheduler nice +19 tasks get a HZ-independent
-1.5%, instead of the variable 3%-5%-9% range they got in the old
-scheduler.
-
-To address the second complaint (of nice levels not being consistent),
-the new scheduler makes nice(1) have the same CPU utilization effect on
-tasks, regardless of their absolute nice levels. So on the new
-scheduler, running a nice +10 and a nice 11 task has the same CPU
-utilization "split" between them as running a nice -5 and a nice -4
-task. (one will get 55% of the CPU, the other 45%.) That is why nice
-levels were changed to be "multiplicative" (or exponential) - that way
-it does not matter which nice level you start out from, the 'relative
-result' will always be the same.
-
-The third complaint (of negative nice levels not being "punchy" enough
-and forcing audio apps to run under the more dangerous SCHED_FIFO
-scheduling policy) is addressed by the new scheduler almost
-automatically: stronger negative nice levels are an automatic
-side-effect of the recalibrated dynamic range of nice levels.
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt
deleted file mode 100644
index 71b54d54998..00000000000
--- a/Documentation/scheduler/sched-rt-group.txt
+++ /dev/null
@@ -1,183 +0,0 @@
- Real-Time group scheduling
- --------------------------
-
-CONTENTS
-========
-
-0. WARNING
-1. Overview
- 1.1 The problem
- 1.2 The solution
-2. The interface
- 2.1 System-wide settings
- 2.2 Default behaviour
- 2.3 Basis for grouping tasks
-3. Future plans
-
-
-0. WARNING
-==========
-
- Fiddling with these settings can result in an unstable system, the knobs are
- root only and assumes root knows what he is doing.
-
-Most notable:
-
- * very small values in sched_rt_period_us can result in an unstable
- system when the period is smaller than either the available hrtimer
- resolution, or the time it takes to handle the budget refresh itself.
-
- * very small values in sched_rt_runtime_us can result in an unstable
- system when the runtime is so small the system has difficulty making
- forward progress (NOTE: the migration thread and kstopmachine both
- are real-time processes).
-
-1. Overview
-===========
-
-
-1.1 The problem
----------------
-
-Realtime scheduling is all about determinism, a group has to be able to rely on
-the amount of bandwidth (eg. CPU time) being constant. In order to schedule
-multiple groups of realtime tasks, each group must be assigned a fixed portion
-of the CPU time available. Without a minimum guarantee a realtime group can
-obviously fall short. A fuzzy upper limit is of no use since it cannot be
-relied upon. Which leaves us with just the single fixed portion.
-
-1.2 The solution
-----------------
-
-CPU time is divided by means of specifying how much time can be spent running
-in a given period. We allocate this "run time" for each realtime group which
-the other realtime groups will not be permitted to use.
-
-Any time not allocated to a realtime group will be used to run normal priority
-tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
-SCHED_OTHER.
-
-Let's consider an example: a frame fixed realtime renderer must deliver 25
-frames a second, which yields a period of 0.04s per frame. Now say it will also
-have to play some music and respond to input, leaving it with around 80% CPU
-time dedicated for the graphics. We can then give this group a run time of 0.8
-* 0.04s = 0.032s.
-
-This way the graphics group will have a 0.04s period with a 0.032s run time
-limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
-needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
-0.00015s. So this group can be scheduled with a period of 0.005s and a run time
-of 0.00015s.
-
-The remaining CPU time will be used for user input and other tasks. Because
-realtime tasks have explicitly allocated the CPU time they need to perform
-their tasks, buffer underruns in the graphics or audio can be eliminated.
-
-NOTE: the above example is not fully implemented yet. We still
-lack an EDF scheduler to make non-uniform periods usable.
-
-
-2. The Interface
-================
-
-
-2.1 System wide settings
-------------------------
-
-The system wide settings are configured under the /proc virtual file system:
-
-/proc/sys/kernel/sched_rt_period_us:
- The scheduling period that is equivalent to 100% CPU bandwidth
-
-/proc/sys/kernel/sched_rt_runtime_us:
- A global limit on how much time realtime scheduling may use. Even without
- CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
- processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
- available to all realtime groups.
-
- * Time is specified in us because the interface is s32. This gives an
- operating range from 1us to about 35 minutes.
- * sched_rt_period_us takes values from 1 to INT_MAX.
- * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
- * A run time of -1 specifies runtime == period, ie. no limit.
-
-
-2.2 Default behaviour
----------------------
-
-The default values for sched_rt_period_us (1000000 or 1s) and
-sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
-SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
-realtime tasks will not lock up the machine but leave a little time to recover
-it. By setting runtime to -1 you'd get the old behaviour back.
-
-By default all bandwidth is assigned to the root group and new groups get the
-period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
-want to assign bandwidth to another group, reduce the root group's bandwidth
-and assign some or all of the difference to another group.
-
-Realtime group scheduling means you have to assign a portion of total CPU
-bandwidth to the group before it will accept realtime tasks. Therefore you will
-not be able to run realtime tasks as any user other than root until you have
-done that, even if the user has the rights to run processes with realtime
-priority!
-
-
-2.3 Basis for grouping tasks
-----------------------------
-
-Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
-CPU bandwidth to task groups.
-
-This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
-to control the CPU time reserved for each control group.
-
-For more information on working with control groups, you should read
-Documentation/cgroups/cgroups.txt as well.
-
-Group settings are checked against the following limits in order to keep the
-configuration schedulable:
-
- \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
-
-For now, this can be simplified to just the following (but see Future plans):
-
- \Sum_{i} runtime_{i} <= global_runtime
-
-
-3. Future plans
-===============
-
-There is work in progress to make the scheduling period for each group
-("<cgroup>/cpu.rt_period_us") configurable as well.
-
-The constraint on the period is that a subgroup must have a smaller or
-equal period to its parent. But realistically its not very useful _yet_
-as its prone to starvation without deadline scheduling.
-
-Consider two sibling groups A and B; both have 50% bandwidth, but A's
-period is twice the length of B's.
-
-* group A: period=100000us, runtime=10000us
- - this runs for 0.01s once every 0.1s
-
-* group B: period= 50000us, runtime=10000us
- - this runs for 0.01s twice every 0.1s (or once every 0.05 sec).
-
-This means that currently a while (1) loop in A will run for the full period of
-B and can starve B's tasks (assuming they are of lower priority) for a whole
-period.
-
-The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
-full deadline scheduling to the linux kernel. Deadline scheduling the above
-groups and treating end of the period as a deadline will ensure that they both
-get their allocated time.
-
-Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
-the biggest challenge as the current linux PI infrastructure is geared towards
-the limited static priority levels 0-99. With deadline scheduling you need to
-do deadline inheritance (since priority is inversely proportional to the
-deadline delta (deadline - now)).
-
-This means the whole PI machinery will have to be reworked - and that is one of
-the most complex pieces of code we have.
diff --git a/Documentation/scheduler/sched-stats.txt b/Documentation/scheduler/sched-stats.txt
deleted file mode 100644
index 8259b34a66a..00000000000
--- a/Documentation/scheduler/sched-stats.txt
+++ /dev/null
@@ -1,154 +0,0 @@
-Version 15 of schedstats dropped counters for some sched_yield:
-yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
-identical to version 14.
-
-Version 14 of schedstats includes support for sched_domains, which hit the
-mainline kernel in 2.6.20 although it is identical to the stats from version
-12 which was in the kernel from 2.6.13-2.6.19 (version 13 never saw a kernel
-release). Some counters make more sense to be per-runqueue; other to be
-per-domain. Note that domains (and their associated information) will only
-be pertinent and available on machines utilizing CONFIG_SMP.
-
-In version 14 of schedstat, there is at least one level of domain
-statistics for each cpu listed, and there may well be more than one
-domain. Domains have no particular names in this implementation, but
-the highest numbered one typically arbitrates balancing across all the
-cpus on the machine, while domain0 is the most tightly focused domain,
-sometimes balancing only between pairs of cpus. At this time, there
-are no architectures which need more than three domain levels. The first
-field in the domain stats is a bit map indicating which cpus are affected
-by that domain.
-
-These fields are counters, and only increment. Programs which make use
-of these will need to start with a baseline observation and then calculate
-the change in the counters at each subsequent observation. A perl script
-which does this for many of the fields is available at
-
- http://eaglet.rain.com/rick/linux/schedstat/
-
-Note that any such script will necessarily be version-specific, as the main
-reason to change versions is changes in the output format. For those wishing
-to write their own scripts, the fields are described here.
-
-CPU statistics
---------------
-cpu<N> 1 2 3 4 5 6 7 8 9
-
-First field is a sched_yield() statistic:
- 1) # of times sched_yield() was called
-
-Next three are schedule() statistics:
- 2) This field is a legacy array expiration count field used in the O(1)
- scheduler. We kept it for ABI compatibility, but it is always set to zero.
- 3) # of times schedule() was called
- 4) # of times schedule() left the processor idle
-
-Next two are try_to_wake_up() statistics:
- 5) # of times try_to_wake_up() was called
- 6) # of times try_to_wake_up() was called to wake up the local cpu
-
-Next three are statistics describing scheduling latency:
- 7) sum of all time spent running by tasks on this processor (in jiffies)
- 8) sum of all time spent waiting to run by tasks on this processor (in
- jiffies)
- 9) # of timeslices run on this cpu
-
-
-Domain statistics
------------------
-One of these is produced per domain for each cpu described. (Note that if
-CONFIG_SMP is not defined, *no* domains are utilized and these lines
-will not appear in the output.)
-
-domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
-
-The first field is a bit mask indicating what cpus this domain operates over.
-
-The next 24 are a variety of load_balance() statistics in grouped into types
-of idleness (idle, busy, and newly idle):
-
- 1) # of times in this domain load_balance() was called when the
- cpu was idle
- 2) # of times in this domain load_balance() checked but found
- the load did not require balancing when the cpu was idle
- 3) # of times in this domain load_balance() tried to move one or
- more tasks and failed, when the cpu was idle
- 4) sum of imbalances discovered (if any) with each call to
- load_balance() in this domain when the cpu was idle
- 5) # of times in this domain pull_task() was called when the cpu
- was idle
- 6) # of times in this domain pull_task() was called even though
- the target task was cache-hot when idle
- 7) # of times in this domain load_balance() was called but did
- not find a busier queue while the cpu was idle
- 8) # of times in this domain a busier queue was found while the
- cpu was idle but no busier group was found
-
- 9) # of times in this domain load_balance() was called when the
- cpu was busy
- 10) # of times in this domain load_balance() checked but found the
- load did not require balancing when busy
- 11) # of times in this domain load_balance() tried to move one or
- more tasks and failed, when the cpu was busy
- 12) sum of imbalances discovered (if any) with each call to
- load_balance() in this domain when the cpu was busy
- 13) # of times in this domain pull_task() was called when busy
- 14) # of times in this domain pull_task() was called even though the
- target task was cache-hot when busy
- 15) # of times in this domain load_balance() was called but did not
- find a busier queue while the cpu was busy
- 16) # of times in this domain a busier queue was found while the cpu
- was busy but no busier group was found
-
- 17) # of times in this domain load_balance() was called when the
- cpu was just becoming idle
- 18) # of times in this domain load_balance() checked but found the
- load did not require balancing when the cpu was just becoming idle
- 19) # of times in this domain load_balance() tried to move one or more
- tasks and failed, when the cpu was just becoming idle
- 20) sum of imbalances discovered (if any) with each call to
- load_balance() in this domain when the cpu was just becoming idle
- 21) # of times in this domain pull_task() was called when newly idle
- 22) # of times in this domain pull_task() was called even though the
- target task was cache-hot when just becoming idle
- 23) # of times in this domain load_balance() was called but did not
- find a busier queue while the cpu was just becoming idle
- 24) # of times in this domain a busier queue was found while the cpu
- was just becoming idle but no busier group was found
-
- Next three are active_load_balance() statistics:
- 25) # of times active_load_balance() was called
- 26) # of times active_load_balance() tried to move a task and failed
- 27) # of times active_load_balance() successfully moved a task
-
- Next three are sched_balance_exec() statistics:
- 28) sbe_cnt is not used
- 29) sbe_balanced is not used
- 30) sbe_pushed is not used
-
- Next three are sched_balance_fork() statistics:
- 31) sbf_cnt is not used
- 32) sbf_balanced is not used
- 33) sbf_pushed is not used
-
- Next three are try_to_wake_up() statistics:
- 34) # of times in this domain try_to_wake_up() awoke a task that
- last ran on a different cpu in this domain
- 35) # of times in this domain try_to_wake_up() moved a task to the
- waking cpu because it was cache-cold on its own cpu anyway
- 36) # of times in this domain try_to_wake_up() started passive balancing
-
-/proc/<pid>/schedstat
-----------------
-schedstats also adds a new /proc/<pid>/schedstat file to include some of
-the same information on a per-process level. There are three fields in
-this file correlating for that process to:
- 1) time spent on the cpu
- 2) time spent waiting on a runqueue
- 3) # of timeslices run on this cpu
-
-A program could be easily written to make use of these extra fields to
-report on how well a particular process or set of processes is faring
-under the scheduler's policies. A simple version of such a program is
-available at
- http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c