summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
...
| * [PATCH] ptrace/coredump/exit_group deadlockAndrea Arcangeli2005-10-302-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I could seldom reproduce a deadlock with a task not killable in T state (TASK_STOPPED, not TASK_TRACED) by attaching a NPTL threaded program to gdb, by segfaulting the task and triggering a core dump while some other task is executing exit_group and while one task is in ptrace_attached TASK_STOPPED state (not TASK_TRACED yet). This originated from a gdb bugreport (the fact gdb was segfaulting the task wasn't a kernel bug), but I just incidentally noticed the gdb bug triggered a real kernel bug as well. Most threads hangs in exit_mm because the core_dumping is still going, the core dumping hangs because the stopped task doesn't exit, the stopped task can't wakeup because it has SIGNAL_GROUP_EXIT set, hence the deadlock. To me it seems that the problem is that the force_sig_specific(SIGKILL) in zap_threads is a noop if the task has PF_PTRACED set (like in this case because gdb is attached). The __ptrace_unlink does nothing because the signal->flags is set to SIGNAL_GROUP_EXIT|SIGNAL_STOP_DEQUEUED (verified). The above info also shows that the stopped task hit a race and got the stop signal (presumably by the ptrace_attach, only the attach, state is still TASK_STOPPED and gdb hangs waiting the core before it can set it to TASK_TRACED) after one of the thread invoked the core dump (it's the core dump that sets signal->flags to SIGNAL_GROUP_EXIT). So beside the fact nobody would wakeup the task in __ptrace_unlink (the state is _not_ TASK_TRACED), there's a secondary problem in the signal handling code, where a task should ignore the ptrace-sigstops as long as SIGNAL_GROUP_EXIT is set (or the wakeup in __ptrace_unlink path wouldn't be enough). So I attempted to make this patch that seems to fix the problem. There were various ways to fix it, perhaps you prefer a different one, I just opted to the one that looked safer to me. I also removed the clearing of the stopped bits from the zap_other_threads (zap_other_threads was safe unlike zap_threads). I don't like useless code, this whole NPTL signal/ptrace thing is already unreadable enough and full of corner cases without confusing useless code into it to make it even less readable. And if this code is really needed, then you may want to explain why it's not being done in the other paths that sets SIGNAL_GROUP_EXIT at least. Even after this patch I still wonder who serializes the read of p->ptrace in zap_threads. Patch is called ptrace-core_dump-exit_group-deadlock-1. This was the trace I've got: test T ffff81003e8118c0 0 14305 1 14311 14309 (NOTLB) ffff810058ccdde8 0000000000000082 000001f4000037e1 ffff810000000013 00000000000000f8 ffff81003e811b00 ffff81003e8118c0 ffff810011362100 0000000000000012 ffff810017ca4180 Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff80141677>{finish_stop+87} <ffffffff8014367f>{get_signal_to_deliver+1359} <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80111575>{sys_ptrace+2293} <ffffffff80131810>{default_wake_function+0} <ffffffff80196399>{sys_ioctl+73} <ffffffff8010dd27>{sysret_signal+28} <ffffffff8010e00f>{ptregscall_common+103} test D ffff810011362100 0 14309 1 14305 14312 (NOTLB) ffff810053c81cf8 0000000000000082 0000000000000286 0000000000000001 0000000000000195 ffff810011362340 ffff810011362100 ffff81002e338040 ffff810001e0ca80 0000000000000001 Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149} <ffffffff801381af>{do_exit+479} <ffffffff80138d0c>{do_group_exit+252} <ffffffff801436db>{get_signal_to_deliver+1451} <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2 <ffffffff8014208a>{force_sig_info+186} <ffffffff804479a0>{do_int3+112} <ffffffff8010e308>{retint_signal+61} test D ffff81002e338040 0 14311 1 14716 14305 (NOTLB) ffff81005ca8dcf8 0000000000000082 0000000000000286 0000000000000001 0000000000000120 ffff81002e338280 ffff81002e338040 ffff8100481cb740 ffff810001e0ca80 0000000000000001 Call Trace:<ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0} <ffffffff80137435>{exit_mm+149} <ffffffff801381af>{do_exit+479} <ffffffff80142d0e>{__dequeue_signal+558} <ffffffff80138d0c>{do_group_exit+252} <ffffffff801436db>{get_signal_to_deliver+1451} <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+208} <ffffffff8014208a>{force_sig_info+186} <ffffffff804479a0>{do_int3+112} <ffffffff8010e308>{retint_signal+61} test D ffff810017ca4180 0 14312 1 14309 13882 (NOTLB) ffff81005d15fcb8 0000000000000082 ffff81005d15fc58 ffffffff80130816 0000000000000897 ffff810017ca43c0 ffff810017ca4180 ffff81003e8118c0 0000000000000082 ffffffff801317ed Call Trace:<ffffffff80130816>{activate_task+150} <ffffffff801317ed>{try_to_wake_up+893} <ffffffff8044677d>{wait_for_completion+173} <ffffffff80131810>{default_wake_function+0} <ffffffff8018cdc3>{do_coredump+819} <ffffffff80445f52>{thread_return+82} <ffffffff801436d4>{get_signal_to_deliver+1444} <ffffffff8010d3ad>{do_signal+157} <ffffffff8013deee>{ptrace_check_attach+222} <ffffffff80140850>{specific_send_sig_info+2 <ffffffff804472e5>{_spin_unlock_irqrestore+5} <ffffffff8014208a>{force_sig_info+186} <ffffffff804476ff>{do_general_protection+159} <ffffffff8010e308>{retint_signal+61} Signed-off-by: Andrea Arcangeli <andrea@suse.de> Cc: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] cpusets: automatic numa mempolicy rebindingPaul Jackson2005-10-301-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch automatically updates a tasks NUMA mempolicy when its cpuset memory placement changes. It does so within the context of the task, without any need to support low level external mempolicy manipulation. If a system is not using cpusets, or if running on a system with just the root (all-encompassing) cpuset, then this remap is a no-op. Only when a task is moved between cpusets, or a cpusets memory placement is changed does the following apply. Otherwise, the main routine below, rebind_policy() is not even called. When mixing cpusets, scheduler affinity, and NUMA mempolicies, the essential role of cpusets is to place jobs (several related tasks) on a set of CPUs and Memory Nodes, the essential role of sched_setaffinity is to manage a jobs processor placement within its allowed cpuset, and the essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs memory placement within its allowed cpuset. However, CPU affinity and NUMA memory placement are managed within the kernel using absolute system wide numbering, not cpuset relative numbering. This is ok until a job is migrated to a different cpuset, or what's the same, a jobs cpuset is moved to different CPUs and Memory Nodes. Then the CPU affinity and NUMA memory placement of the tasks in the job need to be updated, to preserve their cpuset-relative position. This can be done for CPU affinity using sched_setaffinity() from user code, as one task can modify anothers CPU affinity. This cannot be done from an external task for NUMA memory placement, as that can only be modified in the context of the task using it. However, it easy enough to remap a tasks NUMA mempolicy automatically when a task is migrated, using the existing cpuset mechanism to trigger a refresh of a tasks memory placement after its cpuset has changed. All that is needed is the old and new nodemask, and notice to the task that it needs to rebind its mempolicy. The tasks mems_allowed has the old mask, the tasks cpuset has the new mask, and the existing cpuset_update_current_mems_allowed() mechanism provides the notice. The bitmap/cpumask/nodemask remap operators provide the cpuset relative calculations. This patch leaves open a couple of issues: 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies: These mempolicies may reference nodes outside of those allowed to the current task by its cpuset. Tasks are migrated as part of jobs, which reside on what might be several cpusets in a subtree. When such a job is migrated, all NUMA memory policy references to nodes within that cpuset subtree should be translated, and references to any nodes outside that subtree should be left untouched. A future patch will provide the cpuset mechanism needed to mark such subtrees. With that patch, we will be able to correctly migrate these other memory policies across a job migration. 2) Updating cpuset, affinity and memory policies in user space: This is harder. Any placement state stored in user space using system-wide numbering will be invalidated across a migration. More work will be required to provide user code with a migration-safe means to manage its cpuset relative placement, while preserving the current API's that pass system wide numbers, not cpuset relative numbers across the kernel-user boundary. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] cpusets: simple renamePaul Jackson2005-10-301-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for renaming cpusets. Only allow simple rename of cpuset directories in place. Don't allow moving cpusets elsewhere in hierarchy or renaming the special cpuset files in each cpuset directory. The usefulness of this simple rename became apparent when developing task migration facilities. It allows building a second cpuset hierarchy using new names and containing new CPUs and Memory Nodes, moving tasks from the old to the new cpusets, removing the old cpusets, and then renaming the new cpusets to be just like the old names, so that any knowledge that the tasks had of their cpuset names will still be valid. Leaf node cpusets can be migrated to other CPUs or Memory Nodes by just updating their 'cpus' and 'mems' files, but because no cpuset can contain CPUs or Nodes not in its parent cpuset, one cannot do this in a cpuset hierarchy without first expanding all the non-leaf cpusets to contain the union of both the old and new CPUs and Nodes, which would obfuscate the one-to-one migration of a task from one cpuset to another required to correctly migrate the physical page frames currently allocated to that task. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] cpusets: dual semaphore locking overhaulPaul Jackson2005-10-301-137/+281
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Overhaul cpuset locking. Replace single semaphore with two semaphores. The suggestion to use two locks was made by Roman Zippel. Both locks are global. Code that wants to modify cpusets must first acquire the exclusive manage_sem, which allows them read-only access to cpusets, and holds off other would-be modifiers. Before making actual changes, the second semaphore, callback_sem must be acquired as well. Code that needs only to query cpusets must acquire callback_sem, which is also a global exclusive lock. The earlier problems with double tripping are avoided, because it is allowed for holders of manage_sem to nest the second callback_sem lock, and only callback_sem is needed by code called from within __alloc_pages(), where the double tripping had been possible. This is not quite the same as a normal read/write semaphore, because obtaining read-only access with intent to change must hold off other such attempts, while allowing read-only access w/o such intention. Changing cpusets involves several related checks and changes, which must be done while allowing read-only queries (to avoid the double trip), but while ensuring nothing changes (holding off other would be modifiers.) This overhaul of cpuset locking also makes careful use of task_lock() to guard access to the task->cpuset pointer, closing a couple of race conditions noticed while reading this code (thanks, Roman). I've never seen these races fail in any use or test. See further the comments in the code. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] cpusets: remove depth counted locking hackPaul Jackson2005-10-301-65/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove a rather hackish depth counter on cpuset locking. The depth counter was avoiding a possible double trip on the global cpuset_sem semaphore. It worked, but now an improved version of cpuset locking is available, to come in the next patch, using two global semaphores. This patch reverses "cpuset semaphore depth check deadlock fix" The kernel still works, even after this patch, except for some rare and difficult to reproduce race conditions when agressively creating and destroying cpusets marked with the notify_on_release option, on very large systems. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] cpuset cleanupPaul Jackson2005-10-301-1/+0
| | | | | | | | | | | | | | | | Remove one more useless line from cpuset_common_file_read(). Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] posix-timers: use schedule_timeout() in common_nsleep()Oleg Nesterov2005-10-301-18/+1
| | | | | | | | | | | | | | | | | | common_nsleep() reimplements schedule_timeout_interruptible() for unknown reason. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] Unify sys_tkill() and sys_tgkill()Vadim Lobanov2005-10-301-46/+25
| | | | | | | | | | | | | | | | | | | | | | | | The majority of the sys_tkill() and sys_tgkill() function code is duplicated between the two of them. This patch pulls the duplication out into a separate function -- do_tkill() -- and lets sys_tkill() and sys_tgkill() be simple wrappers around it. This should make it easier to maintain in light of future changes. Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] kill sigqueue->lockOleg Nesterov2005-10-301-6/+4
| | | | | | | | | | | | | | | | | | | | This lock is used in sigqueue_free(), but it is always equal to current->sighand->siglock, so we don't need to keep it in the struct sigqueue. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] remove timer debug fieldAndrew Morton2005-10-301-35/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove timer_list.magic and associated debugging code. I originally added this when a spinlock was added to timer_list - this meant that an all-zeroes timer became illegal and init_timer() was required. That spinlock isn't even there any more, although timer.base must now be initialised. I'll keep this debugging code in -mm. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] Use alloc_percpu to allocate workqueues locallyChristoph Lameter2005-10-301-13/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes the workqueus use alloc_percpu instead of an array. The workqueues are placed on nodes local to each processor. The workqueue structure can grow to a significant size on a system with lots of processors if this patch is not applied. 64 bit architectures with all debugging features enabled and configured for 512 processors will not be able to boot without this patch. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] ntp whitespace cleanupAndrew Morton2005-10-301-126/+126
| | | | | | | | | | | | | | Fix bizarre 4-space coding style in the NTP code. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] NTP shift_right cleanupjohn stultz2005-10-302-61/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create a macro shift_right() that avoids the numerous ugly conditionals in the NTP code that look like: if(a < 0) b = -(-a >> shift); else b = a >> shift; Replacing it with: b = shift_right(a, shift); This should have zero effect on the logic, however it should probably have a bit of testing just to be sure. Also replace open-coded min/max with the macros. Signed-off-by : John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] Add kthread_stop_sem()Alan Stern2005-10-301-2/+11
| | | | | | | | | | | | | | | | | | Enhance the kthread API by adding kthread_stop_sem, for use in stopping threads that spend their idle time waiting on a semaphore. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] introduce setup_timer() helperOleg Nesterov2005-10-301-6/+2
| | | | | | | | | | | | | | | | | | | | | | Every user of init_timer() also needs to initialize ->function and ->data fields. This patch adds a simple setup_timer() helper for that. The schedule_timeout() is patched as an example of usage. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] introduce .valid callback for pm_opsShaohua Li2005-10-301-1/+4
| | | | | | | | | | | | | | | | | | | | | | Add pm_ops.valid callback, so only the available pm states show in /sys/power/state. And this also makes an earlier states error report at enter_state before we do actual suspend/resume. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Acked-by: Pavel Machek<pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] swsusp: two simplificationsRafael J. Wysocki2005-10-302-7/+3
| | | | | | | | | | | | | | | | | | | | | | The following patch simplifies the progress meter in disk.c:free_some_memory() and makes disk.c:pm_suspend_disk() call device_resume() explicitly in the suspend path. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] swsusp: get rid of unnecessary wrapper functionRafael J. Wysocki2005-10-301-7/+1
| | | | | | | | | | | | | | | | | | The following patch merges two functions in a trivial way. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] swsusp: cleanupsPavel Machek2005-10-301-12/+10
| | | | | | | | | | | | | | | | | | Reduce number of ifdefs somehow, and fix whitespace a bit. No real code changes. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] swsusp: remove unneccessary includesPavel Machek2005-10-302-31/+3
| | | | | | | | | | | | | | | | Cleanup comments and remove unneccessary includes. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] swsusp: rework memory freeing on resumeRafael J. Wysocki2005-10-304-83/+45
| | | | | | | | | | | | | | | | | | | | | | | | The following patch makes swsusp use the PG_nosave and PG_nosave_free flags to mark pages that should be freed in case of an error during resume. This allows us to simplify the code and to use swsusp_free() in all of the swsusp's resume error paths, which makes them actually work. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] swsusp: reduce the use of global variablesRafael J. Wysocki2005-10-303-43/+43
| | | | | | | | | | | | Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] swsusp: move snapshot functionality to separate fileRafael J. Wysocki2005-10-304-437/+486
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following patch moves the functionality of swsusp related to creating and handling the snapshot of memory to a separate file, snapshot.c This should enable us to untangle the code in the future and eventually to implement some parts of swsusp.c in the user space. The patch does not change the code. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] swsusp: rework image freeingRafael J. Wysocki2005-10-301-84/+75
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following patch makes swsusp use PG_nosave and PG_nosave_free flags to mark pages that should be freed after the state of the system has been restored from the image (or in case of an error during suspend). This allows us to avoid storing metadata in swap twice and to reduce the amount of memory needed by swsusp.  Additionally, it allows us to simplify the code by removing a couple of functions that are no longer necessary. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] create and destroy cpufreq sysfs entries based on cpu notifiersAshok Raj2005-10-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cpufreq entries in sysfs should only be populated when CPU is online state. When we either boot with maxcpus=x and then boot the other cpus by echoing to sysfs online file, these entries should be created and destroyed when CPU_DEAD is notified. Same treatement as cache entries under sysfs. We place the processor in the lowest frequency, so hw managed P-State transitions can still work on the other threads to save power. Primary goal was to just make these directories appear/disapper dynamically. There is one in this patch i had to do, which i really dont like myself but probably best if someone handling the cpufreq infrastructure could give this code right treatment if this is not acceptable. I guess its probably good for the first cut. - Converting lock_cpu_hotplug()/unlock_cpu_hotplug() to disable/enable preempt. The locking was smack in the middle of the notification path, when the hotplug is already holding the lock. I tried another solution to avoid this so avoid taking locks if we know we are from notification path. The solution was getting very ugly and i decided this was probably good for this iteration until someone who understands cpufreq could do a better job than me. (akpm: export cpucontrol to GPL modules: drivers/cpufreq/cpufreq_stats.c now does lock_cpu_hotplug()) Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Zwane Mwaikambo <zwane@holomorphy.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] Remove redundant configs.oBrian Gerst2005-10-301-1/+0
| | | | | | | | | | | | | | | | | | | | Since CONFIG_IKCONFIG_PROC already depends on CONFIG_IKCONFIG, adding configs.o again is redundant. Signed-off-by: Brian Gerst <bgerst@didntduck.org> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: split page table lockHugh Dickins2005-10-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: follow_page with inner ptlockHugh Dickins2005-10-291-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Final step in pushing down common core's page_table_lock. follow_page no longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself; and so no page_table_lock is taken in get_user_pages itself. But get_user_pages (and get_futex_key) do then need follow_page to pin the page for them: take Daniel's suggestion of bitflags to follow_page. Need one for WRITE, another for TOUCH (it was the accessed flag before: vanished along with check_user_page_readable, but surely get_numa_maps is wrong to mark every page it finds as accessed), another for GET. And another, ANON to dispose of untouched_anonymous_page: it seems silly for that to descend a second time, let follow_page observe if there was no page table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED - make_pages_present ought to make readonly anonymous present. Give get_numa_maps a cond_resched while we're there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: ptd_alloc take ptlockHugh Dickins2005-10-291-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: update_hiwaters just in timeHugh Dickins2005-10-292-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] core remove PageReservedNick Piggin2005-10-291-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: dup_mmap down new mmap_semHugh Dickins2005-10-291-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | One anomaly remains from when Andrea rationalized the responsibilities of mmap_sem and page_table_lock: in dup_mmap we add vmas to the child holding its page_table_lock, but not the mmap_sem which normally guards the vma list and rbtree. Which could be an issue for unuse_mm: though since it just walks down the list (today with page_table_lock, tomorrow not), it's probably okay. Will need a memory barrier? Oh, keep it simple, Nick and I agreed, no harm in taking child's mmap_sem here. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: dup_mmap use oldmm moreHugh Dickins2005-10-291-6/+6
| | | | | | | | | | | | | | | | | | | | Use the parent's oldmm throughout dup_mmap, instead of perversely going back to current->mm. (Can you hear the sigh of relief from those mpnts? Usually I squash them, but not today.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: rss = file_rss + anon_rssHugh Dickins2005-10-292-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: mm_init set_mm_countersHugh Dickins2005-10-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but that's not so good if an mm_counter_t is a special type. And how is rss initialized? By set_mm_counter, all over the place. Come on, we just need to initialize them both at once by set_mm_counter in mm_init (which follows the memcpy when forking). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] mm: vm_stat_account unshackledHugh Dickins2005-10-291-1/+1
| | | | | | | | | | | | | | | | | | | | The original vm_stat_account has fallen into disuse, with only one user, and only one user of vm_stat_unaccount. It's easier to keep track if we convert them all to __vm_stat_account, then free it from its __shackles. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] TIMERS: add missing compensation for HZ == 250YOSHIFUJI Hideaki2005-10-291-0/+9
| | | | | | | | | | | | | | | | | | | | Add missing compensation for (HZ == 250) != (1 << SHIFT_HZ) in second_overflow(). Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] missing exports of do_settimeofday() variantsAl Viro2005-10-291-0/+1
| | | | | | | | | | | | | | | | | | frv, sh64, ia64 and sparc64 do not have do_settimeofday() exported (the last two are using variant in kernel/time.c). Exports added to match the rest of architectures. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] fix ->signal->live leak in copy_process()Oleg Nesterov2005-10-291-0/+2
| | | | | | | | | | | | | | | | | | exit_signal() (called from copy_process's error path) should decrement ->signal->live, otherwise forking process will miss 'group_dead' in do_exit(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] gfp_t: kernel/*Al Viro2005-10-284-9/+8
| | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* | Merge in v2.6.14 by handPaul Mackerras2005-10-282-42/+53
|\|
| * [PATCH] Yet more posix-cpu-timer fixesRoland McGrath2005-10-271-4/+7
| | | | | | | | | | | | | | | | | | | | This just makes sure that a thread's expiry times can't get reset after it clears them in do_exit. This is what allowed us to re-introduce the stricter BUG_ON() check in a362f463a6d316d14daed0f817e151835ce97ff7. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * Revert "remove false BUG_ON() from run_posix_cpu_timers()"Linus Torvalds2005-10-272-18/+26
| | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 3de463c7d9d58f8cf3395268230cb20a4c15bffa. Roland has another patch that allows us to leave the BUG_ON() in place by just making sure that the condition it tests for really is always true. That goes in next. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] Fix cpu timers expiration timeOleg Nesterov2005-10-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | There's a silly off-by-one error in the code that updates the expiration of posix CPU timers, causing them to not be properly updated when they hit exactly on their expiration time (which should be the normal case). This causes them to then fire immediately again, and only _then_ get properly updated. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * posix cpu timers: fix timer orderingLinus Torvalds2005-10-261-6/+4
| | | | | | | | | | | | | | Pointed out by Oleg Nesterov, who has been walking over the code forwards and backwards. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] export cpu_online_mapAndrew Morton2005-10-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | With CONFIG_SMP=n: *** Warning: "cpu_online_map" [drivers/firmware/dcdbas.ko] undefined! due to set_cpus_allowed(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] posix-timers: fix posix_cpu_timer_set() vs run_posix_cpu_timers() raceOleg Nesterov2005-10-241-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This might be harmless, but looks like a race from code inspection (I was unable to trigger it). I must admit, I don't understand why we can't return TIMER_RETRY after 'spin_unlock(&p->sighand->siglock)' without doing bump_cpu_timer(), but this is what original code does. posix_cpu_timer_set: read_lock(&tasklist_lock); spin_lock(&p->sighand->siglock); list_del_init(&timer->it.cpu.entry); spin_unlock(&p->sighand->siglock); We are probaly deleting the timer from run_posix_cpu_timers's 'firing' local list_head while run_posix_cpu_timers() does list_for_each_safe. Various bad things can happen, for example we can just delete this timer so that list_for_each() will not notice it and run_posix_cpu_timers() will not reset '->firing' flag. In that case, .... if (timer->it.cpu.firing) { read_unlock(&tasklist_lock); timer->it.cpu.firing = -1; return TIMER_RETRY; } sys_timer_settime() goes to 'retry:', calls posix_cpu_timer_set() again, it returns TIMER_RETRY ... Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] posix-timers: exit path cleanupOleg Nesterov2005-10-241-0/+6
| | | | | | | | | | | | | | No need to rebalance when task exited Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] posix-timers: remove false BUG_ON() from run_posix_cpu_timers()Oleg Nesterov2005-10-242-26/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_exit() clears ->it_##clock##_expires, but nothing prevents another cpu to attach the timer to exiting process after that. After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and before do_exit() calls 'schedule() local timer interrupt can find tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu does sys_wait4) interrupted task has ->signal == NULL. At this moment exiting task has no pending cpu timers, they were cleaned up in __exit_signal()->posix_cpu_timers_exit{,_group}(), so we can just return from irq. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * [PATCH] posix-timers: fix cleanup_timers() and run_posix_cpu_timers() racesOleg Nesterov2005-10-241-19/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1. cleanup_timers() sets timer->task = NULL under tasklist + ->sighand locks. That means that this code in posix_cpu_timer_del() and posix_cpu_timer_set() lock_timer(timer); if (timer->task == NULL) return; read_lock(tasklist); put_task_struct(timer->task) is racy. With this patch timer->task modified and accounted only under timer->it_lock. Sadly, this means that dead task_struct won't be freed until timer deleted or armed. 2. run_posix_cpu_timers() collects expired timers into local list under tasklist + ->sighand again. That means that posix_cpu_timer_del() should check timer->it.cpu.firing under these locks too. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>