summaryrefslogtreecommitdiffstats
path: root/arch/x86/xen
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'cputime' of git://git390.osdl.marist.edu/pub/scm/linux-2.6Linus Torvalds2009-01-031-6/+4
|\ | | | | | | | | | | | | | | | | | | * 'cputime' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: [PATCH] fast vdso implementation for CLOCK_THREAD_CPUTIME_ID [PATCH] improve idle cputime accounting [PATCH] improve precision of idle time detection. [PATCH] improve precision of process accounting. [PATCH] idle cputime accounting [PATCH] fix scaled & unscaled cputime accounting
| * [PATCH] idle cputime accountingMartin Schwidefsky2008-12-311-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cpu time spent by the idle process actually doing something is currently accounted as idle time. This is plain wrong, the architectures that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the time spent doing nothing and the time spent by idle doing work. The first is accounted with account_idle_time and the second with account_system_time. The architectures that use the account_xxx_time interface directly and not the account_xxx_ticks interface now need to do the check for the idle process in their arch code. In particular to improve the system vs true idle time accounting the arch code needs to measure the true idle time instead of just testing for the idle process. To improve the tick based accounting as well we would need an architecture primitive that can tell us if the pt_regs of the interrupted context points to the magic instruction that halts the cpu. In addition idle time is no more added to the stime of the idle process. This field now contains the system time of the idle process as it should be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as every tick that occurs while idle is running will be accounted as idle time. This patch contains the necessary common code changes to be able to distinguish idle system time and true idle time. The architectures with support for VIRT_CPU_ACCOUNTING need some changes to exploit this. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* | Merge branch 'cpus4096-for-linus-2' of ↵Linus Torvalds2009-01-025-20/+34
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits) x86: export vector_used_by_percpu_irq x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and() sched: nominate preferred wakeup cpu, fix x86: fix lguest used_vectors breakage, -v2 x86: fix warning in arch/x86/kernel/io_apic.c sched: fix warning in kernel/sched.c sched: move test_sd_parent() to an SMP section of sched.h sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0 sched: activate active load balancing in new idle cpus sched: bias task wakeups to preferred semi-idle packages sched: nominate preferred wakeup cpu sched: favour lower logical cpu number for sched_mc balance sched: framework for sched_mc/smt_power_savings=N sched: convert BALANCE_FOR_xx_POWER to inline functions x86: use possible_cpus=NUM to extend the possible cpus allowed x86: fix cpu_mask_to_apicid_and to include cpu_online_mask x86: update io_apic.c to the new cpumask code x86: Introduce topology_core_cpumask()/topology_thread_cpumask() x86: xen: use smp_call_function_many() x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c ... Fixed up trivial conflict in kernel/time/tick-sched.c manually
| * x86: xen: use smp_call_function_many()Mike Travis2008-12-161-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | Impact: use new API, remove cpumask from stack. Change smp_call_function_mask() callers to smp_call_function_many(). This removes a cpumask from the stack, and falls back should allocating the cpumask var fail (only possible with CONFIG_CPUMASKS_OFFSTACK). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Cc: jeremy@xensource.com
| * x86: cosmetic changes apic-related files.Mike Travis2008-12-161-5/+6
| | | | | | | | | | | | | | | | This patch simply changes cpumask_t to struct cpumask and similar trivial modernizations. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com>
| * xen: convert to cpumask_var_t and new cpumask primitives.Mike Travis2008-12-163-5/+9
| | | | | | | | | | | | | | | | Simple change, and eventual space saving when NR_CPUS >> nr_cpu_ids. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Mike Travis <travis@sgi.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
| * x86 smp: modify send_IPI_mask interface to accept cpumask_t pointersMike Travis2008-12-161-9/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: cleanup, change parameter passing * Change genapic interfaces to accept cpumask_t pointers where possible. * Modify external callers to use cpumask_t pointers in function calls. * Create new send_IPI_mask_allbutself which is the same as the send_IPI_mask functions but removes smp_processor_id() from list. This removes another common need for a temporary cpumask_t variable. * Functions that used a temp cpumask_t variable for: cpumask_t allbutme = cpu_online_map; cpu_clear(smp_processor_id(), allbutme); if (!cpus_empty(allbutme)) ... become: if (!cpus_equal(cpu_online_map, cpumask_of_cpu(cpu))) ... * Other minor code optimizations (like using cpus_clear instead of CPU_MASK_NONE, etc.) Applies to linux-2.6.tip/master. Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@elte.hu>
| * cpumask: convert struct clock_event_device to cpumask pointers.Rusty Russell2008-12-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Impact: change calling convention of existing clock_event APIs struct clock_event_timer's cpumask field gets changed to take pointer, as does the ->broadcast function. Another single-patch change. For safety, we BUG_ON() in clockevents_register_device() if it's not set. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu>
* | xen: clean up asm/xen/hypervisor.hJeremy Fitzhardinge2008-12-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | Impact: cleanup hypervisor.h had accumulated a lot of crud, including lots of spurious #includes. Clean it all up, and go around fixing up everything else accordingly. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | xen: whitespace/checkpatch cleanupTej2008-12-164-19/+25
|/ | | | | | | | Impact: cleanup Signed-off-by: Tej <bewith.tej@gmail.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds2008-11-301-7/+14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86: always define DECLARE_PCI_UNMAP* macros x86: fixup config space size of CPU functions for AMD family 11h x86, bts: fix wrmsr and spinlock over kmalloc x86, pebs: fix PEBS record size configuration x86, bts: turn macro into static inline function x86, bts: exclude ds.c from build when disabled arch/x86/kernel/pci-calgary_64.c: change simple_strtol to simple_strtoul x86: use limited register constraint for setnz xen: pin correct PGD on suspend x86: revert irq number limitation x86: fixing __cpuinit/__init tangle, xsave_cntxt_init() x86: fix __cpuinit/__init tangle in init_thread_xstate() oprofile: fix an overflow in ppro code
| * xen: pin correct PGD on suspendIan Campbell2008-11-231-7/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix Xen guest boot failure commit eefb47f6a1e855653d275cb90592a3587ea93a09 ("xen: use spin_lock_nest_lock when pinning a pagetable") changed xen_pgd_walk to walk over mm->pgd rather than taking pgd as an argument. This breaks xen_mm_(un)pin_all() because it makes init_mm.pgd readonly instead of the pgd we are interested in and therefore the pin subsequently fails. (XEN) mm.c:2280:d15 Bad type (saw 00000000e8000001 != exp 0000000060000000) for mfn bc464 (pfn 21ca7) (XEN) mm.c:2665:d15 Error while pinning mfn bc464 [ 14.586913] 1 multicall(s) failed: cpu 0 [ 14.586926] Pid: 14, comm: kstop/0 Not tainted 2.6.28-rc5-x86_32p-xenU-00172-gee2f6cc #200 [ 14.586940] Call Trace: [ 14.586955] [<c030c17a>] ? printk+0x18/0x1e [ 14.586972] [<c0103df3>] xen_mc_flush+0x163/0x1d0 [ 14.586986] [<c0104bc1>] __xen_pgd_pin+0xa1/0x110 [ 14.587000] [<c015a330>] ? stop_cpu+0x0/0xf0 [ 14.587015] [<c0104d7b>] xen_mm_pin_all+0x4b/0x70 [ 14.587029] [<c022bcb9>] xen_suspend+0x39/0xe0 [ 14.587042] [<c015a330>] ? stop_cpu+0x0/0xf0 [ 14.587054] [<c015a3cd>] stop_cpu+0x9d/0xf0 [ 14.587067] [<c01417cd>] run_workqueue+0x8d/0x150 [ 14.587080] [<c030e4b3>] ? _spin_unlock_irqrestore+0x23/0x40 [ 14.587094] [<c014558a>] ? prepare_to_wait+0x3a/0x70 [ 14.587107] [<c0141918>] worker_thread+0x88/0xf0 [ 14.587120] [<c01453c0>] ? autoremove_wake_function+0x0/0x50 [ 14.587133] [<c0141890>] ? worker_thread+0x0/0xf0 [ 14.587146] [<c014509c>] kthread+0x3c/0x70 [ 14.587157] [<c0145060>] ? kthread+0x0/0x70 [ 14.587170] [<c0109d1b>] kernel_thread_helper+0x7/0x10 [ 14.587181] call 1/3: op=14 arg=[c0415000] result=0 [ 14.587192] call 2/3: op=14 arg=[e1ca2000] result=0 [ 14.587204] call 3/3: op=26 arg=[c1808860] result=-22 Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | xen_play_dead() is __cpuinitAl Viro2008-11-301-1/+1
| | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | xen_setup_vcpu_info_placement() is not init on x86Al Viro2008-11-301-1/+1
|/ | | | | | | ... so get xen-ops.h in agreement with xen/smp.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds2008-11-071-2/+2
|\ | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, xen: fix use of pgd_page now that it really does return a page
| * x86, xen: fix use of pgd_page now that it really does return a pageJeremy Fitzhardinge2008-11-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix 32-bit Xen guest boot crash On 32-bit PAE, pud_page, for no good reason, didn't really return a struct page *. Since Jan Beulich's fix "i386/PAE: fix pud_page()", pud_page does return a struct page *. Because PAE has 3 pagetable levels, the pud level is folded into the pgd level, so pgd_page() is the same as pud_page(), and now returns a struct page *. Update the xen/mmu.c code which uses pgd_page() accordingly. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | xen: make sure stray alias mappings are gone before pinningJeremy Fitzhardinge2008-11-072-5/+9
|/ | | | | | | | | | | | | | | | | | | | Xen requires that all mappings of pagetable pages are read-only, so that they can't be updated illegally. As a result, if a page is being turned into a pagetable page, we need to make sure all its mappings are RO. If the page had been used for ioremap or vmalloc, it may still have left over mappings as a result of not having been lazily unmapped. This change makes sure we explicitly mop them all up before pinning the page. Unlike aliases created by kmap, the there can be vmalloc aliases even for non-high pages, so we must do the flush unconditionally. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Linux Memory Management List <linux-mm@kvack.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds2008-10-281-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits) ftrace: fix current_tracer error return tracing: fix a build error on alpha ftrace: use a real variable for ftrace_nop in x86 tracing/ftrace: make boot tracer select the sched_switch tracer tracepoint: check if the probe has been registered asm-generic: define DIE_OOPS in asm-generic trace: fix printk warning for u64 ftrace: warning in kernel/trace/ftrace.c ftrace: fix build failure ftrace, powerpc, sparc64, x86: remove notrace from arch ftrace file ftrace: remove ftrace hash ftrace: remove mcount set ftrace: remove daemon ftrace: disable dynamic ftrace for all archs that use daemon ftrace: add ftrace warn on to disable ftrace ftrace: only have ftrace_kill atomic ftrace: use probe_kernel ftrace: comment arch ftrace code ftrace: return error on failed modified text. ftrace: dynamic ftrace process only text section ...
| * Merge branch 'tracing/ftrace' into tracing/urgentIngo Molnar2008-10-221-1/+1
| |\
| | * ftrace: rename FTRACE to FUNCTION_TRACERSteven Rostedt2008-10-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to confusion between the ftrace infrastructure and the gcc profiling tracer "ftrace", this patch renames the config options from FTRACE to FUNCTION_TRACER. The other two names that are offspring from FTRACE DYNAMIC_FTRACE and FTRACE_MCOUNT_RECORD will stay the same. This patch was generated mostly by script, and partially by hand. Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | xen: fix Xen domU boot with batched mprotectChris Lalancette2008-10-271-4/+14
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: fix guest kernel boot crash on certain configs Recent i686 2.6.27 kernels with a certain amount of memory (between 736 and 855MB) have a problem booting under a hypervisor that supports batched mprotect (this includes the RHEL-5 Xen hypervisor as well as any 3.3 or later Xen hypervisor). The problem ends up being that xen_ptep_modify_prot_commit() is using virt_to_machine to calculate which pfn to update. However, this only works for pages that are in the p2m list, and the pages coming from change_pte_range() in mm/mprotect.c are kmap_atomic pages. Because of this, we can run into the situation where the lookup in the p2m table returns an INVALID_MFN, which we then try to pass to the hypervisor, which then (correctly) denies the request to a totally bogus pfn. The right thing to do is to use arbitrary_virt_to_machine, so that we can be sure we are modifying the right pfn. This unfortunately introduces a performance penalty because of a full page-table-walk, but we can avoid that penalty for pages in the p2m list by checking if virt_addr_valid is true, and if so, just doing the lookup in the p2m table. The attached patch implements this, and allows my 2.6.27 i686 based guest with 768MB of memory to boot on a RHEL-5 hypervisor again. Thanks to Jeremy for the suggestions about how to fix this particular issue. Signed-off-by: Chris Lalancette <clalance@redhat.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Chris Lalancette <clalance@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | Merge branch 'genirq-v28-for-linus' of ↵Linus Torvalds2008-10-202-3/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip This merges branches irq/genirq, irq/sparseirq-v4, timers/hpet-percpu and x86/uv. The sparseirq branch is just preliminary groundwork: no sparse IRQs are actually implemented by this tree anymore - just the new APIs are added while keeping the old way intact as well (the new APIs map 1:1 to irq_desc[]). The 'real' sparse IRQ support will then be a relatively small patch ontop of this - with a v2.6.29 merge target. * 'genirq-v28-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (178 commits) genirq: improve include files intr_remapping: fix typo io_apic: make irq_mis_count available on 64-bit too genirq: fix name space collisions of nr_irqs in arch/* genirq: fix name space collision of nr_irqs in autoprobe.c genirq: use iterators for irq_desc loops proc: fixup irq iterator genirq: add reverse iterator for irq_desc x86: move ack_bad_irq() to irq.c x86: unify show_interrupts() and proc helpers x86: cleanup show_interrupts genirq: cleanup the sparseirq modifications genirq: remove artifacts from sparseirq removal genirq: revert dynarray genirq: remove irq_to_desc_alloc genirq: remove sparse irq code genirq: use inline function for irq_to_desc genirq: consolidate nr_irqs and for_each_irq_desc() x86: remove sparse irq from Kconfig genirq: define nr_irqs for architectures with GENERIC_HARDIRQS=n ...
| * | genirq: revert dynarrayThomas Gleixner2008-10-161-1/+1
| | | | | | | | | | | | | | | | | | Revert the dynarray changes. They need more thought and polishing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | xen: Fix bug `do_IRQ: cannot handle IRQ -1 vector 0x6 cpu 1'Alex Nixon2008-10-161-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Following commit 9c3f2468d8339866d9ef6a25aae31a8909c6be0d, do_IRQ() looks up the IRQ number in the per-cpu variable vector_irq. This commit makes Xen initialise an identity vector_irq map for both X86_32 and X86_64. Signed-off-by: Alex Nixon <alex.nixon@citrix.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | x86: move kstat_irqs from kstat to irq_descYinghai Lu2008-10-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | based on Eric's patch ... together mold it with dyn_array for irq_desc, will allcate kstat_irqs for nr_irq_desc alltogether if needed. -- at that point nr_cpus is known already. v2: make sure system without generic_hardirqs works they don't have irq_desc v3: fix merging v4: [mingo@elte.hu] fix typo [ mingo@elte.hu ] irq: build fix fix: arch/x86/xen/spinlock.c: In function 'xen_spin_lock_slow': arch/x86/xen/spinlock.c:90: error: 'struct kernel_stat' has no member named 'irqs' Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | mm: rewrite vmap layerNick Piggin2008-10-202-0/+2
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and provide a fast, scalable percpu frontend for small vmaps (requires a slightly different API, though). The biggest problem with vmap is actually vunmap. Presently this requires a global kernel TLB flush, which on most architectures is a broadcast IPI to all CPUs to flush the cache. This is all done under a global lock. As the number of CPUs increases, so will the number of vunmaps a scaled workload will want to perform, and so will the cost of a global TLB flush. This gives terrible quadratic scalability characteristics. Another problem is that the entire vmap subsystem works under a single lock. It is a rwlock, but it is actually taken for write in all the fast paths, and the read locking would likely never be run concurrently anyway, so it's just pointless. This is a rewrite of vmap subsystem to solve those problems. The existing vmalloc API is implemented on top of the rewritten subsystem. The TLB flushing problem is solved by using lazy TLB unmapping. vmap addresses do not have to be flushed immediately when they are vunmapped, because the kernel will not reuse them again (would be a use-after-free) until they are reallocated. So the addresses aren't allocated again until a subsequent TLB flush. A single TLB flush then can flush multiple vunmaps from each CPU. XEN and PAT and such do not like deferred TLB flushing because they can't always handle multiple aliasing virtual addresses to a physical address. They now call vm_unmap_aliases() in order to flush any deferred mappings. That call is very expensive (well, actually not a lot more expensive than a single vunmap under the old scheme), however it should be OK if not called too often. The virtual memory extent information is stored in an rbtree rather than a linked list to improve the algorithmic scalability. There is a per-CPU allocator for small vmaps, which amortizes or avoids global locking. To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces must be used in place of vmap and vunmap. Vmalloc does not use these interfaces at the moment, so it will not be quite so scalable (although it will use lazy TLB flushing). As a quick test of performance, I ran a test that loops in the kernel, linearly mapping then touching then unmapping 4 pages. Different numbers of tests were run in parallel on an 4 core, 2 socket opteron. Results are in nanoseconds per map+touch+unmap. threads vanilla vmap rewrite 1 14700 2900 2 33600 3000 4 49500 2800 8 70631 2900 So with a 8 cores, the rewritten version is already 25x faster. In a slightly more realistic test (although with an older and less scalable version of the patch), I ripped the not-very-good vunmap batching code out of XFS, and implemented the large buffer mapping with vm_map_ram and vm_unmap_ram... along with a couple of other tricks, I was able to speed up a large directory workload by 20x on a 64 CPU system. I believe vmap/vunmap is actually sped up a lot more than 20x on such a system, but I'm running into other locks now. vmap is pretty well blown off the profiles. Before: 1352059 total 0.1401 798784 _write_lock 8320.6667 <- vmlist_lock 529313 default_idle 1181.5022 15242 smp_call_function 15.8771 <- vmap tlb flushing 2472 __get_vm_area_node 1.9312 <- vmap 1762 remove_vm_area 4.5885 <- vunmap 316 map_vm_area 0.2297 <- vmap 312 kfree 0.1950 300 _spin_lock 3.1250 252 sn_send_IPI_phys 0.4375 <- tlb flushing 238 vmap 0.8264 <- vmap 216 find_lock_page 0.5192 196 find_next_bit 0.3603 136 sn2_send_IPI 0.2024 130 pio_phys_write_mmr 2.0312 118 unmap_kernel_range 0.1229 After: 78406 total 0.0081 40053 default_idle 89.4040 33576 ia64_spinlock_contention 349.7500 1650 _spin_lock 17.1875 319 __reg_op 0.5538 281 _atomic_dec_and_lock 1.0977 153 mutex_unlock 1.5938 123 iget_locked 0.1671 117 xfs_dir_lookup 0.1662 117 dput 0.1406 114 xfs_iget_core 0.0268 92 xfs_da_hashname 0.1917 75 d_alloc 0.0670 68 vmap_page_range 0.0462 <- vmap 58 kmem_cache_alloc 0.0604 57 memset 0.0540 52 rb_next 0.1625 50 __copy_user 0.0208 49 bitmap_find_free_region 0.2188 <- vmap 46 ia64_sn_udelay 0.1106 45 find_inode_fast 0.1406 42 memcmp 0.2188 42 finish_task_switch 0.1094 42 __d_lookup 0.0410 40 radix_tree_lookup_slot 0.1250 37 _spin_unlock_irqrestore 0.3854 36 xfs_bmapi 0.0050 36 kmem_cache_free 0.0256 35 xfs_vn_getattr 0.0322 34 radix_tree_lookup 0.1062 33 __link_path_walk 0.0035 31 xfs_da_do_buf 0.0091 30 _xfs_buf_find 0.0204 28 find_get_page 0.0875 27 xfs_iread 0.0241 27 __strncpy_from_user 0.2812 26 _xfs_buf_initialize 0.0406 24 _xfs_buf_lookup_pages 0.0179 24 vunmap_page_range 0.0250 <- vunmap 23 find_lock_page 0.0799 22 vm_map_ram 0.0087 <- vmap 20 kfree 0.0125 19 put_page 0.0330 18 __kmalloc 0.0176 17 xfs_da_node_lookup_int 0.0086 17 _read_lock 0.0885 17 page_waitqueue 0.0664 vmap has gone from being the top 5 on the profiles and flushing the crap out of all TLBs, to using less than 1% of kernel time. [akpm@linux-foundation.org: cleanups, section fix] [akpm@linux-foundation.org: fix build on alpha] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Krzysztof Helt <krzysztof.h1@poczta.fm> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | x86: paravirt: factor out cpu_khz to common codeGlauber Costa2008-10-151-9/+2
|/ | | | | | | | | KVM intends to use paravirt code to calibrate khz. Xen current code will do just fine. So as a first step, factor out code to pvclock.c. Signed-off-by: Glauber Costa <gcosta@redhat.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
* Merge branch 'linus' into x86/xenIngo Molnar2008-10-121-14/+51
|\ | | | | | | | | | | | | Conflicts: arch/x86/kernel/cpu/common.c arch/x86/kernel/process_64.c arch/x86/xen/enlighten.c
| * Merge branch 'x86/apic' into x86-v28-for-linus-phase4-BIngo Molnar2008-10-111-4/+41
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/x86/kernel/apic_32.c arch/x86/kernel/apic_64.c arch/x86/kernel/setup.c drivers/pci/intel-iommu.c include/asm-x86/cpufeature.h include/asm-x86/dma-mapping.h
| | * Merge branch 'linus' into x86/x2apicIngo Molnar2008-07-254-8/+8
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/pci/dmar.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * \ Merge branch 'linus' into x86/x2apicIngo Molnar2008-07-2213-343/+1425
| | |\ \
| | * | | x86, xen: fix apic_ops build on UPIngo Molnar2008-07-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fix: arch/x86/xen/enlighten.c:615: error: variable ‘xen_basic_apic_ops’ has initializer but incomplete type arch/x86/xen/enlighten.c:616: error: unknown field ‘read’ specified in initializer [...] Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | Merge branch 'x86/apic' into x86/x2apicIngo Molnar2008-07-181-1/+0
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/x86/kernel/paravirt.c arch/x86/kernel/smpboot.c arch/x86/kernel/vmi_32.c arch/x86/lguest/boot.c arch/x86/xen/enlighten.c include/asm-x86/apic.h include/asm-x86/paravirt.h Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * \ \ \ Merge branch 'linus' into x86/x2apicIngo Molnar2008-07-184-97/+53
| | |\ \ \ \
| | * | | | | x86: let 32bit use apic_ops too - fixYinghai Lu2008-07-141-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fix for pv - clean up the namespace there too. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | * | | | | x2apic: xen64 paravirt basic apic opsSuresh Siddha2008-07-121-2/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define the Xen specific basic apic ops, in additon to paravirt apic ops, with some misc warning fixes. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: akpm@linux-foundation.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
| | | | | | |
| | \ \ \ \ \
| | \ \ \ \ \
| | \ \ \ \ \
| *---. \ \ \ \ \ Merge branches 'x86/alternatives', 'x86/cleanups', 'x86/commandline', ↵Ingo Molnar2008-10-061-10/+10
| |\ \ \ \ \ \ \ \ | | | |_|_|_|_|_|/ | | |/| | | | | | | | | | | | | | | 'x86/crashdump', 'x86/debug', 'x86/defconfig', 'x86/doc', 'x86/exports', 'x86/fpu', 'x86/gart', 'x86/idle', 'x86/mm', 'x86/mtrr', 'x86/nmi-watchdog', 'x86/oprofile', 'x86/paravirt', 'x86/reboot', 'x86/sparse-fixes', 'x86/tsc', 'x86/urgent' and 'x86/vmalloc' into x86-v28-for-linus-phase1
| | | | * | | | | x86, paravirt_ops: use unsigned long instead of u32 for alloc_p*() pfn argsEduardo Habkost2008-08-221-10/+10
| | | |/ / / / / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch changes the pfn args from 'u32' to 'unsigned long' on alloc_p*() functions on paravirt_ops, and the corresponding implementations for Xen and VMI. The prototypes for CONFIG_PARAVIRT=n are already using unsigned long, so paravirt.h now matches the prototypes on asm-x86/pgalloc.h. It shouldn't result in any changes on generated code on 32-bit, with or without CONFIG_PARAVIRT. On both cases, 'codiff -f' didn't show any change after applying this patch. On 64-bit, there are (expected) binary changes only when CONFIG_PARAVIRT is enabled, as the patch is really supposed to change the size of the pfn args. [ v2: KVM_GUEST: use the right parameter type on kvm_release_pt() ] Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Acked-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | | | xen: do not reserve 2 pages of padding between hypervisor and fixmap.Ian Campbell2008-10-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reserving space for the hypervisor the Xen paravirt backend adds an extra two pages (this was carried forward from the 2.6.18-xen tree which had them "for safety"). Depending on various CONFIG options this can cause the boot time fixmaps to span multiple PMDs which is not supported and triggers a WARN in early_ioremap_init(). This was exposed by 2216d199b1430d1c0affb1498a9ebdbd9c0de439 which moved the dmi table parsing earlier. x86: fix CONFIG_X86_RESERVE_LOW_64K=y The bad_bios_dmi_table() quirk never triggered because we do DMI setup too late. Move it a bit earlier. There is no real reason to reserve these two extra pages and the fixmap already incorporates FIX_HOLE which serves the same purpose. None of the other callers of reserve_top_address do this. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | | | xen: use spin_lock_nest_lock when pinning a pagetableJeremy Fitzhardinge2008-10-092-29/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When pinning/unpinning a pagetable with split pte locks, we can end up holding multiple pte locks at once (we need to hold the locks while there's a pending batched hypercall affecting the pte page). Because all the pte locks are in the same lock class, lockdep thinks that we're potentially taking a lock recursively. This warning is spurious because we always take the pte locks while holding mm->page_table_lock. lockdep now has spin_lock_nest_lock to express this kind of dominant lock use, so use it here so that lockdep knows what's going on. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | | | xen: clean up x86-64 warningsJeremy Fitzhardinge2008-10-032-54/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are a couple of Xen features which rely on directly accessing per-cpu data via a segment register, which is not yet available on x86-64. In the meantime, just disable direct access to the vcpu info structure; this leaves some of the code as dead, but it will come to life in time, and the warnings are suppressed. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | | | xen: make CONFIG_XEN_SAVE_RESTORE depend on CONFIG_XENChuck Ebbert2008-09-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Xen options need to depend on XEN. Also, add newline at end of file. Without this patch you need to disable CONFIG_PM in order to disable CPU hotplugging. Signed-off-by: Chuck Ebbert <cebbert@redhat.com> Acked-by Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | | | Merge branch 'timers/urgent' into x86/xenIngo Molnar2008-09-231-1/+1
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/x86/kernel/process_32.c arch/x86/kernel/process_64.c Manual merge: arch/x86/kernel/smpboot.c Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * | | | | | | xen: fix for xen guest with mem > 3.7GJeremy Fitzhardinge2008-09-141-1/+1
| | |/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PFN_PHYS() can truncate large addresses unless its passed a suitable large type. This is fixed more generally in the patch series introducing phys_addr_t, but we need a short-term fix to solve a Xen regression reported by Roberto De Ioris. Reported-by: Roberto De Ioris <roberto@unbit.it> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | | xen: fix pinning when not using split pte locksJeremy Fitzhardinge2008-09-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We only pin PTE pages when using split PTE locks, so don't do the pin/unpin when attaching/detaching pte pages to a pinned pagetable. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | | Merge branch 'core/xen' into x86/xenIngo Molnar2008-09-101-1/+1
|\ \ \ \ \ \ \
| * | | | | | | mm: define USE_SPLIT_PTLOCKS rather than repeating expressionJeremy Fitzhardinge2008-09-101-1/+1
| |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define USE_SPLIT_PTLOCKS as a constant expression rather than repeating "NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS" all over the place. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
| * / / / / / x86, xen: Use native_pte_flags instead of native_pte_val for .pte_flagsEduardo Habkost2008-09-061-1/+1
| |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using native_pte_val triggers the BUG_ON() in the paravirt_ops version of pte_flags(). Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | xen: make CPU hotplug functions staticAlex Nixon2008-09-082-10/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no need for these functions to be accessed from outside of xen/smp.c Signed-off-by: Alex Nixon <alex.nixon@citrix.com> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* | | | | | x86, xen: fix build when !CONFIG_HOTPLUG_CPUAlex Nixon2008-09-081-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Alex Nixon <alex.nixon@citrix.com> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>