Commit Graph

13155 Commits (da8b066262e12d1d0a3b1e6d3486e500169bf730)

Author SHA1 Message Date
Tejun Heo 959d851caa Merge branch 'for-3.5' of ../cgroup into block/for-3.5/core-merged
cgroup/for-3.5 contains the following changes which blk-cgroup needs
to proceed with the on-going cleanup.

* Dynamic addition and removal of cftypes to make config/stat file
  handling modular for policies.

* cgroup removal update to not wait for css references to drain to fix
  blkcg removal hang caused by cfq caching cfqgs.

Pull in cgroup/for-3.5 into block/for-3.5/core.  This causes the
following conflicts in block/blk-cgroup.c.

* 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
  conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
  removal.  Resolved by removing @subsys from all subsys methods.

* 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
  controllers" conflicts with ->pre_destroy() and ->attach() updates
  and removal of modular config.  Resolved by dropping forward
  declarations of the methods and applying updates to the relocated
  blkio_subsys.

* 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
  cftype interface" builds upon the previous item.  Resolved by adding
  ->base_cftypes to the relocated blkio_subsys.

Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01 12:55:00 -07:00
Tejun Heo 48ddbe1946 cgroup: make css->refcnt clearing on cgroup removal optional
Currently, cgroup removal tries to drain all css references.  If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.

This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir).  To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).

Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior.  This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.

Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch.  Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.

  http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251

Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2012-04-01 12:09:56 -07:00
Tejun Heo 28b4c27b8e cgroup: use negative bias on css->refcnt to block css_tryget()
When a cgroup is about to be removed, cgroup_clear_css_refs() is
called to check and ensure that there are no active css references.

This is currently achieved by dropping the refcnt to zero iff it has
only the base ref.  If all css refs could be dropped to zero, ref
clearing is successful and CSS_REMOVED is set on all css.  If not, the
base ref is restored.  While css ref is zero w/o CSS_REMOVED set, any
css_tryget() attempt on it busy loops so that they are atomic
w.r.t. the whole css ref clearing.

This does work but dropping and re-instating the base ref is somewhat
hairy and makes it difficult to add more logic to the put path as
there are two of them - the regular css_put() and the reversible base
ref clearing.

This patch updates css ref clearing such that blocking new
css_tryget() and putting the base ref are separate operations.
CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
css_tryget() busy loops while refcnt is negative.  After all css refs
are deactivated, if they were all one, ref clearing succeeded and
CSS_REMOVED is set and the base ref is put using the regular
css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
and the original postive values are restored.

css_refcnt() accessor which always returns the unbiased positive
reference counts is added and used to simplify refcnt usages.  While
at it, relocate and reformat comments in cgroup_has_css_refs().

This separates css->refcnt deactivation and putting the base ref,
which enables the next patch to make ref clearing optional.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:56 -07:00
Tejun Heo 79578621b4 cgroup: implement cgroup_rm_cftypes()
Implement cgroup_rm_cftypes() which removes an array of cftypes from a
subsystem.  It can be called whether the target subsys is attached or
not.  cgroup core will remove the specified file from all existing
cgroups.

This will be used to improve sub-subsys modularity and will be helpful
for unified hierarchy.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:56 -07:00
Tejun Heo 05ef1d7c4a cgroup: introduce struct cfent
This patch adds cfent (cgroup file entry) which is the association
between a cgroup and a file.  This is in-cgroup representation of
files under a cgroup directory.  This simplifies walking walking
cgroup files and thus cgroup_clear_directory(), which is now
implemented in two parts - cgroup_rm_file() and a loop around it.

cgroup_rm_file() will be used to implement cftype removal and cfent is
scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
"sever" semantics).

v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
      if the file had openers at the time of removal.  Moved to
      cgroup_diput().

    - cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
      wasn't empty after removing all files.  This triggered
      spuriously if some files were open during directory clearing.
      Removed.

v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
      spuriously triggered for root cgroups because they don't go
      through cgroup_clear_directory() on unmount.  Don't trigger WARN
      for root cgroups.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
2012-04-01 12:09:56 -07:00
Tejun Heo f6ea93723d cgroup: relocate __d_cgrp() and __d_cft()
Move the two macros upwards as they'll be used earlier in the file.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:55 -07:00
Tejun Heo db0416b649 cgroup: remove cgroup_add_file[s]()
No controller is using cgroup_add_files[s]().  Unexport them, and
convert cgroup_add_files() to handle NULL entry terminated array
instead of taking count explicitly and continue creation on failure
for internal use.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:55 -07:00
Tejun Heo 4baf6e3325 cgroup: convert all non-memcg controllers to the new cftype interface
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.

This is functionally identical transformation.  There shouldn't be any
visible behavior change.

memcg is rather special and will be converted separately.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vivek Goyal <vgoyal@redhat.com>
2012-04-01 12:09:55 -07:00
Tejun Heo 6e6ff25bd5 cgroup: merge cft_release_agent cftype array into the base files array
Now that cftype can express whether a file should only be on root,
cft_release_agent can be merged into the base files cftypes array.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:55 -07:00
Tejun Heo 8e3f6541d4 cgroup: implement cgroup_add_cftypes() and friends
Currently, cgroup directories are populated by subsys->populate()
callback explicitly creating files on each cgroup creation.  This
level of flexibility isn't needed or desirable.  It provides largely
unused flexibility which call for abuses while severely limiting what
the core layer can do through the lack of structure and conventions.

Per each cgroup file type, the only distinction that cgroup users is
making is whether a cgroup is root or not, which can easily be
expressed with flags.

This patch introduces cgroup_add_cftypes().  These deal with cftypes
instead of individual files - controllers indicate that certain types
of files exist for certain subsystem.  Newly added CFTYPE_*_ON_ROOT
flags indicate whether a cftype should be excluded or created only on
the root cgroup.

cgroup_add_cftypes() can be called any time whether the target
subsystem is currently attached or not.  cgroup core will create files
on the existing cgroups as necessary.

Also, cgroup_subsys->base_cftypes is added to ease registration of the
base files for the subsystem.  If non-NULL on subsys init, the cftypes
pointed to by ->base_cftypes are automatically registered on subsys
init / load.

Further patches will convert the existing users and remove the file
based interface.  Note that this interface allows dynamic addition of
files to an active controller.  This will be used for sub-controller
modularity and unified hierarchy in the longer term.

This patch implements the new mechanism but doesn't apply it to any
user.

v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
    cgroup_subsys->base_cftypes, which works better for cgroup_subsys
    which is loaded as module.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:55 -07:00
Tejun Heo b0ca5a84fc cgroup: build list of all cgroups under a given cgroupfs_root
Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
going through cgroup->allcg_node.  The list is protected by
cgroup_mutex and will be used to improve cgroup file handling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:54 -07:00
Tejun Heo ff4c8d503e cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir()
cgroup_populate_dir() currently clears all files and then repopulate
the directory; however, the clearing part is only useful when it's
called from cgroup_remount().  Relocate the invocation to
cgroup_remount().

This is to prepare for further cgroup file handling updates.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-04-01 12:09:54 -07:00
Tejun Heo 8b5a5a9dbc cgroup: deprecate remount option changes
This patch marks the following features for deprecation.

* Rebinding subsys by remount: Never reached useful state - only works
  on empty hierarchies.

* release_agent update by remount: release_agent itself will be
  replaced with conventional fsnotify notification.

v2: Lennart pointed out that "name=" is necessary for mounts w/o any
    controller attached.  Drop "name=" deprecation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
2012-04-01 12:09:54 -07:00
Linus Torvalds f22e08a79f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq()
  sched: Fix __schedule_bug() output when called from an interrupt
  sched/arch: Introduce the finish_arch_post_lock_switch() scheduler callback
2012-03-31 13:35:31 -07:00
Linus Torvalds f187e9fd68 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates and fixes from Ingo Molnar:
 "It's mostly fixes, but there's also two late items:

   - preliminary GTK GUI support for perf report
   - PMU raw event format descriptors in sysfs, to be parsed by tooling

  The raw event format in sysfs is a new ABI.  For example for the 'CPU'
  PMU we have:

    aldebaran:~> ll /sys/bus/event_source/devices/cpu/format/*
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/any
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/cmask
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/edge
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/event
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/inv
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/offcore_rsp
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/pc
    -r--r--r--. 1 root root 4096 Mar 31 10:29 /sys/bus/event_source/devices/cpu/format/umask

  those lists of fields contain a specific format:

    aldebaran:~> cat /sys/bus/event_source/devices/cpu/format/offcore_rsp
    config1:0-63

  So, those who wish to specify raw events can now use the following
  event format:

    -e cpu/cmask=1,event=2,umask=3

  Most people will not want to specify any events (let alone raw
  events), they'll just use whatever default event the tools use.

  But for more obscure PMU events that have no cross-architecture
  generic events the above syntax is more usable and a bit more
  structured than specifying hex numbers."

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  perf tools: Remove auto-generated bison/flex files
  perf annotate: Fix off by one symbol hist size allocation and hit accounting
  perf tools: Add missing ref-cycles event back to event parser
  perf annotate: addr2line wants addresses in same format as objdump
  perf probe: Finder fails to resolve function name to address
  tracing: Fix ent_size in trace output
  perf symbols: Handle NULL dso in dso__name_len
  perf symbols: Do not include libgen.h
  perf tools: Fix bug in raw sample parsing
  perf tools: Fix display of first level of callchains
  perf tools: Switch module.h into export.h
  perf: Move mmap page data_head offset assertion out of header
  perf: Fix mmap_page capabilities and docs
  perf diff: Fix to work with new hists design
  perf tools: Fix modifier to be applied on correct events
  perf tools: Fix various casting issues for 32 bits
  perf tools: Simplify event_read_id exit path
  tracing: Fix ftrace stack trace entries
  tracing: Move the tracing_on/off() declarations into CONFIG_TRACING
  perf report: Add a simple GTK2-based 'perf report' browser
  ...
2012-03-31 13:34:04 -07:00
Srivatsa S. Bhat e3831edd59 sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq()
The function for_each_cpu_mask() expects a *pointer* to struct
cpumask as its second argument, whereas select_fallback_rq()
passes the value itself.

And moreover, for_each_cpu_mask() has been marked as obselete
in include/linux/cpumask.h. So move to the more appropriate
for_each_cpu() variant.

Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Dave Jones <davej@redhat.com>
Cc: Liu Chuansheng <chuansheng.liu@intel.com>
Cc: vapier@gentoo.org
Cc: rusty@rustcorp.com.au
Link: http://lkml.kernel.org/r/4F75BED4.9050005@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-31 10:43:36 +02:00
Linus Torvalds f558c27238 Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull genirq updates from Thomas Gleixner.

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Adjust irq thread affinity on IRQ_SET_MASK_OK_NOCOPY return value
  genirq: Respect NUMA node affinity in setup_irq_irq affinity()
  genirq: Get rid of unneeded force parameter in irq_finalize_oneshot()
  genirq: Minor readablity improvement in irq_wake_thread()
2012-03-30 18:08:05 -07:00
Linus Torvalds 3a0d184943 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core locking updates from Thomas Gleixner.

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Mark get_robust_list as deprecated
  futex: Do not leak robust list to unprivileged process
2012-03-30 18:07:13 -07:00
Jiang Liu f5cb92ac82 genirq: Adjust irq thread affinity on IRQ_SET_MASK_OK_NOCOPY return value
irq_move_masked_irq() checks the return code of
chip->irq_set_affinity() only for 0, but IRQ_SET_MASK_OK_NOCOPY is
also a valid return code, which is there to avoid a redundant copy of
the cpumask. But in case of IRQ_SET_MASK_OK_NOCOPY we not only avoid
the redundant copy, we also fail to adjust the thread affinity of an
eventually threaded interrupt handler.

Handle IRQ_SET_MASK_OK (==0) and IRQ_SET_MASK_OK_NOCOPY(==1) return
values correctly by checking the valid return values seperately.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Keping Chen <chenkeping@huawei.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1333120296-13563-2-git-send-email-jiang.liu@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-30 23:13:46 +02:00
Linus Torvalds 1338631433 Merge branch 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull urgent cgroup fix from Tejun Heo:
 "Commit 61d1d219c4 ('cgroup: remove extra calls to
  find_existing_css_set') which was part of the rc1 cgroup pull request
  made writes to the cgroup "tasks" file return an uninitialized retval
  on success which can cause boot failures with systemd.

  The change stayed in linux-next for quite some time but gcc
  interestingly failed to emit warning about using uninitialized
  variable and the problem seems to materialize only for certain build
  combinations (probably depends on register allocation).

  It's just missing local variable initialization and the fix is trivial
  & safe.  As the problem is critical when it materializes, I'm
  fast-tracking it.  Also included is Li's email address change in
  MAINTAINERS."

* 'for-3.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: cgroup_attach_task() could return -errno after success
  cgroup: update MAINTAINERS entry
2012-03-29 22:33:10 -07:00
Tejun Heo 8f121918f2 cgroup: cgroup_attach_task() could return -errno after success
61d1d219c4 "cgroup: remove extra calls to find_existing_css_set" made
cgroup_task_migrate() return void.  An unfortunate side effect was
that cgroup_attach_task() was depending on that function's return
value to clear its @retval on the success path.  On cgroup mounts
without any subsystem with ->can_attach() callback,
cgroup_attach_task() ended up returning @retval without initializing
it on success.

For some reason, gcc failed to warn about it and it didn't cause
cgroup_attach_task() to return non-zero value in many cases, probably
due to difference in register allocation.  When the problem
materializes, systemd fails to populate /systemd cgroup mount and
fails to boot.

Fix it by initializing @retval to zero on declaration.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jiri Kosina <jkosina@suse.cz>
LKML-Reference: <alpine.LNX.2.00.1203282354440.25526@pobox.suse.cz>
Reviewed-by: Mandeep Singh Baines <msb@chromium.org>
Acked-by: Li Zefan <lizefan@huawei.com>
2012-03-29 22:03:33 -07:00
Linus Torvalds a591afc01d Merge branch 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x32 support for x86-64 from Ingo Molnar:
 "This tree introduces the X32 binary format and execution mode for x86:
  32-bit data space binaries using 64-bit instructions and 64-bit kernel
  syscalls.

  This allows applications whose working set fits into a 32 bits address
  space to make use of 64-bit instructions while using a 32-bit address
  space with shorter pointers, more compressed data structures, etc."

Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c}

* 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  x32: Fix alignment fail in struct compat_siginfo
  x32: Fix stupid ia32/x32 inversion in the siginfo format
  x32: Add ptrace for x32
  x32: Switch to a 64-bit clock_t
  x32: Provide separate is_ia32_task() and is_x32_task() predicates
  x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls
  x86/x32: Fix the binutils auto-detect
  x32: Warn and disable rather than error if binutils too old
  x32: Only clear TIF_X32 flag once
  x32: Make sure TS_COMPAT is cleared for x32 tasks
  fs: Remove missed ->fds_bits from cessation use of fd_set structs internally
  fs: Fix close_on_exec pointer in alloc_fdtable
  x32: Drop non-__vdso weak symbols from the x32 VDSO
  x32: Fix coding style violations in the x32 VDSO code
  x32: Add x32 VDSO support
  x32: Allow x32 to be configured
  x32: If configured, add x32 system calls to system call tables
  x32: Handle process creation
  x32: Signal-related system calls
  x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h>
  ...
2012-03-29 18:12:23 -07:00
Linus Torvalds 12679a2d7e Merge branch 'for-linus' of git://git.linaro.org/people/rmk/linux-arm
Pull more ARM updates from Russell King.

This got a fair number of conflicts with the <asm/system.h> split, but
also with some other sparse-irq and header file include cleanups.  They
all looked pretty trivial, though.

* 'for-linus' of git://git.linaro.org/people/rmk/linux-arm: (59 commits)
  ARM: fix Kconfig warning for HAVE_BPF_JIT
  ARM: 7361/1: provide XIP_VIRT_ADDR for no-MMU builds
  ARM: 7349/1: integrator: convert to sparse irqs
  ARM: 7259/3: net: JIT compiler for packet filters
  ARM: 7334/1: add jump label support
  ARM: 7333/2: jump label: detect %c support for ARM
  ARM: 7338/1: add support for early console output via semihosting
  ARM: use set_current_blocked() and block_sigmask()
  ARM: exec: remove redundant set_fs(USER_DS)
  ARM: 7332/1: extract out code patch function from kprobes
  ARM: 7331/1: extract out insn generation code from ftrace
  ARM: 7330/1: ftrace: use canonical Thumb-2 wide instruction format
  ARM: 7351/1: ftrace: remove useless memory checks
  ARM: 7316/1: kexec: EOI active and mask all interrupts in kexec crash path
  ARM: Versatile Express: add NO_IOPORT
  ARM: get rid of asm/irq.h in asm/prom.h
  ARM: 7319/1: Print debug info for SIGBUS in user faults
  ARM: 7318/1: gic: refactor irq_start assignment
  ARM: 7317/1: irq: avoid NULL check in for_each_irq_desc loop
  ARM: 7315/1: perf: add support for the Cortex-A7 PMU
  ...
2012-03-29 16:53:48 -07:00
Linus Torvalds 7fda0412c5 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  cpusets: Remove an unused variable
  sched/rt: Improve pick_next_highest_task_rt()
  sched: Fix select_fallback_rq() vs cpu_active/cpu_online
  sched/x86/smp: Do not enable IRQs over calibrate_delay()
  sched: Fix compiler warning about declared inline after use
  MAINTAINERS: Update email address for SCHEDULER and PERF EVENTS
2012-03-29 14:46:05 -07:00
Linus Torvalds 6b8212a313 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 updates from Ingo Molnar.

This touches some non-x86 files due to the sanitized INLINE_SPIN_UNLOCK
config usage.

Fixed up trivial conflicts due to just header include changes (removing
headers due to cpu_idle() merge clashing with the <asm/system.h> split).

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/apic/amd: Be more verbose about LVT offset assignments
  x86, tls: Off by one limit check
  x86/ioapic: Add io_apic_ops driver layer to allow interception
  x86/olpc: Add debugfs interface for EC commands
  x86: Merge the x86_32 and x86_64 cpu_idle() functions
  x86/kconfig: Remove CONFIG_TR=y from the defconfigs
  x86: Stop recursive fault in print_context_stack after stack overflow
  x86/io_apic: Move and reenable irq only when CONFIG_GENERIC_PENDING_IRQ=y
  x86/apic: Add separate apic_id_valid() functions for selected apic drivers
  locking/kconfig: Simplify INLINE_SPIN_UNLOCK usage
  x86/kconfig: Update defconfigs
  x86: Fix excessive MSR print out when show_msr is not specified
2012-03-29 14:28:26 -07:00
Linus Torvalds bcd550745f Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer core updates from Thomas Gleixner.

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ia64: vsyscall: Add missing paranthesis
  alarmtimer: Don't call rtc_timer_init() when CONFIG_RTC_CLASS=n
  x86: vdso: Put declaration before code
  x86-64: Inline vdso clock_gettime helpers
  x86-64: Simplify and optimize vdso clock_gettime monotonic variants
  kernel-time: fix s/then/than/ spelling errors
  time: remove no_sync_cmos_clock
  time: Avoid scary backtraces when warning of > 11% adj
  alarmtimer: Make sure we initialize the rtctimer
  ntp: Fix leap-second hrtimer livelock
  x86, tsc: Skip refined tsc calibration on systems with reliable TSC
  rtc: Provide flag for rtc devices that don't support UIE
  ia64: vsyscall: Use seqcount instead of seqlock
  x86: vdso: Use seqcount instead of seqlock
  x86: vdso: Remove bogus locking in update_vsyscall_tz()
  time: Remove bogus comments
  time: Fix change_clocksource locking
  time: x86: Fix race switching from vsyscall to non-vsyscall clock
2012-03-29 14:16:48 -07:00
Grant Likely 092b2fb076 irqdomain: Remove powerpc dependency from debugfs file
The debugfs code is really generic for all platforms.  This patch removes the
powerpc-specific directory reference and makes it available to all
architectures.

Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-03-29 14:31:02 -06:00
Ingo Molnar 1f56ee7b68 Merge branch 'sched/arch' into sched/urgent
Merge reason: It has not gone upstream via the ARM tree, merge it via
              the scheduler tree.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-29 12:48:22 +02:00
Kees Cook ec0c4274e3 futex: Mark get_robust_list as deprecated
Notify get_robust_list users that the syscall is going away.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: kernel-hardening@lists.openwall.com
Cc: spender@grsecurity.net
Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-29 11:37:17 +02:00
Kees Cook bdbb776f88 futex: Do not leak robust list to unprivileged process
It was possible to extract the robust list head address from a setuid
process if it had used set_robust_list(), allowing an ASLR info leak. This
changes the permission checks to be the same as those used for similar
info that comes out of /proc.

Running a setuid program that uses robust futexes would have had:
  cred->euid != pcred->euid
  cred->euid == pcred->uid
so the old permissions check would allow it. I'm not aware of any setuid
programs that use robust futexes, so this is just a preventative measure.

(This patch is based on changes from grsecurity.)

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: kernel-hardening@lists.openwall.com
Cc: spender@grsecurity.net
Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.net
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-29 11:37:17 +02:00
Prarit Bhargava 241fc640be genirq: Respect NUMA node affinity in setup_irq_irq affinity()
We respect node affinity of devices already in the irq descriptor
allocation, but we ignore it for the initial interrupt affinity
setup, so the interrupt might be routed to a different node.

Restrict the default affinity mask to the node on which the irq
descriptor is allocated.

[ tglx: Massaged changelog ]

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1332788538-17425-1-git-send-email-prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-29 11:31:53 +02:00
Alexander Gordeev f3f79e38d5 genirq: Get rid of unneeded force parameter in irq_finalize_oneshot()
The only place irq_finalize_oneshot() is called with force parameter set
is the threaded handler error exit path. But IRQTF_RUNTHREAD is dropped
at this point and irq_wake_thread() is not going to set it again,
since PF_EXITING is set for this thread already. So irq_finalize_oneshot()
will drop the threads bit in threads_oneshot anyway and hence the force
parameter is superfluous.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120321162234.GP24806@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-29 11:31:53 +02:00
Alexander Gordeev 69592db298 genirq: Minor readablity improvement in irq_wake_thread()
exit_irq_thread() clears IRQTF_RUNTHREAD flag and drops the thread's bit in
desc->threads_oneshot then. The bit must not be set again in between and it
does not, since irq_wake_thread() sees PF_EXITING flag first and returns.

Due to above the order or checking PF_EXITING and IRQTF_RUNTHREAD flags in
irq_wake_thread() is important. This change just makes it more visible in the
source code.

Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120321162212.GO24806@dhcp-26-207.brq.redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-29 11:31:52 +02:00
Stephen Boyd 6135fc1eb4 sched: Fix __schedule_bug() output when called from an interrupt
If schedule is called from an interrupt handler __schedule_bug()
will call show_regs() with the registers saved during the
interrupt handling done in do_IRQ(). This means we'll see the
registers and the backtrace for the process that was interrupted
and not the full backtrace explaining who called schedule().

This is due to 838225b ("sched: use show_regs() to improve
__schedule_bug() output", 2007-10-24) which improperly assumed
that get_irq_regs() would return the registers for the current
stack because it is being called from within an interrupt
handler. Simply remove the show_reg() code so that we dump a
backtrace for the interrupt handler that called schedule().

[ I ran across this when I was presented with a scheduling while
  atomic log with a stacktrace pointing at spin_unlock_irqrestore().
  It made no sense and I had to guess what interrupt handler could
  be called and poke around for someone calling schedule() in an
  interrupt handler. A simple test of putting an msleep() in
  an interrupt handler works better with this patch because you
  can actually see the msleep() call in the backtrace. ]

Also-reported-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Satyam Sharma <satyam@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1332979847-27102-1-git-send-email-sboyd@codeaurora.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-29 08:34:45 +02:00
Linus Torvalds 532bfc851a Merge branch 'akpm' (Andrew's patch-bomb)
Merge third batch of patches from Andrew Morton:
 - Some MM stragglers
 - core SMP library cleanups (on_each_cpu_mask)
 - Some IPI optimisations
 - kexec
 - kdump
 - IPMI
 - the radix-tree iterator work
 - various other misc bits.

 "That'll do for -rc1.  I still have ~10 patches for 3.4, will send
  those along when they've baked a little more."

* emailed from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  backlight: fix typo in tosa_lcd.c
  crc32: add help text for the algorithm select option
  mm: move hugepage test examples to tools/testing/selftests/vm
  mm: move slabinfo.c to tools/vm
  mm: move page-types.c from Documentation to tools/vm
  selftests/Makefile: make `run_tests' depend on `all'
  selftests: launch individual selftests from the main Makefile
  radix-tree: use iterators in find_get_pages* functions
  radix-tree: rewrite gang lookup using iterator
  radix-tree: introduce bit-optimized iterator
  fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
  nbd: rename the nbd_device variable from lo to nbd
  pidns: add reboot_pid_ns() to handle the reboot syscall
  sysctl: use bitmap library functions
  ipmi: use locks on watchdog timeout set on reboot
  ipmi: simplify locking
  ipmi: fix message handling during panics
  ipmi: use a tasklet for handling received messages
  ipmi: increase KCS timeouts
  ipmi: decrease the IPMI message transaction time in interrupt mode
  ...
2012-03-28 17:19:28 -07:00
Daniel Lezcano cf3f89214e pidns: add reboot_pid_ns() to handle the reboot syscall
In the case of a child pid namespace, rebooting the system does not really
makes sense.  When the pid namespace is used in conjunction with the other
namespaces in order to create a linux container, the reboot syscall leads
to some problems.

A container can reboot the host.  That can be fixed by dropping the
sys_reboot capability but we are unable to correctly to poweroff/
halt/reboot a container and the container stays stuck at the shutdown time
with the container's init process waiting indefinitively.

After several attempts, no solution from userspace was found to reliabily
handle the shutdown from a container.

This patch propose to make the init process of the child pid namespace to
exit with a signal status set to : SIGINT if the child pid namespace
called "halt/poweroff" and SIGHUP if the child pid namespace called
"reboot".  When the reboot syscall is called and we are not in the initial
pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
"RESTART", and "RESTART2".  Otherwise we return EINVAL.

Returning EINVAL is also an easy way to check if this feature is supported
by the kernel when invoking another 'reboot' option like CAD.

By this way the parent process of the child pid namespace knows if it
rebooted or not and can take the right decision.

Test case:
==========

#include <alloca.h>
#include <stdio.h>
#include <sched.h>
#include <unistd.h>
#include <signal.h>
#include <sys/reboot.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <linux/reboot.h>

static int do_reboot(void *arg)
{
        int *cmd = arg;

        if (reboot(*cmd))
                printf("failed to reboot(%d): %m\n", *cmd);
}

int test_reboot(int cmd, int sig)
{
        long stack_size = 4096;
        void *stack = alloca(stack_size) + stack_size;
        int status;
        pid_t ret;

        ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
        if (ret < 0) {
                printf("failed to clone: %m\n");
                return -1;
        }

        if (wait(&status) < 0) {
                printf("unexpected wait error: %m\n");
                return -1;
        }

        if (!WIFSIGNALED(status)) {
                printf("child process exited but was not signaled\n");
                return -1;
        }

        if (WTERMSIG(status) != sig) {
                printf("signal termination is not the one expected\n");
                return -1;
        }

        return 0;
}

int main(int argc, char *argv[])
{
        int status;

        status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
        if (status < 0)
                return 1;
        printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");

        status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
        if (status < 0)
                return 1;
        printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");

        status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
        if (status < 0)
                return 1;
        printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");

        status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
        if (status < 0)
                return 1;
        printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");

        status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
        if (status >= 0) {
                printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
                return 1;
        }
        printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");

        return 0;
}

[akpm@linux-foundation.org: tweak and add comments]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:36 -07:00
Akinobu Mita 5a04cca6c3 sysctl: use bitmap library functions
Use bitmap_set() instead of using set_bit() for each bit.  This conversion
is valid because the bitmap is private in the function call and atomic
bitops were unnecessary.

This also includes minor change.
- Use bitmap_copy() for shorter typing

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:36 -07:00
Zhenzhong Duan eaa3be6add kexec: add further check to crashkernel
When using crashkernel=2M-256M, the kernel doesn't give any warning.  This
is misleading sometimes.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:36 -07:00
Will Deacon d034cfab4f kexec: crash: don't save swapper_pg_dir for !CONFIG_MMU configurations
nommu platforms don't have very interesting swapper_pg_dir pointers and
usually just #define them to NULL, meaning that we can't include them in
the vmcoreinfo on the kexec crash path.

This patch only saves the swapper_pg_dir if we have an MMU.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:36 -07:00
Gilad Ben-Yossef b3a7e98e02 smp: add func to IPI cpus based on parameter func
Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
calculates the cpumask of cpus to IPI by calling a function supplied as a
parameter in order to determine whether to IPI each specific cpu.

The function works around allocation failure of cpumask variable in
CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
via smp_call_function_single().

The function is useful since it allows to seperate the specific code that
decided in each case whether to IPI a specific cpu for a specific request
from the common boilerplate code of handling creating the mask, handling
failures etc.

[akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
[akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
[akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Michal Nazarewicz <mina86@mina86.org>
Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
Cc: Milton Miller <miltonm@bga.com>
Reviewed-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:35 -07:00
Gilad Ben-Yossef 3fc498f165 smp: introduce a generic on_each_cpu_mask() function
We have lots of infrastructure in place to partition multi-core systems
such that we have a group of CPUs that are dedicated to specific task:
cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
Still, kernel code will at times interrupt all CPUs in the system via IPIs
for various needs.  These IPIs are useful and cannot be avoided
altogether, but in certain cases it is possible to interrupt only specific
CPUs that have useful work to do and not the entire system.

This patch set, inspired by discussions with Peter Zijlstra and Frederic
Weisbecker when testing the nohz task patch set, is a first stab at trying
to explore doing this by locating the places where such global IPI calls
are being made and turning the global IPI into an IPI for a specific group
of CPUs.  The purpose of the patch set is to get feedback if this is the
right way to go for dealing with this issue and indeed, if the issue is
even worth dealing with at all.  Based on the feedback from this patch set
I plan to offer further patches that address similar issue in other code
paths.

This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
infrastructure API (the former derived from existing arch specific
versions in Tile and Arm) and uses them to turn several global IPI
invocation to per CPU group invocations.

Core kernel:

on_each_cpu_mask() calls a function on processors specified by cpumask,
which may or may not include the local processor.

You must not call this function with disabled interrupts or from a
hardware interrupt handler or from a bottom half handler.

arch/arm:

Note that the generic version is a little different then the Arm one:

1. It has the mask as first parameter
2. It calls the function on the calling CPU with interrupts disabled,
   but this should be OK since the function is called on the other CPUs
   with interrupts disabled anyway.

arch/tile:

The API is the same as the tile private one, but the generic version
also calls the function on the with interrupts disabled in UP case

This is OK since the function is called on the other CPUs
with interrupts disabled.

Signed-off-by: Gilad Ben-Yossef <gilad@benyossef.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Avi Kivity <avi@redhat.com>
Acked-by: Michal Nazarewicz <mina86@mina86.org>
Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
Cc: Milton Miller <miltonm@bga.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:35 -07:00
Linus Torvalds 0195c00244 Disintegrate and delete asm/system.h
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
 8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
 u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
 ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
 rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
 eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
 HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
 /5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
 Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
 4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
 FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
 NypQthI85pc=
 =G9mT
 -----END PGP SIGNATURE-----

Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system

Pull "Disintegrate and delete asm/system.h" from David Howells:
 "Here are a bunch of patches to disintegrate asm/system.h into a set of
  separate bits to relieve the problem of circular inclusion
  dependencies.

  I've built all the working defconfigs from all the arches that I can
  and made sure that they don't break.

  The reason for these patches is that I recently encountered a circular
  dependency problem that came about when I produced some patches to
  optimise get_order() by rewriting it to use ilog2().

  This uses bitops - and on the SH arch asm/bitops.h drags in
  asm-generic/get_order.h by a circuituous route involving asm/system.h.

  The main difficulty seems to be asm/system.h.  It holds a number of
  low level bits with no/few dependencies that are commonly used (eg.
  memory barriers) and a number of bits with more dependencies that
  aren't used in many places (eg.  switch_to()).

  These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

        Move memory barriers here.  This already done for MIPS and Alpha.

    (2) asm/switch_to.h

        Move switch_to() and related stuff here.

    (3) asm/exec.h

        Move arch_align_stack() here.  Other process execution related bits
        could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

        Move xchg() and cmpxchg() here as they're full word atomic ops and
        frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

        Move die() and related bits.

    (6) asm/auxvec.h

        Move AT_VECTOR_SIZE_ARCH here.

  Other arch headers are created as needed on a per-arch basis."

Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that.  We'll find out anything that got broken and fix it..

* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
  Delete all instances of asm/system.h
  Remove all #inclusions of asm/system.h
  Add #includes needed to permit the removal of asm/system.h
  Move all declarations of free_initmem() to linux/mm.h
  Disintegrate asm/system.h for OpenRISC
  Split arch_align_stack() out from asm-generic/system.h
  Split the switch_to() wrapper out of asm-generic/system.h
  Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
  Create asm-generic/barrier.h
  Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
  Disintegrate asm/system.h for Xtensa
  Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
  Disintegrate asm/system.h for Tile
  Disintegrate asm/system.h for Sparc
  Disintegrate asm/system.h for SH
  Disintegrate asm/system.h for Score
  Disintegrate asm/system.h for S390
  Disintegrate asm/system.h for PowerPC
  Disintegrate asm/system.h for PA-RISC
  Disintegrate asm/system.h for MN10300
  ...
2012-03-28 15:58:21 -07:00
David Howells 9ffc93f203 Remove all #inclusions of asm/system.h
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it.  Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28 18:30:03 +01:00
David Howells 96f951edb1 Add #includes needed to permit the removal of asm/system.h
asm/system.h is a cause of circular dependency problems because it contains
commonly used primitive stuff like barrier definitions and uncommonly used
stuff like switch_to() that might require MMU definitions.

asm/system.h has been disintegrated by this point on all arches into the
following common segments:

 (1) asm/barrier.h

     Moved memory barrier definitions here.

 (2) asm/cmpxchg.h

     Moved xchg() and cmpxchg() here.  #included in asm/atomic.h.

 (3) asm/bug.h

     Moved die() and similar here.

 (4) asm/exec.h

     Moved arch_align_stack() here.

 (5) asm/elf.h

     Moved AT_VECTOR_SIZE_ARCH here.

 (6) asm/switch_to.h

     Moved switch_to() here.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28 18:30:03 +01:00
David Howells d550bbd40c Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for Sparc.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: sparclinux@vger.kernel.org
2012-03-28 18:30:03 +01:00
Dan Carpenter 160594e99d cpusets: Remove an unused variable
We don't use "cpu" any more after 2baab4e904 "sched: Fix
select_fallback_rq() vs cpu_active/cpu_online".

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120328104608.GD29022@elgon.mountain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-28 13:40:44 +02:00
Steven Rostedt 12b5da349a tracing: Fix ent_size in trace output
When reading the trace file, the records of each of the per_cpu buffers
are examined to find the next event to print out. At the point of looking
at the event, the size of the event is recorded. But if the first event is
chosen, the other events in the other CPU buffers will reset the event size
that is stored in the iterator descriptor, causing the event size passed to
the output functions to be incorrect.

In most cases this is not a problem, but for the case of stack traces, it
is. With the change to the stack tracing to record a dynamic number of
back traces, the output depends on the size of the entry instead of the
fixed 8 back traces. When the entry size is not correct, the back traces
would not be fully printed.

Note, reading from the per-cpu trace files were not affected.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-03-27 12:05:44 -04:00
Michael J Wang 1b028abc77 sched/rt: Improve pick_next_highest_task_rt()
Avoid extra work by continuing on to the next rt_rq if the highest
prio task in current rt_rq is the same priority as our candidate
task.

More detailed explanation:  if next is not NULL, then we have found a
candidate task, and its priority is next->prio.  Now we are looking
for an even higher priority task in the other rt_rq's.  idx is the
highest priority in the current candidate rt_rq.  In the current 3.3
code, if idx is equal to next->prio, we would start scanning the tasks
in that rt_rq and replace the current candidate task with a task from
that rt_rq.  But the new task would only have a priority that is equal
to our previous candidate task, so we have not advanced our goal of
finding a higher prio task.  So we should avoid the extra work by
continuing on to the next rt_rq if idx is equal to next->prio.

Signed-off-by: Michael J Wang <mjwang@broadcom.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Yong Zhang <yong.zhang0@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-27 14:52:12 +02:00
Peter Zijlstra 2baab4e904 sched: Fix select_fallback_rq() vs cpu_active/cpu_online
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was
supposed to finally sort the cpu_active mess, instead uncovered more.

Since CPU_STARTING is ran before setting the cpu online, there's a
(small) window where the cpu has active,!online.

If during this time there's a wakeup of a task that used to reside on
that cpu select_task_rq() will use select_fallback_rq() to compute an
alternative cpu to run on since we find !online.

select_fallback_rq() however will compute the new cpu against
cpu_active, this means that it can return the same cpu it started out
with, the !online one, since that cpu is in fact marked active.

This results in us trying to scheduling a task on an offline cpu and
triggering a WARN in the IPI code.

The solution proposed by Chuansheng Liu of setting cpu_active in
set_cpu_online() is buggy, firstly not all archs actually use
set_cpu_online(), secondly, not all archs call set_cpu_online() with
IRQs disabled, this means we would introduce either the same race or
the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on
wrong CPU") -- albeit much narrower.

[ By setting online first and active later we have a window of
  online,!active, fresh and bound kthreads have task_cpu() of 0 and
  since cpu0 isn't in tsk_cpus_allowed() we end up in
  select_fallback_rq() which excludes !active, resulting in a reset
  of ->cpus_allowed and the thread running all over the place. ]

The solution is to re-work select_fallback_rq() to require active
_and_ online. This makes the active,!online case work as expected,
OTOH archs running CPU_STARTING after setting online are now
vulnerable to the issue from fd8a7de17 -- these are alpha and
blackfin.

Reported-by: Chuansheng Liu <chuansheng.liu@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: linux-alpha@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-27 14:50:14 +02:00
Ingo Molnar 7fd52392c5 Merge branch 'linus' into perf/urgent
Merge reason: we need to fix a non-trivial merge conflict.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-26 17:19:03 +02:00