Commit Graph

15434 Commits (0cf062d0ffa33d491e2695b0d298ccf9cbb58d3d)

Author SHA1 Message Date
Guillaume Knispel 5ae87e79ec poll/select: avoid arithmetic overflow in __estimate_accuracy()
__estimate_accuracy() was prone to integer overflow, for example if *tv ==
{2147, 483648000} on a 32 bit computer (or even for delays as small as
{429, 500000000} if the task is niced).

Because the result was already forced between 0 and 100ms, the effect of
the overflow was not too problematic, but the use of the hrtimer range
feature was not optimal in overflow cases.

This patch ensures that there can not be an integer overflow in this
function.

Signed-off-by: Guillaume Knispel <gknispel@proformatique.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:27 -07:00
Roel Kluin ca976c53de smbfs: read buffer overflow
This function uses signed integers for the unix_date and local variables -
if a negative number is supplied and the leap-year condition is not met,
month will be 0, leading to a read of day_n[-1]

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-23 07:39:27 -07:00
Tao Ma b80474b432 ocfs2: Use buffer IO if we are appending a file.
In ocfs2_file_aio_write, we will prevent direct io if
we find that we are appending(changing i_size) and call
generic_file_aio_write_nolock. But actually O_DIRECT flag
is there and this function will call generic_file_direct_write
eventually which will update i_size and leave di->i_size
alone. The bug is
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1173.

So this patch let ocfs2_direct_IO returns 0 directly if we
are appending so that buffered write will be called and
di->i_size get updated successfully. And this is also
what we want in ocfs2_file_aio_write.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-23 01:54:49 -07:00
Wengang Wang 83e32d9044 ocfs2: add spinlock protection when dealing with lockres->purge.
when we check/modify lockres->purge, we should with the protection of lockres->spinlock.
in dlm_purge_lockres(), the checking/modifying is not with the protectin.
this patch fixes it.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-23 01:54:48 -07:00
Coly Li d92bc5127b dlmglue.c: add missed mlog lines
This patch adds the missed mlog_exit() and mlog_exit_void() lines when routines
return.

Signed-off-by: Coly Li <coly.li@suse.de>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-23 01:54:47 -07:00
Sunil Mushran a2f2ddbf2b ocfs2: __ocfs2_abort() should not enable panic for local mounts
In a clustered setup, we have to panic the box on journal abort. This is
because we don't have the facility to go hard readonly. With hard ro, another
node would detect node failure and initiate recovery.

Having said that, we shouldn't force panic if the volume is mounted locally.
This patch defers the handling to the mount option, errors.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-23 01:54:46 -07:00
Tao Ma bd50873dc7 ocfs2: Add ioctl for reflink.
The ioctl will take 3 parameters: old_path, new_path and
preserve and call vfs_reflink. It is useful when we backport
reflink features to old kernels.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:51 -07:00
Tao Ma 64871b8d62 ocfs2: Enable refcount tree support.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:50 -07:00
Tao Ma 09bf27a000 ocfs2: Implement ocfs2_reflink.
Implement ocfs2_reflink.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:49 -07:00
Tao Ma 0fe9b66c65 ocfs2: Add preserve to reflink.
reflink has 2 options for the destination file:
1. snapshot: reflink will attempt to preserve ownership, permissions,
   and all other security state in order to create a full snapshot.
2. new file: it will acquire the data extent sharing but will see the
   file's security state and attributes initialized as a new file.

So add the option to ocfs2.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:49 -07:00
Tao Ma bc13d34757 ocfs2: Create reflinked file in orphan dir.
reflink is a very complicated process, so it can't be integrated
into one transaction. So if the system panic in the operation, we
may leave a unfinished inode in the destication directory.

So we will try to create an inode in orphan_dir first, reflink it
to the src file and then move it to the destication file in the end.
In that way we won't be afraid of any corruption during the reflink.

This patch adds 2 functions for orphan_dir operation:
1. Create a new inode in orphand dir.
2. Move an inode to a target dir.

Note:
fsck.ocfs2 should work for us to remove the unfinished file in the
orphan_dir.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:48 -07:00
Tao Ma 19bd341f6a ocfs2: Use proper parameter for some inode operation.
In order to make the original function more suitable for reflink,
we modify the following inode operations. Both are tiny.

1. ocfs2_mknod_locked only use dentry for mlog, so move it to
   the caller so that reflink can use it without dentry.
2. ocfs2_prepare_orphan_dir only want inode to get its ip_blkno.
   So use ip_blkno instead.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:47 -07:00
Tao Ma c18b812d12 ocfs2: Make transaction extend more efficient.
In ocfs2_extend_rotate_transaction, op_credits is the orignal
credits in the handle and we only want to extend the credits
for the rotation, but the old solution always double it. It
is harmless for some minor operations, but for actions like
reflink we may rotate tree many times and cause the credits
increase dramatically. So this patch try to only increase
the desired credits.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:46 -07:00
Tao Ma 7540c1a77b ocfs2: Don't merge in 1st refcount ops of reflink.
Actually the whole reflink will touch refcount tree 2 times:
1. It will add the clusters in the extent record to the tree if it
   isn't refcounted before.
2. It will add 1 refcount to these clusters when it add these
   extent records to the tree.

So actually we shouldn't do merge in the 1st operation since the 2nd
one will soon be called and we may have to split it again. Do a merge
first and split soon is a waste of time. So we only merge in the 2nd
round. This is done by adding a new internal __ocfs2_increase_refcount
and call it with "not-merge" for 1st refcount operation in reflink.

This also has a side-effect that we don't need to worry too much about
the metadata allocation in the 2nd round since it will only merge and
no split will happen for those records.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:46 -07:00
Tao Ma ce9c5a54c0 ocfs2: Modify removing xattr process for refcount.
The old xattr value remove is quite simple, it just erase the
tree and free the clusters. But as we have added refcount support,
The process is a little complicated.

We have to lock the refcount tree at the beginning, what's more,
we may split the refcount tree in some cases, so meta/credits are
needed.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:45 -07:00
Tao Ma 2999d12f4d ocfs2: Add reflink support for xattr.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:45 -07:00
Tao Ma a7fe7a3a1a ocfs2: Create an xattr indexed block if needed.
With reflink, there is a need that we create a new xattr indexed
block from the very beginning. So add a new parameter for
ocfs2_create_xattr_block.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:44 -07:00
Tao Ma 8b2c0dba51 ocfs2: Call refcount tree remove process properly.
Now with xattr refcount support, we need to check whether
we have xattr refcounted before we remove the refcount tree.

Now the mechanism is:
1) Check whether i_clusters == 0, if no, exit.
2) check whether we have i_xattr_loc in dinode. if yes, exit.
2) Check whether we have inline xattr stored outside, if yes, exit.
4) Remove the tree.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:44 -07:00
Tao Ma 0129241e2b ocfs2: Attach xattr clusters to refcount tree.
In ocfs2, when xattr's value is larger than OCFS2_XATTR_INLINE_SIZE,
it will be kept outside of the blocks we store xattr entry. And they
are stored in a b-tree also. So this patch try to attach all these
clusters to refcount tree also.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:43 -07:00
Tao Ma 47bca4950b ocfs2: Abstract ocfs2 xattr tree extend rec iteration process.
Currently we have ocfs2_iterate_xattr_buckets which can receive
a para and a callback to iterate a series of bucket. It is good.
But actually the 2 callers ocfs2_xattr_tree_list_index_block and
ocfs2_delete_xattr_index_block are almost the same. The only
difference is that the latter need to handle the extent record
also. So add a new function named ocfs2_iterate_xattr_index_block.
It can be given func callback which are used for exten record.
So now we only have one iteration function for the xattr index
block. Ane what's more, it is useful for our future reflink
operations.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:43 -07:00
Tao Ma 5aea1f0ef4 ocfs2: Abstract the creation of xattr block.
In xattr reflink, we also need to create xattr block, so
abstract the process out.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:42 -07:00
Tao Ma fd68a894fc ocfs2: Remove inode from ocfs2_xattr_bucket_get_name_value.
In ocfs2_xattr_bucket_get_name_value, actually we only use
super_block. So use it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:41 -07:00
Tao Ma 492a8a33e1 ocfs2: Add CoW support for xattr.
In order to make 2 transcation(xattr and cow) independent with each other,
we CoW the whole xattr out in case we are setting them.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:41 -07:00
Tao Ma 913580b4cd ocfs2: Abstract duplicate clusters process in CoW.
We currently use pagecache to duplicate clusters in CoW,
but it isn't suitable for xattr case. So abstract it out
so that the caller can decide which method it use.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:40 -07:00
Tao Ma 1061f9c1c9 ocfs2: Return extent flags for xattr value tree.
With the new refcount tree, xattr value can also be refcounted
among multiple files. So return the appropriate extent flags
so that CoW can used it later.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:39 -07:00
Tao Ma a9063ab9a3 ocfs2: handle file attributes issue for reflink.
A reflink creates a snapshot of a file, that means the attributes
must be identical except for three exceptions - nlink, ino, and ctime.

As for time changes, Here is a brief description:

1. Source file:
   1) atime: Ignore. Let the lazy atime code handle that.
   2) mtime: don't touch.
   3) ctime: If we change the tree (adding REFCOUNTED to at least one
             extent), update it.
2. Destination file:
   1) atime: ignore.
   2) mtime: we want it to appear identical to the source.
   3) ctime: update.

The idea here is that an ls -l will show the same time for the
src and target - it shows mtime.  Backup software like rsync and tar
will treat the new file correctly too.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:39 -07:00
Tao Ma 110a045aca ocfs2: Add normal functions for reflink a normal file's extents.
2 major functions are added in this patch.

ocfs2_attach_refcount_tree will create a new refcount tree to the
old file if it doesn't have one and insert all the extent records
to the tree if they are not refcounted.

ocfs2_create_reflink_node will:
1. set the refcount tree to the new file.
2. call ocfs2_duplicate_extent_list which will iterate all the
   extents for the old file, insert it to the new file and increase
   the corresponding referennce count.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:38 -07:00
Tao Ma 37f8a2bfaa ocfs2: CoW a reflinked cluster when it is truncated.
When we truncate a file to a specific size which resides in a reflinked
cluster, we need to CoW it since ocfs2_zero_range_for_truncate will
zero the space after the size(just another type of write).

So we add a "max_cpos" in ocfs2_refcount_cow so that it will stop when
it hit the max cluster offset.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:38 -07:00
Tao Ma 293b2f70b4 ocfs2: Integrate CoW in file write.
When we use mmap, we CoW the refcountd clusters in
ocfs2_write_begin_nolock. While for normal file
io(including directio), we do CoW in
ocfs2_prepare_inode_for_write.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:37 -07:00
Tao Ma 6ae23c5555 ocfs2: CoW refcount tree improvement.
During CoW, if the old extent record is refcounted, we allocate
som new clusters and do CoW. Actually we can have some improvement
here. If the old extent has refcount=1, that means now it is only
used by this file. So we don't need to allocate new clusters, just
remove the refcounted flag and it is OK. We also have to remove
it from the refcount tree while not deleting it.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:36 -07:00
Tao Ma 6f70fa5199 ocfs2: Add CoW support.
This patch try CoW support for a refcounted record.

the whole process will be:
1. Calculate how many clusters we need to CoW and where we start.
   Extents that are not completely encompassed by the write will
   be broken on 1MB boundaries.
2. Do CoW for the clusters with the help of page cache.
3. Change the b-tree structure with the new allocated clusters.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:36 -07:00
Tao Ma bcbbb24a6a ocfs2: Decrement refcount when truncating refcounted extents.
Add 'Decrement refcount for delete' in to the normal truncate
process. So for a refcounted extent record, call refcount rec
decrementation instead of cluster free.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:35 -07:00
Tao Ma 1aa75fea64 ocfs2: Add functions for extents refcounted.
Add function ocfs2_mark_extent_refcounted which can mark
an extent refcounted.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:34 -07:00
Tao Ma 1823cb0b9f ocfs2: Add support of decrementing refcount for delete.
Given a physical cpos and length, decrement the refcount
in the tree. If the refcount for any portion of the extent goes
to zero, that portion is queued for freeing.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:33 -07:00
Tao Ma e73a819db9 ocfs2: Add support for incrementing refcount in the tree.
Given a physical cpos and length, increment the refcount
in the tree. If the extent has not been seen before, a refcount
record is created for it. Refcount records may be merged or
split by this operation.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:33 -07:00
Tao Ma e2e9f6082b ocfs2: move tree path functions to alloc.h.
Now fs/ocfs2/alloc.c has more than 7000 lines. It contains our
basic b-tree operation. Although we have already make our b-tree
operation generic, the basic structrue ocfs2_path which is used
to iterate one b-tree branch is still static and limited to only
used in alloc.c. As refcount tree need them and I don't want to
add any more b-tree unrelated code to alloc.c, export them out.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:32 -07:00
Tao Ma fe92441595 ocfs2: Add refcount b-tree as a new extent tree.
Add refcount b-tree as a new extent tree so that it can
use the b-tree to store and maniuplate ocfs2_refcount_rec.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:31 -07:00
Tao Ma 555936bfcb ocfs2: Abstract extent split process.
ocfs2_mark_extent_written actually does the following things:
1. check the parameters.
2. initialize the left_path and split_rec.
3. call __ocfs2_mark_extent_written. it will do:
   1) check the flags of unwritten
   2) do the real split work.
The whole process is packed tightly somehow. So this patch
will abstract 2 different functions so that future b-tree
operation can work with it.

1. __ocfs2_split_extent will accept path and split_rec and do
  the real split work.
2. ocfs2_change_extent_flag will accept a new flag and initialize
   path and split_rec.

So now ocfs2_mark_extent_written will do:
1. check the parameters.
2. call ocfs2_change_extent_flag.
   1) initalize the left_path and split_rec.
   2) check whether the new flags conflict with the old one.
   3) call __ocfs2_split_extent to do the split.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:31 -07:00
Tao Ma 853a3a1439 ocfs2: Wrap ocfs2_extent_contig in ocfs2_extent_tree.
Add a new operation eo_ocfs2_extent_contig int the extent tree's
operations vector. So that with the new refcount tree, We want
this so that refcount trees can always return CONTIG_NONE and
prevent extent merging.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:30 -07:00
Tao Ma 8bf396de98 ocfs2: Basic tree root operation.
Add basic refcount tree root operation.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:30 -07:00
Tao Ma 374a263e79 ocfs2: Add refcount tree lock mechanism.
Implement locking around struct ocfs2_refcount_tree.  This protects
all read/write operations on refcount trees.  ocfs2_refcount_tree
has its own lock and its own caching_info, protecting buffers among
multiple nodes.

User must call ocfs2_lock_refcount_tree before his operation on
the tree and unlock it after that.

ocfs2_refcount_trees are referenced by the block number of the
refcount tree root block, So we create an rb-tree on the ocfs2_super
to look them up.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:29 -07:00
Tao Ma c732eb16bf ocfs2: Add caching info for refcount tree.
refcount tree should use its own caching info so that when
we downconvert the refcount tree lock, we can drop all the
cached buffer head.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:28 -07:00
Tao Ma 8dec98edfe ocfs2: Add new refcount tree lock resource in dlmglue.
refcount tree lock resource is used to protect refcount
tree read/write among multiple nodes.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:28 -07:00
Tao Ma a433848132 ocfs2: Abstract caching info checkpoint.
In meta downconvert, we need to checkpoint the metadata in an inode.
For refcount tree, we also need it. So abstract the process out.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:27 -07:00
Tao Ma f2c870e3b1 ocfs2: Add ocfs2_read_refcount_block.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:26 -07:00
Tao Ma 93c97087a6 ocfs2: Add metaecc for ocfs2_refcount_block.
Add metaecc and journal trigger for ocfs2_refcount_block.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:26 -07:00
Tao Ma 721f69c404 ocfs2: Define refcount tree structure.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:25 -07:00
Linus Torvalds a87e84b5cd Merge branch 'for-2.6.32' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.32' of git://linux-nfs.org/~bfields/linux: (68 commits)
  nfsd4: nfsv4 clients should cross mountpoints
  nfsd: revise 4.1 status documentation
  sunrpc/cache: avoid variable over-loading in cache_defer_req
  sunrpc/cache: use list_del_init for the list_head entries in cache_deferred_req
  nfsd: return success for non-NFS4 nfs4_state_start
  nfsd41: Refactor create_client()
  nfsd41: modify nfsd4.1 backchannel to use new xprt class
  nfsd41: Backchannel: Implement cb_recall over NFSv4.1
  nfsd41: Backchannel: cb_sequence callback
  nfsd41: Backchannel: Setup sequence information
  nfsd41: Backchannel: Server backchannel RPC wait queue
  nfsd41: Backchannel: Add sequence arguments to callback RPC arguments
  nfsd41: Backchannel: callback infrastructure
  nfsd4: use common rpc_cred for all callbacks
  nfsd4: allow nfs4 state startup to fail
  SUNRPC: Defer the auth_gss upcall when the RPC call is asynchronous
  nfsd4: fix null dereference creating nfsv4 callback client
  nfsd4: fix whitespace in NFSPROC4_CLNT_CB_NULL definition
  nfsd41: sunrpc: add new xprt class for nfsv4.1 backchannel
  sunrpc/cache: simplify cache_fresh_locked and cache_fresh_unlocked.
  ...
2009-09-22 07:54:33 -07:00
Linus Torvalds 342ff1a1b5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  trivial: fix typo in aic7xxx comment
  trivial: fix comment typo in drivers/ata/pata_hpt37x.c
  trivial: typo in kernel-parameters.txt
  trivial: fix typo in tracing documentation
  trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
  trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
  trivial: remove unnecessary semicolons
  trivial: Fix duplicated word "options" in comment
  trivial: kbuild: remove extraneous blank line after declaration of usage()
  trivial: improve help text for mm debug config options
  trivial: doc: hpfall: accept disk device to unload as argument
  trivial: doc: hpfall: reduce risk that hpfall can do harm
  trivial: SubmittingPatches: Fix reference to renumbered step
  trivial: fix typos "man[ae]g?ment" -> "management"
  trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
  trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
  trivial: fix missing printk space in amd_k7_smp_check
  trivial: fix typo s/ketymap/keymap/ in comment
  trivial: fix typo "to to" in multiple files
  trivial: fix typos in comments s/DGBU/DBGU/
  ...
2009-09-22 07:51:45 -07:00
Michael S. Tsirkin 3d2d827f5c mm: move use_mm/unuse_mm from aio.c to mm/
Anyone who wants to do copy to/from user from a kernel thread, needs
use_mm (like what fs/aio has).  Move that into mm/, to make reusing and
exporting easier down the line, and make aio use it.  Next intended user,
besides aio, will be vhost-net.

Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:42 -07:00
Eric B Munson 6bfde05bf5 hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs internal mount
This patchset adds a flag to mmap that allows the user to request that an
anonymous mapping be backed with huge pages.  This mapping will borrow
functionality from the huge page shm code to create a file on the kernel
internal mount and use it to approximate an anonymous mapping.  The
MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without
both flags being preset.

A new flag is necessary because there is no other way to hook into huge
pages without creating a file on a hugetlbfs mount which wouldn't be
MAP_ANONYMOUS.

To userspace, this mapping will behave just like an anonymous mapping
because the file is not accessible outside of the kernel.

This patchset is meant to simplify the programming model.  Presently there
is a large chunk of boiler platecode, contained in libhugetlbfs, required
to create private, hugepage backed mappings.  This patch set would allow
use of hugepages without linking to libhugetlbfs or having hugetblfs
mounted.

Unification of the VM code would provide these same benefits, but it has
been resisted each time that it has been suggested for several reasons: it
would break PAGE_SIZE assumptions across the kernel, it makes page-table
abstractions really expensive, and it does not provide any benefit on
architectures that do not support huge pages, incurring fast path
penalties without providing any benefit on these architectures.

This patch:

There are two means of creating mappings backed by huge pages:

        1. mmap() a file created on hugetlbfs
        2. Use shm which creates a file on an internal mount which essentially
           maps it MAP_SHARED

The internal mount is only used for shared mappings but there is very
little that stops it being used for private mappings. This patch extends
hugetlbfs_file_setup() to deal with the creation of files that will be
mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is
used in a subsequent patch to implement the MAP_HUGETLB mmap() flag.

Signed-off-by: Eric Munson <ebmunson@us.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:41 -07:00
Hugh Dickins 3f96b79ad9 tmpfs: depend on shmem
CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
more sense of it by removing CONFIG_TMPFS altogether, always enabling its
code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
CONFIG_TMPFS off that we'd better leave that as is.

But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
shmem_acl.o being pointlessly built into the kernel when SHMEM is off.

And a selfish change, to prevent the world from being rebuilt when I
switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
header files is mm.h shmem_lock() - give that a shmem.c stub instead.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:41 -07:00
Hugh Dickins f3e8fccd06 mm: add get_dump_page
In preparation for the next patch, add a simple get_dump_page(addr)
interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
get_user_pages() directly.  They're not interested in errors: they
just want to use holes as much as possible, to save space and make
sure that the data is aligned where the headers said it would be.

Oh, and don't use that horrid DUMP_SEEK(off) macro!

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:40 -07:00
KOSAKI Motohiro 5d863b8968 oom: fix oom_adjust_write() input sanity check
Andrew Morton pointed out oom_adjust_write() has very strange EIO
and new line handling. this patch fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:39 -07:00
KOSAKI Motohiro 495789a51a oom: make oom_score to per-process value
oom-killer kills a process, not task.  Then oom_score should be calculated
as per-process too.  it makes consistency more and makes speed up
select_bad_process().

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:39 -07:00
KOSAKI Motohiro 28b83c5193 oom: move oom_adj value from task_struct to signal_struct
Currently, OOM logic callflow is here.

    __out_of_memory()
        select_bad_process()            for each task
            badness()                   calculate badness of one task
                oom_kill_process()      search child
                    oom_kill_task()     kill target task and mm shared tasks with it

example, process-A have two thread, thread-A and thread-B and it have very
fat memory and each thread have following oom_adj and oom_score.

     thread-A: oom_adj = OOM_DISABLE, oom_score = 0
     thread-B: oom_adj = 0,           oom_score = very-high

Then, select_bad_process() select thread-B, but oom_kill_task() refuse
kill the task because thread-A have OOM_DISABLE.  Thus __out_of_memory()
call select_bad_process() again.  but select_bad_process() select the same
task.  It mean kernel fall in livelock.

The fact is, select_bad_process() must select killable task.  otherwise
OOM logic go into livelock.

And root cause is, oom_adj shouldn't be per-thread value.  it should be
per-process value because OOM-killer kill a process, not thread.  Thus
This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct signal_struct.  it naturally prevent
select_bad_process() choose wrong task.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:39 -07:00
Jan Beulich 4481374ce8 mm: replace various uses of num_physpages by totalram_pages
Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages.  The amount of what actually is usable as storage
should instead be used as a basis here.

Some of the calculations (i.e.  those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:38 -07:00
KAMEZAWA Hiroyuki 73d7c33e81 kcore: /proc/kcore should use vread
/proc/kcore has its own routine to access vmallc area.  It can be replaced
with vread().  And by this, /proc/kcore can do safe access to vmalloc
area.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Mike Smith <scgtrp@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:34 -07:00
Moussa A. Ba 398499d5f3 pagemap clear_refs: modify to specify anon or mapped vma clearing
The patch makes the clear_refs more versatile in adding the option to
select anonymous pages or file backed pages for clearing.  This addition
has a measurable impact on user space application performance as it
decreases the number of pagewalks in scenarios where one is only
interested in a specific type of page (anonymous or file mapped).

The patch adds anonymous and file backed filters to the clear_refs interface.

echo 1 > /proc/PID/clear_refs resets the bits on all pages
echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only

Any other value is ignored

Signed-off-by: Moussa A. Ba <moussa.a.ba@gmail.com>
Signed-off-by: Jared E. Hulbert <jaredeh@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:33 -07:00
Hugh Dickins 9a84089514 ksm: identify PageKsm pages
KSM will need to identify its kernel merged pages unambiguously, and
/proc/kpageflags will probably like to do so too.

Since KSM will only be substituting anonymous pages, statistics are best
preserved by making a PageKsm page a special PageAnon page: one with no
anon_vma.

But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near
PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons.

Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:31 -07:00
KOSAKI Motohiro 4b02108ac1 mm: oom analysis: add shmem vmstat
Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.

We often use the following calculation to determine the amount of shmem
pages:

shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

however the expression does not consider isolated and mlocked pages.

This patch adds explicit accounting for pages used by shmem and tmpfs.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:27 -07:00
KOSAKI Motohiro c6a7f5728a mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log output
The amount of memory allocated to kernel stacks can become significant and
cause OOM conditions.  However, we do not display the amount of memory
consumed by stacks.

Add code to display the amount of memory used for stacks in /proc/meminfo.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:27 -07:00
Alexey Dobriyan 83d5cde47d const: make block_device_operations const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:25 -07:00
Alexey Dobriyan 7b021967c5 const: make lock_manager_operations const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:25 -07:00
Alexey Dobriyan 6aed62853c const: make file_lock_operations const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:25 -07:00
Alexey Dobriyan 6e1d5dcc2b const: mark remaining inode_operations as const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Alexey Dobriyan 7f09410bbc const: mark remaining address_space_operations const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Alexey Dobriyan ac4cfdd6d1 const: mark remaining export_operations const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Alexey Dobriyan b87221de6a const: mark remaining super_operations const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Alexey Dobriyan 0d54b217a2 const: make struct super_block::s_qcop const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Alexey Dobriyan 61e225dc34 const: make struct super_block::dq_op const
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Jan Kara 580be0837a fs: make sure data stored into inode is properly seen before unlocking new inode
In theory it could happen that on one CPU we initialize a new inode but
clearing of I_NEW | I_LOCK gets reordered before some of the
initialization.  Thus on another CPU we return not fully uptodate inode
from iget_locked().

This seems to fix a corruption issue on ext3 mounted over NFS.

[akpm@linux-foundation.org: add some commentary]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-09-22 07:17:24 -07:00
Steve Dickson 3c394ddaa7 nfsd4: nfsv4 clients should cross mountpoints
Allow NFS v4 clients to seamlessly cross mount point without
have to set either the 'crossmnt' or the 'nohide' export
options.

Signed-Off-By: Steve Dickson <steved@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-09-21 16:02:25 -04:00
Linus Torvalds 43c1266ce4 Merge branch 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf: Tidy up after the big rename
  perf: Do the big rename: Performance Counters -> Performance Events
  perf_counter: Rename 'event' to event_id/hw_event
  perf_counter: Rename list_entry -> group_entry, counter_list -> group_list

Manually resolved some fairly trivial conflicts with the tracing tree in
include/trace/ftrace.h and kernel/trace/trace_syscalls.c.
2009-09-21 09:15:07 -07:00
Linus Torvalds 58e75a0973 Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
  nfs: initialize the backing_dev_info when creating the server
  writeback: make balance_dirty_pages() gradually back more off
  writeback: don't use schedule_timeout() without setting runstate
  nfs: nfs_kill_super() should call bdi_unregister() after killing super
2009-09-21 09:04:30 -07:00
Jens Axboe 48d0764998 nfs: initialize the backing_dev_info when creating the server
NFS may free the server structure without ever having used the
bdi, so we either need to flag the bdi as being uninitialized or
initialize it up front. This does the latter.

This fixes a crash with mounting more than one NFS file system,
should people ever need that kind of obscure NFS functionality.

Tested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-21 15:40:33 +02:00
Jens Axboe 92f25053c0 nfs: nfs_kill_super() should call bdi_unregister() after killing super
Otherwise we could be attempting to flush data for a writeback
thread and bdi that have already disappeared.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-09-21 15:40:32 +02:00
Joe Perches a419aef8b8 trivial: remove unnecessary semicolons
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-09-21 15:14:58 +02:00
Anand Gadiyar fd589a8f0a trivial: fix typo "to to" in multiple files
Signed-off-by: Anand Gadiyar <gadiyar@ti.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-09-21 15:14:55 +02:00
Anand Gadiyar 411c940385 trivial: fix typo "for for" in multiple files
trivial: fix typo "for for" in multiple files

Signed-off-by: Anand Gadiyar <gadiyar@ti.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2009-09-21 15:14:54 +02:00
Ingo Molnar cdd6c482c9 perf: Do the big rename: Performance Counters -> Performance Events
Bye-bye Performance Counters, welcome Performance Events!

In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.

Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.

All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)

The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.

Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.

User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)

This patch has been generated via the following script:

  FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

  sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

  for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
  done

  FILES=$(find . -name perf_event.*)

  sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\<event\>/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.

Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.

( NOTE: 'counters' are still the proper terminology when we deal
  with hardware registers - and these sed scripts are a bit
  over-eager in renaming them. I've undone some of that, but
  in case there's something left where 'counter' would be
  better than 'event' we can undo that on an individual basis
  instead of touching an otherwise nicely automated patch. )

Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 14:28:04 +02:00
Artem Bityutskiy 7cce2f4cb7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into linux-next
Conflicts:
	fs/ubifs/super.c

Merge the upstream tree in order to resolve a conflict with the
per-bdi writeback changes from the linux-2.6-block tree.
2009-09-21 12:09:22 +03:00
David Woodhouse 6469f540ea Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Conflicts:
	drivers/mtd/mtdcore.c

Merged in order that I can apply the Nomadik nand/onenand support patches.
2009-09-20 05:55:36 -07:00
David Woodhouse dd799983e9 jffs2: Use SLAB_HWCACHE_ALIGN for jffs2_raw_{dirent,inode} slabs
We may end up doing DMA to/from these. Until the new MTD API fixes the
issues, this should stop things from falling over.

Original idea from Gilles Casse <list@gcasse.net>

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2009-09-19 16:14:01 -07:00
Linus Torvalds 3530c18862 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (64 commits)
  ext4: Update documentation about quota mount options
  ext4: replace MAX_DEFRAG_SIZE with EXT_MAX_BLOCK
  ext4: Fix the alloc on close after a truncate hueristic
  ext4: Add a tracepoint for ext4_alloc_da_blocks()
  ext4: store EXT4_EXT_MIGRATE in i_state instead of i_flags
  ext4: limit block allocations for indirect-block files to < 2^32
  ext4: Fix different block exchange issue in EXT4_IOC_MOVE_EXT
  ext4: Add null extent check to ext_get_path
  ext4: Replace BUG_ON() with ext4_error() in move_extents.c
  ext4: Replace get_ext_path macro with an inline funciton
  ext4: Fix include/trace/events/ext4.h to work with Systemtap
  ext4: Fix initalization of s_flex_groups
  ext4: Always set dx_node's fake_dirent explicitly.
  ext4: Fix async commit mode to be safe by using a barrier
  ext4: Don't update superblock write time when filesystem is read-only
  ext4: Clarify the locking details in mballoc
  ext4: check for need init flag in ext4_mb_load_buddy
  ext4: move ext4_mb_init_group() function earlier in the mballoc.c
  ext4: Make non-journal fsync work properly
  ext4: Assure that metadata blocks are written during fsync in no journal mode
  ...
2009-09-18 10:56:26 -07:00
Linus Torvalds 9eead2a811 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: add fusectl interface to max_background
  fuse: limit user-specified values of max background requests
  fuse: use drop_nlink() instead of direct nlink manipulation
  fuse: document protocol version negotiation
  fuse: make the number of max background requests and congestion threshold tunable
2009-09-18 09:23:03 -07:00
Linus Torvalds 5ce0028987 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
  dlm: use kernel_sendpage
  dlm: fix connection close handling
  dlm: fix double-release of socket in error exit path
2009-09-18 09:19:10 -07:00
Linus Torvalds 2511817cf9 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
  ext3: Flush disk caches on fsync when needed
  ext3: Add locking to ext3_do_update_inode
  ext3: Fix possible deadlock between ext3_truncate() and ext3_get_blocks()
  jbd: Annotate transaction start also for journal_restart()
  jbd: Journal block numbers can ever be only 32-bit use unsigned int for them
  ext3: Update MAINTAINERS for ext3 and JBD
  JBD: round commit timer up to avoid uncommitted transaction
2009-09-18 09:18:52 -07:00
Linus Torvalds 79b520e87e Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (39 commits)
  xfs: includecheck fix for fs/xfs/xfs_iops.c
  xfs: switch to seq_file
  xfs: Record new maintainer information
  xfs: use correct log reservation when handling ENOSPC in xfs_create
  xfs: xfs_showargs() reports group *and* project quotas enabled
  xfs: un-static xfs_inobt_lookup
  xfs: actually enable the swapext compat handler
  xfs: simplify xfs_trans_iget
  xfs: merge fsync and O_SYNC handling
  xfs: speed up free inode search
  xfs: rationalize xfs_inobt_lookup*
  xfs: untangle xfs_dialloc
  xfs: factor out debug checks from xfs_dialloc and xfs_difree
  xfs: improve xfs_inobt_update prototype
  xfs: improve xfs_inobt_get_rec prototype
  xfs: factor out inode initialisation
  fs/xfs: Correct redundant test
  xfs: remove XFS_INO64_OFFSET
  un-static xfs_read_agf
  xfs: add more statics & drop some unused functions
  ...
2009-09-17 09:54:37 -07:00
Eric Sandeen 0a80e9867d ext4: replace MAX_DEFRAG_SIZE with EXT_MAX_BLOCK
There's no reason to redefine the maximum allowable offset
in an extent-based file just for defrag; 
EXT_MAX_BLOCK already does this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-17 11:55:58 -04:00
Theodore Ts'o 5534fb5bb3 ext4: Fix the alloc on close after a truncate hueristic
In an attempt to avoid doing an unneeded flush after opening a
(previously non-existent) file with O_CREAT|O_TRUNC, the code only
triggered the hueristic if ei->disksize was non-zero.  Turns out that
the VFS doesn't call ->truncate() if the file doesn't exist, and
ei->disksize is always zero even if the file previously existed.  So
remove the test, since it isn't necessary and in fact disabled the
hueristic.

Thanks to Clemens Eisserer that he was seeing problems with files
written using kwrite and eclipse after sudden crashes caused by a
buggy Intel video driver.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-17 09:34:16 -04:00
Artem Bityutskiy e055f7e873 UBIFS: fix debugging dump
In 'dbg_check_space_info()' we want to dump current lprops statistics,
but actually dump old statistics. Fix this.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2009-09-17 15:08:31 +03:00
Theodore Ts'o fb40ba0d98 ext4: Add a tracepoint for ext4_alloc_da_blocks()
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 19:30:40 -04:00
Theodore Ts'o 1b9c12f44c ext4: store EXT4_EXT_MIGRATE in i_state instead of i_flags
EXT4_EXT_MIGRATE is only intended to be used for an in-memory flag,
and the hex value assigned to it collides with FS_DIRECTIO_FL (which
is also stored in i_flags).  There's no reason for the
EXT4_EXT_MIGRATE bit to be stored in i_flags, so we switch it to use
i_state instead.

Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-17 08:32:22 -04:00
Eric Sandeen fb0a387dcd ext4: limit block allocations for indirect-block files to < 2^32
Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.

This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.

This should address RH Bug 519471, ext4 bitmap allocator 
must limit blocks to < 2^32

* ext4_find_goal() is modified to choose a goal < UINT_MAX,
  so that our starting point is in an acceptable range.

* ext4_xattr_block_set() is modified such that the goal block
  is < UINT_MAX, as above.

* ext4_mb_regular_allocator() is modified so that the group
  search does not continue into groups which are too high

* ext4_mb_use_preallocated() has a check that we don't use
  preallocated space which is too far out

* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs

No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes.  Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.

For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:45:10 -04:00
Akira Fujita c40ce3c9ea ext4: Fix different block exchange issue in EXT4_IOC_MOVE_EXT
If logical block offset of original file which is passed to
EXT4_IOC_MOVE_EXT is different from donor file's,
a calculation error occurs in ext4_calc_swap_extents(),
therefore wrong block is exchanged between original file and donor file.
As a result, we hit ext4_error() in check_block_validity().
To detect the logical offset difference in EXT4_IOC_MOVE_EXT,
add checks to mext_calc_swap_extents() and handle it as error,
since data exchange must be done between the same blocks in EXT4_IOC_MOVE_EXT.

Reported-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:25:39 -04:00
Akira Fujita 347fa6f1c7 ext4: Add null extent check to ext_get_path
There is the possibility that path structure which is taken
by ext4_ext_find_extent() indicates null extents.
Because during data block exchanging in ext4_move_extents(),
constitution of an extent tree may be changed.
As a solution, the patch adds null extent check
to ext_get_path().

Reported-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 14:25:07 -04:00
Akira Fujita 2147b1a6a4 ext4: Replace BUG_ON() with ext4_error() in move_extents.c
Replace BUG_ON calls with a call to ext4_error()
to print an error message if EXT4_IOC_MOVE_EXT failed
with some kind of reasons.  This will help to debug.
Ted pointed this out, thanks.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 13:46:35 -04:00
Akira Fujita e8505970af ext4: Replace get_ext_path macro with an inline funciton
Replace get_ext_path macro with an inline function,
since this macro looks like a function call but its arguments
get modified. Ted pointed this out, thanks.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-09-16 13:46:38 -04:00
Jan Kara 56fcad29d4 ext3: Flush disk caches on fsync when needed
In case we fsync() a file and inode is not dirty, we don't force a transaction
to disk and hence don't flush disk caches. Thus file data could be just in disk
caches and not on persistent storage. Fix the problem by flushing disk caches
if we didn't force a transaction commit.

Signed-off-by: Jan Kara <jack@suse.cz>
2009-09-16 17:44:11 +02:00