Commit graph

19271 commits

Author SHA1 Message Date
Sage Weil
52dfb8ac0e ceph: constify dentry_operations
This makes checkpatch happy.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 10:25:30 -07:00
Sage Weil
213c99ee0c ceph: whitespace cleanup
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-03 10:25:11 -07:00
Matthieu CASTET
c18de72fb3 UBIFS: fix a memory leak on error path.
In 'mount_ubifs()', in case of 'ubifs_leb_unmap()' falure,
free allocated resources.

Signed-off-by: Matthieu CASTET <matthieu.castet@parrot.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-08-03 08:58:09 +03:00
Greg Farnum
40819f6fb2 ceph: add flock/fcntl lock support
Implement flock inode operation to support advisory file locking.  All
lock/unlock operations are synchronous with the MDS.  Lock state is
sent when reconnecting to a recovering MDS to restore the shared lock
state.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 16:10:53 -07:00
Greg Farnum
fbaad9797a ceph: define on-wire types, constants for file locking support
Define the MDS operations and data types for doing file advisory locking
with the MDS.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:54 -07:00
Greg Farnum
c6f3fdc592 ceph: add CEPH_FEATURE_FLOCK to the supported feature bits
This informs the server that we will accept v2 client_caps format and v2
client_reconnect format messages.

Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:51 -07:00
Sage Weil
20cb34ae9e ceph: support v2 reconnect encoding
Encode either old or v2 encoding of client_reconnect message, depending on
whether the peer has the FLOCK feature bit.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:50 -07:00
Sage Weil
ce1fbc8dd6 ceph: support v2 client_caps encoding
Add support for v2 encoding of MClientCaps, which includes a flock blob.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:49 -07:00
Sage Weil
cbbfe49905 ceph: move AES iv definition to shared header
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 15:48:31 -07:00
Eric Van Hensbergen
327aec03ac 9p: fix sparse warnings in new xattr code
fixes:

  CHECK   fs/9p/xattr.c
	fs/9p/xattr.c:73:6: warning: Using plain integer as NULL pointer
	fs/9p/xattr.c:135:6: warning: Using plain integer as NULL pointer

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:38 -05:00
Eric Van Hensbergen
ea1375333e fs/9p: remove sparse warning in vfs_inode
make v9fs_dentry_from_dir_inode static

Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:37 -05:00
Aneesh Kumar K.V
a534c8d15b fs/9p: Prevent parallel rename when doing fid_lookup
During fid lookup we need to make sure that the dentry->d_parent doesn't
change so that we can safely walk the parent dentries. To ensure that
we need to prevent cross directory rename during fid_lookup. Add a
per superblock rename_sem rw_semaphore to prevent parallel fid lookup and
rename.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:35 -05:00
Aneesh Kumar K.V
ebf46264a0 fs/9p: Add support user. xattr
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:35 -05:00
M. Mohan Kumar
ef56547efa 9p: Implement LOPEN
Implement 9p2000.L version of open(LOPEN) interface in 9p client.

For LOPEN, no need to convert the flags to and from 9p mode to VFS mode.

Synopsis:

    size[4] Tlopen tag[2] fid[4] mode[4]

    size[4] Rlopen tag[2] qid[13] iounit[4]

[Fix mode bit format - jvrao@linux.vnet.ibm.com]

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbegren <ericvh@gmail.com>
2010-08-02 14:28:32 -05:00
Venkateswararao Jujjuri (JV)
5643135a28 fs/9p: This patch implements TLCREATE for 9p2000.L protocol.
SYNOPSIS

    size[4] Tlcreate tag[2] fid[4] name[s] flags[4] mode[4] gid[4]

    size[4] Rlcreate tag[2] qid[13] iounit[4]

DESCRIPTION

The Tlreate request asks the file server to create a new regular file with the
name supplied, in the directory (dir) represented by fid.
The mode argument specifies the permissions to use. New file is created with
the uid if the fid and with supplied gid.

The flags argument represent Linux access mode flags with which the caller
is requesting to open the file with. Protocol allows all the Linux access
modes but it is upto the server to allow/disallow any of these acess modes.
If the server doesn't support any of the access mode, it is expected to
return error.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:32 -05:00
M. Mohan Kumar
01a622bd74 9p: Implement TMKDIR
Implement TMKDIR as part of 2000.L Work

Synopsis

    size[4] Tmkdir tag[2] fid[4] name[s] mode[4] gid[4]

    size[4] Rmkdir tag[2] qid[13]

Description

    mkdir asks the file server to create a directory with given name,
    mode and gid. The qid for the new directory is returned with
    the mkdir reply message.

Note: 72 is selected as the opcode for TMKDIR from the reserved list.

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:31 -05:00
M. Mohan Kumar
4b43516ab1 9p: Implement TMKNOD
Synopsis

    size[4] Tmknod tag[2] fid[4] name[s] mode[4] major[4] minor[4] gid[4]

    size[4] Rmknod tag[2] qid[13]

Description

    mknod asks the file server to create a device node with given major and
    minor number, mode and gid. The qid for the new device node is returned
    with the mknod reply message.

[sripathik@in.ibm.com: Fix error handling code]

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:30 -05:00
Venkateswararao Jujjuri (JV)
50cc42ff3d 9p: Define and implement TSYMLINK for 9P2000.L
Create a symbolic link

SYNOPSIS

size[4] Tsymlink tag[2] fid[4] name[s] symtgt[s] gid[4]

size[4] Rsymlink tag[2] qid[13]

DESCRIPTION

Create a symbolic link named 'name' pointing to 'symtgt'.
gid represents the effective group id of the caller.
The  permissions of a symbolic link are irrelevant hence it is omitted
from the protocol.

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Reviewed-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:29 -05:00
Eric Van Hensbergen
09d34ee5f9 9p: Define and implement TLINK for 9P2000.L
This patch adds a helper function to get the dentry from inode and
uses it in creating a Hardlink

SYNOPSIS

size[4] Tlink tag[2] dfid[4] oldfid[4] newpath[s]

size[4] Rlink tag[2]

DESCRIPTION

Create a link 'newpath' in directory pointed by dfid linking to oldfid path.

[sripathik@in.ibm.com : p9_client_link should not free req structure
if p9_client_rpc has returned an error.]

Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:28:09 -05:00
Sripathi Kodi
87d7845aa0 9p: Implement client side of setattr for 9P2000.L protocol.
SYNOPSIS

      size[4] Tsetattr tag[2] attr[n]

      size[4] Rsetattr tag[2]

    DESCRIPTION

      The setattr command changes some of the file status information.
      attr resembles the iattr structure used in Linux kernel. It
      specifies which status parameter is to be changed and to what
      value. It is laid out as follows:

         valid[4]
            specifies which status information is to be changed. Possible
            values are:
            ATTR_MODE       (1 << 0)
            ATTR_UID        (1 << 1)
            ATTR_GID        (1 << 2)
            ATTR_SIZE       (1 << 3)
            ATTR_ATIME      (1 << 4)
            ATTR_MTIME      (1 << 5)
            ATTR_ATIME_SET  (1 << 7)
            ATTR_MTIME_SET  (1 << 8)

            The last two bits represent whether the time information
            is being sent by the client's user space. In the absense
            of these bits the server always uses server's time.

         mode[4]
            File permission bits

         uid[4]
            Owner id of file

         gid[4]
            Group id of the file

         size[8]
            File size

         atime_sec[8]
            Time of last file access, seconds

         atime_nsec[8]
            Time of last file access, nanoseconds

         mtime_sec[8]
            Time of last file modification, seconds

         mtime_nsec[8]
            Time of last file modification, nanoseconds

Explanation of the patches:
--------------------------

*) The kernel just copies relevent contents of iattr structure to
   p9_iattr_dotl structure and passes it down to the client. The
   only check it has is calling inode_change_ok()
*) The p9_iattr_dotl structure does not have ctime and ia_file
   parameters because I don't think these are needed in our case.
   The client user space can request updating just ctime by calling
   chown(fd, -1, -1). This is handled on server side without a need
   for putting ctime on the wire.
*) The server currently supports changing mode, time, ownership and
   size of the file.
*) 9P RFC says "Either all the changes in wstat request happen, or
   none of them does: if the request succeeds, all changes were made;
   if it fails, none were."
   I have not done anything to implement this specifically because I
   don't see a reason.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:25:10 -05:00
Sripathi Kodi
f085312204 9p: getattr client implementation for 9P2000.L protocol.
SYNOPSIS

              size[4] Tgetattr tag[2] fid[4] request_mask[8]

              size[4] Rgetattr tag[2] lstat[n]

           DESCRIPTION

              The getattr transaction inquires about the file identified by fid.
              request_mask is a bit mask that specifies which fields of the
              stat structure is the client interested in.

              The reply will contain a machine-independent directory entry,
              laid out as follows:

                 st_result_mask[8]
                    Bit mask that indicates which fields in the stat structure
                    have been populated by the server

                 qid.type[1]
                    the type of the file (directory, etc.), represented as a bit
                    vector corresponding to the high 8 bits of the file's mode
                    word.

                 qid.vers[4]
                    version number for given path

                 qid.path[8]
                    the file server's unique identification for the file

                 st_mode[4]
                    Permission and flags

                 st_uid[4]
                    User id of owner

                 st_gid[4]
                    Group ID of owner

                 st_nlink[8]
                    Number of hard links

                 st_rdev[8]
                    Device ID (if special file)

                 st_size[8]
                    Size, in bytes

                 st_blksize[8]
                    Block size for file system IO

                 st_blocks[8]
                    Number of file system blocks allocated

                 st_atime_sec[8]
                    Time of last access, seconds

                 st_atime_nsec[8]
                    Time of last access, nanoseconds

                 st_mtime_sec[8]
                    Time of last modification, seconds

                 st_mtime_nsec[8]
                    Time of last modification, nanoseconds

                 st_ctime_sec[8]
                    Time of last status change, seconds

                 st_ctime_nsec[8]
                    Time of last status change, nanoseconds

                 st_btime_sec[8]
                    Time of creation (birth) of file, seconds

                 st_btime_nsec[8]
                    Time of creation (birth) of file, nanoseconds

                 st_gen[8]
                    Inode generation

                 st_data_version[8]
                    Data version number

              request_mask and result_mask bit masks contain the following bits
                 #define P9_STATS_MODE          0x00000001ULL
                 #define P9_STATS_NLINK         0x00000002ULL
                 #define P9_STATS_UID           0x00000004ULL
                 #define P9_STATS_GID           0x00000008ULL
                 #define P9_STATS_RDEV          0x00000010ULL
                 #define P9_STATS_ATIME         0x00000020ULL
                 #define P9_STATS_MTIME         0x00000040ULL
                 #define P9_STATS_CTIME         0x00000080ULL
                 #define P9_STATS_INO           0x00000100ULL
                 #define P9_STATS_SIZE          0x00000200ULL
                 #define P9_STATS_BLOCKS        0x00000400ULL

                 #define P9_STATS_BTIME         0x00000800ULL
                 #define P9_STATS_GEN           0x00001000ULL
                 #define P9_STATS_DATA_VERSION  0x00002000ULL

                 #define P9_STATS_BASIC         0x000007ffULL
                 #define P9_STATS_ALL           0x00003fffULL

        This patch implements the client side of getattr implementation for
        9P2000.L. It introduces a new structure p9_stat_dotl for getting
        Linux stat information along with QID. The data layout is similar to
        stat structure in Linux user space with the following major
        differences:

        inode (st_ino) is not part of data. Instead qid is.

        device (st_dev) is not part of data because this doesn't make sense
        on the client.

        All time variables are 64 bit wide on the wire. The kernel seems to use
        32 bit variables for these variables. However, some of the architectures
        have used 64 bit variables and glibc exposes 64 bit variables to user
        space on some architectures. Hence to be on the safer side we have made
        these 64 bit in the protocol. Refer to the comments in
        include/asm-generic/stat.h

        There are some additional fields: st_btime_sec, st_btime_nsec, st_gen,
        st_data_version apart from the bitmask, st_result_mask. The bit mask
        is filled by the server to indicate which stat fields have been
        populated by the server. Currently there is no clean way for the
        server to obtain these additional fields, so it sends back just the
        basic fields.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Signed-off-by: Eric Van Hensbegren <ericvh@gmail.com>
2010-08-02 14:25:09 -05:00
Aneesh Kumar K.V
9ffaf63e34 fs/9p: Pass the correct user credentials during attach
We need to make sure we pass the right uid value
during attach. dotl is similar to dotu in this regard.
Without this mapped security model on dotl doesn't work

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:25:08 -05:00
Sripathi Kodi
7751bdb3a0 9p: readdir implementation for 9p2000.L
This patch implements the kernel part of readdir() implementation for 9p2000.L

    Change from V3: Instead of inode, server now sends qids for each dirent

    SYNOPSIS

    size[4] Treaddir tag[2] fid[4] offset[8] count[4]
    size[4] Rreaddir tag[2] count[4] data[count]

    DESCRIPTION

    The readdir request asks the server to read the directory specified by 'fid'
    at an offset specified by 'offset' and return as many dirent structures as
    possible that fit into count bytes. Each dirent structure is laid out as
    follows.

            qid.type[1]
              the type of the file (directory, etc.), represented as a bit
              vector corresponding to the high 8 bits of the file's mode
              word.

            qid.vers[4]
              version number for given path

            qid.path[8]
              the file server's unique identification for the file

            offset[8]
              offset into the next dirent.

            type[1]
              type of this directory entry.

            name[256]
              name of this directory entry.

    This patch adds v9fs_dir_readdir_dotl() as the readdir() call for 9p2000.L.
    This function sends P9_TREADDIR command to the server. In response the server
    sends a buffer filled with dirent structures. This is different from the
    existing v9fs_dir_readdir() call which receives stat structures from the server.
    This results in significant speedup of readdir() on large directories.
    For example, doing 'ls >/dev/null' on a directory with 10000 files on my
    laptop takes 1.088 seconds with the existing code, but only takes 0.339 seconds
    with the new readdir.

Signed-off-by: Sripathi Kodi <sripathik@in.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:25:07 -05:00
M. Mohan Kumar
97e8442b09 9p: Make use of iounit for read/write
Change the v9fs_file_readn function to limit the maximum transfer size
based on the iounit or msize.

Also remove the redundant check for limiting the transfer size in
v9fs_file_write. This check is done by p9_client_write.

Signed-off-by: M. Mohan Kumar <mohan@in.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-08-02 14:25:06 -05:00
Sage Weil
73a7e693f9 ceph: fix decoding of pool snap info
The pool info contains a vector for snap_info_t, not snap ids.  This fixes
the broken decoding, which would declare teh update corrupt when a pool
snapshot was created.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-02 11:10:07 -07:00
Alex Elder
6b0a2996a0 Merge branch 'v2.6.35' 2010-08-02 10:24:57 -05:00
Justin P. Mattock
581b7e9fc0 udf: super.c Fix warning: variable 'sbi' set but not used
This fixes this warning when building the kernel:
  CC      fs/udf/super.o
fs/udf/super.c: In function 'udf_load_sequence':
fs/udf/super.c:1582:22: warning: variable 'sbi' set but not used
Please have a look, when you have time and let me know.

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-08-02 14:57:40 +02:00
Huang Weiyi
de67445f0e udf: remove duplicated #include
Remove duplicated #include('s) in
  fs/udf/file.c

Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-08-02 14:57:39 +02:00
Theodore Ts'o
a51dca9cd3 jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
By using an atomic_t for t_updates and t_outstanding credits, this
should allow us to not need to take transaction t_handle_lock in
jbd2_journal_stop().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-02 08:43:25 -04:00
Jeff Layton
cb76d5e250 cifs: fsc should not default to "on"
I'm not sure why this was merged with this flag hardcoded on, but it
seems quite dangerous. Turn it off.

Also, mount.cifs hands unrecognized options off to the kernel so there
should be no need for changes there in order to support this.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:41 +00:00
Steve French
f67909cf80 [CIFS] remove redundant path walking in dfs_do_refmount
Reviewed-by: Dave Howells <dhowells@redhat.com>
Signed-off-by: Igor Mammedov <niallain@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:40 +00:00
Jeff Layton
f636a34802 cifs: ignore the "mand", "nomand" and "_netdev" mount options
These are all handled by the userspace mount programs, but older versions
of mount.cifs also handed them off to the kernel. Ignore them.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:40 +00:00
Jeff Layton
3572d2857f cifs: map NT_STATUS_ERROR_WRITE_PROTECTED to -EROFS
Seems like a more sensible mapping than -EIO.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:40 +00:00
Jeff Layton
f30b9c1184 cifs: don't allow cifs_iget to match inodes of the wrong type
If the type is different from what we think it should be, then don't
match the existing inode.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:39 +00:00
Steve French
9f841593ff [CIFS] relinquish fscache cookie before freeing CIFSTconInfo
Doh, fix a use after free bug.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Reviewed-and-Tested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:39 +00:00
Jeff Layton
3e4b3e1f68 cifs: add separate cred_uid field to sesInfo
Right now, there's no clear separation between the uid that owns the
credentials used to do the mount and the overriding owner of the files
on that mount.

Add a separate cred_uid field that is set to the real uid
of the mount user. Unlike the linux_uid, the uid= option does not
override this parameter. The parm is sent to cifs.upcall, which can then
preferentially use the creduid= parm instead of the uid= parm for
finding credentials.

This is not the only way to solve this. We could try to do all of this
in kernel instead by having a module parameter that affects what gets
passed in the uid= field of the upcall. That said, we have a lot more
flexibility to change things in userspace so I think it probably makes
sense to do it this way.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:39 +00:00
Kulikov Vasiliy
f55fdcca6b fs: cifs: check kmalloc() result
If kmalloc() fails exit with -ENOMEM.

Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:39 +00:00
Steve French
0ccd48025f [CIFS] Missing ifdef
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:38 +00:00
Steve French
d0e6f44e6c [CIFS] Missing line from previous commit
CC: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:38 +00:00
Steve French
c5e04a3e49 [CIFS] Fix build break when CONFIG_CIFS_FSCACHE disabled
CC: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:38 +00:00
Suresh Jayaraman
fa1df75d4d cifs: add mount option to enable local caching
Add a mount option 'fsc' to enable local caching on CIFS.

I considered adding a separate debug bit for caching, but it appears that
debugging would be relatively easier with the normal CIFS_INFO level.

As the cifs-utils (userspace) changes are not done yet, this patch enables
'fsc' by default to enable testing.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:37 +00:00
Suresh Jayaraman
56698236e1 cifs: read pages from FS-Cache
Read pages from a FS-Cache data storage object into a CIFS inode.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:37 +00:00
Suresh Jayaraman
9dc06558c2 cifs: store pages into local cache
Store pages from an CIFS inode into the data storage object associated with
that inode.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:37 +00:00
Suresh Jayaraman
85f2d6b44d cifs: FS-Cache page management
Takes care of invalidation and release of FS-Cache marked pages and also
invalidation of the FsCache page flag when the inode is removed.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:36 +00:00
Suresh Jayaraman
9451a9a52f cifs: define inode-level cache object and register them
Define inode-level data storage objects (managed by cifsInodeInfo structs).
Each inode-level object is created in a super-block level object and is itself
a data storage object in to which pages from the inode are stored.

The inode object is keyed by UniqueId. The coherency data being used is
LastWriteTime, LastChangeTime and end of file reported by the server.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:36 +00:00
Suresh Jayaraman
d03382ce9a cifs: define superblock-level cache index objects and register them
Define superblock-level cache index objects (managed by cifsTconInfo structs).
Each superblock object is created in a server-level index object and in itself
an index into which inode-level objects are inserted.

The superblock object is keyed by sharename. The UniqueId/IndexNumber is used to
validate that the exported share is the same since we accessed it last time.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:36 +00:00
Jeff Layton
8913007e67 cifs: remove unused cifsUidInfo struct
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:35 +00:00
Jeff Layton
4ff67b720c cifs: clean up cifs_find_smb_ses (try #2)
This patch replaces the earlier patch by the same name. The only
difference is that MAX_PASSWORD_SIZE has been increased to attempt to
match the limits that windows enforces.

Do a better job of matching sessions by authtype. Matching by username
for a Kerberos session is incorrect, and anonymous sessions need special
handling.

Also, in the case where we do match by username, we also need to match
by password. That ensures that someone else doesn't "borrow" an existing
session without needing to know the password.

Finally, passwords can be longer than 16 bytes. Bump MAX_PASSWORD_SIZE
to 512 to match the size that the userspace mount helper allows.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:35 +00:00
Jeff Layton
daf5b0b6f3 cifs: match secType when searching for existing tcp session
The secType is a per-tcp session entity, but the current routine doesn't
verify that it is acceptible when attempting to match an existing TCP
session.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:35 +00:00
Jeff Layton
4515148ef7 cifs: move address comparison into separate function
Move the address comparator out of cifs_find_tcp_session and into a
separate function for cleanliness. Also change the argument to
that function to a "struct sockaddr" pointer. Passing pointers to
sockaddr_storage is a little odd since that struct is generally for
declaring static storage.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:34 +00:00
Jeff Layton
50d971602a cifs: set the port in sockaddr in a more clearly defined fashion
This patch should replace the patch I sent a couple of weeks ago to
set the port in cifs_convert_address.

Currently we set this in cifs_find_tcp_session, but that's more of a
side effect than anything. Add a new function called cifs_fill_sockaddr.
Have it call cifs_convert_address and then set the port.

This also allows us to skip passing in the port as a separate parm to
cifs_find_tcp_session.

Also, change cifs_convert_address take a struct sockaddr * rather than
void * to make it clearer how this function should be called.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:34 +00:00
Suresh Jayaraman
488f1d2d6c cifs: define server-level cache index objects and register them
Define server-level cache index objects (as managed by TCP_ServerInfo structs)
and register then with FS-Cache. Each server object is created in the CIFS
top-level index object and is itself an index into which superblock-level
objects are inserted.

The server objects are now keyed by {IPaddress,family,port} tuple.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:34 +00:00
Suresh Jayaraman
f579cf3cfd cifs: register CIFS for caching
Define CIFS for FS-Cache and register for caching. Upon registration the
top-level index object cookie will be stuck to the netfs definition by
FS-Cache.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:33 +00:00
Joe Perches
c21dfb699f fs/cifs: Remove unnecessary casts of private_data
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:33 +00:00
Suresh Jayaraman
3feb41cff8 cifs: add kernel config option for CIFS Client caching support
Add a kernel config option to enable local caching for CIFS.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:33 +00:00
Suresh Jayaraman
c6332e237f cifs: remove unused ip_address field in struct TCP_Server_Info
The ip_address field is not used and seems redundant as there is union addr
already and I don't see any future use as well.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:33 +00:00
Suresh Jayaraman
e4317ceca2 cifs: remove an potentially confusing, obsolete comment
The recent commit 6ca9f3bae8 modified the code so
that filp is full instantiated whenever the file is created and passed back.
The below comment is no longer true, remove it.

Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:32 +00:00
Suresh Jayaraman
abd2e44dca cifs: guard cifsglob.h against multiple inclusion
Add conditional compile macros to guard the header file against multiple
inclusion.

Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-08-02 12:40:32 +00:00
Steven Whitehouse
0809f6ec18 GFS2: Fix recovery stuck bug (try #2)
This is a clean up of the code which deals with LM_FLAG_NOEXP
which aims to remove any possible race conditions by using
gl_spin to cover the gap between testing for the LM_FLAG_NOEXP
and the GL_FROZEN flag.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-08-02 10:15:17 +01:00
Ingo Molnar
3772b73472 Merge commit 'v2.6.35' into perf/core
Conflicts:
	tools/perf/Makefile
	tools/perf/util/hist.c

Merge reason: Resolve the conflicts and update to latest upstream.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-08-02 08:31:54 +02:00
Eric Paris
d09ca73979 security: make LSMs explicitly mask off permissions
SELinux needs to pass the MAY_ACCESS flag so it can handle auditting
correctly.  Presently the masking of MAY_* flags is done in the VFS.  In
order to allow LSMs to decide what flags they care about and what flags
they don't just pass them all and the each LSM mask off what they don't
need.  This patch should contain no functional changes to either the VFS or
any LSM.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by:  Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
2010-08-02 15:35:07 +10:00
Eric Paris
9cfcac810e vfs: re-introduce MAY_CHDIR
Currently MAY_ACCESS means that filesystems must check the permissions
right then and not rely on cached results or the results of future
operations on the object.  This can be because of a call to sys_access() or
because of a call to chdir() which needs to check search without relying on
any future operations inside that dir.  I plan to use MAY_ACCESS for other
purposes in the security system, so I split the MAY_ACCESS and the
MAY_CHDIR cases.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by:  Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
2010-08-02 15:35:06 +10:00
Tetsuo Handa
ea0d3ab239 LSM: Remove unused arguments from security_path_truncate().
When commit be6d3e56a6 "introduce new LSM hooks
where vfsmount is available." was proposed, regarding security_path_truncate(),
only "struct file *" argument (which AppArmor wanted to use) was removed.
But length and time_attrs arguments are not used by TOMOYO nor AppArmor.
Thus, let's remove these arguments.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: James Morris <jmorris@namei.org>
2010-08-02 15:33:40 +10:00
Theodore Ts'o
8b67f04ab9 ext4: Add mount options in superblock
Allow mount options to be stored in the superblock.  Also add default
mount option bits for nobarrier, block_validity, discard, and nodelalloc.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 23:14:20 -04:00
Sage Weil
2d9c98ae97 ceph: make ->sync_fs not wait if wait==0
The ->sync_fs() super op only needs to wait if wait is true.  Otherwise,
just get some dirty cap writeback started.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
b8cd07e78e ceph: warn on missing snap realm
Well, this Shouldn't Happen, so it would be helpful to know the caller when
it does.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
effcb9ed43 ceph: print useful error message when crush rule not found
Include the crush_ruleset in the error message.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
a8b763a9b3 ceph: use %pU to print uuid (fsid)
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
f0b18d9f22 ceph: sync header defs with server code
Define ROLLBACK op, IFLOCK inode lock (for advisory file locking).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
5cd068c200 ceph: clean up header guards
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:42 -07:00
Sage Weil
9688f19a18 ceph: strip misleading/obsolete version, feature info
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
6a2593823a ceph: specify supported features in super.h
Specify the supported/required feature bits in super.h client code instead
of using the definitions from the shared kernel/userspace headers (which
will go away shortly).

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
c309f0ab26 ceph: clean up fsid mount option
Specify the fsid mount option in hex, not via the major/minor u64 hackery we had
before.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
e0f9f9ee8f ceph: remove unused 'monport' mount option
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Greg Farnum
e55b71f802 ceph: handle ESTALE properly; on receipt send to authority if it wasn't
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Greg Farnum
2bc50259fa ceph: add ceph_get_cap_for_mds function.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
154f42c2c3 ceph: connect to export targets on cap export
When we get a cap EXPORT message, make sure we are connected to all export
targets to ensure we can handle the matching IMPORT.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:41 -07:00
Sage Weil
cb170a2215 ceph: connect to export targets if mds is laggy
If an MDS we are talking to may have failed, we need to open sessions to
its potential export targets to ensure that any in-progress migration that
may have involved some of our caps is properly handled.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil
ed0552a1a2 ceph: introduce helper to connect to mds export targets
There are a few cases where we need to open sessions with a given mds's
potential export targets.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil
796d6955a5 ceph: only set num_pages in calc_layout
Setting it elsewhere is unnecessary and more fragile.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Yehuda Sadeh
37151668ba ceph: do caps accounting per mds_client
Caps related accounting is now being done per mds client instead
of just being global. This prepares ground work for a later revision
of the caps preallocated reservation list.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil
0deb01c999 ceph: track laggy state of mds from mdsmap
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Yehuda Sadeh
cd84db6e40 ceph: code cleanup
Mainly fixing minor issues reported by sparse.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:40 -07:00
Sage Weil
ca81f3f6bd ceph: skip if no auth cap in flush_snaps
If we have a capsnap but no auth cap (e.g. because it is migrating to
another mds), bail out and do nothing for now.  Do NOT remove the capsnap
from the flush list.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
3b454c4945 ceph: simplify caps revocation, fix for multimds
The caps revocation should either initiate writeback, invalidateion, or
call check_caps to ack or do the dirty work.  The primary question is
whether we can get away with only checking the auth cap or whether all
caps need to be checked.

The old code was doing...something else.  At the very least, revocations
from non-auth MDSs could break by triggering the "check auth cap only"
case.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
38e8883ee3 ceph: simplify add_cap_releases
No functional change, aside from more useful debug output.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
ee6b272b9c ceph: drop unused argument
Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
2962507ca2 ceph: perform lazy reads when file mode and caps permit
If the file mode is marked as "lazy," perform cached/buffered reads when
the caps permit it.  Adjust the rdcache_gen and invalidation logic
accordingly so that we manage our cache based on the FILE_CACHE -or-
FILE_LAZYIO cap bits.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
33caad324b ceph: perform lazy writes when file mode and caps permit
If we have marked a file as "lazy" (using the ceph ioctl), perform buffered
writes when the MDS caps allow it.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
8c6e9229fc ceph: add LAZYIO ioctl to mark a file description for lazy consistency
Allow an application to mark a file descriptor for lazy file consistency
semantics, allowing buffered reads and writes when multiple clients are
accessing the same file.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:39 -07:00
Sage Weil
84d9509234 ceph: request FILE_LAZYIO cap when LAZY file mode is set
Also clean up the file flags -> file mode -> wanted caps functions while
we're at it.  This resyncs this file with userspace.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-08-01 20:11:38 -07:00
Trond Myklebust
77a63f3d1e NFS: Fix a typo in include/linux/nfs_fs.h
nfs_commit_inode() needs to be defined irrespectively of whether or not
we are supporting NFSv3 and NFSv4.

Allow the compiler to optimise away code in the NFSv2-only case by
converting it into an inlined stub function.

Reported-and-tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-01 15:10:01 -07:00
Dmitry Monakhov
ca0e05e4b1 ext4: force block allocation on quota_off
Perform full sync procedure so that any delayed allocation blocks are
allocated so quota will be consistent.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 17:48:36 -04:00
Eric Sandeen
437f88cc03 ext4: fix freeze deadlock under IO
Commit 6b0310fbf0 caused a regression resulting in deadlocks
when freezing a filesystem which had active IO; the vfs_check_frozen
level (SB_FREEZE_WRITE) did not let the freeze-related IO syncing
through.  Duh.

Changing the test to FREEZE_TRANS should let the normal freeze
syncing get through the fs, but still block any transactions from
starting once the fs is completely frozen.

I tested this by running fsstress in the background while periodically
snapshotting the fs and running fsck on the result.  I ran into
occasional deadlocks, but different ones.  I think this is a
fine fix for the problem at hand, and the other deadlocky things
will need more investigation.

Reported-by: Phillip Susi <psusi@cfl.rr.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-08-01 17:33:29 -04:00
Linus Torvalds
fc71ff8a6c Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFS: Ensure that writepage respects the nonblock flag
  NFS: kswapd must not block in nfs_release_page
  nfs: include space for the NUL in root path
2010-07-30 19:02:21 -07:00
David Howells
51c20fcced CIFS: Remove __exit mark from cifs_exit_dns_resolver()
Remove the __exit mark from cifs_exit_dns_resolver() as it's called by the
module init routine in case of error, and so may have been discarded during
linkage.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-30 18:56:09 -07:00
Trond Myklebust
cfb506e1d3 NFS: Ensure that writepage respects the nonblock flag
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30 15:38:56 -04:00
Trond Myklebust
b608b283a9 NFS: kswapd must not block in nfs_release_page
See https://bugzilla.kernel.org/show_bug.cgi?id=16056

If other processes are blocked waiting for kswapd to free up some memory so
that they can make progress, then we cannot allow kswapd to block on those
processes.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@kernel.org
2010-07-30 15:38:42 -04:00
Dan Carpenter
674b222292 nfs: include space for the NUL in root path
In root_nfs_name() it does the following:

        if (strlen(buf) + strlen(cp) > NFS_MAXPATHLEN) {
                printk(KERN_ERR "Root-NFS: Pathname for remote directory too long.\n");
                return -1;
        }
        sprintf(nfs_export_path, buf, cp);

In the original code if (strlen(buf) + strlen(cp) == NFS_MAXPATHLEN)
then the sprintf() would lead to an overflow.  Generally the rest of the
code assumes that the path can have NFS_MAXPATHLEN (1024) characters and
a NUL terminator so the fix is to add space to the nfs_export_path[]
buffer.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30 15:33:39 -04:00
Trond Myklebust
77041ed9b4 NFSv4: Ensure the lockowners are labelled using the fl_owner and/or fl_pid
flock locks want to be labelled using the process pid, while posix locks
want to be labelled using the fl_owner.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30 14:46:10 -04:00
Trond Myklebust
d3c7b7ccc1 NFSv4: Add support for the RELEASE_LOCKOWNER operation
This is needed by NFSv4.0 servers in order to keep the number of locking
stateids at a manageable level.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30 14:46:10 -04:00
Trond Myklebust
daccbded7f NFSv4: Clean up for lockowner XDR encoding
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30 14:46:09 -04:00
Trond Myklebust
f11ac8db5d NFSv4: Ensure that we track the NFSv4 lock state in read/write requests.
This patch fixes bugzilla entry 14501:
  https://bugzilla.kernel.org/show_bug.cgi?id=14501

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-07-30 14:41:56 -04:00
J. Bruce Fields
039a87ca53 nfsd: minor nfsd read api cleanup
Christoph points that the NFSv2/v3 callers know which case they want
here, so we may as well just call the file=NULL case directly instead of
making this conditional.

Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-30 12:54:54 -04:00
Abhijith Das
c639d5d8f6 GFS2: Fix typo in stuffed file data copy handling
trunc_start() in bmap.c incorrectly uses sizeof(struct gfs2_inode) instead of
sizeof(struct gfs2_dinode).

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-30 16:34:06 +01:00
Andi Kleen
6904996101 gcc-4.6: nfsd: fix initialized but not read warnings
Fixes at least one real minor bug: the nfs4 recovery dir sysctl
would not return its status properly.

Also I finished Al's 1e41568d73 ("Take ima_path_check() in nfsd
past dentry_open() in nfsd_open()") commit, it moved the IMA
code, but left the old path initializer in there.

The rest is just dead code removed I think, although I was not
fully sure about the "is_borc" stuff. Some more review
would be still good.

Found by gcc 4.6's new warnings.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-29 19:32:17 -04:00
J. Bruce Fields
f9d7562fdb nfsd4: share file descriptors between stateid's
The vfs doesn't really allow us to "upgrade" a file descriptor from
read-only to read-write, and our attempt to do so in nfs4_upgrade_open
is ugly and incomplete.

Move to a different scheme where we keep multiple opens, shared between
open stateid's, in the nfs4_file struct.  Each file will be opened at
most 3 times (for read, write, and read-write), and those opens will be
shared between all clients and openers.  On upgrade we will do another
open if necessary instead of attempting to upgrade an existing open.
We keep count of the number of readers and writers so we know when to
close the shared files.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-07-29 18:19:23 -04:00
David Howells
de09a9771a CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of
credentials by incrementing their usage count after their replacement by the
task being accessed.

What happens is that get_task_cred() can race with commit_creds():

	TASK_1			TASK_2			RCU_CLEANER
	-->get_task_cred(TASK_2)
	rcu_read_lock()
	__cred = __task_cred(TASK_2)
				-->commit_creds()
				old_cred = TASK_2->real_cred
				TASK_2->real_cred = ...
				put_cred(old_cred)
				  call_rcu(old_cred)
		[__cred->usage == 0]
	get_cred(__cred)
		[__cred->usage == 1]
	rcu_read_unlock()
							-->put_cred_rcu()
							[__cred->usage == 1]
							panic()

However, since a tasks credentials are generally not changed very often, we can
reasonably make use of a loop involving reading the creds pointer and using
atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero.

If successful, we can safely return the credentials in the knowledge that, even
if the task we're accessing has released them, they haven't gone to the RCU
cleanup code.

We then change task_state() in procfs to use get_task_cred() rather than
calling get_cred() on the result of __task_cred(), as that suffers from the
same problem.

Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be
tripped when it is noticed that the usage count is not zero as it ought to be,
for example:

kernel BUG at kernel/cred.c:168!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/ksm/run
CPU 0
Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex
745
RIP: 0010:[<ffffffff81069881>]  [<ffffffff81069881>] __put_cred+0xc/0x45
RSP: 0018:ffff88019e7e9eb8  EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff
RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0
RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0
R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001
FS:  00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0)
Stack:
 ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45
<0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000
<0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246
Call Trace:
 [<ffffffff810698cd>] put_cred+0x13/0x15
 [<ffffffff81069b45>] commit_creds+0x16b/0x175
 [<ffffffff8106aace>] set_current_groups+0x47/0x4e
 [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105
 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b
Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00
48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b
04 25 00 cc 00 00 48 3b b8 58 04 00 00 75
RIP  [<ffffffff81069881>] __put_cred+0xc/0x45
 RSP <ffff88019e7e9eb8>
---[ end trace df391256a100ebdd ]---

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-29 15:16:17 -07:00
J. Bruce Fields
0292191417 nfsd4: fix openmode checking on IO using lock stateid
It is legal to perform a write using the lock stateid that was
originally associated with a read lock, or with a file that was
originally opened for read, but has since been upgraded.

So, when checking the openmode, check the mode associated with the
open stateid from which the lock was derived.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-29 16:37:12 -04:00
J. Bruce Fields
21fb4016bd nfsd4: miscellaneous process_open2 cleanup
Move more work into helper functions.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-29 16:34:29 -04:00
J. Bruce Fields
c3e4808086 nfsd4: don't pretend to support write delegations
The delegation code mostly pretends to support either read or write
delegations.  However, correct support for write delegations would
require, for example, breaking of delegations (and/or implementation of
cb_getattr) on stat.  Currently all that stops us from handing out
delegations is a subtle reference-counting issue.

Avoid confusion by adding an earlier check that explicitly refuses write
delegations.

For now, though, I'm not going so far as to rip out existing
half-support for write delegations, in case we get around to using that
soon.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-29 16:05:51 -04:00
Theodore Ts'o
4538821993 ext4: drop inode from orphan list if ext4_delete_inode() fails
There were some error paths in ext4_delete_inode() which was not
dropping the inode from the orphan list.  This could lead to a BUG_ON
on umount when the orphan list is discovered to be non-empty.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-29 15:06:10 -04:00
Steven Whitehouse
7cdee5dbf4 Revert "GFS2: recovery stuck on transaction lock"
This reverts commit b7dc2df572.

The initial patch didn't quite work since it doesn't cover all
the possible routes by which the GLF_FROZEN flag might be set.
A revised fix is coming up in the next patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-29 14:39:29 +01:00
Steven Whitehouse
d5341a9241 GFS2: Make "try" lock not try quite so hard
This looks like a big change, but in reality its only a single line of actual
code change, the rest is just moving a function to before its new caller.
The "try" flag for glocks is a rather subtle and delicate setting since it
requires that the state machine tries just hard enough to ensure that it has
a good chance of getting the requested lock, but no so hard that the
request can land up blocked behind another.

The patch adds in an additional check which will fail any queued try
locks if there is another request blocking the try lock request which
is not granted and compatible, nor in progress already. The check is made
only after all pending locks which may be granted have been granted.

I've checked this with the reproducer for the reported flock bug which
this is intended to fix, and it now passes.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-29 09:37:38 +01:00
David Rientjes
4244b52e18 GFS2: remove dependency on __GFP_NOFAIL
The k[mc]allocs in dr_split_leaf() and dir_double_exhash() are failable,
so remove __GFP_NOFAIL from their masks.

Cc: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-29 09:37:18 +01:00
Bob Peterson
461cb419f0 GFS2: Simplify gfs2_write_alloc_required
Function gfs2_write_alloc_required always returned zero as its
return code.  Therefore, it doesn't need to return a return code
at all.  Given that, we can use the return value to return whether
or not the dinode needs block allocations rather than passing
that value in, which in turn simplifies a bunch of error checking.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-29 09:36:56 +01:00
Steven Whitehouse
ba6e93645f GFS2: Wait for journal id on mount if not specified on mount command line
This patch implements a wait for the journal id in the case that it has
not been specified on the command line. This is to allow the future
removal of the mount.gfs2 helper. The journal id would instead be
directly communicated by gfs_controld to the file system. Here is a
comparison of the two systems:

Current:
1. mount calls mount.gfs2
2. mount.gfs2 connects to gfs_controld to retrieve the journal id
3. mount.gfs2 adds the journal id to the mount command line and calls
the mount system call
4. gfs_controld receives the status of the mount request via a uevent

Proposed:
1. mount calls the mount system call (no mount.gfs2 helper)
2. gfs_controld receives a uevent for a gfs2 fs which it doesn't know
about already
3. gfs_controld assigns a journal id to it via sysfs
4. the mount system call then completes as normal (sending a uevent
according to status)

The advantage of the proposed system is that it is completely backward
compatible with the current system both at the kernel and at the
userland levels. The "first" parameter can also be set the same way,
with the restriction that it must be set before the journal id is
assigned.

In addition, if mount becomes stuck waiting for a reply from
gfs_controld which never arrives, then it is killable and will abort the
mount gracefully.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-29 09:36:35 +01:00
Steven Whitehouse
30116ff6c6 GFS2: Use nobh_writepage
Use nobh_writepage rather than calling mpage_writepage directly.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
2010-07-29 09:36:14 +01:00
Andre Osterhues
a6f80fb7b5 ecryptfs: Bugfix for error related to ecryptfs_hash_buckets
The function ecryptfs_uid_hash wrongly assumes that the
second parameter to hash_long() is the number of hash
buckets instead of the number of hash bits.
This patch fixes that and renames the variable
ecryptfs_hash_buckets to ecryptfs_hash_bits to make it
clearer.

Fixes: CVE-2010-2492

Signed-off-by: Andre Osterhues <aosterhues@escrypt.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-28 19:59:24 -07:00
Linus Torvalds
6c50e1a49b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: use complete_all and wake_up_all
  ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"
  ceph: fix dentry lease release
  ceph: fix leak of dentry in ceph_init_dentry() error path
  ceph: fix pg_mapping leak on pg_temp updates
  ceph: fix d_release dop for snapdir, snapped dentries
  ceph: avoid dcache readdir for snapdir
2010-07-28 11:10:53 -07:00
Steven Whitehouse
d2a97a4e99 GFS2: Use kmalloc when possible for ->readdir()
If we don't need a huge amount of memory in ->readdir() then
we can use kmalloc rather than vmalloc to allocate it. This
should cut down on the greater overheads associated with
vmalloc for smaller directories.

We may be able to eliminate vmalloc entirely at some stage,
but this is easy to do right away.

Also using GFP_NOFS to avoid any issues wrt to deleting inodes
while under a glock, and suggestion from Linus to factor out
the alloc/dealloc.

I've given this a test with a variety of different sized
directories and it seems to work ok.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-07-28 11:10:03 -07:00
Eric Paris
1968f5eed5 fanotify: use both marks when possible
fanotify currently, when given a vfsmount_mark will look up (if it exists)
the corresponding inode mark.  This patch drops that lookup and uses the
mark provided.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:55 -04:00
Eric Paris
ce8f76fb73 fsnotify: pass both the vfsmount mark and inode mark
should_send_event() and handle_event() will both need to look up the inode
event if they get a vfsmount event.  Lets just pass both at the same time
since we have them both after walking the lists in lockstep.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:54 -04:00
Eric Paris
613a807fe7 fsnotify: walk the inode and vfsmount lists simultaneously
We currently walk the list of marks on an inode followed by the list of
marks on the vfsmount.  These are in order (by the memory address of the
group) so lets walk them both together.  Eventually we can pass both the
inode mark and the vfsmount mark to helpers simultaneously.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:54 -04:00
Eric Paris
84a5b68e8d fsnotify: rework ignored mark flushing
currently ignored_mark clearing is done in a seperate list traversal
before the actual list traversal to send events.  There is no need for
this.  Do them at the same time.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:54 -04:00
Eric Paris
02436668d9 fsnotify: remove global fsnotify groups lists
The global fsnotify groups lists were invented as a way to increase the
performance of fsnotify by shortcutting events which were not interesting.
With the changes to walk the object lists rather than global groups lists
these shortcuts are not useful.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:54 -04:00
Eric Paris
43709a288e fsnotify: remove group->mask
group->mask is now useless.  It was originally a shortcut for fsnotify to
save on performance.  These checks are now redundant, so we remove them.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:54 -04:00
Eric Paris
03930979af fsnotify: remove the global masks
Because we walk the object->fsnotify_marks list instead of the global
fsnotify groups list we don't need the fsnotify_inode_mask and
fsnotify_vfsmount_mask as these were simply shortcuts in fsnotify() for
performance.  They are now extra checks, rip them out.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:54 -04:00
Eric Paris
2612abb51b fsnotify: cleanup should_send_event
The change to use srcu and walk the object list rather than the global
fsnotify_group list means that should_send_event is no longer needed for a
number of groups and can be simplified for others.  Do that.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:53 -04:00
Eric Paris
0215054f37 fanotify: use the mark in handler functions
fanotify now gets a mark in the should_send_event and handle_event
functions.  Rather than look up the mark themselves fanotify should just use
the mark it was handed.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:53 -04:00
Eric Paris
c496313fcc dnotify: use the mark in handler functions
dnotify now gets a mark in the should_send_event and handle_event
functions.  Rather than look up the mark themselves dnotify should just use
the mark it was handed.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:53 -04:00
Eric Paris
7f6b6117e1 inotify: use the mark in handler functions
inotify now gets a mark in the should_send_event and handle_event
functions.  Rather than look up the mark themselves inotify should just use
the mark it was handed.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:52 -04:00
Eric Paris
3a9b16b407 fsnotify: send fsnotify_mark to groups in event handling functions
With the change of fsnotify to use srcu walking the marks list instead of
walking the global groups list we now know the mark in question.  The code can
send the mark to the group's handling functions and the groups won't have to
find those marks themselves.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:52 -04:00
Andreas Gruenbacher
8778abb9a8 fsnotify: Exchange list heads instead of moving elements
Instead of moving list elements from destroy_list to &private_destroy_list,
exchange the list heads.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:52 -04:00
Eric Paris
75c1be487a fsnotify: srcu to protect read side of inode and vfsmount locks
Currently reading the inode->i_fsnotify_marks or
vfsmount->mnt_fsnotify_marks lists are protected by a spinlock on both the
read and the write side.  This patch protects the read side of those lists
with a new single srcu.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:52 -04:00
Eric Paris
700307a29a fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
Currently fsnotify check is mark->group is NULL to decide if
fsnotify_destroy_mark() has already been called or not.  With the upcoming
rcu work it is a heck of a lot easier to use an explicit flag than worry
about group being set to NULL.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:52 -04:00
Eric Paris
a4c6e9961f fsnotify: use _rcu functions for mark list traversal
In preparation for srcu locking use all _rcu appropiete functions for mark
list addition, removal, and traversal.  The operations are still done under a
spinlock at the end of this patch.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:51 -04:00
Eric Paris
0c6532e4e3 fsnotify: place marks on object in order of group memory address
fsnotify_marks currently are placed on objects (inodes or vfsmounts) in
arbitrary order.  This patch places them in order of the group memory address.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:51 -04:00
Eric Paris
c1e5c95402 vfs/fsnotify: fsnotify_close can delay the final work in fput
fanotify almost works like so:

user context calls fsnotify_* function with a struct file.
   fsnotify takes a reference on the struct path
user context goes about it's buissiness

at some later point in time the fsnotify listener gets the struct path
   fanotify listener calls dentry_open() to create a file which userspace can deal with
      listener drops the reference on the struct path
at some later point the listener calls close() on it's new file

With the switch from struct path to struct file this presents a problem for
fput() and fsnotify_close().  fsnotify_close() is called when the filp has
already reached 0 and __fput() wants to do it's cleanup.

The solution presented here is a bit odd.  If an event is created from a
struct file we take a reference on the file.  We check however if the f_count
was already 0 and if so we take an EXTRA reference EVEN THOUGH IT WAS ZERO.
In __fput() (where we know the f_count hit 0 once) we check if the f_count is
non-zero and if so we drop that 'extra' ref and return without destroying the
file.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:51 -04:00
Eric Paris
3bcf3860a4 fsnotify: store struct file not struct path
Al explains that calling dentry_open() with a mnt/dentry pair is only
garunteed to be safe if they are already used in an open struct file.  To
make sure this is the case don't store and use a struct path in fsnotify,
always use a struct file.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:51 -04:00
Eric Paris
f70ab54cc6 fsnotify: fsnotify_add_notify_event should return an event
Rather than the horrific void ** argument and such just to pass the
fanotify_merge event back to the caller of fsnotify_add_notify_event() have
those things return an event if it was different than the event suggusted to
be added.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:50 -04:00
Eric Paris
5ba08e2eeb fsnotify: add pr_debug throughout
It can be hard to debug fsnotify since there are so few printks.  Use
pr_debug to allow for dynamic debugging.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:50 -04:00
Eric Paris
80af258867 fanotify: groups can specify their f_flags for new fd
Currently fanotify fds opened for thier listeners are done with f_flags
equal to O_RDONLY | O_LARGEFILE.  This patch instead takes f_flags from the
fanotify_init syscall and uses those when opening files in the context of
the listener.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:50 -04:00
Eric Paris
20dee624ca fsnotify: check to make sure all fsnotify bits are unique
This patch adds a check to make sure that all fsnotify bits are unique and we
cannot accidentally use the same bit for 2 different fsnotify event types.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:50 -04:00
Jerome Marchand
44b350fc23 inotify: Fix mask checks
The mask checks in inotify_update_existing_watch() and
inotify_new_watch() are useless because inotify_arg_to_mask() sets
FS_IN_IGNORED and FS_EVENT_ON_CHILD bits anyway.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:49 -04:00
Eric Paris
f874e1ac21 inotify: force inotify and fsnotify use same bits
inotify uses bits called IN_* and fsnotify uses bits called FS_*.  These
need to line up.  This patch adds build time checks to make sure noone can
change these bits so they are not the same.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:49 -04:00
Eric Paris
8c1934c8d7 inotify: allow users to request not to recieve events on unlinked children
An inotify watch on a directory will send events for children even if those
children have been unlinked.  This patch add a new inotify flag IN_EXCL_UNLINK
which allows a watch to specificy they don't care about unlinked children.
This should fix performance problems seen by tasks which add a watch to
/tmp and then are overrun with events when other processes are reading and
writing to unlinked files they created in /tmp.

https://bugzilla.kernel.org/show_bug.cgi?id=16296

Requested-by: Matthias Clasen <mclasen@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:49 -04:00
Eric Paris
611da04f7a inotify: send IN_UNMOUNT events
Since the .31 or so notify rewrite inotify has not sent events about
inodes which are unmounted.  This patch restores those events.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:49 -04:00
Eric Paris
ff311008ab inotify: fix inotify oneshot support
During the large inotify rewrite to fsnotify I completely dropped support
for IN_ONESHOT.  Reimplement that support.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 10:18:48 -04:00
Tejun Heo
e4e047a220 fsnotify: update gfp/slab.h includes
Implicit slab.h inclusion via percpu.h is about to go away.  Make sure
gfp.h or slab.h is included as necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:03 -04:00
Eric Paris
08ae89380a fanotify: drop the useless priority argument
The priority argument in fanotify is useless.  Kill it.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:03 -04:00
Eric Paris
269ed32a9c fanotify: default Kconfig to n
fanotify has default to y in linux-next since it's inception but default to
n in the final push to Linus.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:03 -04:00
H Hartley Sweeten
0a24887afa inotify_user.c: make local symbol static
The symbol inotify_max_user_watches is not used outside this
file and should be static.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:02 -04:00
Eric Paris
b31d397e43 fsnotify: call iput on inodes when no longer marked
fsnotify takes an igrab on an inode when it adds a mark.  The code was
supposed to drop the reference when the mark was removed but didn't.
This caused problems when an fs was unmounted because those inodes would
clearly not be gone.  Thus resulting in the most devistating of messages:

VFS: Busy inodes after unmount of loop0. Self-destruct in 5 seconds.
>>> Have a nice day...

Jiri Slaby bisected the problem to a patch in the fsnotify tree.  The
code snippets below show my stupidity quite clearly.

void fsnotify_destroy_inode_mark(struct fsnotify_mark *mark)
{
	...
	mark->inode = NULL;
	...
}

void fsnotify_destroy_mark(struct fsnotify_mark *mark)
{
	struct inode *inode = NULL;
	...
	if (mark->flags & FSNOTIFY_MARK_FLAG_INODE) {
		fsnotify_destroy_inode_mark(mark);
		inode = mark->i.inode;
	}
	...
	if (inode)
		iput(inode);
	...
}

Obviously the intent was to capture the inode before it was set to NULL in
fsnotify_destory_inode_mark() so we wouldn't be leaking inodes forever.
Instead we leaked them (and exploded on umount)

Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:02 -04:00
Jean-Christophe Dubois
98b5c10d32 fanotify: do not always return 0 in fsnotify
It seems to me you are always returning 0 in fsnotify, when you should return
the error (EPERM) returned by fanotify.

Signed-off-by: Jean-Christophe DUBOIS <jcd@tribudubois.net>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:02 -04:00
Eric Paris
8860f060e4 fanotify: do not return 0 in a void function
remove_access_response() is supposed to have a void return, but was
returning 0;

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:02 -04:00
Eric Paris
b2d879096a fanotify: userspace interface for permission responses
fanotify groups need to respond to events which include permissions types.
To do so groups will send a response using write() on the fanotify_fd they
have open.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:02 -04:00
Eric Paris
9e66e4233d fanotify: permissions and blocking
This is the backend work needed for fanotify to support the new
FS_OPEN_PERM and FS_ACCESS_PERM fsnotify events.  This is done using the
new fsnotify secondary queue.  No userspace interface is provided actually
respond to or request these events.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:02 -04:00
Eric Paris
c4ec54b40d fsnotify: new fsnotify hooks and events types for access decisions
introduce a new fsnotify hook, fsnotify_perm(), which is called from the
security code.  This hook is used to allow fsnotify groups to make access
control decisions about events on the system.  We also must change the
generic fsnotify function to return an error code if we intend these hooks
to be in any way useful.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:01 -04:00
Eric Paris
59b0df211b fsnotify: use unsigned char * for dentry->d_name.name
fsnotify was using char * when it passed around the d_name.name string
internally but it is actually an unsigned char *.  This patch switches
fsnotify to use unsigned and should silence some pointer signess warnings
which have popped out of xfs.  I do not add -Wpointer-sign to the fsnotify
code as there are still issues with kstrdup and strlen which would pop
out needless warnings.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:01 -04:00
Eric Paris
43ed7e16a8 fanotify: use merge argument to determine actual event added to queue
fanotify needs to know the actual event added to queues so it can be
correctly checked for return values from userspace.  To do this we need to
pass that information from the merger code back to the main even handling
routine.  Currently that information is unused, but it will be.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:01 -04:00
Eric Paris
6e5f77b32e fsnotify: intoduce a notification merge argument
Each group can define their own notification (and secondary_q) merge
function.  Inotify does tail drop, fanotify does matching and drop which
can actually allocate a completely new event.  But for fanotify to properly
deal with permissions events it needs to know the new event which was
ultimately added to the notification queue.  This patch just implements a
void ** argument which is passed to the merge function.  fanotify can use
this field to pass the new event back to higher layers.

Signed-off-by: Eric Paris <eparis@redhat.com>
for fanotify to properly deal with permissions events
2010-07-28 09:59:01 -04:00
Eric Paris
cb2d429faf fsnotify: add group priorities
This introduces an ordering to fsnotify groups.  With purely asynchronous
notification based "things" implementing fsnotify (inotify, dnotify) ordering
isn't particularly important.  But if people want to use fsnotify for the
basis of sycronous notification or blocking notification ordering becomes
important.

eg. A Hierarchical Storage Management listener would need to get its event
before an AV scanner could get its event (since the HSM would need to
bring the data in for the AV scanner to scan.)  Typically asynchronous notification
would want to run after the AV scanner made any relevant access decisions
so as to not send notification about an event that was denied.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:01 -04:00
Eric Paris
4d92604cc9 fanotify: clear all fanotify marks
fanotify listeners may want to clear all marks.  They may want to do this
to destroy all of their inode marks which have nothing but ignores.
Realistically this is useful for av vendors who update policy and want to
clear all of their cached allows.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:00 -04:00
Eric Paris
c9778a98e7 fanotify: allow ignored_masks to survive modify
Some users may want to truely ignore an inode even if it has been modified.
Say you are wanting a mount which contains a log file and you really don't
want any notification about that file.  This patch allows the listener to
do that.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:00 -04:00
Eric Paris
c908370fc1 fsnotify: allow ignored_mask to survive modification
Some inodes a group may want to never hear about a set of events even if
the inode is modified.  We add a new mark flag which indicates that these
marks should not have their ignored_mask cleared on modification.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:00 -04:00
Eric Paris
e898386146 fsnotify: clear ignored mask on modify
On inode modification we clear the ignored mask for all of the marks on the
inode.  This allows userspace to ignore accesses to inodes until there is
something different.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:00 -04:00
Eric Paris
b9e4e3bd04 fanotify: allow users to set an ignored_mask
Change the sys_fanotify_mark() system call so users can set ignored_masks
on inodes.  Remember, if a user new sets a real mask, and only sets ignored
masks, the ignore will never be pinned in memory.  Thus ignored_masks can
be lost under memory pressure and the user may again get events they
previously thought were ignored.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:00 -04:00
Eric Paris
32a4df13b8 fanotify: ignored_mask to ignore events
When fanotify receives an event it will check event->mask & ~ignored_mask.
If no bits are left the event will not be sent.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:00 -04:00
Eric Paris
33af5e32e0 fsnotify: ignored_mask - excluding notification
The ignored_mask is a new mask which is part of fsnotify marks.  A group's
should_send_event() function can use the ignored mask to determine that
certain events are not of interest.  In particular if a group registers a
mask including FS_OPEN on a vfsmount they could add FS_OPEN to the
ignored_mask for individual inodes and not send open events for those
inodes.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:59:00 -04:00
Eric Paris
90b1e7a578 fsnotify: allow marks to not pin inodes in core
inotify marks must pin inodes in core.  dnotify doesn't technically need to
since they are closed when the directory is closed.  fanotify also need to
pin inodes in core as it works today.  But the next step is to introduce
the concept of 'ignored masks' which is actually a mask of events for an
inode of no interest.  I claim that these should be liberally sent to the
kernel and should not pin the inode in core.  If the inode is brought back
in the listener will get an event it may have thought excluded, but this is
not a serious situation and one any listener should deal with.

This patch lays the ground work for non-pinning inode marks by using lazy
inode pinning.  We do not pin a mark until it has a non-zero mask entry.  If a
listener new sets a mask we never pin the inode.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:59 -04:00
Andreas Gruenbacher
33d3dfff45 fanotify: remove outgoing function checks in fanotify.h
A number of validity checks on outgoing data are done in static inlines but
are only used in one place.  Instead just do them where they are used for
readability.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:59 -04:00
Andreas Gruenbacher
88380fe66e fanotify: remove fanotify.h declarations
fanotify_mark_validate functions are all needlessly declared in headers as
static inlines.  Instead just do the checks where they are needed for code
readability.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:59 -04:00
Andreas Gruenbacher
f3640192c0 fanotify: split fanotify_remove_mark
split fanotify_remove_mark into fanotify_remove_inode_mark and
fanotify_remove_vfsmount_mark.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:59 -04:00
Andreas Gruenbacher
eac8e9e80c fanotify: rename FAN_MARK_ON_VFSMOUNT to FAN_MARK_MOUNT
the term 'vfsmount' isn't sensicle to userspace.  instead call is 'mount.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:59 -04:00
Eric Paris
0ff21db9fc fanotify: hooks the fanotify_mark syscall to the vfsmount code
Create a new fanotify_mark flag which indicates we should attach the mark
to the vfsmount holding the object referenced by dfd and pathname rather
than the inode itself.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:59 -04:00
Andreas Gruenbacher
90dd201d1a fanotify: remove fanotify_add_mark
fanotify_add_mark now does nothing useful anymore, drop it.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:58 -04:00
Andreas Gruenbacher
52202dfbd9 fanotify: do not return pointer from fanotify_add_*_mark
No need to return the mark from fanotify_add_*_mark to fanotify_add_mark

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:58 -04:00
Andreas Gruenbacher
912ee3946c fanotify: do not call fanotify_update_object_mask in fanotify_add_mark
Recalculate masks in fanotify_add_mark, don't use
fanotify_update_object_mask.  This gets us one step closers to readable
code.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:58 -04:00
Andreas Gruenbacher
088b09b0ac fanotify: do not call fanotify_update_object_mask in fanotify_remove_mark
Recalculate masks in fanotify_remove_mark, don't use
fanotify_update_object_mask.  This gets us one step closers to readable
code.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:58 -04:00
Andreas Gruenbacher
c6223f4649 fanotify: remove fanotify_update_mark
fanotify_update_mark() doesn't do much useful;  remove it.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:58 -04:00
Eric Paris
88826276dc fanotify: infrastructure to add an remove marks on vfsmounts
infrastructure work to add and remove marks on vfsmounts.  This should get
every set up except wiring the functions to the syscalls.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:57 -04:00
Eric Paris
1c529063a3 fanotify: should_send_event needs to handle vfsmounts
currently should_send_event in fanotify only cares about marks on inodes.
This patch extends that interface to indicate that it cares about events
that happened on vfsmounts.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:57 -04:00
Andreas Gruenbacher
ca9c726eea fsnotify: Infrastructure for per-mount watches
Per-mount watches allow groups to listen to fsnotify events on an entire
mount.  This patch simply adds and initializes the fields needed in the
vfsmount struct to make this happen.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:57 -04:00
Eric Paris
0d48b7f01f fsnotify: vfsmount marks generic functions
Much like inode-mark.c has all of the code dealing with marks on inodes
this patch adds a vfsmount-mark.c which has similar code but is intended
for marks on vfsmounts.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:57 -04:00
Andreas Gruenbacher
2504c5d63b fsnotify/vfsmount: add fsnotify fields to struct vfsmount
This patch adds the list and mask fields needed to support vfsmount marks.
These are the same fields fsnotify needs on an inode.  They are not used,
just declared and we note where the cleanup hook should be (the function is
not yet defined)

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:57 -04:00
Eric Paris
ba643f04cd fsnotify: clear marks to 0 in fsnotify_init_mark
Currently fsnotify_init_mark sets some fields to 0/NULL.  Some users
already used some sorts of zalloc, some didn't.  This patch uses memset to
explicitly zero everything in the fsnotify_mark when it is initialized so we
don't have to be careful if fields are later added to marks.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:57 -04:00
Eric Paris
5444e2981c fsnotify: split generic and inode specific mark code
currently all marking is done by functions in inode-mark.c.  Some of this
is pretty generic and should be instead done in a generic function and we
should only put the inode specific code in inode-mark.c

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:57 -04:00
Andreas Gruenbacher
32c3263221 fanotify: Add pids to events
Pass the process identifiers of the triggering processes to fanotify
listeners: this information is useful for event filtering and logging.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:56 -04:00
Andreas Gruenbacher
22aa425dec fanotify: create_fd cleanup
Code cleanup which does the fd creation work seperately from the userspace
metadata creation.  It fits better with the other code.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:56 -04:00
Heiko Carstens
9bbfc964b8 fanotify: CONFIG_HAVE_SYSCALL_WRAPPERS for sys_fanotify_mark
Please note that you need the patch below in addition, otherwise the
syscall wrapper stuff won't work on those 32 bit architectures which enable
the wrappers.

When enabled the syscall wrapper defines always take long parameters and then
cast them to whatever is needed. This approach doesn't work for the 32 bit
case where the original syscall takes a long long parameter, since we would
lose the upper 32 bits.
So syscalls with 64 bit arguments are special cases wrt to syscall wrappers
and enp up in the ugliness below (see also sys_fallocate). In addition these
special cased syscall wrappers have the drawback that ftrace syscall tracing
doesn't work on them, since they don't get defined by using the usual macros.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:56 -04:00
Paul Mundt
ef601a9cfd fanotify: select ANON_INODES.
fanotify references anon_inode_getfd(), which is only available with
ANON_INODES enabled. Presently this bails out with the following:

  LD      vmlinux
fs/built-in.o: In function `sys_fanotify_init':
(.text+0x26d1c): undefined reference to `anon_inode_getfd'
make: *** [vmlinux] Error 1

which is trivially corrected by adding an ANON_INODES select.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:56 -04:00
Eric Paris
a1014f1023 fanotify: send events using read
Send events to userspace by reading the file descriptor from fanotify_init().
One will get blocks of data which look like:

struct fanotify_event_metadata {
	__u32 event_len;
	__u32 vers;
	__s32 fd;
	__u64 mask;
	__s64 pid;
	__u64 cookie;
} __attribute__ ((packed));

Simple code to retrieve and deal with events is below

	while ((len = read(fan_fd, buf, sizeof(buf))) > 0) {
		struct fanotify_event_metadata *metadata;

		metadata = (void *)buf;
		while(FAN_EVENT_OK(metadata, len)) {
			[PROCESS HERE!!]
			if (metadata->fd >= 0 && close(metadata->fd) != 0)
				goto fail;
			metadata = FAN_EVENT_NEXT(metadata, len);
		}
	}

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:56 -04:00
Eric Paris
2a3edf8604 fanotify: fanotify_mark syscall implementation
NAME
	fanotify_mark - add, remove, or modify an fanotify mark on a
filesystem object

SYNOPSIS
	int fanotify_mark(int fanotify_fd, unsigned int flags, u64 mask,
			  int dfd, const char *pathname)

DESCRIPTION
	fanotify_mark() is used to add remove or modify a mark on a filesystem
	object.  Marks are used to indicate that the fanotify group is
	interested in events which occur on that object.  At this point in
	time marks may only be added to files and directories.

	fanotify_fd must be a file descriptor returned by fanotify_init()

	The flags field must contain exactly one of the following:

	FAN_MARK_ADD - or the bits in mask and ignored mask into the mark
	FAN_MARK_REMOVE - bitwise remove the bits in mask and ignored mark
		from the mark

	The following values can be OR'd into the flags field:

	FAN_MARK_DONT_FOLLOW - same meaning as O_NOFOLLOW as described in open(2)
	FAN_MARK_ONLYDIR - same meaning as O_DIRECTORY as described in open(2)

	dfd may be any of the following:
	AT_FDCWD: the object will be lookup up based on pathname similar
		to open(2)

	file descriptor of a directory: if pathname is not NULL the
		object to modify will be lookup up similar to openat(2)

	file descriptor of the final object: if pathname is NULL the
		object to modify will be the object referenced by dfd

	The mask is the bitwise OR of the set of events of interest such as:
	FAN_ACCESS		- object was accessed (read)
	FAN_MODIFY		- object was modified (write)
	FAN_CLOSE_WRITE		- object was writable and was closed
	FAN_CLOSE_NOWRITE	- object was read only and was closed
	FAN_OPEN		- object was opened
	FAN_EVENT_ON_CHILD	- interested in objected that happen to
				  children.  Only relavent when the object
				  is a directory
	FAN_Q_OVERFLOW		- event queue overflowed (not implemented)

RETURN VALUE
	On success, this system call returns 0. On error, -1 is
	returned, and errno is set to indicate the error.

ERRORS
	EINVAL An invalid value was specified in flags.

	EINVAL An invalid value was specified in mask.

	EINVAL An invalid value was specified in ignored_mask.

	EINVAL fanotify_fd is not a file descriptor as returned by
	fanotify_init()

	EBADF fanotify_fd is not a valid file descriptor

	EBADF dfd is not a valid file descriptor and path is NULL.

	ENOTDIR dfd is not a directory and path is not NULL

	EACCESS no search permissions on some part of the path

	ENENT file not found

	ENOMEM Insufficient kernel memory is available.

CONFORMING TO
	These system calls are Linux-specific.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:56 -04:00
Eric Paris
bbaa4168b2 fanotify: sys_fanotify_mark declartion
This patch simply declares the new sys_fanotify_mark syscall

int fanotify_mark(int fanotify_fd, unsigned int flags, u64_mask,
		  int dfd const char *pathname)

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:55 -04:00
Eric Paris
52c923dd07 fanotify: fanotify_init syscall implementation
NAME
	fanotify_init - initialize an fanotify group

SYNOPSIS
	int fanotify_init(unsigned int flags, unsigned int event_f_flags, int priority);

DESCRIPTION
	fanotify_init() initializes a new fanotify instance and returns a file
	descriptor associated with the new fanotify event queue.

	The following values can be OR'd into the flags field:

	FAN_NONBLOCK Set the O_NONBLOCK file status flag on the new open file description.
		Using this flag saves extra calls to fcntl(2) to achieve the same
		result.

	FAN_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor.
		See the description of the O_CLOEXEC flag in open(2) for reasons why
		this may be useful.

	The event_f_flags argument is unused and must be set to 0

	The priority argument is unused and must be set to 0

RETURN VALUE
	On success, this system call return a new file descriptor. On error, -1 is
	returned, and errno is set to indicate the error.

ERRORS
	EINVAL An invalid value was specified in flags.

	EINVAL A non-zero valid was passed in event_f_flags or in priority

	ENFILE The system limit on the total number of file descriptors has been reached.

	ENOMEM Insufficient kernel memory is available.

CONFORMING TO
	These system calls are Linux-specific.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:55 -04:00
Eric Paris
11637e4b7d fanotify: fanotify_init syscall declaration
This patch defines a new syscall fanotify_init() of the form:

int sys_fanotify_init(unsigned int flags, unsigned int event_f_flags,
		      unsigned int priority)

This syscall is used to create and fanotify group.  This is very similar to
the inotify_init() syscall.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:55 -04:00
Eric Paris
9dced01a09 fanotify: do not clone on merge unless needed
Currently if 2 events are going to be merged on the notication queue with
different masks the second event will be cloned and will replace the first
event.  However if this notification queue is the only place referencing
the event in question there is no reason not to just update the event in
place.  We can tell this if the event->refcnt == 1.  Since we hold a
reference for each queue this event is on we know that when refcnt == 1
this is the only queue.  The other concern is that it might be about to be
added to a new queue, but this can't be the case since fsnotify holds a
reference on the event until it is finished adding it to queues.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:55 -04:00
Eric Paris
a12a7dd328 fanotify: merge notification events with different masks
Instead of just merging fanotify events if they are exactly the same, merge
notification events with different masks.  To do this we have to clone the
old event, update the mask in the new event with the new merged mask, and
put the new event in place of the old event.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:55 -04:00
Eric Paris
767cd46c33 fanotify:drop notification if they exist in the outgoing queue
fanotify listeners get an open file descriptor to the object in question so
the ordering of operations is not as important as in other notification
systems.  inotify will drop events if the last event in the event FIFO is
the same as the current event.  This patch will drop fanotify events if
they are the same as another event anywhere in the event FIFO.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:54 -04:00
Eric Paris
ff0b16a985 fanotify: fscking all notification system
fanotify is a novel file notification system which bases notification on
giving userspace both an event type (open, close, read, write) and an open
file descriptor to the object in question.  This should address a number of
races and problems with other notification systems like inotify and dnotify
and should allow the future implementation of blocking or access controlled
notification.  These are useful for on access scanners or hierachical storage
management schemes.

This patch just implements the basics of the fsnotify functions.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:54 -04:00
Andreas Gruenbacher
3556608709 fsnotify: take inode->i_lock inside fsnotify_find_mark_entry()
All callers to fsnotify_find_mark_entry() except one take and
release inode->i_lock around the call.  Take the lock inside
fsnotify_find_mark_entry() instead.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:54 -04:00
Eric Paris
ef5e2b785f dnotify: rename mark_entry to mark
nomenclature change.  Used to call things 'entries' but now we just call
them 'marks.'  Do those changes for dnotify.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:54 -04:00
Eric Paris
000285deb9 inotify: rename mark_entry to just mark
rename anything in inotify that deals with mark_entry to just be mark.  It
makes a lot more sense.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:54 -04:00
Eric Paris
841bdc10f5 fsnotify: rename mark_entry to just mark
previously I used mark_entry when talking about marks on inodes.  The
_entry is pretty useless.  Just use "mark" instead.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:53 -04:00
Eric Paris
d07754412f fsnotify: rename fsnotify_find_mark_entry to fsnotify_find_mark
the _entry portion of fsnotify functions is useless.  Drop it.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:53 -04:00
Eric Paris
e61ce86737 fsnotify: rename fsnotify_mark_entry to just fsnotify_mark
The name is long and it serves no real purpose.  So rename
fsnotify_mark_entry to just fsnotify_mark.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:53 -04:00
Andreas Gruenbacher
72acc85442 fsnotify: kill FSNOTIFY_EVENT_FILE
Some fsnotify operations send a struct file.  This is more information than
we technically need.  We instead send a struct path in all cases instead of
sometimes a path and sometimes a file.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:53 -04:00
Eric Paris
098cf2fc77 fsnotify: add flags to fsnotify_mark_entries
To differentiate between inode and vfsmount (or other future) types of
marks we add a flags field and set the inode bit on inode marks (the only
currently supported type of mark)

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:52 -04:00
Eric Paris
2823e04de4 fsnotify: put inode specific fields in an fsnotify_mark in a union
The addition of marks on vfs mounts will be simplified if the inode
specific parts of a mark and the vfsmnt specific parts of a mark are
actually in a union so naming can be easy.  This patch just implements the
inode struct and the union.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:52 -04:00
Eric Paris
3a9fb89f4c fsnotify: include vfsmount in should_send_event when appropriate
To ensure that a group will not duplicate events when it receives it based
on the vfsmount and the inode should_send_event test we should distinguish
those two cases.  We pass a vfsmount to this function so groups can make
their own determinations.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:52 -04:00
Eric Paris
7131485a93 fsnotify: mount point listeners list and global mask
currently all of the notification systems implemented select which inodes
they care about and receive messages only about those inodes (or the
children of those inodes.)  This patch begins to flesh out fsnotify support
for the concept of listeners that want to hear notification for an inode
accessed below a given monut point.  This patch implements a second list
of fsnotify groups to hold these types of groups and a second global mask
to hold the events of interest for this type of group.

The reason we want a second group list and mask is because the inode based
notification should_send_event support which makes each group look for a mark
on the given inode.  With one nfsmount listener that means that every group would
have to take the inode->i_lock, look for their mark, not find one, and return
for every operation.   By seperating vfsmount from inode listeners only when
there is a inode listener will the inode groups have to look for their
mark and take the inode lock.  vfsmount listeners will have to grab the lock and
look for a mark but there should be fewer of them, and one vfsmount listener
won't cause the i_lock to be grabbed and released for every fsnotify group
on every io operation.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:52 -04:00
Eric Paris
4ca763523e fsnotify: add groups to fsnotify_inode_groups when registering inode watch
Currently all fsnotify groups are added immediately to the
fsnotify_inode_groups list upon creation.  This means, even groups with no
watches (common for audit) will be on the global tracking list and will
get checked for every event.  This patch adds groups to the global list on
when the first inode mark is added to the group.

Signed-of-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:51 -04:00
Eric Paris
36fddebaa8 fsnotify: initialize the group->num_marks in a better place
Currently the comments say that group->num_marks is held because the group
is on the fsnotify_group list.  This isn't strictly the case, we really
just hold the num_marks for the life of the group (any time group->refcnt
is != 0)  This patch moves the initialization stuff and makes it clear when
it is really being held.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:51 -04:00
Eric Paris
19c2a0e1a2 fsnotify: rename fsnotify_groups to fsnotify_inode_groups
Simple renaming patch.  fsnotify is about to support mount point listeners
so I am renaming fsnotify_groups and fsnotify_mask to indicate these are lists
used only for groups which have watches on inodes.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:51 -04:00
Eric Paris
0d2e2a1d00 fsnotify: drop mask argument from fsnotify_alloc_group
Nothing uses the mask argument to fsnotify_alloc_group.  This patch drops
that argument.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:51 -04:00
Eric Paris
ffab83402f fsnotify: fsnotify_obtain_group should be fsnotify_alloc_group
fsnotify_obtain_group was intended to be able to find an already existing
group.  Nothing uses that functionality.  This just renames it to
fsnotify_alloc_group so it is clear what it is doing.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:50 -04:00
Eric Paris
cd7752ce7c fsnotify: fsnotify_obtain_group kzalloc cleanup
fsnotify_obtain_group uses kzalloc but then proceedes to set things to 0.
This patch just deletes those useless lines.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:50 -04:00
Eric Paris
74be0cc828 fsnotify: remove group_num altogether
The original fsnotify interface has a group-num which was intended to be
able to find a group after it was added.  I no longer think this is a
necessary thing to do and so we remove the group_num.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:50 -04:00
Eric Paris
cac69dad32 fsnotify: lock annotation for event replacement
fsnotify_replace_event need to lock both the old and the new event.  This
causes lockdep to get all pissed off since it dosn't know this is safe.
It's safe in this case since the new event is impossible to be reached from
other places in the kernel.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:50 -04:00
Eric Paris
1201a5361b fsnotify: replace an event on a list
fanotify would like to clone events already on its notification list, make
changes to the new event, and then replace the old event on the list with
the new event.  This patch implements the replace functionality of that
process.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:49 -04:00
Eric Paris
b4e4e14073 fsnotify: clone existing events
fsnotify_clone_event will take an event, clone it, and return the cloned
event to the caller.  Since events may be in use by multiple fsnotify
groups simultaneously certain event entries (such as the mask) cannot be
changed after the event was created.  Since fanotify would like to merge
events happening on the same file it needs a new clean event to work with
so it can change any fields it wishes.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:49 -04:00
Eric Paris
74766bbfa9 fsnotify: per group notification queue merge types
inotify only wishes to merge a new event with the last event on the
notification fifo.  fanotify is willing to merge any events including by
means of bitwise OR masks of multiple events together.  This patch moves
the inotify event merging logic out of the generic fsnotify notification.c
and into the inotify code.  This allows each use of fsnotify to provide
their own merge functionality.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:49 -04:00
Eric Paris
28c60e37f8 fsnotify: send struct file when sending events to parents when possible
fanotify needs a path in order to open an fd to the object which changed.
Currently notifications to inode's parents are done using only the inode.
For some parental notification we have the entire file, send that so
fanotify can use it.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:48 -04:00
Eric Paris
2a12a9d781 fsnotify: pass a file instead of an inode to open, read, and write
fanotify, the upcoming notification system actually needs a struct path so it can
do opens in the context of listeners, and it needs a file so it can get f_flags
from the original process.  Close was the only operation that already was passing
a struct file to the notification hook.  This patch passes a file for access,
modify, and open as well as they are easily available to these hooks.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:32 -04:00
Eric Paris
8112e2d6a7 fsnotify: include data in should_send calls
fanotify is going to need to look at file->private_data to know if an event
should be sent or not.  This passes the data (which might be a file,
dentry, inode, or none) to the should_send function calls so fanotify can
get that information when available

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:31 -04:00
Eric Paris
7b0a04fbfb fsnotify: provide the data type to should_send_event
fanotify is only interested in event types which contain enough information
to open the original file in the context of the fanotify listener.  Since
fanotify may not want to send events if that data isn't present we pass
the data type to the should_send_event function call so fanotify can express
its lack of interest.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:31 -04:00
Eric Paris
d7f0ce4e43 inotify: do not spam console without limit
inotify was supposed to have a dmesg printk ratelimitor which would cause
inotify to only emit one message per boot.  The static bool was never set
so it kept firing messages.  This patch correctly limits warnings in multiple
places.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:31 -04:00
Eric Paris
2dfc1cae4c inotify: remove inotify in kernel interface
nothing uses inotify in the kernel, drop it!

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:31 -04:00
Eric Paris
7050c48826 inotify: do not reuse watch descriptors
Prior to 2.6.31 inotify would not reuse watch descriptors until all of
them had been used at least once.  After the rewrite inotify would reuse
watch descriptors.  The selinux utility 'restorecond' was found to have
problems when watch descriptors were reused.  This patch reverts to the
pre inotify rewrite behavior to not reuse watch descriptors.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:20 -04:00
Eric Paris
6f3a539e3b fsnotify: use kmem_cache_zalloc to simplify event initialization
fsnotify event initialization is done entry by entry with almost everything
set to either 0 or NULL.  Use kmem_cache_zalloc and only initialize things
that need non-zero initialization.  Also means we don't have to change
initialization entries based on the config options.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:20 -04:00
Eric Paris
f0553af054 fsnotify: kzalloc fsnotify groups
Use kzalloc for fsnotify_groups so that none of the fields can leak any
information accidentally.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:20 -04:00
Eric Paris
31ddd3268d inotify: use container_of instead of casting
inotify_free_mark casts directly from an fsnotify_mark_entry to an
inotify_inode_mark_entry.  This works, but should use container_of instead
for future proofing.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:19 -04:00
Eric Paris
b4277d3dd5 fsnotify: use fsnotify_create_event to allocate the q_overflow event
Currently fsnotify defines a static fsnotify event which is sent when a
group overflows its allotted queue length.  This patch just allocates that
event from the event cache rather than defining it statically.  There is no
known reason that the current implementation is wrong, but this makes sure the
event is initialized and created like any other.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:19 -04:00
Eric Paris
40554c3dae fsnotify: allow addition of duplicate fsnotify marks
This patch allows a task to add a second fsnotify mark to an inode for the
same group.  This mark will be added to the end of the inode's list and
this will never be found by the stand fsnotify_find_mark() function.   This
is useful if a user wants to add a new mark before removing the old one.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:17 -04:00
Eric Paris
9e1c74321d fsnotify: duplicate fsnotify_mark_entry data between 2 marks
Simple copy fsnotify information from one mark to another in preparation
for the second mark to replace the first.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:17 -04:00
Eric Paris
b7ba837153 inotify: simplify the inotify idr handling
This patch moves all of the idr editing operations into their own idr
functions.  It makes it easier to prove locking correctness and to to
understand the code flow.

Signed-off-by: Eric Paris <eparis@redhat.com>
2010-07-28 09:58:16 -04:00
J. Bruce Fields
fa0a21269f nfsd: bypass readahead cache when have struct file
The readahead cache compensates for the fact that the NFS server
currently does an open and close on every IO operation in the NFSv2 and
NFSv3 case.

In the NFSv4 case we have long-lived struct files associated with client
opens, so there's no need for this.  In fact, concurrent IO's using
trying to modify the same file->f_ra may cause problems.

So, don't bother with the readahead cache in that case.

Note eventually we'll likely do this in the v2/v3 case as well by
keeping a cache of struct files instead of struct file_ra_state's.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-27 18:15:54 -04:00
Yehuda Sadeh
03066f2345 ceph: use complete_all and wake_up_all
This fixes an issue triggered by running concurrent syncs. One of the syncs
would go through while the other would just hang indefinitely. In any case, we
never actually want to wake a single waiter, so the *_all functions should
be used.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-27 13:11:17 -07:00
Latchesar Ionkov
da7ddd3296 9p: Pass the correct end of buffer to p9stat_read
Pass the correct end of the buffer to p9stat_read.

Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
2010-07-27 14:52:04 -05:00
Theodore Ts'o
f613dfcb33 ext4: check to make make sure bd_dev is set before dereferencing it
There are some drivers which may not set bdev->bd_dev.  So make sure
it is non-NULL before dereferencing it.

Google-Bug-Id: 1773557

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:08 -04:00
Eric Sandeen
cc937db74b jbd2: Make barrier messages less scary
Saying things like "sync failed" when a device does
not support barriers makes users slightly more worried than
they need to be; rather than talking about sync failures,
let's just state the barrier-based facts.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:08 -04:00
Eric Sandeen
e3570639c8 ext4: don't print scary messages for allocation failures post-abort
I often get emails containing the "This should not happen!!" message,
conveniently trimmed to remove things like:

sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 03 13 c9 70 00 00 28 00
end_request: I/O error, dev sda, sector 51628400
Aborting journal on device dm-0-8.
EXT4-fs error (device dm-0): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (dm-0): Remounting filesystem read-only

I don't think there is any value to the verbosity if the reason is
due to a filesystem abort; it just obfuscates the root cause.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:08 -04:00
Toshiyuki Okajima
d889dc8382 ext4: fix EFBIG edge case when writing to large non-extent file
By running the following reproducer, we can confirm that the write 
system call returns with 0 when it should return the error EFBIG.

#!/bin/sh

/bin/dd if=/dev/zero of=./img bs=1k count=1 seek=1024k > /dev/null 2>&1
/sbin/mkfs.ext3 -Fq ./img
/bin/mount -o loop -t ext4 ./img /mnt
/bin/touch /mnt/file
strace /bin/dd if=/dev/zero of=/mnt/file conv=notrunc bs=1k count=1 seek=$((2194719883264/1024)) 2>&1 | /bin/egrep "write.* 1024\) = "
/bin/umount /mnt
exit

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@redhat.com>
2010-07-27 11:56:07 -04:00
Eric Sandeen
79e8303677 ext4: fix ext4_get_blocks references
ext4_get_blocks got renamed to ext4_map_blocks, but left stale
comments and a prototype littered around.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Jan Kara
62d2b5f2dc ext4: Always journal quota file modifications
When journaled quota options are not specified, we do writes
to quota files just in data=ordered mode. This actually causes
warnings from JBD2 about dirty journaled buffer because ext4_getblk
unconditionally treats a block allocated by it as metadata. Since
quota actually is filesystem metadata, the easiest way to get rid
of the warning is to always treat quota writes as metadata...

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Cyrill Gorcunov
dcc7dae3cb ext4: Fix potential memory leak in ext4_fill_super
Under heavy memory pressure we may hit out of memory
situation and as result kstrdup'ed options will not be
freed. Fix it.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:07 -04:00
Theodore Ts'o
0c095c7f11 ext4: Don't error out the fs if the user tries to make a file too big
If the user attempts to make a non-extent-mapped file to be too large,
return EFBIG, but don't call ext4_std_err() which will end up marking
the file system as containing an error.

Thanks to Toshiyuki Okajima-san at Fujitsu for pointing this out.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
Eric Sandeen
506bf2d821 ext4: allocate stripe-multiple IOs on stripe boundaries
For some reason, today mballoc only allocates IOs which are exactly
stripe-sized on a stripe boundary.  If you have a multiple (say, a
128k IO on a 64k stripe) you may end up unaligned.

It seems to me that a simple change to align stripe-multiple IOs
on stripe boundaries would be a very good idea, unless this breaks
some other mballoc heuristic for some reason...

Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
jiayingz@google.com (Jiaying Zhang)
5b3ff237be ext4: move aio completion after unwritten extent conversion
This patch is to be applied upon Christoph's "direct-io: move aio_complete
into ->end_io" patch. It adds iocb and result fields to struct ext4_io_end_t,
so that we can call aio_complete from  ext4_end_io_nolock() after the extent
conversion has finished.

I have verified with Christoph's aio-dio test that used to fail after a few
runs on an original kernel but now succeeds on the patched kernel.

See http://thread.gmane.org/gmane.comp.file-systems.ext4/19659 for details.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
Christoph Hellwig
552ef8024f direct-io: move aio_complete into ->end_io
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited.  That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.

This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated. 

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz> 
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:06 -04:00
Jiaying Zhang
5c521830cf ext4: Support discard requests when running in no-journal mode
Issue discard request in ext4_free_blocks() when ext4 has no journal and
is mounted with discard option.
    
Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:05 -04:00
Theodore Ts'o
47def82672 jbd2: Remove __GFP_NOFAIL from jbd2 layer
__GFP_NOFAIL is going away, so add our own retry loop.  Also add
jbd2__journal_start() and jbd2__journal_restart() which take a gfp
mask, so that file systems can optionally (re)start transaction
handles using GFP_KERNEL.  If they do this, then they need to be
prepared to handle receiving an PTR_ERR(-ENOMEM) error, and be ready
to reflect that error up to userspace.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:05 -04:00
Amir G
4038968738 ext4: Fix block bitmap inconsistencies after a crash when deleting files
We have experienced bitmap inconsistencies after crash during file
delete under heavy load.  The crash is not file system related and I
the following patch in ext4_free_branches() fixes the recovery
problem.

If the transaction is restarted and there is a crash before the new
transaction is committed, then after recovery, the blocks that this
indirect block points to have been freed, but the indirect block
itself has not been freed and may still point to some of the free
blocks (because of the ext4_forget()).

So ext4_forget() should be called inside ext4_free_blocks() to avoid
this problem.

Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:05 -04:00
Joe Perches
a271fe8527 ext4: Remove unnecessary casts of private_data
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o
e5880d76ae ext4: fix potential NULL dereference while tracing
The allocation_context pointer can be NULL.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o
89eeddf033 ext4: Define s_jnl_backup_type in superblock
This has been in use by e2fsprogs for a while; define it to keep the
super block fields in sync.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o
66e61a9e95 ext4: Once a day, printk file system error information to dmesg
This allows us to grab any file system error messages by scraping
/var/log/messages.  This will make it easy for us to do error analysis
across the very large number of machines as we deploy ext4 across the
fleet.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:04 -04:00
Theodore Ts'o
1c13d5c087 ext4: Save error information to the superblock for analysis
Save number of file system errors, and the time function name, line
number, block number, and inode number of the first and most recent
errors reported on the file system in the superblock.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:03 -04:00
Theodore Ts'o
c398eda0e4 ext4: Pass line numbers to ext4_error() and friends
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:56:40 -04:00
Theodore Ts'o
60fd4da34d ext4: Cleanup ext4_check_dir_entry so __func__ is now implicit
Also start passing the line number to ext4_check_dir since we're going
to need it in upcoming patch.
    
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2010-07-27 11:54:40 -04:00
Christoph Hellwig
209fb87a25 xfs simplify and speed up direct I/O completions
Our current handling of direct I/O completions is rather suboptimal,
because we defer it to a workqueue more often than needed, and we
perform a much to aggressive flush of the workqueue in case unwritten
extent conversions happen.

This patch changes the direct I/O reads to not even use a completion
handler, as we don't bother to use it at all, and to perform the unwritten
extent conversions in caller context for synchronous direct I/O.

For a small I/O size direct I/O workload on a consumer grade SSD, such as
the untar of a kernel tree inside qemu this patch gives speedups of
about 5%.  Getting us much closer to the speed of a native block device,
or a fully allocated XFS file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 16:09:19 -05:00
Christoph Hellwig
fb511f2150 xfs: move aio completion after unwritten extent conversion
If we write into an unwritten extent using AIO we need to complete the AIO
request after the extent conversion has finished.  Without that a read could
race to see see the extent still unwritten and return zeros.   For synchronous
I/O we already take care of that by flushing the xfsconvertd workqueue (which
might be a bit of overkill).

To do that add iocb and result fields to struct xfs_ioend, so that we can
call aio_complete from xfs_end_io after the extent conversion has happened.
Note that we need a new result field as io_error is used for positive errno
values, while the AIO code can return negative error values and positive
transfer sizes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 16:09:10 -05:00
Christoph Hellwig
40e2e97316 direct-io: move aio_complete into ->end_io
Filesystems with unwritten extent support must not complete an AIO request
until the transaction to convert the extent has been commited.  That means
the aio_complete calls needs to be moved into the ->end_io callback so
that the filesystem can control when to call it exactly.

This makes a bit of a mess out of dio_complete and the ->end_io callback
prototype even more complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 16:09:02 -05:00
Dave Chinner
696123fca8 xfs: fix big endian build
Commit 0fd7275cc42ab734eaa1a2c747e65479bd1e42af ("xfs: fix gcc 4.6
set but not read and unused statement warnings") failed to convert
some code inside XFS_NATIVE_HOST (big endian host code only) and
hence fails to build on such machines. Fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 16:07:38 -05:00
Eric W. Biederman
d33002129e sysfs: allow creating symlinks from untagged to tagged directories
Supporting symlinks from untagged to tagged directories is reasonable,
and needed to support CONFIG_SYSFS_DEPRECATED.  So don't fail a prior
allowing that case to work.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26 12:02:41 -07:00
Eric W. Biederman
521d045354 sysfs: sysfs_delete_link handle symlinks from untagged to tagged directories.
This happens for network devices when SYSFS_DEPRECATED is enabled.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26 12:02:41 -07:00
Eric W. Biederman
96d6523adf sysfs: Don't allow the creation of symlinks we can't remove
Recently my tagged sysfs support revealed a flaw in the device core
that a few rare drivers are running into such that we don't always put
network devices in a class subdirectory named net/.

Since we are not creating the class directory the network devices wind
up in a non-tagged directory, but the symlinks to the network devices
from /sys/class/net are in a tagged directory.  All of which works
until we go to remove or rename the symlink.  When we remove or rename
a symlink we look in the namespace of the target of the symlink.
Since the target of the symlink is in a non-tagged sysfs directory we
don't have a namespace to look in, and we fail to remove the symlink.

Detect this problem up front and simply don't create symlinks we won't
be able to remove later.  This prevents symlink leakage and fails in
a much clearer and more understandable way.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-26 12:02:41 -07:00
Christoph Hellwig
ecd7f082d6 xfs: clean up xfs_bmap_get_bp
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:53 -05:00
Christoph Hellwig
5d18898b20 xfs: simplify xfs_truncate_file
xfs_truncate_file is only used for truncating quota files.  Move it to
xfs_qm_syscalls.c so it can be marked static and take advatange of the
fact by removing the unused page cache validation and taking the iget
into the helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:52 -05:00
Christoph Hellwig
939d723b72 xfs: kill the b_strat callback in xfs_buf
The b_strat callback is used by xfs_buf_iostrategy to perform additional
checks before submitting a buffer.  It is used in xfs_bwrite and when
writing out delayed buffers.  In xfs_bwrite it we can de-virtualize the
call easily as b_strat is set a few lines above the call to
xfs_buf_iostrategy.  For the delayed buffers the rationale is a bit
more complicated:

 - there are three callers of xfs_buf_delwri_queue, which places buffers
   on the delwri list:
    (1) xfs_bdwrite - this sets up b_strat, so it's fine
    (2) xfs_buf_iorequest.  None of the callers can have XBF_DELWRI set:
	- xlog_bdstrat is only used for log buffers, which are never delwri
	- _xfs_buf_read explicitly clears the delwri flag
	- xfs_buf_iodone_work retries log buffers only
	- xfsbdstrat - only used for reads, superblock writes without the
	  delwri flag, log I/O and file zeroing with explicitly allocated
	  buffers.
	- xfs_buf_iostrategy - only calls xfs_buf_iorequest if b_strat is
	  not set
    (3) xfs_buf_unlock
	- only puts the buffer on the delwri list if the DELWRI flag is
	  already set.  The DELWRI flag is only ever set in xfs_bwrite,
	  xfs_buf_iodone_callbacks, or xfs_trans_log_buf.  For
	  xfs_buf_iodone_callbacks and xfs_trans_log_buf we require
	  an initialized buf item, which means b_strat was set to
	  xfs_bdstrat_cb in xfs_buf_item_init.

Conclusion: we can just get rid of the callback and replace it with
explicit calls to xfs_bdstrat_cb.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:52 -05:00
Christoph Hellwig
a64afb057b xfs: remove obsolete osyncisosync mount option
Since Linux 2.6.33 the kernel has support for real O_SYNC, which made
the osyncisosync option a no-op.  Warn the users about this and remove
the mount flag for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:51 -05:00
Christoph Hellwig
0664ce8d0f xfs: clean up filestreams helpers
Move xfs_filestream_peek_ag, xxfs_filestream_get_ag and xfs_filestream_put_ag
from xfs_filestream.h to xfs_filestream.c where it's only callers are, and
remove the inline marker while we're at it to let the compiler decide on the
inlining.  Also don't return a value from xfs_filestream_put_ag because
we don't need it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:51 -05:00
Christoph Hellwig
73523a2ecf xfs: fix gcc 4.6 set but not read and unused statement warnings
[hch: dropped a few hunks that need structural changes instead]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:51 -05:00
Tony Luck
0f1a932f5d xfs: Fix build when CONFIG_XFS_POSIX_ACL=n
When CONFIG_XFS_POSIX_ACL is not set "xfs_check_acl" is #defined
to NULL - which breaks the code attempting to add a tracepoint
on this function.

Only define the tracepoint when the function exists.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:50 -05:00
Kulikov Vasiliy
3f34885cd7 xfs: fix unsigned underflow in xfs_free_eofblocks
map_len is unsigned. Checking map_len <= 0 is buggy when it should be
below zero. So, check exact expression instead of map_len.

Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:50 -05:00
Dave Chinner
aea1b95321 xfs: use GFP_NOFS for page cache allocation
Avoid a lockdep warning by preventing page cache allocation from
recursing back into the filesystem during memory reclaim.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:49 -05:00
Dave Chinner
4a7edddcb5 xfs: fix memory reclaim recursion deadlock on locked inode buffer
Calling into memory reclaim with a locked inode buffer can deadlock
if memory reclaim tries to lock the inode buffer during inode
teardown. Convert the relevant memory allocations to use KM_NOFS to
avoid this deadlock condition.

Reported-by: Peter Watkins <treestem@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:49 -05:00
Dave Chinner
438697064a xfs: fix xfs_trans_add_item() lockdep warnings
xfs_trans_add_item() is called with ip->i_ilock held, which means it
is unsafe for memory reclaim to recurse back into the filesystem
(ilock is required in writeback). Hence the allocation needs to be
KM_NOFS to avoid recursion.

Lockdep report indicating memory allocation being called with the
ip->i_ilock held is as follows:

[ 1749.866796] =================================
[ 1749.867788] [ INFO: inconsistent lock state ]
[ 1749.868327] 2.6.35-rc3-dgc+ #25
[ 1749.868741] ---------------------------------
[ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1749.868741]  (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741] {IN-RECLAIM_FS-W} state was registered at:
[ 1749.868741]   [<ffffffff810b3a97>] __lock_acquire+0x437/0x1450
[ 1749.868741]   [<ffffffff810b4b56>] lock_acquire+0xa6/0x160
[ 1749.868741]   [<ffffffff810a20b5>] down_write_nested+0x65/0xb0
[ 1749.868741]   [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741]   [<ffffffff8134e819>] xfs_reclaim_inode+0x99/0x310
[ 1749.868741]   [<ffffffff8134f56b>] xfs_inode_ag_walk+0x8b/0x150
[ 1749.868741]   [<ffffffff8134f6bb>] xfs_inode_ag_iterator+0x8b/0xf0
[ 1749.868741]   [<ffffffff8134f7a8>] xfs_reclaim_inode_shrink+0x88/0x90
[ 1749.868741]   [<ffffffff81119d07>] shrink_slab+0x137/0x1a0
[ 1749.868741]   [<ffffffff8111bbe1>] balance_pgdat+0x421/0x6a0
[ 1749.868741]   [<ffffffff8111bf7d>] kswapd+0x11d/0x320
[ 1749.868741]   [<ffffffff8109ce56>] kthread+0x96/0xa0
[ 1749.868741]   [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
[ 1749.868741] irq event stamp: 4234335
[ 1749.868741] hardirqs last  enabled at (4234335): [<ffffffff81147d25>] kmem_cache_free+0x115/0x220
[ 1749.868741] hardirqs last disabled at (4234334): [<ffffffff81147c4d>] kmem_cache_free+0x3d/0x220
[ 1749.868741] softirqs last  enabled at (4233112): [<ffffffff81084dd2>] __do_softirq+0x142/0x260
[ 1749.868741] softirqs last disabled at (4233095): [<ffffffff81035edc>] call_softirq+0x1c/0x50
[ 1749.868741] 
[ 1749.868741] other info that might help us debug this:
[ 1749.868741] 2 locks held by dd/2835:
[ 1749.868741]  #0:  (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [<ffffffff81316edd>] xfs_ilock_nowait+0xed/0x200
[ 1749.868741]  #1:  (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
[ 1749.868741] 
[ 1749.868741] stack backtrace:
[ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25
[ 1749.868741] Call Trace:
[ 1749.868741]  [<ffffffff810b1faa>] print_usage_bug+0x18a/0x190
[ 1749.868741]  [<ffffffff8104264f>] ? save_stack_trace+0x2f/0x50
[ 1749.868741]  [<ffffffff810b2400>] ? check_usage_backwards+0x0/0xf0
[ 1749.868741]  [<ffffffff810b2f11>] mark_lock+0x331/0x400
[ 1749.868741]  [<ffffffff810b3047>] mark_held_locks+0x67/0x90
[ 1749.868741]  [<ffffffff810b3111>] lockdep_trace_alloc+0xa1/0xe0
[ 1749.868741]  [<ffffffff81147419>] kmem_cache_alloc+0x39/0x1e0
[ 1749.868741]  [<ffffffff8133f954>] kmem_zone_alloc+0x94/0xe0
[ 1749.868741]  [<ffffffff8133f9be>] kmem_zone_zalloc+0x1e/0x50
[ 1749.868741]  [<ffffffff81335f02>] xfs_trans_add_item+0x72/0xb0
[ 1749.868741]  [<ffffffff81339e41>] xfs_trans_ijoin+0xa1/0xd0
[ 1749.868741]  [<ffffffff81319f82>] xfs_itruncate_finish+0x312/0x5d0
[ 1749.868741]  [<ffffffff8133cb87>] xfs_free_eofblocks+0x227/0x280
[ 1749.868741]  [<ffffffff8133cd18>] xfs_release+0x138/0x190
[ 1749.868741]  [<ffffffff813464c5>] xfs_file_release+0x15/0x20
[ 1749.868741]  [<ffffffff81150ebf>] fput+0x13f/0x260
[ 1749.868741]  [<ffffffff8114d8c2>] filp_close+0x52/0x80
[ 1749.868741]  [<ffffffff8114d9a9>] sys_close+0xb9/0x120
[ 1749.868741]  [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:49 -05:00
Dave Chinner
2f11feabb1 xfs: simplify and remove xfs_ireclaim
xfs_ireclaim has to get and put te pag structure because it is only
called with the inode to reclaim. The one caller of this function
already has a reference on the pag and a pointer to is, so move the
radix tree delete to the caller and remove xfs_ireclaim completely.
This avoids a xfs_perag_get/put on every inode being reclaimed.

The overhead was noticed in a bug report at:

https://bugzilla.kernel.org/show_bug.cgi?id=16348

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:48 -05:00
Dave Chinner
ec53d1dbb3 xfs: don't block on buffer read errors
xfs_buf_read() fails to detect dispatch errors before attempting to
wait on sychronous IO. If there was an error, it will get stuck
forever, waiting for an I/O that was never started. Make sure the
error is detected correctly.

Further, such a failure can leave locked pages in the page cache
which will cause a later operation to hang on the page. Ensure that
we correctly process pages in the buffers when we get a dispatch
error.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:48 -05:00
Dave Chinner
a4190f90b4 xfs: move inode shrinker unregister even earlier
I missed Dave Chinner's second revision of this change, and pushed
his first version out to the repository instead.

	commit a476c59ebb279d738718edc0e3fb76aab3687114
	Author: Dave Chinner <dchinner@redhat.com>

This commit compensates for that by moving a block of code up a bit
further, with a result that matches the the effect of Dave's second
version.

Dave's first version was:
	Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Dave's second version was:
	Reviewed-by: Christoph Hellwig <hch@lst.de>

Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2010-07-26 13:16:47 -05:00
Christoph Hellwig
fa17b25e9f xfs: remove a dmapi leftover
The open_exec file operation is only added by the external dmapi
patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 13:16:47 -05:00
Christoph Hellwig
78558fe8d8 xfs: writepage always has buffers
These days we always have buffers thanks to ->page_mkwrite.  And we
already have an assert a few lines above tripping in case that was
not true due to a bug.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 13:16:46 -05:00
Christoph Hellwig
d4f7a5cbd5 xfs: allow writeback from kswapd
We only need disable I/O from direct or memcg reclaim.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 13:16:46 -05:00
Christoph Hellwig
651701d71d xfs: remove incorrect log write optimization
We do need a barrier for the first buffer of a split log write.
Otherwise we might incorrectly stamp the tail LSN into transactions
in the first part of the split write, or not flush data I/O before
updating the inode size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 13:16:45 -05:00
Dave Chinner
2727ccc950 xfs: unregister inode shrinker before freeing filesystem structures
Currently we don't remove the XFS mount from the shrinker list until
late in the unmount path. By this time, we have already torn down
the internals of the filesystem (e.g. the per-ag structures), and
hence if the shrinker is executed between the teardown and the
unregistering, the shrinker will get NULL per-ag structure pointers
and panic trying to dereference them.

Fix this by removing the xfs mount from the shrinker list before
tearing down it's internal structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2010-07-26 13:16:45 -05:00
Christoph Hellwig
cca28fb83d xfs: split xfs_itrace_entry
Replace the xfs_itrace_entry catchall with specific trace points.  For
most simple callers we now use the simple inode class, which used to
be the iget class, but add more details tracing for namespace events,
which now includes the name of the directory entries manipulated.

Remove the xfs_inactive trace point, which is a duplicate of the clear_inode
one, and the xfs_change_file_space trace point, which is immediately
followed by the more specific alloc/free space trace points.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:44 -05:00
Christoph Hellwig
f2d6761433 xfs: remove xfs_iput
xfs_iput is just a small wrapper for xfs_iunlock + IRELE.  Having this
out of line wrapper means the trace events in those two can't track
their caller properly.  So just remove the wrapper and opencode the
unlock + rele in the few callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:44 -05:00
Christoph Hellwig
ef35e9255d xfs: remove xfs_iput_new
We never get an i_mode of 0 or a locked VFS inode until we pass in the
XFS_IGET_CREATE flag to xfs_iget, which makes xfs_iput_new equivalent to
xfs_iput for the only caller.  In addition to that xfs_nfs_get_inode
does not even need to lock the inode given that the generation never changes
for a life inode, so just pass a 0 lock_flags to xfs_iget and release
the inode using IRELE in the error path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:44 -05:00
Christoph Hellwig
d2e078c33c xfs: some iget tracing cleanups / fixes
The xfs_iget_alloc/found tracepoints are a bit misnamed and misplaced.
Rename them to xfs_iget_hit/xfs_iget_miss and move them to the beggining
of the xfs_iget_cache_hit/miss functions.  Add a new xfs_iget_reclaim_fail
tracepoint for the case where we fail to re-initialize a VFS inode,
and add a second instance of the xfs_iget_skip tracepoint for the case
of a failed igrab() call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:43 -05:00
Christoph Hellwig
807cbbdb43 xfs: do not use emums for flags used in tracing
The tracing code can't print flags defined as enums.  Most flags that
we want to print are defines as macros already, but move the few remaining
ones over to make the trace output more useful.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:43 -05:00
Christoph Hellwig
64c8614941 xfs: remove explicit xfs_sync_data/xfs_sync_attr calls on umount
On the final put of a superblock the VFS already calls sync_filesystem
for us to write out all data and wait for it.  No need to start another
asynchronous writeback inside ->put_super.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:42 -05:00
Christoph Hellwig
f2bde9b89b xfs: small cleanups for xfs_iomap / __xfs_get_blocks
Remove the flags argument to  __xfs_get_blocks as we can easily derive
it from the direct argument, and remove the unused BMAPI_MMAP flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:42 -05:00
Christoph Hellwig
3070451eea xfs: reduce stack usage in xfs_iomap
xfs_iomap passes a xfs_bmbt_irec pointer to xfs_iomap_write_direct and
xfs_iomap_write_allocate to give them the results of our read-only
xfs_bmapi query.  Instead of allocating a new xfs_bmbt_irec on stack
for the next call to xfs_bmapi re use the one we got passed as it's not
used after this point.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:42 -05:00
Christoph Hellwig
7a36c8a98a xfs: avoid synchronous transaction in xfs_fs_write_inode
We already rely on the fact that the sync code will cause a synchronous
log force later on (currently via xfs_fs_sync_fs -> xfs_quiesce_data ->
xfs_sync_data), so no need to do this here.  This allows us to avoid
a lot of synchronous log forces during sync, which pays of especially
with delayed logging enabled.   Some compilebench numbers that show
this:

xfs (delayed logging, 256k logbufs)
===================================

intial create		  25.94 MB/s	  25.75 MB/s	  25.64 MB/s
create			   8.54 MB/s	   9.12 MB/s	   9.15 MB/s
patch			   2.47 MB/s	   2.47 MB/s	   3.17 MB/s
compile			  29.65 MB/s	  30.51 MB/s	  27.33 MB/s
clean			  90.92 MB/s	  98.83 MB/s	 128.87 MB/s
read tree		  11.90 MB/s	  11.84 MB/s	   8.56 MB/s
read compiled		  28.75 MB/s	  29.96 MB/s	  24.25 MB/s
delete tree		8.39 seconds	8.12 seconds	8.46 seconds
delete compiled		8.35 seconds	8.44 seconds	5.11 seconds
stat tree		6.03 seconds	5.59 seconds	5.19 seconds
stat compiled tree	9.00 seconds	9.52 seconds	8.49 seconds

xfs + write_inode log_force removal
===================================
intial create		  25.87 MB/s	  25.76 MB/s	  25.87 MB/s
create			  15.18 MB/s	  14.80 MB/s	  14.94 MB/s
patch			   3.13 MB/s	   3.14 MB/s	   3.11 MB/s
compile			  36.74 MB/s	  37.17 MB/s	  36.84 MB/s
clean			 226.02 MB/s	 222.58 MB/s	 217.94 MB/s
read tree		  15.14 MB/s	  15.02 MB/s	  15.14 MB/s
read compiled tree	  29.30 MB/s	  29.31 MB/s	  29.32 MB/s
delete tree		6.22 seconds	6.14 seconds	6.15 seconds
delete compiled tree	5.75 seconds	5.92 seconds	5.81 seconds
stat tree		4.60 seconds	4.51 seconds	4.56 seconds
stat compiled tree	4.07 seconds	3.87 seconds	3.96 seconds

In addition to that also remove the delwri inode flush that is unessecary
now that bulkstat is always coherent.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:41 -05:00
Christoph Hellwig
20cb52ebd1 xfs: simplify xfs_vm_writepage
The writepage implementation in XFS still tries to deal with dirty but
unmapped buffers which used to caused by writes through shared mmaps.  Since
the introduction of ->page_mkwrite these can't happen anymore, so remove the
code dealing with them.

Note that the all_bh variable which causes us to start I/O on all buffers on
the pages was controlled by the count of unmapped buffers, which also
included those not actually dirty.  It's now unconditionally initialized to
0 but set to 1 for the case of small file size extensions.  It probably can
be removed entirely, but that's left for another patch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:41 -05:00
Christoph Hellwig
89f3b36396 xfs: simplify xfs_vm_releasepage
Currently the xfs releasepage implementation has code to deal with converting
delayed allocated and unwritten space.  But we never get called for those as
we always convert delayed and unwritten space when cleaning a page, or drop
the state from the buffers in block_invalidatepage.  We still keep a WARN_ON
on those cases for now, but remove all the case dealing with it, which allows
to fold xfs_page_state_convert into xfs_vm_writepage and remove the !startio
case from the whole writeback path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:40 -05:00
Eric Sandeen
3d9b02e3c7 xfs: fix corruption case for block size < page size
xfstests 194 first truncats a file back and then extends it again by
truncating it to a larger size.  This causes discard_buffer to drop
the mapped, but not the uptodate bit and thus creates something that
xfs_page_state_convert takes for unmapped space created by mmap because
it doesn't check for the dirty bit, which also gets cleared by
discard_buffer and checked by other ->writepage implementations like
block_write_full_page.  Handle this kind of buffers early, and unlike
Eric's first version of the patch simply ASSERT that the buffers is
dirty, given that the mmap write case can't happen anymore since the
introduction of ->page_mkwrite.  The now dead code dealing with that
will be deleted in a follow on patch.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:40 -05:00
Christoph Hellwig
b4e9181e77 xfs: remove unused delta tracking code in xfs_bmapi
This code was introduced four years ago in commit
3e57ecf640 without any review and has
been unused since.  Remove it just as the rest of the code introduced
in that commit to reduce that stack usage and complexity in this central
piece of code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:39 -05:00
Christoph Hellwig
cd8b0bb3c4 xfs: remove unused XFS_BMAPI_ flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:39 -05:00
Christoph Hellwig
a59f55703c xfs: remove the unused XFS_TRANS_NOSLEEP/XFS_TRANS_WAIT flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:38 -05:00
Christoph Hellwig
9134c2332e xfs: remove the unused XFS_LOG_SLEEP and XFS_LOG_NOSLEEP flags
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:38 -05:00
Christoph Hellwig
dbb2f6529f xfs: kill the unused xlog_debug variable
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:37 -05:00
Christoph Hellwig
4e0d5f926b xfs: fix the xfs_log_iovec i_addr type
By making this member a void pointer we can get rid of a lot of pointless
casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:36 -05:00
Christoph Hellwig
898621d5a7 xfs: simplify inode to transaction joining
Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
joining it to a transaction via xfs_trans_ijoin.

This patches instead makes xfs_trans_ijoin usable on it's own by doing
an implicity xfs_trans_ihold, which also allows us to drop the third
argument.  For the case where we want to hold a reference on the inode
a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
the inode for needing an xfs_iput.  In addition to the cleaner interface
to the caller this also simplifies the implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:36 -05:00
Christoph Hellwig
4d16e9246f xfs: simplify buffer pinning
Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode
them in their only callers, just like we did for the inode pinning a while
ago.  Also remove duplicate trace points - the bufitem tracepoints cover
all the information that is present in a buffer tracepoint.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:36 -05:00
Christoph Hellwig
ca30b2a7b7 xfs: give li_cb callbacks the correct prototype
Stop the function pointer casting madness and give all the li_cb instances
correct prototype.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:35 -05:00
Christoph Hellwig
7bfa31d8e0 xfs: give xfs_item_ops methods the correct prototypes
Stop the function pointer casting madness and give all the xfs_item_ops the
correct prototypes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:35 -05:00
Christoph Hellwig
9412e3181c xfs: merge iop_unpin_remove into iop_unpin
The unpin_remove item operation instances always share most of the
implementation with the respective unpin implementation.  So instead
of keeping two different entry points add a remove flag to the unpin
operation and share the code more easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:34 -05:00
Christoph Hellwig
e98c414f9a xfs: simplify log item descriptor tracking
Currently we track log item descriptor belonging to a transaction using a
complex opencoded chunk allocator.  This code has been there since day one
and seems to work around the lack of an efficient slab allocator.

This patch replaces it with dynamically allocated log item descriptors
from a dedicated slab pool, linked to the transaction by a linked list.

This allows to greatly simplify the log item descriptor tracking to the
point where it's just a couple hundred lines in xfs_trans.c instead of
a separate file.  The external API has also been simplified while we're
at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
delete items from a transaction have been simplified to the bare minium,
and the xfs_trans_find_item function is replaced with a direct dereference
of the li_desc field.  All debug code walking the list of log items in
a transaction is down to a simple list_for_each_entry.

Note that we could easily use a singly linked list here instead of the
double linked list from list.h as the fastpath only does deletion from
sequential traversal.  But given that we don't have one available as
a library function yet I use the list.h functions for simplicity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:34 -05:00
Christoph Hellwig
3400777ff0 xfs: remove unneeded #include statements
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
2010-07-26 13:16:33 -05:00
Christoph Hellwig
288699feca xfs: drop dmapi hooks
Dmapi support was never merged upstream, but we still have a lot of hooks
bloating XFS for it, all over the fast pathes of the filesystem.

This patch drops over 700 lines of dmapi overhead.  If we'll ever get HSM
support in mainline at least the namespace events can be done much saner
in the VFS instead of the individual filesystem, so it's not like this
is much help for future work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2010-07-26 13:16:33 -05:00
Ryusuke Konishi
89c0fd014d nilfs2: reject filesystem with unsupported block size
This inserts sanity check that refuses to mount a filesystem with
unsupported block size.

Previously, kernel code of nilfs was looking only limitation of
devices though mkfs.nilfs2 limits the range of block sizes; there was
no check that prevents rec_len overflow with larger block sizes.

With this change, block sizes larger than 64KB or smaller than 1KB
will get rejected explicitly by kernel.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-25 23:29:21 +09:00
Ryusuke Konishi
6cda9fa257 nilfs2: avoid rec_len overflow with 64KB block size
With 64KB blocksize, a directory entry can have size 64KB which does
not fit into 16 bits we have for entry length.  So this patch stores
0xffff instead and converts value when read from / written to disk.

Nilfs derives its directory implementation from ext2 filesystem, and
this draws upon the corresponding change on ext2.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-25 20:46:43 +09:00
Robert P. J. Day
25848b3ec6 ceph: Correct obvious typo of Kconfig variable "CRYPTO_AES"
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-24 21:36:07 -07:00
Tejun Heo
40f2b6ffe5 fscache: fix build on !CONFIG_SYSCTL
Commit 8b8edefa (fscache: convert object to use workqueue instead of
slow-work) made fscache_exit() call unregister_sysctl_table()
unconditionally breaking build when sysctl is disabled.  Fix it by
putting it inside CONFIG_SYSCTL.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: David Howells <dhowells@redhat.com>
2010-07-24 11:10:09 +02:00
Ryusuke Konishi
c28e69d933 nilfs2: simplify nilfs_get_page function
Implementation of nilfs_get_page() is a bit old as below:

 - A common read_mapping_page inline function is now available instead
   of its read_cache_page use.
 - wait_on_page_locked() use in the function is eliminable since
   read_cache_page function does the same thing through wait_on_page_read().
 - PageUptodate() check is eliminable for the same reason.

This renews nilfs_get_page() based on these points.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-24 17:36:29 +09:00
Sage Weil
1dadcce358 ceph: fix dentry lease release
When we embed a dentry lease release notification in a request, invalidate
our lease so we don't think we still have it.  Otherwise we can get all
sorts of incorrect client behavior when multiple clients are interacting
with the same part of the namespace.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 13:54:21 -07:00
Sage Weil
8c696737aa ceph: fix leak of dentry in ceph_init_dentry() error path
If we fail to allocate a ceph_dentry_info, don't leak the dn reference.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:07 -07:00
Sage Weil
bc4fdca857 ceph: fix pg_mapping leak on pg_temp updates
Free the ceph_pg_mapping structs when they are removed from the pg_temp
rbtree.  Also fix a leak in the __insert_pg_mapping() error path.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:06 -07:00
Sage Weil
252af52146 ceph: fix d_release dop for snapdir, snapped dentries
We need to set the d_release dop for snapdir and snapped dentries so that
the ceph_dentry_info struct gets released.  We also use the dcache to
cache readdir results when possible, which only works if we know when
dentries are dropped from the cache.  Since we don't use the dcache for
readdir in the hidden snapdir, avoid that case in ceph_dentry_release.

Signed-off-by: Sage Weil <sage@newdream.net>
2010-07-23 10:02:05 -07:00
J. Bruce Fields
af4718f3f9 nfsd: minor nfsd_svc() cleanup
More idiomatic to put the error case in the if clause.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:27 -04:00
J. Bruce Fields
59db4a0c10 nfsd: move more into nfsd_startup()
This is just cleanup--it's harmless to call nfsd_rachache_init,
nfsd_init_socks, and nfsd_reset_versions more than once.  But there's no
point to it.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:26 -04:00
Jeff Layton
ac77efbe2b nfsd: just keep single lockd reference for nfsd
Right now, nfsd keeps a lockd reference for each socket that it has
open. This is unnecessary and complicates the error handling on
startup and shutdown. Change it to just do a lockd_up when starting
the first nfsd thread just do a single lockd_down when taking down the
last nfsd thread. Because of the strange way the sv_count is handled
this requires an extra flag to tell whether the nfsd_serv holds a
reference for lockd or not.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:26 -04:00
Jeff Layton
628b368728 nfsd: clean up nfsd_create_serv error handling
There doesn't seem to be any need to reset the nfssvc_boot time if the
nfsd startup failed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:25 -04:00
Jeff Layton
0cd14a061e nfsd: fix error handling in __write_ports_addxprt
__write_ports_addxprt calls nfsd_create_serv. That increases the
refcount of nfsd_serv (which is tracked in sv_nrthreads). The service
only decrements the thread count on error, not on success like
__write_ports_addfd does, so using this interface leaves the nfsd
thread count high.

Fix this by having this function call svc_destroy() on error to release
the reference (and possibly to tear down the service) and simply
decrement the refcount without tearing down the service on success.

This makes the sv_threads handling work basically the same in both
__write_ports_addxprt and __write_ports_addfd.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:24 -04:00
Jeff Layton
78a8d7c8ca nfsd: fix error handling when starting nfsd with rpcbind down
The refcounting for nfsd is a little goofy. What happens is that we
create the nfsd RPC service, attach sockets to it but don't actually
start the threads until someone writes to the "threads" procfile. To do
this, __write_ports_addfd will create the nfsd service and then will
decrement the refcount when exiting but won't actually destroy the
service.

This is fine when there aren't errors, but when there are this can
cause later attempts to start nfsd to fail. nfsd_serv will be set,
and that causes __write_versions to return EBUSY.

Fix this by calling svc_destroy on nfsd_serv when this function is
going to return error.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:23 -04:00
Jeff Layton
4ad9a344be nfsd4: fix v4 state shutdown error paths
If someone tries to shut down the laundry_wq while it isn't up it'll
cause an oops.

This can happen because write_ports can create a nfsd_svc before we
really start the nfs server, and we may fail before the server is ever
started.

Also make sure state is shutdown on error paths in nfsd_svc().

Use a common global nfsd_up flag instead of nfs4_init, and create common
helper functions for nfsd start/shutdown, as there will be other work
that we want done only when we the number of nfsd threads transitions
between zero and nonzero.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:51:22 -04:00
J. Bruce Fields
55b13354d7 nfsd: remove unused assignment from nfsd_link
Trivial cleanup, since "dest" is never used.

Reported-by: Anshul Madan <Anshul.Madan@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2010-07-23 08:50:39 -04:00
Tejun Heo
6ecd7c2dd9 gfs2: use workqueue instead of slow-work
Workqueue can now handle high concurrency.  Convert gfs to use
workqueue instead of slow-work.

* Steven pointed out that recovery path might be run from allocation
  path and thus requires forward progress guarantee without memory
  allocation.  Create and use gfs_recovery_wq with rescuer.  Please
  note that forward progress wasn't guaranteed with slow-work.

* Updated to use non-reentrant workqueue.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2010-07-23 13:14:25 +02:00
Dave Chinner
aa32a79638 ext3: default to ordered mode
data=writeback mode is dangerous as it leads to higher data loss and stale data
exposure when systems crash. It should not be the default, especially when all
major distros ensure their ext3 filesystems default to ordered mode. Change the
default mode to the safer data=ordered mode, because we should be caring far
more about avoiding stale data exposure than performance.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-23 12:50:55 +02:00
Jan Kara
43d2932d88 quota: Use mark_inode_dirty_sync instead of mark_inode_dirty
Quota code never touches file data. It just modifies i_blocks + i_bytes
of inodes and inode flags of quota files. So use mark_inode_dirty_sync
instead of mark_inode_dirty.

Signed-off-by: Jan Kara <jack@suse.cz>
2010-07-23 12:50:46 +02:00
Ryusuke Konishi
c5ca48aabe nilfs2: reject incompatible filesystem
This forces nilfs to check compatibility of feature flags so as to
reject a filesystem with unknown features when it mounts or remounts
the filesystem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:16 +09:00
Ryusuke Konishi
03bdb5ac58 nilfs2: apply read-ahead for nilfs_btree_lookup_contig
This applies read-ahead to nilfs_btree_do_lookup and
nilfs_btree_lookup_contig functions and extends them to read ahead
siblings of level 1 btree nodes that hold data blocks.

At present, the read-ahead is not applied to most btree operations;
only get_block() callback function, which is used during read of
regular files or directories, receives the benefit.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:16 +09:00
Ryusuke Konishi
4e13e66bee nilfs2: introduce check flag to btree node buffer
nilfs_btree_get_block() now may return untested buffer due to
read-ahead.  This adds a new flag for buffer heads so that the btree
code can check whether the buffer is already verified or not.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
464ece8863 nilfs2: add btree get block function with readahead option
This adds __nilfs_btree_get_block() function that can issue a series
of read-ahead requests for sibling btree nodes.

This read-ahead needs parent node block, so nilfs_btree_readahead_info
structure is added to pass the information that
__nilfs_btree_get_block() needs.

This also replaces the previous nilfs_btree_get_block() implementation
with a wrapper function of __nilfs_btree_get_block().

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
26dfdd8e29 nilfs2: add read ahead mode to nilfs_btnode_submit_block
This adds mode argument to nilfs_btnode_submit_block() function and
allows it to issue a read-ahead request.

An optional submit_ptr argument is also added to store the actual
block address for which bio is sent.  submit_ptr is used for a series
of read-ahead requests, and helps to decide if each requested block is
continous to the previous one on disk.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
f8e6cc013b nilfs2: fix buffer head leak in nilfs_btnode_submit_block
nilfs_btnode_submit_block() refers to buffer head just before
returning from the function, but it releases the buffer head earlier
than that if nilfs_dat_translate() gets an error.

This has potential for oops in the erroneous case.  This fixes the
issue.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
7c397a81fe nilfs2: eliminate inline keywords in btree implementation
This removes all inline uses from btree.c.  Gcc now agressively apply
inline expansion even for the functions declared without the keyword;
the inline use in btree.c looks excessive.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:15 +09:00
Ryusuke Konishi
5ad2686e92 nilfs2: get maximum number of child nodes from bmap object
The patch "reduce repetitive calculation of max number of child nodes"
gathered up the calculation of maximum number of child nodes into
nilfs_btree_nchildren_per_block() function.  This makes the function
get resultant value from a private variable in bmap object instead of
calculating it for each call.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
9b7b265c9a nilfs2: reduce repetitive calculation of max number of child nodes
The current btree implementation repeats the same calculation on the
maximum number of child nodes.  This is because a few low level
routines use the calculation for index addressing in a btree node
block.

This reduces the calculation by explicitly passing the maximum number
of child nodes (ncmax) through their argument.

This changes parameter passing of the following functions:

 - nilfs_btree_node_dptrs
 - nilfs_btree_node_get_ptr
 - nilfs_btree_node_set_ptr
 - nilfs_btree_node_init
 - nilfs_btree_node_move_left
 - nilfs_btree_node_move_right
 - nilfs_btree_node_insert
 - nilfs_btree_node_delete, and
 - nilfs_btree_get_node

The following functions are removed:

 - nilfs_btree_node_nchildren_min
 - nilfs_btree_node_nchildren_max

Most middle level btree operations are rewritten to pass a proper
ncmax value depending on whether each occurrence of node is "root" or
not.

A constant NILFS_BTREE_ROOT_NCHILDREN_MAX is used for the root node,
whereas nilfs_btree_nchildren_per_block() function is used for
non-root nodes.  If a node could be either root or a non-root node, an
output argument of nilfs_btree_get_node() is used to set up ncmax.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
ea64ab87cd nilfs2: optimize calculation of min/max number of btree node children
nilfs_btree_node_nchildren_max() and nilfs_btree_node_nchildren_min()
functions switch return value depending on whether target node is the
root or a node block.  In most uses of these functions, however, the
node type is fixed, and moreover the same calculation is repeatedly
performed in loop.

This unfold these functions depending on context and move them outside
loops wherever possible.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
364ec2d700 nilfs2: remove redundant pointer checks in bmap lookup functions
nilfs_bmap_lookup and its variants are supposed to take a valid
pointer argument to return a block address, thus pointer checks in
nilfs_btree_lookup and nilfs_direct_lookup are needless.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
05d0e94b66 nilfs2: get rid of nilfs_bmap_union
This removes nilfs_bmap_union and finally unifies three structures and
the union in bmap/btree code into one.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
dc935be2a0 nilfs2: unify bmap set_target_v operations
This unifies two similar functions nilfs_btree_set_target_v and
nilfs_direct_set_target_v into one, nilfs_bmap_set_target_v.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:14 +09:00
Ryusuke Konishi
e7c274f808 nilfs2: get rid of nilfs_btree uses
This replaces all uses of nilfs_btree struct in implementation of
btree mapping with nilfs_bmap struct.

Name of local variable "btree" is kept not to bloat amount of change.
And, a part of local variables "bmap" is renamed to "btree" to uniform
naming rule.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00
Ryusuke Konishi
10ff885ba6 nilfs2: get rid of nilfs_direct uses
This replaces all uses of nilfs_direct struct in implementation of
direct mapping with nilfs_bmap struct.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00
Ryusuke Konishi
583ada4761 nilfs2: remove constant qualifier from argument of bmap propagate
The first argument of bops->bop_propagate operation takes a constant
qualifier, and causes compilation error when removed cast to pointer
of nilfs_btree structure type.  This fixes the issue to prepare for
succesive removal of nilfs_btree struct.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00
Ryusuke Konishi
25b8d7ded0 nilfs2: get rid of private conversion macros on bmap key and pointer
Will remove nilfs_bmap_key_to_dkey(), nilfs_bmap_dkey_to_key(),
nilfs_bmap_ptr_to_dptr(), and nilfs_bmap_dptr_to_ptr() for simplicity.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
2010-07-23 10:02:13 +09:00