Commit Graph

116 Commits (6841ebee6b02abe178abd30f40806e385cd96777)

Author SHA1 Message Date
Al Viro 6131ffaa1f more file_inode() open-coded instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-27 16:59:05 -05:00
Eric Wong 6a4e922c3d fuse: avoid out-of-scope stack access
The all pointers within fuse_req must point to valid memory once
fuse_force_forget() returns.

This bug appeared in "fuse: implement NFS-like readdirplus support"
and was never in any official Linux release.

I tested the fuse_force_forget() code path by injecting to fake -ENOMEM and
verified the FORGET operation was called properly in userspace.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-02-04 15:22:23 +01:00
Maxim Patlasov 85f40aec88 fuse: use req->page_descs[] for argpages cases
Previously, anyone who set flag 'argpages' only filled req->pages[] and set
per-request page_offset. This patch re-works all cases where argpages=1 to
fill req->page_descs[] properly.

Having req->page_descs[] filled properly allows to re-work fuse_copy_pages()
to copy page fragments described by req->page_descs[]. This will be useful
for next patches optimizing direct_IO.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov b2430d7567 fuse: add per-page descriptor <offset, length> to fuse_req
The ability to save page pointers along with lengths and offsets in fuse_req
will be useful to cover several iovec-s with a single fuse_req.

Per-request page_offset is removed because anybody who need it can use
req->page_descs[0].offset instead.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:27 +01:00
Maxim Patlasov 4d53dc99ba fuse: rework fuse_retrieve()
The patch reworks fuse_retrieve() to allocate only so many page pointers
as needed. The core part of the patch is the following calculation:

	num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;

(thanks Miklos for formula). All other changes are mostly shuffling lines.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:26 +01:00
Maxim Patlasov b111c8c0e3 fuse: categorize fuse_get_req()
The patch categorizes all fuse_get_req() invocations into two categories:
 - fuse_get_req_nopages(fc) - when caller doesn't care about req->pages
 - fuse_get_req(fc, n) - when caller need n page pointers (n > 0)

Adding fuse_get_req_nopages() helps to avoid numerous fuse_get_req(fc, 0)
scattered over code. Now it's clear from the first glance when a caller need
fuse_req with page pointers.

The patch doesn't make any logic changes. In multi-page case, it silly
allocates array of FUSE_MAX_PAGES_PER_REQ page pointers. This will be amended
by future patches.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
Maxim Patlasov 4250c0668e fuse: general infrastructure for pages[] of variable size
The patch removes inline array of FUSE_MAX_PAGES_PER_REQ page pointers from
fuse_req. Instead of that, req->pages may now point either to small inline
array or to an array allocated dynamically.

This essentially means that all callers of fuse_request_alloc[_nofs] should
pass the number of pages needed explicitly.

The patch doesn't make any logic changes.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
Anand V. Avati 0b05b18381 fuse: implement NFS-like readdirplus support
This patch implements readdirplus support in FUSE, similar to NFS.
The payload returned in the readdirplus call contains
'fuse_entry_out' structure thereby providing all the necessary inputs
for 'faking' a lookup() operation on the spot.

If the dentry and inode already existed (for e.g. in a re-run of ls -l)
then just the inode attributes timeout and dentry timeout are refreshed.

With a simple client->network->server implementation of a FUSE based
filesystem, the following performance observations were made:

Test: Performing a filesystem crawl over 20,000 files with

sh# time ls -lR /mnt

Without readdirplus:
Run 1: 18.1s
Run 2: 16.0s
Run 3: 16.2s

With readdirplus:
Run 1: 4.1s
Run 2: 3.8s
Run 3: 3.8s

The performance improvement is significant as it avoided 20,000 upcalls
calls (lookup). Cache consistency is no worse than what already is.

Signed-off-by: Anand V. Avati <avati@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24 16:21:25 +01:00
Wei Yongjun 8f706111a8 fuse: remove unused variable in fuse_try_move_page()
The variables mapping,index are initialized but never used
otherwise, so remove the unused variables.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-17 13:09:59 +01:00
Eric W. Biederman 499dcf2024 userns: Support fuse interacting with multiple user namespaces
Use kuid_t and kgid_t in struct fuse_conn and struct fuse_mount_data.

The connection between between a fuse filesystem and a fuse daemon is
established when a fuse filesystem is mounted and provided with a file
descriptor the fuse daemon created by opening /dev/fuse.

For now restrict the communication of uids and gids between the fuse
filesystem and the fuse daemon to the initial user namespace.  Enforce
this by verifying the file descriptor passed to the mount of fuse was
opened in the initial user namespace.  Ensuring the mount happens in
the initial user namespace is not necessary as mounts from non-initial
user namespaces are not yet allowed.

In fuse_req_init_context convert the currrent fsuid and fsgid into the
initial user namespace for the request that will be sent to the fuse
daemon.

In fuse_fill_attr convert the uid and gid passed from the fuse daemon
from the initial user namespace into kuids and kgids.

In iattr_to_fattr called from fuse_setattr convert kuids and kgids
into the uids and gids in the initial user namespace before passing
them to the fuse filesystem.

In fuse_change_attributes_common called from fuse_dentry_revalidate,
fuse_permission, fuse_geattr, and fuse_setattr, and fuse_iget convert
the uid and gid from the fuse daemon into a kuid and a kgid to store
on the fuse inode.

By default fuse mounts are restricted to task whose uid, suid, and
euid matches the fuse user_id and whose gid, sgid, and egid matches
the fuse group id.  Convert the user_id and group_id mount options
into kuids and kgids at mount time, and use uid_eq and gid_eq to
compare the in fuse_allow_task.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-14 22:05:33 -08:00
Al Viro cb0942b812 make get_file() return its argument
simplifies a bunch of callers...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:25 -04:00
Miklos Szeredi c9e67d4837 fuse: fix retrieve length
In some cases fuse_retrieve() would return a short byte count if offset was
non-zero.  The data returned was correct, though.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2012-09-04 18:45:54 +02:00
Cong Wang 2408f6ef6b fuse: remove the second argument of k[un]map_atomic()
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:22 +08:00
John Muir 451d0f5999 FUSE: Notifying the kernel of deletion.
Allows a FUSE file-system to tell the kernel when a file or directory is
deleted. If the specified dentry has the specified inode number, the kernel will
unhash it.

The current 'fuse_notify_inval_entry' does not cause the kernel to clean up
directories that are in use properly, and as a result the users of those
directories see incorrect semantics from the file-system. The error condition
seen when 'fuse_notify_inval_entry' is used to notify of a deleted directory is
avoided when 'fuse_notify_delete' is used instead.

The following scenario demonstrates the difference:
1. User A chdirs into 'testdir' and starts reading 'testfile'.
2. User B rm -rf 'testdir'.
3. User B creates 'testdir'.
4. User C chdirs into 'testdir'.

If you run the above within the same machine on any file-system (including fuse
file-systems), there is no problem: user C is able to chdir into the new
testdir. The old testdir is removed from the dentry tree, but still open by user
A.

If operations 2 and 3 are performed via the network such that the fuse
file-system uses one of the notify functions to tell the kernel that the nodes
are gone, then the following error occurs for user C while user A holds the
original directory open:

muirj@empacher:~> ls /test/testdir
ls: cannot access /test/testdir: No such file or directory

The issue here is that the kernel still has a dentry for testdir, and so it is
requesting the attributes for the old directory, while the file-system is
responding that the directory no longer exists.

If on the other hand, if the file-system can notify the kernel that the
directory is deleted using the new 'fuse_notify_delete' function, then the above
ls will find the new directory as expected.

Signed-off-by: John Muir <john@jmuir.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-12-13 11:58:49 +01:00
Miklos Szeredi 48706d0a91 fuse: fix fuse_retrieve
Fix two bugs in fuse_retrieve():

 - retrieving more than one page would yield repeated instances of the
   first page

 - if more than FUSE_MAX_PAGES_PER_REQ pages were requested than the
   request page array would overflow

fuse_retrieve() was added in 2.6.36 and these bugs had been there since the
beginning.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
2011-12-13 10:36:59 +01:00
Miklos Szeredi 5dfcc87fd7 fuse: fix memory leak
kmemleak is reporting that 32 bytes are being leaked by FUSE:

  unreferenced object 0xe373b270 (size 32):
  comm "fusermount", pid 1207, jiffies 4294707026 (age 2675.187s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<b05517d7>] kmemleak_alloc+0x27/0x50
    [<b0196435>] kmem_cache_alloc+0xc5/0x180
    [<b02455be>] fuse_alloc_forget+0x1e/0x20
    [<b0245670>] fuse_alloc_inode+0xb0/0xd0
    [<b01b1a8c>] alloc_inode+0x1c/0x80
    [<b01b290f>] iget5_locked+0x8f/0x1a0
    [<b0246022>] fuse_iget+0x72/0x1a0
    [<b02461da>] fuse_get_root_inode+0x8a/0x90
    [<b02465cf>] fuse_fill_super+0x3ef/0x590
    [<b019e56f>] mount_nodev+0x3f/0x90
    [<b0244e95>] fuse_mount+0x15/0x20
    [<b019d1bc>] mount_fs+0x1c/0xc0
    [<b01b5811>] vfs_kern_mount+0x41/0x90
    [<b01b5af9>] do_kern_mount+0x39/0xd0
    [<b01b7585>] do_mount+0x2e5/0x660
    [<b01b7966>] sys_mount+0x66/0xa0

This leak report is consistent and happens once per boot on
3.1.0-rc5-dirty.

This happens if a FORGET request is queued after the fuse device was
released.

Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-12 11:47:10 -07:00
Miklos Szeredi c2183d1e9b fuse: check size of FUSE_NOTIFY_INVAL_ENTRY message
FUSE_NOTIFY_INVAL_ENTRY didn't check the length of the write so the
message processing could overrun and result in a "kernel BUG at
fs/fuse/dev.c:629!"

Reported-by: Han-Wen Nienhuys <hanwenn@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2011-08-24 10:20:17 +02:00
Miklos Szeredi ef6a3c6311 mm: add replace_page_cache_page() function
This function basically does:

     remove_from_page_cache(old);
     page_cache_release(old);
     add_to_page_cache_locked(new);

Except it does this atomically, so there's no possibility for the "add" to
fail because of a race.

If memory cgroups are enabled, then the memory cgroup charge is also moved
from the old page to the new.

This function is currently used by fuse to move pages into the page cache
on read, instead of copying the page contents.

[minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22 17:44:02 -07:00
Bryan Green 357ccf2b69 fuse: wakeup pollers on connection release/abort
If a fuse dev connection is broken, wake up any
processes that are blocking, in a poll system call,
on one of the files in the now defunct filesystem.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2011-03-21 13:58:05 +01:00
Miklos Szeredi 02c048b919 fuse: allow batching of FORGET requests
Terje Malmedal reports that a fuse filesystem with 32 million inodes
on a machine with lots of memory can take up to 30 minutes to process
FORGET requests when all those inodes are evicted from the icache.

To solve this, create a BATCH_FORGET request that allows up to about
8000 FORGET requests to be sent in a single message.

This request is only sent if userspace supports interface version 7.16
or later, otherwise fall back to sending individual FORGET messages.

Reported-by: Terje Malmedal <terje.malmedal@usit.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-12-07 20:16:56 +01:00
Miklos Szeredi 07e77dca8a fuse: separate queue for FORGET requests
Terje Malmedal reports that a fuse filesystem with 32 million inodes
on a machine with lots of memory can go unresponsive for up to 30
minutes when all those inodes are evicted from the icache.

The reason is that FORGET messages, sent when the inode is evicted,
are queued up together with regular filesystem requests, and while the
huge queue of FORGET messages are processed no other filesystem
operation can proceed.

Since a full fuse request structure is allocated for each inode, these
take up quite a bit of memory as well.

To solve these issues, create a slim 'fuse_forget_link' structure
containing just the minimum of information required to send the FORGET
request and chain these on a separate queue.

When userspace is asking for a request make sure that FORGET and
non-FORGET requests are selected fairly: for each 8 non-FORGET allow
16 FORGET requests.  This will make sure FORGETs do not pile up, yet
other requests are also allowed to proceed while the queued FORGETs
are processed.

Reported-by: Terje Malmedal <terje.malmedal@usit.uio.no>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-12-07 20:16:56 +01:00
Miklos Szeredi 0be8557bcd fuse: use release_pages()
Replace iterated page_cache_release() with release_pages(), which is
faster and shorter.

Needs release_pages() to be exported to modules.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27 18:03:17 -07:00
Miklos Szeredi b6777c40c7 fuse: use clear_highpage() and KM_USER0 instead of KM_USER1
Commit 7909b1c640 ("fuse: don't use atomic kmap") removed KM_USER0 usage
from fuse/dev.c.  Switch KM_USER1 uses to KM_USER0 for clarity.  Also
replace open coded clear_highpage().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:13 -07:00
Jan Beulich 3ecb01df32 use clear_page()/copy_page() in favor of memset()/memcpy() on whole pages
After all that's what they are intended for.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-26 16:52:13 -07:00
Geert Uytterhoeven 0157443c56 fuse: Initialize total_len in fuse_retrieve()
fs/fuse/dev.c:1357: warning: ‘total_len’ may be used uninitialized in this
function

Initialize total_len to zero, else its value will be undefined.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-10-04 10:45:32 +02:00
Miklos Szeredi b9ca67b2dd fuse: fix lock annotations
Sparse doesn't understand lock annotations of the form
__releases(&foo->lock).  Change them to __releases(foo->lock).  Same
for __acquires().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-09-07 13:42:41 +02:00
Miklos Szeredi 595afaf9e6 fuse: flush background queue on connection close
David Bartly reported that fuse can hang in fuse_get_req_nofail() when
the connection to the filesystem server is no longer active.

If bg_queue is not empty then flush_bg_queue() called from
request_end() can put more requests on to the pending queue.  If this
happens while ending requests on the processing queue then those
background requests will be queued to the pending list and never
ended.

Another problem is that fuse_dev_release() didn't wake up processes
sleeping on blocked_waitq.

Solve this by:

 a) flushing the background queue before calling end_requests() on the
    pending and processing queues

 b) setting blocked = 0 and waking up processes waiting on
    blocked_waitq()

Thanks to David for an excellent bug report.

Reported-by: David Bartley <andareed@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2010-09-07 13:42:41 +02:00
Miklos Szeredi 2d45ba381a fuse: add retrieve request
Userspace filesystem can request data to be retrieved from the inode's
mapping.  This request is synchronous and the retrieved data is queued
as a new request.  If the write to the fuse device returns an error
then the retrieve request was not completed and a reply will not be
sent.

Only present pages are returned in the retrieve reply.  Retrieving
stops when it finds a non-present page and only data prior to that is
returned.

This request doesn't change the dirty state of pages.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-07-12 14:41:40 +02:00
Miklos Szeredi a1d75f2582 fuse: add store request
Userspace filesystem can request data to be stored in the inode's
mapping.  This request is synchronous and has no reply.  If the write
to the fuse device returns an error then the store request was not
fully completed (but may have updated some pages).

If the stored data overflows the current file size, then the size is
extended, similarly to a write(2) on the filesystem.

Pages which have been completely stored are marked uptodate.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-07-12 14:41:40 +02:00
Miklos Szeredi 7909b1c640 fuse: don't use atomic kmap
Don't use atomic kmap for mapping userspace buffers in device
read/write/splice.

This is necessary because the next patch (adding store notify)
requires that caller of fuse_copy_page() may sleep between
invocations.  The simplest way to ensure this is to change the atomic
kmaps to non-atomic ones.

Thankfully architectures where kmap() is not a no-op are going out of
fashion, so we can ignore the (probably negligible) performance impact
of this change.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-07-12 14:41:40 +02:00
Linus Torvalds 003386fff3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  mm: export generic_pipe_buf_*() to modules
  fuse: support splice() reading from fuse device
  fuse: allow splice to move pages
  mm: export remove_from_page_cache() to modules
  mm: export lru_cache_add_*() to modules
  fuse: support splice() writing to fuse device
  fuse: get page reference for readpages
  fuse: use get_user_pages_fast()
  fuse: remove unneeded variable
2010-05-30 09:16:14 -07:00
Kay Sievers 578454ff7e driver core: add devname module aliases to allow module on-demand auto-loading
This adds:
  alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.

Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.

The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
  $ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
  # Device nodes to trigger on-demand module loading.
  microcode cpu/microcode c10:184
  fuse fuse c10:229
  ppp_generic ppp c108:0
  tun net/tun c10:200
  dm_mod mapper/control c10:235

Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
  $ /sbin/udevd --debug
  ...
  static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
  static_dev_create_from_modules: mknod '/dev/fuse' c10:229
  static_dev_create_from_modules: mknod '/dev/ppp' c108:0
  static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
  static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
  udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
  udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666

A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.

Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.

This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)

Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-25 15:08:26 -07:00
Miklos Szeredi c3021629a0 fuse: support splice() reading from fuse device
Allow userspace filesystem implementation to use splice() to read from
the fuse device.

The userspace filesystem can now transfer data coming from a WRITE
request to an arbitrary file descriptor (regular file, block device or
socket) without having to go through a userspace buffer.

The semantics of using splice() to read messages are:

 1)  with a single splice() call move the whole message from the fuse
     device to a temporary pipe
 2)  read the header from the pipe and determine the message type
 3a) if message is a WRITE then splice data from pipe to destination
 3b) else read rest of message to userspace buffer

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:07 +02:00
Miklos Szeredi ce534fb052 fuse: allow splice to move pages
When splicing buffers to the fuse device with SPLICE_F_MOVE, try to
move pages from the pipe buffer into the page cache.  This allows
populating the fuse filesystem's cache without ever touching the page
contents, i.e. zero copy read capability.

The following steps are performed when trying to move a page into the
page cache:

 - buf->ops->confirm() to make sure the new page is uptodate
 - buf->ops->steal() to try to remove the new page from it's previous place
 - remove_from_page_cache() on the old page
 - add_to_page_cache_locked() on the new page

If any of the above steps fail (non fatally) then the code falls back
to copying the page.  In particular ->steal() will fail if there are
external references (other than the page cache and the pipe buffer) to
the page.

Also since the remove_from_page_cache() + add_to_page_cache_locked()
are non-atomic it is possible that the page cache is repopulated in
between the two and add_to_page_cache_locked() will fail.  This could
be fixed by creating a new atomic replace_page_cache_page() function.

fuse_readpages_end() needed to be reworked so it works even if
page->mapping is NULL for some or all pages which can happen if the
add_to_page_cache_locked() failed.

A number of sanity checks were added to make sure the stolen pages
don't have weird flags set, etc...  These could be moved into generic
splice/steal code.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:07 +02:00
Miklos Szeredi dd3bb14f44 fuse: support splice() writing to fuse device
Allow userspace filesystem implementation to use splice() to write to
the fuse device.  The semantics of using splice() are:

 1) buffer the message header and data in a temporary pipe
 2) with a *single* splice() call move the message from the temporary pipe
    to the fuse device

The READ reply message has the most interesting use for this, since
now the data from an arbitrary file descriptor (which could be a
regular file, a block device or a socket) can be tranferred into the
fuse device without having to go through a userspace buffer.  It will
also allow zero copy moving of pages.

One caveat is that the protocol on the fuse device requires the length
of the whole message to be written into the header.  But the length of
the data transferred into the temporary pipe may not be known in
advance.  The current library implementation works around this by
using vmplice to write the header and modifying the header after
splicing the data into the pipe (error handling omitted):

	struct fuse_out_header out;

	iov.iov_base = &out;
	iov.iov_len = sizeof(struct fuse_out_header);
	vmsplice(pip[1], &iov, 1, 0);
	len = splice(input_fd, input_offset, pip[1], NULL, len, 0);
	/* retrospectively modify the header: */
	out.len = len + sizeof(struct fuse_out_header);
	splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags);

This works since vmsplice only saves a pointer to the data, it does
not copy the data itself.

Since pipes are currently limited to 16 pages and messages need to be
spliced atomically, the length of the data is limited to 15 pages (or
60kB for 4k pages).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:06 +02:00
Miklos Szeredi 1bf94ca73e fuse: use get_user_pages_fast()
Replace uses of get_user_pages() with get_user_pages_fast().  It looks
nicer and should be faster in most cases.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-05-25 15:06:06 +02:00
Fang Wenqi b2d82ee3c8 fuse: fix large stack use
gcc 4.4 warns about:
  fs/fuse/dev.c: In function ‘fuse_notify_inval_entry’:
  fs/fuse/dev.c:925: warning: the frame size of 1060 bytes is larger than 1024 bytes

The problem is we declare two structures and a large array on the stack,
I move the array alway from the stack and allocate memory for it dynamically.

Signed-off-by: Fang Wenqi <antonf@turbolinux.com.cn>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-02-05 12:08:31 +01:00
Miklos Szeredi b21dda438b fuse: cleanup in fuse_notify_inval_...()
Small cleanup in fuse_notify_inval_inode() and
fuse_notify_inval_entry().

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2010-02-05 12:08:31 +01:00
Linus Torvalds 9eead2a811 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: add fusectl interface to max_background
  fuse: limit user-specified values of max background requests
  fuse: use drop_nlink() instead of direct nlink manipulation
  fuse: document protocol version negotiation
  fuse: make the number of max background requests and congestion threshold tunable
2009-09-18 09:23:03 -07:00
Linus Torvalds 81e4e1ba7e Revert "fuse: Fix build error" as unnecessary
This reverts commit 097041e576.

Trond had a better fix, which is the parent of this one ("Fix compile
error due to congestion_wait() changes")

Requested-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Larry Finger <Larry.Finger@lwfinger.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-11 11:22:34 -07:00
Larry Finger 097041e576 fuse: Fix build error
When building v2.6.31-rc2-344-g69ca06c, the following build errors are
found due to missing includes:

 CC [M]  fs/fuse/dev.o
fs/fuse/dev.c: In function ‘request_end’:
fs/fuse/dev.c:289: error: ‘BLK_RW_SYNC’ undeclared (first use in this function)
...
fs/nfs/write.c: In function ‘nfs_set_page_writeback’:
fs/nfs/write.c:207: error: ‘BLK_RW_ASYNC’ undeclared (first use in this function)

Signed-off-by: Larry Finger@lwfinger.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-07-10 19:09:46 -07:00
Jens Axboe 8aa7e847d8 Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-07-10 20:31:53 +02:00
Csaba Henk 7a6d3c8b30 fuse: make the number of max background requests and congestion threshold tunable
The practical values for these limits depend on the design of the
filesystem server so let userspace set them at initialization time.

Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-07-07 17:28:52 +02:00
John Muir 3b463ae0c6 fuse: invalidation reverse calls
Add notification messages that allow the filesystem to invalidate VFS
caches.

Two notifications are added:

 1) inode invalidation

   - invalidate cached attributes
   - invalidate a range of pages in the page cache (this is optional)

 2) dentry invalidation

   - try to invalidate a subtree in the dentry cache

Care must be taken while accessing the 'struct super_block' for the
mount, as it can go away while an invalidation is in progress.  To
prevent this, introduce a rw-semaphore, that is taken for read during
the invalidation and taken for write in the ->kill_sb callback.

Cc: Csaba Henk <csaba@gluster.com>
Cc: Anand Avati <avati@zresearch.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-06-30 20:12:24 +02:00
Csaba Henk b4c458b3a2 fuse: fix return value of fuse_dev_write()
On 64 bit systems -- where sizeof(ssize_t) > sizeof(int) -- the following test
exposes a bug due to a non-careful return of an int or unsigned value:

implement a FUSE filesystem which sends an unsolicited notification to
the kernel with invalid opcode. The respective write to /dev/fuse
will return (1 << 32) - EINVAL with errno == 0 instead of -1 with
errno == EINVAL.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-06-30 20:06:23 +02:00
Tejun Heo 08cbf542bf fuse: export symbols to be used by CUSE
Export the following symbols for CUSE.

fuse_conn_put()
fuse_conn_get()
fuse_conn_kill()
fuse_send_init()
fuse_do_open()
fuse_sync_release()
fuse_direct_io()
fuse_do_ioctl()
fuse_file_poll()
fuse_request_alloc()
fuse_get_req()
fuse_put_request()
fuse_request_send()
fuse_abort_conn()
fuse_dev_release()
fuse_dev_operations

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:42 +02:00
Tejun Heo a325f9b922 fuse: update fuse_conn_init() and separate out fuse_conn_kill()
Update fuse_conn_init() such that it doesn't take @sb and move bdi
registration into a separate function.  Also separate out
fuse_conn_kill() from fuse_put_super().

These will be used to implement cuse.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-04-28 16:56:41 +02:00
Miklos Szeredi f6d47a1761 fuse: fix poll notify
Move fuse_copy_finish() to before calling fuse_notify_poll_wakeup().
This is not a big issue because fuse_notify_poll_wakeup() should be
atomic, but it's cleaner this way, and later uses of notification will
need to be able to finish the copying before performing some actions.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2009-01-26 15:00:59 +01:00
Miklos Szeredi 26c3679101 fuse: destroy bdi on umount
If a fuse filesystem is unmounted but the device file descriptor
remains open and a new mount reuses the old device number, then the
mount fails with EEXIST and the following warning is printed in the
kernel log:

  WARNING: at fs/sysfs/dir.c:462 sysfs_add_one+0x35/0x3d()
  sysfs: duplicate filename '0:15' can not be created

The cause is that the bdi belonging to the fuse filesystem was
destoryed only after the device file was released.  Fix this by
calling bdi_destroy() from fuse_put_super() instead.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
2009-01-26 15:00:59 +01:00
Linus Torvalds 5fec8bdbf9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: clean up annotations of fc->lock
  fuse: fix sparse warning in ioctl
  fuse: update interface version
  fuse: add fuse_conn->release()
  fuse: separate out fuse_conn_init() from new_conn()
  fuse: add fuse_ prefix to several functions
  fuse: implement poll support
  fuse: implement unsolicited notification
  fuse: add file kernel handle
  fuse: implement ioctl support
  fuse: don't let fuse_req->end() put the base reference
  fuse: move FUSE_MINOR to miscdevice.h
  fuse: style fixes
2009-01-06 17:01:20 -08:00