linux

Commit Graph

Author	SHA1	Message	Date
Eric Sandeen	982363c97f	ecryptfs: propagate key errors up at mount time Mounting with invalid key signatures should probably fail, if they were specifically requested but not available. Also fix case checks in process_request_key_err() for the right sign of the errnos, as spotted by Jan Tluka. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Tluka <jtluka@redhat.com> Acked-by: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Tyler Hicks	6c4c17b073	ecryptfs: discard ecryptfsd registration messages in miscdev The userspace eCryptfs daemon sends HELO and QUIT messages to the kernel for per-user daemon (un)registration. These messages are required when netlink is used as the transport, but (un)registration is handled by opening and closing the device file when miscdev is the transport. These messages should be discarded in the miscdev transport so that a daemon isn't registered twice. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:31 -07:00
Michael Halcrow	746f1e558b	eCryptfs: Privileged kthread for lower file opens eCryptfs would really like to have read-write access to all files in the lower filesystem. Right now, the persistent lower file may be opened read-only if the attempt to open it read-write fails. One way to keep from having to do that is to have a privileged kthread that can open the lower persistent file on behalf of the user opening the eCryptfs file; this patch implements this functionality. This patch will properly allow a less-privileged user to open the eCryptfs file, followed by a more-privileged user opening the eCryptfs file, with the first user only being able to read and the second user being able to both read and write. eCryptfs currently does this wrong; it will wind up calling vfs_write() on a file that was opened read-only. This is fixed in this patch. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:30 -07:00
Ulrich Drepper	9fe5ad9c8c	flag parameters add-on: remove epoll_create size param Remove the size parameter from the new epoll_create syscall and renames the syscall itself. The updated test program follows. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_epoll_create2 # ifdef __x86_64__ # define __NR_epoll_create2 291 # elif defined __i386__ # define __NR_epoll_create2 329 # else # error "need __NR_epoll_create2" # endif #endif #define EPOLL_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_epoll_create2, 0); if (fd == -1) { puts ("epoll_create2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("epoll_create2(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_epoll_create2, EPOLL_CLOEXEC); if (fd == -1) { puts ("epoll_create2(EPOLL_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	e38b36f325	flag parameters: check magic constants This patch adds test that ensure the boundary conditions for the various constants introduced in the previous patches is met. No code is generated. [akpm@linux-foundation.org: fix alpha] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	510df2dd48	flag parameters: NONBLOCK in inotify_init This patch adds non-blocking support for inotify_init1. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_NONBLOCK O_NONBLOCK int main (void) { int fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("inotify_init1(0) set non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_NONBLOCK); if (fd == -1) { puts ("inotify_init1(IN_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("inotify_init1(IN_NONBLOCK) set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	be61a86d72	flag parameters: NONBLOCK in pipe This patch adds O_NONBLOCK support to pipe2. It is minimally more involved than the patches for eventfd et.al but still trivial. The interfaces of the create_write_pipe and create_read_pipe helper functions were changed and the one other caller as well. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_pipe2 # ifdef __x86_64__ # define __NR_pipe2 293 # elif defined __i386__ # define __NR_pipe2 331 # else # error "need __NR_pipe2" # endif #endif int main (void) { int fds[2]; if (syscall (__NR_pipe2, fds, 0) == -1) { puts ("pipe2(0) failed"); return 1; } for (int i = 0; i < 2; ++i) { int fl = fcntl (fds[i], F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { printf ("pipe2(0) set non-blocking mode for fds[%d]\n", i); return 1; } close (fds[i]); } if (syscall (__NR_pipe2, fds, O_NONBLOCK) == -1) { puts ("pipe2(O_NONBLOCK) failed"); return 1; } for (int i = 0; i < 2; ++i) { int fl = fcntl (fds[i], F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { printf ("pipe2(O_NONBLOCK) does not set non-blocking mode for fds[%d]\n", i); return 1; } close (fds[i]); } puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	6b1ef0e60d	flag parameters: NONBLOCK in timerfd_create This patch adds support for the TFD_NONBLOCK flag to timerfd_create. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_timerfd_create # ifdef __x86_64__ # define __NR_timerfd_create 283 # elif defined __i386__ # define __NR_timerfd_create 322 # else # error "need __NR_timerfd_create" # endif #endif #define TFD_NONBLOCK O_NONBLOCK int main (void) { int fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, 0); if (fd == -1) { puts ("timerfd_create(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("timerfd_create(0) set non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, TFD_NONBLOCK); if (fd == -1) { puts ("timerfd_create(TFD_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("timerfd_create(TFD_NONBLOCK) set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	e7d476dfdf	flag parameters: NONBLOCK in eventfd This patch adds support for the EFD_NONBLOCK flag to eventfd2. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_eventfd2 # ifdef __x86_64__ # define __NR_eventfd2 290 # elif defined __i386__ # define __NR_eventfd2 328 # else # error "need __NR_eventfd2" # endif #endif #define EFD_NONBLOCK O_NONBLOCK int main (void) { int fd = syscall (__NR_eventfd2, 1, 0); if (fd == -1) { puts ("eventfd2(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("eventfd2(0) sets non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_eventfd2, 1, EFD_NONBLOCK); if (fd == -1) { puts ("eventfd2(EFD_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("eventfd2(EFD_NONBLOCK) does not set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	5fb5e04926	flag parameters: NONBLOCK in signalfd This patch adds support for the SFD_NONBLOCK flag to signalfd4. The additional changes needed are minimal. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_signalfd4 # ifdef __x86_64__ # define __NR_signalfd4 289 # elif defined __i386__ # define __NR_signalfd4 327 # else # error "need __NR_signalfd4" # endif #endif #define SFD_NONBLOCK O_NONBLOCK int main (void) { sigset_t ss; sigemptyset (&ss); sigaddset (&ss, SIGUSR1); int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0); if (fd == -1) { puts ("signalfd4(0) failed"); return 1; } int fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if (fl & O_NONBLOCK) { puts ("signalfd4(0) set non-blocking mode"); return 1; } close (fd); fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_NONBLOCK); if (fd == -1) { puts ("signalfd4(SFD_NONBLOCK) failed"); return 1; } fl = fcntl (fd, F_GETFL); if (fl == -1) { puts ("fcntl failed"); return 1; } if ((fl & O_NONBLOCK) == 0) { puts ("signalfd4(SFD_NONBLOCK) does not set non-blocking mode"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:29 -07:00
Ulrich Drepper	99829b8329	flag parameters: NONBLOCK in anon_inode_getfd Building on the previous change to anon_inode_getfd, this patch introduces support for handling of O_NONBLOCK in addition to the already supported O_CLOEXEC. Following patches will take advantage of this support. As can be seen, the additional support for supporting this functionality is minimal. Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	4006553b06	flag parameters: inotify_init This patch introduces the new syscall inotify_init1 (note: the 1 stands for the one parameter the syscall takes, as opposed to no parameter before). The values accepted for this parameter are function-specific and defined in the inotify.h header. Here the values must match the O_* flags, though. In this patch CLOEXEC support is introduced. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_CLOEXEC O_CLOEXEC int main (void) { int fd; fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("inotify_init1(0) set close-on-exit"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_CLOEXEC); if (fd == -1) { puts ("inotify_init1(IN_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	ed8cae8ba0	flag parameters: pipe This patch introduces the new syscall pipe2 which is like pipe but it also takes an additional parameter which takes a flag value. This patch implements the handling of O_CLOEXEC for the flag. I did not add support for the new syscall for the architectures which have a special sys_pipe implementation. I think the maintainers of those archs have the chance to go with the unified implementation but that's up to them. The implementation introduces do_pipe_flags. I did that instead of changing all callers of do_pipe because some of the callers are written in assembler. I would probably screw up changing the assembly code. To avoid breaking code do_pipe is now a small wrapper around do_pipe_flags. Once all callers are changed over to do_pipe_flags the old do_pipe function can be removed. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_pipe2 # ifdef __x86_64__ # define __NR_pipe2 293 # elif defined __i386__ # define __NR_pipe2 331 # else # error "need __NR_pipe2" # endif #endif int main (void) { int fd[2]; if (syscall (__NR_pipe2, fd, 0) != 0) { puts ("pipe2(0) failed"); return 1; } for (int i = 0; i < 2; ++i) { int coe = fcntl (fd[i], F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { printf ("pipe2(0) set close-on-exit for fd[%d]\n", i); return 1; } } close (fd[0]); close (fd[1]); if (syscall (__NR_pipe2, fd, O_CLOEXEC) != 0) { puts ("pipe2(O_CLOEXEC) failed"); return 1; } for (int i = 0; i < 2; ++i) { int coe = fcntl (fd[i], F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { printf ("pipe2(O_CLOEXEC) does not set close-on-exit for fd[%d]\n", i); return 1; } } close (fd[0]); close (fd[1]); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	336dd1f70f	flag parameters: dup2 This patch adds the new dup3 syscall. It extends the old dup2 syscall by one parameter which is meant to hold a flag value. Support for the O_CLOEXEC flag is added in this patch. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_dup3 # ifdef __x86_64__ # define __NR_dup3 292 # elif defined __i386__ # define __NR_dup3 330 # else # error "need __NR_dup3" # endif #endif int main (void) { int fd = syscall (__NR_dup3, 1, 4, 0); if (fd == -1) { puts ("dup3(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("dup3(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_dup3, 1, 4, O_CLOEXEC); if (fd == -1) { puts ("dup3(O_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("dup3(O_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	a0998b50c3	flag parameters: epoll_create This patch adds the new epoll_create2 syscall. It extends the old epoll_create syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is EPOLL_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name EPOLL_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_epoll_create2 # ifdef __x86_64__ # define __NR_epoll_create2 291 # elif defined __i386__ # define __NR_epoll_create2 329 # else # error "need __NR_epoll_create2" # endif #endif #define EPOLL_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_epoll_create2, 1, 0); if (fd == -1) { puts ("epoll_create2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("epoll_create2(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_epoll_create2, 1, EPOLL_CLOEXEC); if (fd == -1) { puts ("epoll_create2(EPOLL_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:28 -07:00
Ulrich Drepper	11fcb6c146	flag parameters: timerfd_create The timerfd_create syscall already has a flags parameter. It just is unused so far. This patch changes this by introducing the TFD_CLOEXEC flag to set the close-on-exec flag for the returned file descriptor. A new name TFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_timerfd_create # ifdef __x86_64__ # define __NR_timerfd_create 283 # elif defined __i386__ # define __NR_timerfd_create 322 # else # error "need __NR_timerfd_create" # endif #endif #define TFD_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, 0); if (fd == -1) { puts ("timerfd_create(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("timerfd_create(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_timerfd_create, CLOCK_REALTIME, TFD_CLOEXEC); if (fd == -1) { puts ("timerfd_create(TFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("timerfd_create(TFD_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	b087498eb5	flag parameters: eventfd This patch adds the new eventfd2 syscall. It extends the old eventfd syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is EFD_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name EFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_eventfd2 # ifdef __x86_64__ # define __NR_eventfd2 290 # elif defined __i386__ # define __NR_eventfd2 328 # else # error "need __NR_eventfd2" # endif #endif #define EFD_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_eventfd2, 1, 0); if (fd == -1) { puts ("eventfd2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("eventfd2(0) sets close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC); if (fd == -1) { puts ("eventfd2(EFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	9deb27baed	flag parameters: signalfd This patch adds the new signalfd4 syscall. It extends the old signalfd syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is SFD_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name SFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_signalfd4 # ifdef __x86_64__ # define __NR_signalfd4 289 # elif defined __i386__ # define __NR_signalfd4 327 # else # error "need __NR_signalfd4" # endif #endif #define SFD_CLOEXEC O_CLOEXEC int main (void) { sigset_t ss; sigemptyset (&ss); sigaddset (&ss, SIGUSR1); int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0); if (fd == -1) { puts ("signalfd4(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("signalfd4(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_CLOEXEC); if (fd == -1) { puts ("signalfd4(SFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("signalfd4(SFD_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Ulrich Drepper	7d9dbca342	flag parameters: anon_inode_getfd extension This patch just extends the anon_inode_getfd interface to take an additional parameter with a flag value. The flag value is passed on to get_unused_fd_flags in anticipation for a use with the O_CLOEXEC flag. No actual semantic changes here, the changed callers all pass 0 for now. [akpm@linux-foundation.org: KVM fix] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Akinobu Mita	6e2c10a12a	binfmt_misc: use simple_read_from_buffer() Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:27 -07:00
Jon Tollefson	f4a67cceee	fs: check for statfs overflow Adds a check for an overflow in the filesystem size so if someone is checking with statfs() on a 16G blocksize hugetlbfs in a 32bit binary that it will report back EOVERFLOW instead of a size of 0. Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Jon Tollefson <kniht@linux.vnet.ibm.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:19 -07:00
Andi Kleen	a137e1cc6d	hugetlbfs: per mount huge page sizes Add the ability to configure the hugetlb hstate used on a per mount basis. - Add a new pagesize= option to the hugetlbfs mount that allows setting the page size - This option causes the mount code to find the hstate corresponding to the specified size, and sets up a pointer to the hstate in the mount's superblock. - Change the hstate accessors to use this information rather than the global_hstate they were using (requires a slight change in mm/memory.c so we don't NULL deref in the error-unmap path -- see comments). [np: take hstate out of hugetlbfs inode and vma->vm_private_data] Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:17 -07:00
Andi Kleen	a551643895	hugetlb: modular state for hugetlb page size The goal of this patchset is to support multiple hugetlb page sizes. This is achieved by introducing a new struct hstate structure, which encapsulates the important hugetlb state and constants (eg. huge page size, number of huge pages currently allocated, etc). The hstate structure is then passed around the code which requires these fields, they will do the right thing regardless of the exact hstate they are operating on. This patch adds the hstate structure, with a single global instance of it (default_hstate), and does the basic work of converting hugetlb to use the hstate. Future patches will add more hstate structures to allow for different hugetlbfs mounts to have different page sizes. [akpm@linux-foundation.org: coding-style fixes] Acked-by: Adam Litke <agl@us.ibm.com> Acked-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:17 -07:00
Eric Dumazet	a47a126ad5	vmallocinfo: add NUMA information Christoph recently added /proc/vmallocinfo file to get information about vmalloc allocations. This patch adds NUMA specific information, giving number of pages allocated on each memory node. This should help to check that vmalloc() is able to respect NUMA policies. Example of output on a four nodes machine (one cpu per node) 1) network hash tables are evenly spreaded on four nodes (OK) (Same point for inodes and dentries hash tables) 2) iptables tables (x_tables) are correctly allocated on each cpu node (OK). 3) sys_swapon() allocates its memory from one node only. 4) each loaded module is using memory on one node. Sysadmins could tune their setup to change points 3) and 4) if necessary. grep "pages=" /proc/vmallocinfo 0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204/0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 0xffffc2000031a000-0xffffc2000031d000 12288 alloc_large_system_hash+0x204/0x2c0 pages=2 vmalloc N1=1 N2=1 0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e/0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 0xffffc2000033e000-0xffffc20000341000 12288 sys_swapon+0x640/0xac0 pages=2 vmalloc N0=2 0xffffc20000341000-0xffffc20000344000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N0=2 0xffffc20000344000-0xffffc20000347000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N1=2 0xffffc20000347000-0xffffc2000034a000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N2=2 0xffffc2000034a000-0xffffc2000034d000 12288 xt_alloc_table_info+0xfe/0x130 [x_tables] pages=2 vmalloc N3=2 0xffffc20004381000-0xffffc20004402000 528384 alloc_large_system_hash+0x204/0x2c0 pages=128 vmalloc N0=32 N1=32 N2=32 N3=32 0xffffc20004402000-0xffffc20004803000 4198400 alloc_large_system_hash+0x204/0x2c0 pages=1024 vmalloc vpages N0=256 N1=256 N2=256 N3=256 0xffffc20004803000-0xffffc20004904000 1052672 alloc_large_system_hash+0x204/0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 0xffffc20004904000-0xffffc20004bec000 `3047424` sys_swapon+0x640/0xac0 pages=743 vmalloc vpages N0=743 0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 pages=14 vmalloc N1=14 0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 pages=4 vmalloc N0=4 0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N0=2 0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N1=10 0xffffffffa0022000-0xffffffffa0028000 24576 sys_init_module+0xc27/0x1d00 pages=5 vmalloc N3=5 0xffffffffa0028000-0xffffffffa0050000 163840 sys_init_module+0xc27/0x1d00 pages=39 vmalloc N1=39 0xffffffffa0050000-0xffffffffa0052000 8192 sys_init_module+0xc27/0x1d00 pages=1 vmalloc N1=1 0xffffffffa0052000-0xffffffffa0056000 16384 sys_init_module+0xc27/0x1d00 pages=3 vmalloc N1=3 0xffffffffa0056000-0xffffffffa0081000 176128 sys_init_module+0xc27/0x1d00 pages=42 vmalloc N3=42 0xffffffffa0081000-0xffffffffa00ae000 184320 sys_init_module+0xc27/0x1d00 pages=44 vmalloc N3=44 0xffffffffa00ae000-0xffffffffa00b1000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 0xffffffffa00b1000-0xffffffffa00b9000 32768 sys_init_module+0xc27/0x1d00 pages=7 vmalloc N0=7 0xffffffffa00b9000-0xffffffffa00c4000 45056 sys_init_module+0xc27/0x1d00 pages=10 vmalloc N3=10 0xffffffffa00c6000-0xffffffffa00e0000 106496 sys_init_module+0xc27/0x1d00 pages=25 vmalloc N2=25 0xffffffffa00e0000-0xffffffffa00f1000 69632 sys_init_module+0xc27/0x1d00 pages=16 vmalloc N2=16 0xffffffffa00f1000-0xffffffffa00f4000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 0xffffffffa00f4000-0xffffffffa00f7000 12288 sys_init_module+0xc27/0x1d00 pages=2 vmalloc N3=2 [akpm@linux-foundation.org: fix comment] Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:17 -07:00
Pavel Machek	cce7708158	SYNC_FILE_RANGE_WRITE may and will block. Document that. [akpm@linux-foundation.org: fix comment text] Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:17 -07:00
Mel Gorman	04f2cbe356	hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed After patch 2 in this series, a process that successfully calls mmap() for a MAP_PRIVATE mapping will be guaranteed to successfully fault until a process calls fork(). At that point, the next write fault from the parent could fail due to COW if the child still has a reference. We only reserve pages for the parent but a copy must be made to avoid leaking data from the parent to the child after fork(). Reserves could be taken for both parent and child at fork time to guarantee faults but if the mapping is large it is highly likely we will not have sufficient pages for the reservation, and it is common to fork only to exec() immediatly after. A failure here would be very undesirable. Note that the current behaviour of mainline with MAP_PRIVATE pages is pretty bad. The following situation is allowed to occur today. 1. Process calls mmap(MAP_PRIVATE) 2. Process calls mlock() to fault all pages and makes sure it succeeds 3. Process forks() 4. Process writes to MAP_PRIVATE mapping while child still exists 5. If the COW fails at this point, the process gets SIGKILLed even though it had taken care to ensure the pages existed This patch improves the situation by guaranteeing the reliability of the process that successfully calls mmap(). When the parent performs COW, it will try to satisfy the allocation without using reserves. If that fails the parent will steal the page leaving any children without a page. Faults from the child after that point will result in failure. If the child COW happens first, an attempt will be made to allocate the page without reserves and the child will get SIGKILLed on failure. To summarise the new behaviour: 1. If the original mapper performs COW on a private mapping with multiple references, it will attempt to allocate a hugepage from the pool or the buddy allocator without using the existing reserves. On fail, VMAs mapping the same area are traversed and the page being COW'd is unmapped where found. It will then steal the original page as the last mapper in the normal way. 2. The VMAs the pages were unmapped from are flagged to note that pages with data no longer exist. Future no-page faults on those VMAs will terminate the process as otherwise it would appear that data was corrupted. A warning is printed to the console that this situation occured. 2. If the child performs COW first, it will attempt to satisfy the COW from the pool if there are enough pages or via the buddy allocator if overcommit is allowed and the buddy allocator can satisfy the request. If it fails, the child will be killed. If the pool is large enough, existing applications will not notice that the reserves were a factor. Existing applications depending on the no-reserves been set are unlikely to exist as for much of the history of hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that point or failing the mmap(). [npiggin@suse.de: fix CONFIG_HUGETLB=n build] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:16 -07:00
Mel Gorman	a1e78772d7	hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in a similar manner to the reservations taken for MAP_SHARED mappings. The reserve count is accounted both globally and on a per-VMA basis for private mappings. This guarantees that a process that successfully calls mmap() will successfully fault all pages in the future unless fork() is called. The characteristics of private mappings of hugetlbfs files behaviour after this patch are; 1. The process calling mmap() is guaranteed to succeed all future faults until it forks(). 2. On fork(), the parent may die due to SIGKILL on writes to the private mapping if enough pages are not available for the COW. For reasonably reliable behaviour in the face of a small huge page pool, children of hugepage-aware processes should not reference the mappings; such as might occur when fork()ing to exec(). 3. On fork(), the child VMAs inherit no reserves. Reads on pages already faulted by the parent will succeed. Successful writes will depend on enough huge pages being free in the pool. 4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper and at fault time otherwise. Before this patch, all reads or writes in the child potentially needs page allocations that can later lead to the death of the parent. This applies to reads and writes of uninstantiated pages as well as COW. After the patch it is only a write to an instantiated page that causes problems. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:16 -07:00
Kentaro Makita	da3bbdd463	fix soft lock up at NFS mount via per-SB LRU-list of unused dentries [Summary] Split LRU-list of unused dentries to one per superblock to avoid soft lock up during NFS mounts and remounting of any filesystem. Previously I posted here: http://lkml.org/lkml/2008/3/5/590 [Descriptions] - background dentry_unused is a list of dentries which are not referenced. dentry_unused grows up when references on directories or files are released. This list can be very long if there is huge free memory. - the problem When shrink_dcache_sb() is called, it scans all dentry_unused linearly under spin_lock(), and if dentry->d_sb is differnt from given superblock, scan next dentry. This scan costs very much if there are many entries, and very ineffective if there are many superblocks. IOW, When we need to shrink unused dentries on one dentry, but scans unused dentries on all superblocks in the system. For example, we scan 500 dentries to unmount a filesystem, but scans 1,000,000 or more unused dentries on other superblocks. In our case , At mounting NFS, shrink_dcache_sb() is called to shrink unused dentries on NFS, but scans 100,000,000 unused dentries on superblocks in the system such as local ext3 filesystems. I hear NFS mounting took 1 min on some system in use. : NFS uses virtual filesystem in rpc layer, so NFS is affected by this problem. 100,000,000 is possible number on large systems. Per-superblock LRU of unused dentried can reduce the cost in reasonable manner. - How to fix I found this problem is solved by David Chinner's "Per-superblock unused dentry LRU lists V3"(1), so I rebase it and add some fix to reclaim with fairness, which is in Andrew Morton's comments(2). 1) http://lkml.org/lkml/2006/5/25/318 2) http://lkml.org/lkml/2006/5/25/320 Split LRU-list of unused dentries to each superblocks. Then, NFS mounting will check dentries under a superblock instead of all. But this spliting will break LRU of dentry-unused. So, I've attempted to make reclaim unused dentrins with fairness by calculate number of dentries to scan on this sb based on following way number of dentries to scan on this sb = count * (number of dentries on this sb / number of dentries in the machine) - ToDo - I have to measuring performance number and do stress tests. - When unmount occurs during prune_dcache(), scanning on same superblock, It is unable to reach next superblock because it is gone away. We restart scannig superblock from first one, it causes unfairness of reclaim unused dentries on first superblock. But I think this happens very rarely. - Test Results Result on 6GB boxes with excessive unused dentries. Without patch: $ cat /proc/sys/fs/dentry-state 10181835 10180203 45 0 0 0 # mount -t nfs 10.124.60.70:/work/kernel-src nfs real 0m1.830s user 0m0.001s sys 0m1.653s With this patch: $ cat /proc/sys/fs/dentry-state 10236610 10234751 45 0 0 0 # mount -t nfs 10.124.60.70:/work/kernel-src nfs real 0m0.106s user 0m0.002s sys 0m0.032s [akpm@linux-foundation.org: fix comments] Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: David Chinner <dgc@sgi.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:15 -07:00
Jan Beulich	42b7772812	mm: remove double indirection on tlb parameter to free_pgd_range() & Co The double indirection here is not needed anywhere and hence (at least) confusing. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:15 -07:00
Adrian Bunk	c748e1340e	mm/vmstat.c: proper externs This patch adds proper extern declarations for five variables in include/linux/vmstat.h Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-24 10:47:14 -07:00
Shirish Pargaonkar	ef571cadd5	[CIFS] Fix warnings from checkpatch Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 15:56:05 +00:00
Shirish Pargaonkar	b1910ad622	[CIFS] Fix improper endian conversion of ACL subauth field In mode_to_acl when converting a Unix mode to a Windows ACL the subauth fields of the SID in the ACL were translated incorrectly on bigendian architectures Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 14:53:20 +00:00
Shirish Pargaonkar	76c510ad2e	[CIFS] Fix possible double free if search immediately after search rewind fails Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 14:48:33 +00:00
Steve French	99b1f5b2f6	[CIFS] remove checkpatch warning Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 02:37:45 +00:00
Alexey Dobriyan	f984c7b982	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 02:34:24 +00:00
Harvey Harrison	5ca33c6ac3	cifs: assorted endian annotations fs/cifs/cifssmb.c:3917:13: warning: incorrect type in assignment (different base types) fs/cifs/cifssmb.c:3917:13: expected bool [unsigned] [usertype] is_unicode fs/cifs/cifssmb.c:3917:13: got restricted __le16 The comment explains why __force is used here. fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 fs/cifs/connect.c:458:16: warning: cast to restricted __be32 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-24 01:14:41 +00:00
Jeff Layton	8efdbde647	[CIFS] break ATTR_SIZE changes out into their own function Move the code that handles ATTR_SIZE changes to its own function. This makes for a smaller function and reduces the level of indentation. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-23 21:28:12 +00:00
Jeff Layton	09e50d55a9	lockdep: annotate cifs in-kernel sockets Put CIFS sockets in their own class to avoid some lockdep warnings. CIFS sockets are not exposed to user-space, and so are not subject to the same deadlock scenarios. A similar change was made a couple of years ago for RPC sockets in commit `ed07536ed6`. This patch should prevent lockdep false-positives like this one: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.18-98.el5.jtltest.38.bz456320.1debug #1 ------------------------------------------------------- test5/2483 is trying to acquire lock: (sk_lock-AF_INET){--..}, at: [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f but task is already holding lock: (&inode->i_alloc_sem){--..}, at: [<ffffffff8002e454>] notify_change+0xf5/0x2e0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&inode->i_alloc_sem){--..}: [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff8002e454>] notify_change+0xf5/0x2e0 [<ffffffff800a4e36>] down_write+0x3c/0x68 [<ffffffff8002e454>] notify_change+0xf5/0x2e0 [<ffffffff800e358d>] do_truncate+0x50/0x6b [<ffffffff8005197c>] get_write_access+0x40/0x46 [<ffffffff80012cf1>] may_open+0x1d3/0x22e [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd [<ffffffff800289c6>] do_filp_open+0x1c/0x38 [<ffffffff800683ef>] _spin_unlock+0x17/0x20 [<ffffffff800167a7>] get_unused_fd+0xf9/0x107 [<ffffffff8001a704>] do_sys_open+0x44/0xbe [<ffffffff80060116>] system_call+0x7e/0x83 [<ffffffffffffffff>] 0xffffffffffffffff -> #2 (&sysfs_inode_imutex_key){--..}: [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf [<ffffffff8010f6df>] create_dir+0x26/0x1d7 [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff8010f6df>] create_dir+0x26/0x1d7 [<ffffffff800671c0>] mutex_lock_nested+0x104/0x29c [<ffffffff800a819d>] __lock_acquire+0x9ca/0xadf [<ffffffff8010f6df>] create_dir+0x26/0x1d7 [<ffffffff8010fc67>] sysfs_create_dir+0x58/0x76 [<ffffffff8015144c>] kobject_add+0xdb/0x198 [<ffffffff801be765>] class_device_add+0xb2/0x465 [<ffffffff8005a6ff>] kobject_get+0x12/0x17 [<ffffffff80225265>] register_netdevice+0x270/0x33e [<ffffffff8022538c>] register_netdev+0x59/0x67 [<ffffffff80464d40>] net_olddevs_init+0xb/0xac [<ffffffff80448a79>] init+0x1f9/0x2fc [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27 [<ffffffff80067f86>] trace_hardirqs_on_thunk+0x35/0x37 [<ffffffff80061079>] child_rip+0xa/0x11 [<ffffffff80068885>] _spin_unlock_irq+0x24/0x27 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff80179a59>] acpi_ds_init_one_object+0x0/0x80 [<ffffffff80448880>] init+0x0/0x2fc [<ffffffff8006106f>] child_rip+0x0/0x11 [<ffffffffffffffff>] 0xffffffffffffffff -> #1 (rtnl_mutex){--..}: [<ffffffff800a817c>] __lock_acquire+0x9a9/0xadf [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7 [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7 [<ffffffff800671c0>] mutex_lock_nested+0x104/0x29c [<ffffffff8025acf8>] ip_mc_leave_group+0x23/0xb7 [<ffffffff802451b0>] do_ip_setsockopt+0x6d1/0x9bf [<ffffffff800a575e>] lock_release_holdtime+0x27/0x48 [<ffffffff800a575e>] lock_release_holdtime+0x27/0x48 [<ffffffff8006a85e>] do_page_fault+0x503/0x835 [<ffffffff8012cbf6>] socket_has_perm+0x5b/0x68 [<ffffffff80245556>] ip_setsockopt+0x22/0x78 [<ffffffff8021c973>] sys_setsockopt+0x91/0xb7 [<ffffffff800602a6>] tracesys+0xd5/0xdf [<ffffffffffffffff>] 0xffffffffffffffff -> #0 (sk_lock-AF_INET){--..}: [<ffffffff800a5037>] print_stack_trace+0x59/0x68 [<ffffffff800a8092>] __lock_acquire+0x8bf/0xadf [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f [<ffffffff80035466>] lock_sock+0xd4/0xe4 [<ffffffff80096e91>] _local_bh_enable+0xcb/0xe0 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f [<ffffffff80057540>] sock_sendmsg+0xf3/0x110 [<ffffffff800a2bb6>] autoremove_wake_function+0x0/0x2e [<ffffffff800a10e4>] kernel_text_address+0x1a/0x26 [<ffffffff8006f4e2>] dump_trace+0x211/0x23a [<ffffffff800a6d3d>] find_usage_backwards+0x5f/0x88 [<ffffffff8840221a>] MD5Final+0xaf/0xc2 [cifs] [<ffffffff884032ec>] cifs_calculate_signature+0x55/0x69 [cifs] [<ffffffff8021d891>] kernel_sendmsg+0x35/0x47 [<ffffffff883ff38e>] smb_send+0xa3/0x151 [cifs] [<ffffffff883ff5de>] SendReceive+0x1a2/0x448 [cifs] [<ffffffff800a812f>] __lock_acquire+0x95c/0xadf [<ffffffff883e758a>] CIFSSMBSetEOF+0x20d/0x25b [cifs] [<ffffffff883fa430>] cifs_set_file_size+0x110/0x3b7 [cifs] [<ffffffff883faa89>] cifs_setattr+0x3b2/0x6f6 [cifs] [<ffffffff8002e454>] notify_change+0xf5/0x2e0 [<ffffffff8002e4a4>] notify_change+0x145/0x2e0 [<ffffffff800e358d>] do_truncate+0x50/0x6b [<ffffffff8005197c>] get_write_access+0x40/0x46 [<ffffffff80012cf1>] may_open+0x1d3/0x22e [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd [<ffffffff800289c6>] do_filp_open+0x1c/0x38 [<ffffffff800683ef>] _spin_unlock+0x17/0x20 [<ffffffff800167a7>] get_unused_fd+0xf9/0x107 [<ffffffff8001a704>] do_sys_open+0x44/0xbe [<ffffffff800602a6>] tracesys+0xd5/0xdf [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 2 locks held by test5/2483: #0: (&inode->i_mutex){--..}, at: [<ffffffff800e3582>] do_truncate+0x45/0x6b #1: (&inode->i_alloc_sem){--..}, at: [<ffffffff8002e454>] notify_change+0xf5/0x2e0 stack backtrace: Call Trace: [<ffffffff800a6a7b>] print_circular_bug_tail+0x65/0x6e [<ffffffff800a5037>] print_stack_trace+0x59/0x68 [<ffffffff800a8092>] __lock_acquire+0x8bf/0xadf [<ffffffff800a8a72>] lock_acquire+0x55/0x70 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f [<ffffffff80035466>] lock_sock+0xd4/0xe4 [<ffffffff80096e91>] _local_bh_enable+0xcb/0xe0 [<ffffffff800606a8>] restore_args+0x0/0x30 [<ffffffff800270d2>] tcp_sendmsg+0x1c/0xb2f [<ffffffff80057540>] sock_sendmsg+0xf3/0x110 [<ffffffff800a2bb6>] autoremove_wake_function+0x0/0x2e [<ffffffff800a10e4>] kernel_text_address+0x1a/0x26 [<ffffffff8006f4e2>] dump_trace+0x211/0x23a [<ffffffff800a6d3d>] find_usage_backwards+0x5f/0x88 [<ffffffff8840221a>] :cifs:MD5Final+0xaf/0xc2 [<ffffffff884032ec>] :cifs:cifs_calculate_signature+0x55/0x69 [<ffffffff8021d891>] kernel_sendmsg+0x35/0x47 [<ffffffff883ff38e>] :cifs:smb_send+0xa3/0x151 [<ffffffff883ff5de>] :cifs:SendReceive+0x1a2/0x448 [<ffffffff800a812f>] __lock_acquire+0x95c/0xadf [<ffffffff883e758a>] :cifs:CIFSSMBSetEOF+0x20d/0x25b [<ffffffff883fa430>] :cifs:cifs_set_file_size+0x110/0x3b7 [<ffffffff883faa89>] :cifs:cifs_setattr+0x3b2/0x6f6 [<ffffffff8002e454>] notify_change+0xf5/0x2e0 [<ffffffff8002e4a4>] notify_change+0x145/0x2e0 [<ffffffff800e358d>] do_truncate+0x50/0x6b [<ffffffff8005197c>] get_write_access+0x40/0x46 [<ffffffff80012cf1>] may_open+0x1d3/0x22e [<ffffffff8001bc81>] open_namei+0x2c6/0x6dd [<ffffffff800289c6>] do_filp_open+0x1c/0x38 [<ffffffff800683ef>] _spin_unlock+0x17/0x20 [<ffffffff800167a7>] get_unused_fd+0xf9/0x107 [<ffffffff8001a704>] do_sys_open+0x44/0xbe [<ffffffff800602a6>] tracesys+0xd5/0xdf Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-23 18:25:38 +00:00
Linus Torvalds	c010b2f76c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (82 commits) ipw2200: Call netif_*_queue() interfaces properly. netxen: Needs to include linux/vmalloc.h [netdrvr] atl1d: fix !CONFIG_PM build r6040: rework init_one error handling r6040: bump release number to 0.18 r6040: handle RX fifo full and no descriptor interrupts r6040: change the default waiting time r6040: use definitions for magic values in descriptor status r6040: completely rework the RX path r6040: call napi_disable when puting down the interface and set lp->dev accordingly. mv643xx_eth: fix NETPOLL build r6040: rework the RX buffers allocation routine r6040: fix scheduling while atomic in r6040_tx_timeout r6040: fix null pointer access and tx timeouts r6040: prefix all functions with r6040 rndis_host: support WM6 devices as modems at91_ether: use netstats in net_device structure sfc: Create one RX queue and interrupt per CPU package by default sfc: Use a separate workqueue for resets sfc: I2C adapter initialisation fixes ...	2008-07-22 19:09:51 -07:00
Adrian Bunk	8086cd451f	netns: make get_proc_net() static get_proc_net() can now become static. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-22 14:19:19 -07:00
Linus Torvalds	53baaaa968	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (79 commits) arm: bus_id -> dev_name() and dev_set_name() conversions sparc64: fix up bus_id changes in sparc core code 3c59x: handle pci_name() being const MTD: handle pci_name() being const HP iLO driver sysdev: Convert the x86 mce tolerant sysdev attribute to generic attribute sysdev: Add utility functions for simple int/ulong variable sysdev attributes sysdev: Pass the attribute to the low level sysdev show/store function driver core: Suppress sysfs warnings for device_rename(). kobject: Transmit return value of call_usermodehelper() to caller sysfs-rules.txt: reword API stability statement debugfs: Implement debugfs_remove_recursive() HOWTO: change email addresses of James in HOWTO always enable FW_LOADER unless EMBEDDED=y uio-howto.tmpl: use unique output names uio-howto.tmpl: use standard copyright/legal markings sysfs: don't call notify_change sysdev: fix debugging statements in registration code. kobject: should use kobject_put() in kset-example kobject: reorder kobject to save space on 64 bit builds ...	2008-07-22 13:13:47 -07:00
Alexey Dobriyan	ee1e6ab605	proc: fix /proc/*/pagemap some more struct pagemap_walk was placed on stack, some hooks are initialized, the rest (->pgd_entry, ->pud_entry, ->pte_entry) are valid but junk. Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Vegard Nossum" <vegard.nossum@gmail.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-22 09:59:41 -07:00
John Reiser	6519108746	execve filename: document and export via auxiliary vector The Linux kernel puts the filename argument of execve() into the new address space. Many developers are surprised to learn this. Those who know and could use it, object "But it's not documented." Those who want to use it dislike the expression (char *)(1+ strlen(env[-1+ n_env]) + env[-1+ n_env]) because it requires locating the last original environment variable, and assumes that the filename follows the characters. This patch documents the insertion of the filename, and makes it easier to find by adding a new tag AT_EXECFN in the ElfXX_auxv_t; see <elf.h>. In many cases readlink("/proc/self/exe",) gives the same answer. But if all the original pages get unmapped, then the kernel erases the symlink for /proc/self/exe. This can happen when a program decompressor does a good job of cleaning up after uncompressing directly to memory, so that the address space of the target program looks the same as if compression had never happened. One example is http://upx.sourceforge.net . One notable use of the underlying concept (what path containED the executable) is glibc expanding $ORIGIN in DT_RUNPATH. In practice for the near term, it may be a good idea for user-mode code to use both /proc/self/exe and AT_EXECFN as fall-back methods for each other. /proc/self/exe can fail due to unmapping, AT_EXECFN can fail because it won't be present on non-new systems. The auxvec or {AT_EXECFN}.d_val also can get overwritten, although in nearly all cases this would be the result of a bug. The runtime cost is one NEW_AUX_ENT using two words of stack space. The underlying value is maintained already as bprm->exec; setup_arg_pages() in fs/exec.c slides it for stack_shift, etc. Signed-off-by: John Reiser <jreiser@BitWagon.com> Cc: Roland McGrath <roland@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-22 09:59:40 -07:00
Jan Beulich	04e1e0ccca	[CIFS] Fix compiler warning on 64-bit Signed-off-by: Steve French <sfrench@us.ibm.com>	2008-07-22 13:04:18 +00:00
Cornelia Huck	36ce6dad6e	driver core: Suppress sysfs warnings for device_rename(). driver core: Suppress sysfs warnings for device_rename(). Renaming network devices to an already existing name is not something we want sysfs to print a scary warning for, since the callers can deal with this correctly. So let's introduce sysfs_create_link_nowarn() which gets rid of the common warning. Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:55:01 -07:00
Haavard Skinnemoen	9505e63756	debugfs: Implement debugfs_remove_recursive() debugfs_remove_recursive() will remove a dentry and all its children. Drivers can use this to zap their whole debugfs tree so that they don't need to keep track of every single debugfs dentry they created. It may fail to remove the whole tree in certain cases: sh-3.2# rmmod atmel-mci < /sys/kernel/debug/mmc0/ios/clock mmc0: card b368 removed atmel_mci atmel_mci.0: Lost dma0chan1, falling back to PIO sh-3.2# ls /sys/kernel/debug/mmc0/ ios But I'm not sure if that case can be handled in any sane manner. Signed-off-by: Haavard Skinnemoen <haavard.skinnemoen@atmel.com> Cc: Pierre Ossman <drzeus-list@drzeus.cx> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:59 -07:00
Miklos Szeredi	93265d13ea	sysfs: don't call notify_change sysfs_chmod_file() calls notify_change() to change the permission bits on a sysfs file. Replace with explicit call to sysfs_setattr() and fsnotify_change(). This is equivalent, except that security_inode_setattr() is not called. This function is called by drivers, so the security checks do not make any sense. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:57 -07:00
Kay Sievers	aab0de2451	driver core: remove KOBJ_NAME_LEN define Kobjects do not have a limit in name size since a while, so stop pretending that they do. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:52 -07:00
Greg Kroah-Hartman	6143b59970	device create: coda: convert device_create to device_create_drvdata device_create() is race-prone, so use the race-free device_create_drvdata() instead as device_create() is going away. Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2008-07-21 21:54:41 -07:00
Linus Torvalds	14b395e35d	Merge branch 'for-2.6.27' of git://linux-nfs.org/~bfields/linux * 'for-2.6.27' of git://linux-nfs.org/~bfields/linux: (51 commits) nfsd: nfs4xdr.c do-while is not a compound statement nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c lockd: Pass "struct sockaddr *" to new failover-by-IP function lockd: get host reference in nlmsvc_create_block() instead of callers lockd: minor svclock.c style fixes lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock lockd: nlm_release_host() checks for NULL, caller needn't file lock: reorder struct file_lock to save space on 64 bit builds nfsd: take file and mnt write in nfs4_upgrade_open nfsd: document open share bit tracking nfsd: tabulate nfs4 xdr encoding functions nfsd: dprint operation names svcrdma: Change WR context get/put to use the kmem cache svcrdma: Create a kmem cache for the WR contexts svcrdma: Add flush_scheduled_work to module exit function svcrdma: Limit ORD based on client's advertised IRD svcrdma: Remove unused wait q from svcrdma_xprt structure svcrdma: Remove unneeded spin locks from __svc_rdma_free svcrdma: Add dma map count and WARN_ON ...	2008-07-20 21:21:46 -07:00
Linus Torvalds	db6d8c7a40	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (1232 commits) iucv: Fix bad merging. net_sched: Add size table for qdiscs net_sched: Add accessor function for packet length for qdiscs net_sched: Add qdisc_enqueue wrapper highmem: Export totalhigh_pages. ipv6 mcast: Omit redundant address family checks in ip6_mc_source(). net: Use standard structures for generic socket address structures. ipv6 netns: Make several "global" sysctl variables namespace aware. netns: Use net_eq() to compare net-namespaces for optimization. ipv6: remove unused macros from net/ipv6.h ipv6: remove unused parameter from ip6_ra_control tcp: fix kernel panic with listening_get_next tcp: Remove redundant checks when setting eff_sacks tcp: options clean up tcp: Fix MD5 signatures for non-linear skbs sctp: Update sctp global memory limit allocations. sctp: remove unnecessary byteshifting, calculate directly in big-endian sctp: Allow only 1 listening socket with SO_REUSEADDR sctp: Do not leak memory on multiple listen() calls sctp: Support ipv6only AF_INET6 sockets. ...	2008-07-20 17:43:29 -07:00
Linus Torvalds	f7df406dce	Merge branch 'configfs-fixup-ptr-error' of git://oss.oracle.com/git/jlbec/linux-2.6 * 'configfs-fixup-ptr-error' of git://oss.oracle.com/git/jlbec/linux-2.6: configfs: Allow ->make_item() and ->make_group() to return detailed errors. Revert "configfs: Allow ->make_item() and ->make_group() to return detailed errors."	2008-07-20 17:17:52 -07:00
Alan Cox	a352def21a	tty: Ldisc revamp Move the line disciplines towards a conventional ->ops arrangement. For the moment the actual 'tty_ldisc' struct in the tty is kept as part of the tty struct but this can then be changed if it turns out that when it all settles down we want to refcount ldiscs separately to the tty. Pull the ldisc code out of /proc and put it with our ldisc code. Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-20 17:12:34 -07:00
David S. Miller	407d819cf0	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/holtmann/bluetooth-2.6	2008-07-19 00:30:39 -07:00
Harvey Harrison	5108b27651	nfsd: nfs4xdr.c do-while is not a compound statement The WRITEMEM macro produces sparse warnings of the form: fs/nfsd/nfs4xdr.c:2668:2: warning: do-while statement is not a compound statement Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-18 15:18:35 -04:00
J. Bruce Fields	ad1060c89c	nfsd: Use C99 initializers in fs/nfsd/nfs4xdr.c Thanks to problem report and original patch from Harvey Harrison. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Harvey Harrison <harvey.harrison@gmail.com> Cc: Benny Halevy <bhalevy@panasas.com>	2008-07-18 15:04:58 -04:00
Pavel Emelyanov	b6fcbdb4f2	proc: consolidate per-net single-release callers They are symmetrical to single_open ones :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:07:44 -07:00
Pavel Emelyanov	de05c557b2	proc: consolidate per-net single_open callers There are already 7 of them - time to kill some duplicate code. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2008-07-18 04:07:21 -07:00
David S. Miller	49997d7515	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6 Conflicts: Documentation/powerpc/booting-without-of.txt drivers/atm/Makefile drivers/net/fs_enet/fs_enet-main.c drivers/pci/pci-acpi.c net/8021q/vlan.c net/iucv/iucv.c	2008-07-18 02:39:39 -07:00
Joel Becker	a6795e9ebb	configfs: Allow ->make_item() and ->make_group() to return detailed errors. The configfs operations ->make_item() and ->make_group() currently return a new item/group. A return of NULL signifies an error. Because of this, -ENOMEM is the only return code bubbled up the stack. Multiple folks have requested the ability to return specific error codes when these operations fail. This patch adds that ability by changing the ->make_item/group() ops to return ERR_PTR() values. These errors are bubbled up appropriately. NULL returns are changed to -ENOMEM for compatibility. Also updated are the in-kernel users of configfs. This is a rework of reverted commit `11c3b79218`. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-17 15:21:29 -07:00
Joel Becker	f89ab8619e	Revert "configfs: Allow ->make_item() and ->make_group() to return detailed errors." This reverts commit `11c3b79218`. The code will move to PTR_ERR(). Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-17 14:53:48 -07:00
Linus Torvalds	5b664cb235	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: [PATCH] ocfs2: fix oops in mmap_truncate testing configfs: call drop_link() to cleanup after create_link() failure configfs: Allow ->make_item() and ->make_group() to return detailed errors. configfs: Fix failing mkdir() making racing rmdir() fail configfs: Fix deadlock with racing rmdir() and rename() configfs: Make configfs_new_dirent() return error code instead of NULL configfs: Protect configfs_dirent s_links list mutations configfs: Introduce configfs_dirent_lock ocfs2: Don't snprintf() without a format. ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs ocfs2/net: Silence build warnings on sparc64 ocfs2: Handle error during journal load ocfs2: Silence an error message in ocfs2_file_aio_read() ocfs2: use simple_read_from_buffer() ocfs2: fix printk format warnings with OCFS2_FS_STATS=n [PATCH 2/2] ocfs2: Instrument fs cluster locks [PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option	2008-07-17 10:55:51 -07:00
Coly Li	c0420ad2ca	[PATCH] ocfs2: fix oops in mmap_truncate testing This patch fixes a mmap_truncate bug which was found by ocfs2 test suite. In an ocfs2 cluster more than 1 node, run program mmap_truncate, which races mmap writes and truncates from multiple processes. While the test is running, a stat from another node forces writeout, causing an oops in ocfs2_get_block() because it sees a buffer to write which isn't allocated. This patch fixed the bug by clear dirty and uptodate bits in buffer, leave the buffer unmapped and return. Fix is suggested by Mark Fasheh, and I code up the patch. Signed-off-by: Coly Li <coyli@suse.de> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-16 16:13:04 -07:00
Linus Torvalds	9c1be0c471	Merge branch 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6 * 'for_linus' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: include to compilation UBIFS: add new flash file system UBIFS: add brief documentation MAINTAINERS: add UBIFS section do_mounts: allow UBI root device name VFS: export sync_sb_inodes VFS: move inode_lock into sync_sb_inodes	2008-07-16 15:02:57 -07:00
Linus Torvalds	8df1b049bc	Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (82 commits) NFSv4: Remove BKL from the nfsv4 state recovery SUNRPC: Remove the BKL from the callback functions NFS: Remove BKL from the readdir code NFS: Remove BKL from the symlink code NFS: Remove BKL from the sillydelete operations NFS: Remove the BKL from the rename, rmdir and unlink operations NFS: Remove BKL from NFS lookup code NFS: Remove the BKL from nfs_link() NFS: Remove the BKL from the inode creation operations NFS: Remove BKL usage from open() NFS: Remove BKL usage from the write path NFS: Remove the BKL from the permission checking code NFS: Remove attribute update related BKL references NFS: Remove BKL requirement from attribute updates NFS: Protect inode->i_nlink updates using inode->i_lock nfs: set correct fl_len in nlmclnt_test() SUNRPC: Support registering IPv6 interfaces with local rpcbind daemon SUNRPC: Refactor rpcb_register to make rpcbindv4 support easier SUNRPC: None of rpcb_create's callers wants a privileged source port SUNRPC: Introduce a specific rpcb_create for contacting localhost ...	2008-07-16 14:49:49 -07:00
Randy Dunlap	3c3622dcb6	Fix compile issues in fs/compat_ioctl.c when CONFIG_BLOCK is disabled Fix fs/compat_ioctl.c to handle CONFIG_BLOCK=n, CONFIG_SCSI=n to avoid build errors: In file included from include/scsi/scsi.h:12, from fs/compat_ioctl.c:71: include/scsi/scsi_cmnd.h:27:25: warning: "BLK_MAX_CDB" is not defined include/scsi/scsi_cmnd.h:28:3: error: #error MAX_COMMAND_SIZE can not be bigger than BLK_MAX_CDB In file included from include/scsi/scsi.h:12, from fs/compat_ioctl.c:71: include/scsi/scsi_cmnd.h: In function 'scsi_bidi_cmnd': include/scsi/scsi_cmnd.h:182: error: implicit declaration of function 'blk_bidi_rq' include/scsi/scsi_cmnd.h:183: error: dereferencing pointer to incomplete type include/scsi/scsi_cmnd.h: In function 'scsi_in': include/scsi/scsi_cmnd.h:189: error: dereferencing pointer to incomplete type Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-16 11:46:16 -07:00
Trond Myklebust	cadc723cc1	Merge branch 'bkl-removal' into next	2008-07-15 18:34:58 -04:00
Trond Myklebust	e89e896d31	Merge branch 'devel' into next Conflicts: fs/nfs/file.c Fix up the conflict with Jon Corbet's bkl-removal tree	2008-07-15 18:34:16 -04:00
Trond Myklebust	f839c4c199	NFSv4: Remove BKL from the nfsv4 state recovery Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:57 -04:00
Trond Myklebust	a86dc496b7	SUNRPC: Remove the BKL from the callback functions Push it into those callback functions that actually need it. Note that all the NFS operations use their own locking, so don't need the BKL. Ditto for the rpcbind client. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:57 -04:00
Trond Myklebust	c3cc8c019c	NFS: Remove BKL from the readdir code Page accesses are serialised using the page locks, whereas all attribute updates are serialised using the inode->i_lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:56 -04:00
Trond Myklebust	76566991f9	NFS: Remove BKL from the symlink code Page cache accesses are serialised using page locks, whereas attribute updates are serialised using inode->i_lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:56 -04:00
Trond Myklebust	52e2e8d37e	NFS: Remove BKL from the sillydelete operations Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:55 -04:00
Trond Myklebust	bd9bb454b7	NFS: Remove the BKL from the rename, rmdir and unlink operations Attribute updates are safe, and dentry operations are protected using VFS level locks. Defer removing the BKL from sillyrename until a separate patch. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:55 -04:00
Trond Myklebust	fc0f684c21	NFS: Remove BKL from NFS lookup code All dentry-related operations are already BKL-safe, since they are protected by the VFS locking. No extra locks should be needed in the NFS code. In the case of nfs_revalidate_inode(), we're only doing an attribute update (protected by the inode->i_lock). In the case of nfs_lookup(), we're instantiating a new dentry, so there should be no contention possible until after we call d_materialise_unique. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:54 -04:00
Trond Myklebust	fc81af535e	NFS: Remove the BKL from nfs_link() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:54 -04:00
Trond Myklebust	f1e2eda235	NFS: Remove the BKL from the inode creation operations nfs_instantiate() does not require the BKL, neither do the attribute updates or the RPC code. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:53 -04:00
Trond Myklebust	bba67e0e3f	NFS: Remove BKL usage from open() All the NFSv4 stateful operations are already protected by other locks (in particular by the rpc_sequence locks. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:53 -04:00
Trond Myklebust	b6a2e569e2	NFS: Remove BKL usage from the write path Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:52 -04:00
Trond Myklebust	4d80f2ecd5	NFS: Remove the BKL from the permission checking code Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:52 -04:00
Trond Myklebust	fa6dc9dc59	NFS: Remove attribute update related BKL references Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:51 -04:00
Trond Myklebust	a3d01454bc	NFS: Remove BKL requirement from attribute updates The main problem is dealing with inode->i_size: we need to set the inode->i_lock on all attribute updates, and so vmtruncate won't cut it. Make an NFS-private version of vmtruncate that has the necessary locking semantics. The result should be that the following inode attribute updates are protected by inode->i_lock nfsi->cache_validity nfsi->read_cache_jiffies nfsi->attrtimeo nfsi->attrtimeo_timestamp nfsi->change_attr nfsi->last_updated nfsi->cache_change_attribute nfsi->access_cache nfsi->access_cache_entry_lru nfsi->access_cache_inode_lru nfsi->acl_access nfsi->acl_default nfsi->nfs_page_tree nfsi->ncommit nfsi->npages nfsi->open_files nfsi->silly_list nfsi->acl nfsi->open_states inode->i_size inode->i_atime inode->i_mtime inode->i_ctime inode->i_nlink inode->i_uid inode->i_gid The following is protected by dir->i_mutex nfsi->cookieverf Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:51 -04:00
Trond Myklebust	1b83d70703	NFS: Protect inode->i_nlink updates using inode->i_lock Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:10:50 -04:00
Felix Blyakher	d67d1c7bf9	nfs: set correct fl_len in nlmclnt_test() fcntl(F_GETLK) on an nfs client incorrectly returns the values for the conflicting lock. fl_len value is always 1. If the conflicting lock is (0, 4095) the F_GETLK request for (1024, 10) returns (0, 1), which doesn't even cover the requested range, and is quite confusing. The fix is trivial, set fl_end from the fl_end value recieved from the nfs server. Signed-off-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-15 18:08:59 -04:00
Chuck Lever	367c8c7bd9	lockd: Pass "struct sockaddr *" to new failover-by-IP function Pass a more generic socket address type to nlmsvc_unlock_all_by_ip() to allow for future support of IPv6. Also provide additional sanity checking in failover_unlock_ip() when constructing the server's IP address. As an added bonus, provide clean kerneldoc comments on related NLM interfaces which were recently added. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 16:11:29 -04:00
Ingo Molnar	1a781a777b	Merge branch 'generic-ipi' into generic-ipi-for-linus Conflicts: arch/powerpc/Kconfig arch/s390/kernel/time.c arch/x86/kernel/apic_32.c arch/x86/kernel/cpu/perfctr-watchdog.c arch/x86/kernel/i8259_64.c arch/x86/kernel/ldt.c arch/x86/kernel/nmi_64.c arch/x86/kernel/smpboot.c arch/x86/xen/smp.c include/asm-x86/hw_irq_32.h include/asm-x86/hw_irq_64.h include/asm-x86/mach-default/irq_vectors.h include/asm-x86/mach-voyager/irq_vectors.h include/asm-x86/smp.h kernel/Makefile Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-15 21:55:59 +02:00
J. Bruce Fields	560de0e659	lockd: get host reference in nlmsvc_create_block() instead of callers It may not be obvious (till you look at the definition of nlm_alloc_call()) that a function like nlmsvc_create_block() should consume a reference on success or failure, so I find it clearer if it takes the reference it needs itself. And both callers already do this immediately before the call anyway. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 15:40:25 -04:00
J. Bruce Fields	6d7bbbbacc	lockd: minor svclock.c style fixes Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 15:28:43 -04:00
Jeff Layton	6cde4de807	lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_lock nlmsvc_lock calls nlmsvc_lookup_host to find a nlm_host struct. The callers of this function, however, call nlmsvc_retrieve_args or nlm4svc_retrieve_args, which also return a nlm_host struct. Change nlmsvc_lock to take a host arg instead of calling nlmsvc_lookup_host itself and change the callers to pass a pointer to the nlm_host they've already found. Since nlmsvc_testlock() now just uses the caller's reference, we no longer need to get or release it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 14:53:33 -04:00
Jeff Layton	8f920d5e29	lockd: eliminate duplicate nlmsvc_lookup_host call from nlmsvc_testlock nlmsvc_testlock calls nlmsvc_lookup_host to find a nlm_host struct. The callers of this functions, however, call nlmsvc_retrieve_args or nlm4svc_retrieve_args, which also return a nlm_host struct. Change nlmsvc_testlock to take a host arg instead of calling nlmsvc_lookup_host itself and change the callers to pass a pointer to the nlm_host they've already found. We take a reference to host in the place where nlmsvc_testlock() previous did a new lookup, so the reference counting is unchanged from before. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 14:26:52 -04:00
Linus Torvalds	e4e0fadcd9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6: jfs: remove DIRENTSIZ JFS: diAlloc() should return -EIO rather than EIO JFS: skip bad iput() call in error path JFS: switch to seq_files JFS: 0 is not valid errno value so return NULL from jfs_lookup	2008-07-15 11:03:19 -07:00
Linus Torvalds	38c46578ff	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: [GFS2] Fix GFS2's use of do_div() in its quota calculations [GFS2] Remove unused declaration [GFS2] Remove support for unused and pointless flag [GFS2] Replace rgrp "recent list" with mru list [GFS2] Allow local DF locks when holding a cached EX glock [GFS2] Fix delayed demote race [GFS2] don't call permission() [GFS2] Fix module building [GFS2] Glock documentation [GFS2] Remove all_list from lock_dlm [GFS2] Remove obsolete conversion deadlock avoidance code [GFS2] Remove remote lock dropping code [GFS2] kernel panic mounting volume [GFS2] Revise readpage locking [GFS2] Fix ordering of args for list_add [GFS2] trivial sparse lock annotations [GFS2] No lock_nolock [GFS2] Fix ordering bug in lock_dlm [GFS2] Clean up the glock core	2008-07-15 10:38:46 -07:00
Jeff Layton	b0e92aae15	lockd: nlm_release_host() checks for NULL, caller needn't No need to check for a NULL argument twice. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2008-07-15 12:35:20 -04:00
Linus Torvalds	8d2567a620	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (61 commits) ext4: Documention update for new ordered mode and delayed allocation ext4: do not set extents feature from the kernel ext4: Don't allow nonextenst mount option for large filesystem ext4: Enable delalloc by default. ext4: delayed allocation i_blocks fix for stat ext4: fix delalloc i_disksize early update issue ext4: Handle page without buffers in ext4_*_writepage() ext4: Add ordered mode support for delalloc ext4: Invert lock ordering of page_lock and transaction start in delalloc mm: Add range_cont mode for writeback ext4: delayed allocation ENOSPC handling percpu_counter: new function percpu_counter_sum_and_set ext4: Add delayed allocation support in data=writeback mode vfs: add hooks for ext4's delayed allocation support jbd2: Remove data=ordered mode support using jbd buffer heads ext4: Use new framework for data=ordered mode in JBD2 jbd2: Implement data=ordered mode handling via inodes vfs: export filemap_fdatawrite_range() ext4: Fix lock inversion in ext4_ext_truncate() ext4: Invert the locking order of page_lock and transaction start ...	2008-07-15 08:36:38 -07:00
Artem Bityutskiy	0d7eff873c	UBIFS: include to compilation Add UBIFS to Makefile and Kbuild. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-07-15 17:35:24 +03:00
Artem Bityutskiy	1e51764a3c	UBIFS: add new flash file system This is a new flash file system. See http://www.linux-mtd.infradead.org/doc/ubifs.html Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>	2008-07-15 17:35:15 +03:00
Linus Torvalds	d1794f2c5b	Merge branch 'bkl-removal' of git://git.lwn.net/linux-2.6 * 'bkl-removal' of git://git.lwn.net/linux-2.6: (146 commits) IB/umad: BKL is not needed for ib_umad_open() IB/uverbs: BKL is not needed for ib_uverbs_open() bf561-coreb: BKL unneeded for open() Call fasync() functions without the BKL snd/PCM: fasync BKL pushdown ipmi: fasync BKL pushdown ecryptfs: fasync BKL pushdown Bluetooth VHCI: fasync BKL pushdown tty_io: fasync BKL pushdown tun: fasync BKL pushdown i2o: fasync BKL pushdown mpt: fasync BKL pushdown Remove BKL from remote_llseek v2 Make FAT users happier by not deadlocking x86-mce: BKL pushdown vmwatchdog: BKL pushdown vmcp: BKL pushdown via-pmu: BKL pushdown uml-random: BKL pushdown uml-mmapper: BKL pushdown ...	2008-07-14 14:48:31 -07:00
Jonathan Corbet	2fceef397f	Merge commit 'v2.6.26' into bkl-removal	2008-07-14 15:29:34 -06:00
Louis Rilling	e752065175	configfs: call drop_link() to cleanup after create_link() failure When allow_link() succeeds but create_link() fails, the subsystem is not informed of the failure. This patch fixes this by calling drop_link() on create_link() failures. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Joel Becker	11c3b79218	configfs: Allow ->make_item() and ->make_group() to return detailed errors. The configfs operations ->make_item() and ->make_group() currently return a new item/group. A return of NULL signifies an error. Because of this, -ENOMEM is the only return code bubbled up the stack. Multiple folks have requested the ability to return specific error codes when these operations fail. This patch adds that ability by changing the ->make_item/group() ops to return an int. Also updated are the in-kernel users of configfs. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	6d8344baee	configfs: Fix failing mkdir() making racing rmdir() fail When fixing the rename() vs rmdir() deadlock, we stopped locking default groups' inodes in configfs_detach_prep(), letting racing mkdir() in default groups proceed concurrently. This enables races like below happen, which leads to a failing mkdir() making rmdir() fail, despite the group to remove having no user-created directory under it in the end. process A: process B: /* PWD=A/B */ mkdir("C") make_item("C") attach_group("C") rmdir("A") detach_prep("A") detach_prep("B") error because of "C" return -ENOTEMPTY attach_group("C/D") error (eg -ENOMEM) return -ENOMEM This patch prevents such scenarii by making rmdir() wait as long as detach_prep() fails because a racing mkdir() is in the middle of attach_group(). To achieve this, mkdir() sets a flag CONFIGFS_USET_IN_MKDIR in parent's configfs_dirent before calling attach_group(), and clears the flag once attach_group() is done. detach_prep() fails with -EAGAIN whenever the flag is hit and returns the guilty inode's mutex so that rmdir() can wait on it. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	b3e76af874	configfs: Fix deadlock with racing rmdir() and rename() This patch fixes the deadlock between racing sys_rename() and configfs_rmdir(). The idea is to avoid locking i_mutexes of default groups in configfs_detach_prep(), and rely instead on the new configfs_dirent_lock to protect against configfs_dirent's linkage mutations. To ensure that an mkdir() racing with rmdir() will not create new items in a to-be-removed default group, we make configfs_new_dirent() check for the CONFIGFS_USET_DROPPING flag right before linking the new dirent, and return error if the flag is set. This makes racing mkdir()/symlink()/dir_open() fail in places where errors could already happen, resp. in (attach_item()\|attach_group())/create_link()/new_dirent(). configfs_depend() remains safe since it locks all the path from configfs root, and is thus mutually exclusive with rmdir(). An advantage of this is that now detach_groups() unconditionnaly takes the default groups i_mutex, which makes it more consistent with populate_groups(). Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	107ed40bd0	configfs: Make configfs_new_dirent() return error code instead of NULL This patch makes configfs_new_dirent return negative error code instead of NULL, which will be useful in the next patch to differentiate ENOMEM from ENOENT. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	5301a77da2	configfs: Protect configfs_dirent s_links list mutations Symlinks to a config_item are listed under its configfs_dirent s_links, but the list mutations are not protected by any common lock. This patch uses the configfs_dirent_lock spinlock to add the necessary protection. Note: we should also protect the list_empty() test in configfs_detach_prep() but 1/ the lock should not be released immediately because nothing would prevent the list from being filled after a successful list_empty() test, making the problem tricky, 2/ this will be solved by the rmdir() vs rename() deadlock bugfix. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:16 -07:00
Louis Rilling	6f61076406	configfs: Introduce configfs_dirent_lock This patch introduces configfs_dirent_lock spinlock to protect configfs_dirent traversals against linkage mutations (add/del/move). This will allow configfs_detach_prep() to avoid locking i_mutexes. Locking rules for configfs_dirent linkage mutations are the same plus the requirement of taking configfs_dirent_lock. For configfs_dirent walking, one can either take appropriate i_mutex as before, or take configfs_dirent_lock. The spinlock could actually be a mutex, but the critical sections are either O(1) or should not be too long (default groups walking in last patch). ChangeLog: - Clarify the comment on configfs_dirent_lock usage - Move sd->s_element init before linking the new dirent - In lseek(), do not release configfs_dirent_lock before the dirent is relinked. Signed-off-by: Louis Rilling <Louis.Rilling@kerlabs.com> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:15 -07:00
Joel Becker	fe9f387740	ocfs2: Don't snprintf() without a format. Some system files are per-slot. Their names include the slot number. ocfs2_sprintf_system_inode_name() uses the system inode definitions to fill in the slot number with snprintf(). For global system files, there is no node number, and the name was printed as a format with no arguments. -Wformat-nonliteral and -Wformat-security don't like this. Instead, use a static "%s" format and the name as the argument. Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:15 -07:00
Joel Becker	e407e39783	ocfs2: Fix CONFIG_OCFS2_DEBUG_FS #ifdefs A couple places use OCFS2_DEBUG_FS where they really mean CONFIG_OCFS2_DEBUG_FS. Reported-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Joel Becker <joel.becker@oracle.com>	2008-07-14 13:57:15 -07:00
Sunil Mushran	461c6a30ec	ocfs2/net: Silence build warnings on sparc64 suseconds_t is type long on most arches except sparc64 where it is type int. This patch silences the following warnings that are generated when building on it. netdebug.c: In function 'nst_seq_show': netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 13 has type 'suseconds_t' netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 15 has type 'suseconds_t' netdebug.c:152: warning: format '%lu' expects type 'long unsigned int', but argument 17 has type 'suseconds_t' netdebug.c: In function 'sc_seq_show': netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 19 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 21 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 23 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 25 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 27 has type 'suseconds_t' netdebug.c:332: warning: format '%lu' expects type 'long unsigned int', but argument 29 has type 'suseconds_t' Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:15 -07:00
Wengang Wang	01af482037	ocfs2: Handle error during journal load This patch ensures the mount fails if the fs is unable to load the journal. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:15 -07:00
Sunil Mushran	56753bd3b9	ocfs2: Silence an error message in ocfs2_file_aio_read() This patch silences an EINVAL error message in ocfs2_file_aio_read() that is always due to a user error. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:15 -07:00
Akinobu Mita	7600c72b75	ocfs2: use simple_read_from_buffer() Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:15 -07:00
Randy Dunlap	dd25e55ea1	ocfs2: fix printk format warnings with OCFS2_FS_STATS=n Fix printk format warnings when OCFS2_FS_STATS=n: linux-next-20080528/fs/ocfs2/dlmglue.c: In function 'ocfs2_dlm_seq_show': linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 3 has type 'int' linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 4 has type 'int' linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 7 has type 'int' linux-next-20080528/fs/ocfs2/dlmglue.c:2623: warning: format '%llu' expects type 'long long unsigned int', but argument 8 has type 'int' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:14 -07:00
Sunil Mushran	8ddb7b004d	[PATCH 2/2] ocfs2: Instrument fs cluster locks This patch adds code to track the number of times the fs takes various cluster locks as well as the times associated with it. The information is made available to users via debugfs. This patch was originally written by Jan Kara <jack@suse.cz>. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:14 -07:00
Sunil Mushran	ce7231e92d	[PATCH 1/2] ocfs2: Add CONFIG_OCFS2_FS_STATS config option This patch adds config option CONFIG_OCFS2_FS_STATS to allow building the fs with instrumentation enabled. An upcoming patch will provide support to instrument cluster locking, which is a crucial overhead in a cluster file system. This config option allows users to avoid the cpu and memory overhead that is involved in gathering such statistics. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-14 13:57:14 -07:00
Linus Torvalds	a3da5bf84a	Merge branch 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (821 commits) x86: make 64bit hpet_set_mapping to use ioremap too, v2 x86: get x86_phys_bits early x86: max_low_pfn_mapped fix #4 x86: change _node_to_cpumask_ptr to return const ptr x86: I/O APIC: remove an IRQ2-mask hack x86: fix numaq_tsc_disable calling x86, e820: remove end_user_pfn x86: max_low_pfn_mapped fix, #3 x86: max_low_pfn_mapped fix, #2 x86: max_low_pfn_mapped fix, #1 x86_64: fix delayed signals x86: remove conflicting nx6325 and nx6125 quirks x86: Recover timer_ack lost in the merge of the NMI watchdog x86: I/O APIC: Never configure IRQ2 x86: L-APIC: Always fully configure IRQ0 x86: L-APIC: Set IRQ0 as edge-triggered x86: merge dwarf2 headers x86: use AS_CFI instead of UNWIND_INFO x86: use ignore macro instead of hash comment x86: use matching CFI_ENDPROC ...	2008-07-14 13:43:24 -07:00
Linus Torvalds	847106ff62	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (25 commits) security: remove register_security hook security: remove dummy module fix security: remove dummy module security: remove unused sb_get_mnt_opts hook LSM/SELinux: show LSM mount options in /proc/mounts SELinux: allow fstype unknown to policy to use xattrs if present security: fix return of void-valued expressions SELinux: use do_each_thread as a proper do/while block SELinux: remove unused and shadowed addrlen variable SELinux: more user friendly unknown handling printk selinux: change handling of invalid classes (Was: Re: 2.6.26-rc5-mm1 selinux whine) SELinux: drop load_mutex in security_load_policy SELinux: fix off by 1 reference of class_to_string in context_struct_compute_av SELinux: open code sidtab lock SELinux: open code load_mutex SELinux: open code policy_rwlock selinux: fix endianness bug in network node address handling selinux: simplify ioctl checking SELinux: enable processes with mac_admin to get the raw inode contexts Security: split proc ptrace checking into read vs. attach ...	2008-07-14 13:36:55 -07:00
Linus Torvalds	dddec01eb8	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: (37 commits) splice: fix generic_file_splice_read() race with page invalidation ramfs: enable splice write drivers/block/pktcdvd.c: avoid useless memset cdrom: revert commit `22a9189` (cdrom: use kmalloced buffers instead of buffers on stack) scsi: sr avoids useless buffer allocation block: blk_rq_map_kern uses the bounce buffers for stack buffers block: add blk_queue_update_dma_pad DAC960: push down BKL pktcdvd: push BKL down into driver paride: push ioctl down into driver block: use get_unaligned_* helpers block: extend queue_flag bitops block: request_module(): use format string Add bvec_merge_data to handle stacked devices and ->merge_bvec() block: integrity flags can't use bit ops on unsigned short cmdfilter: extend default read filter sg: fix odd style (extra parenthesis) introduced by cmd filter patch block: add bounce support to blk_rq_map_user_iov cfq-iosched: get rid of enable_idle being unused warning allow userspace to modify scsi command filter on per device basis ...	2008-07-14 13:15:14 -07:00
Marcel Holtmann	40be492fe4	[Bluetooth] Export details about authentication requirements With the Simple Pairing support, the authentication requirements are an explicit setting during the bonding process. Track and enforce the requirements and allow higher layers like L2CAP and RFCOMM to increase them if needed. This patch introduces a new IOCTL that allows to query the current authentication requirements. It is also possible to detect Simple Pairing support in the kernel this way. Signed-off-by: Marcel Holtmann <marcel@holtmann.org>	2008-07-14 20:13:50 +02:00
Artem Bityutskiy	4ee6afd344	VFS: export sync_sb_inodes This patch exports the 'sync_sb_inodes()' which is needed for UBIFS because it has to force write-back from time to time. Namely, the UBIFS budgeting subsystem forces write-back when its pessimistic callculations show that there is no free space on the media. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-07-14 19:10:52 +03:00
Hans Reiser	ae8547b0a9	VFS: move inode_lock into sync_sb_inodes This patch makes 'sync_sb_inodes()' lock 'inode_lock', rather than expect that the caller will do this. This change was previously done by Hans Reiser <reiser@namesys.com> and sat in the -mm tree. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>	2008-07-14 19:10:52 +03:00
Ingo Molnar	d59fdcf2ac	Merge commit 'v2.6.26' into x86/core	2008-07-14 11:37:46 +02:00
Eric Paris	2069f45784	LSM/SELinux: show LSM mount options in /proc/mounts This patch causes SELinux mount options to show up in /proc/mounts. As with other code in the area seq_put errors are ignored. Other LSM's will not have their mount options displayed until they fill in their own security_sb_show_options() function. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: James Morris <jmorris@namei.org>	2008-07-14 15:02:05 +10:00
Stephen Smalley	006ebb40d3	Security: split proc ptrace checking into read vs. attach Enable security modules to distinguish reading of process state via proc from full ptrace access by renaming ptrace_may_attach to ptrace_may_access and adding a mode argument indicating whether only read access or full attach access is requested. This allows security modules to permit access to reading process state without granting full ptrace access. The base DAC/capability checking remains unchanged. Read access to /proc/pid/mem continues to apply a full ptrace attach check since check_mem_permission() already requires the current task to already be ptracing the target. The other ptrace checks within proc for elements like environ, maps, and fds are changed to pass the read mode instead of attach. In the SELinux case, we model such reading of process state as a reading of a proc file labeled with the target process' label. This enables SELinux policy to permit such reading of process state without permitting control or manipulation of the target process, as there are a number of cases where programs probe for such information via proc but do not need to be able to control the target (e.g. procps, lsof, PolicyKit, ConsoleKit). At present we have to choose between allowing full ptrace in policy (more permissive than required/desired) or breaking functionality (or in some cases just silencing the denials via dontaudit rules but this can hide genuine attacks). This version of the patch incorporates comments from Casey Schaufler (change/replace existing ptrace_may_attach interface, pass access mode), and Chris Wright (provide greater consistency in the checking). Note that like their predecessors __ptrace_may_attach and ptrace_may_attach, the __ptrace_may_access and ptrace_may_access interfaces use different return value conventions from each other (0 or -errno vs. 1 or 0). I retained this difference to avoid any changes to the caller logic but made the difference clearer by changing the latter interface to return a bool rather than an int and by adding a comment about it to ptrace.h for any future callers. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: James Morris <jmorris@namei.org>	2008-07-14 15:01:47 +10:00
Jeff Layton	536abdb080	cifs: fix wksidarr declaration to be big-endian friendly The current definition of wksidarr works fine on little endian arches (since cpu_to_le32 is a no-op there), but on big-endian arches, it fails to compile with this error: error: braced-group within expression allowed only inside a function The problem is that this static declaration has cpu_to_le32 embedded within it, and that expands into a function macro. We need to use __constant_cpu_to_le32() instead. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Steven French <sfrench@us.ibm.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-12 14:33:42 -07:00
Jeff Layton	e911d0cc87	cifs: fix inode leak in cifs_get_inode_info_unix Try this: mount a share with unix extensions create a file on it umount the share You'll get the following message in the ring buffer: VFS: Busy inodes after unmount of cifs. Self-destruct in 5 seconds. Have a nice day... ...the problem is that cifs_get_inode_info_unix is creating and hashing a new inode even when it's going to return error anyway. The first lookup when creating a file returns an error so we end up leaking this inode before we do the actual create. This appears to be a regression caused by commit `0e4bbde94f`. The following patch seems to fix it for me, and fixes a minor formatting nit as well. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steven French <sfrench@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-12 14:33:42 -07:00
Ingo Molnar	ae94b8075a	Merge branch 'linus' into x86/core Conflicts: arch/x86/mm/ioremap.c Signed-off-by: Ingo Molnar <mingo@elte.hu>	2008-07-12 07:29:02 +02:00
Eric Sandeen	e4079a11f5	ext4: do not set extents feature from the kernel We've talked for a while about getting rid of any feature- setting from the kernel; this gets rid of the code which would set the INCOMPAT_EXTENTS flag on the first file write when mounted as ext4[dev]. With this patch, if the extents feature is not already set on disk, then mounting as ext4 will fall back to noextents with a warning, and if -o extents is explicitly requested, the mount will fail, also with warning. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	c07651b556	ext4: Don't allow nonextenst mount option for large filesystem The block mapped inode format can address only blocks within 232. This causes a number of issues, the biggest of which is that the block allocator needs to be taught that certain inodes can not utilize block numbers > 232. So until this is fixed, it is simplest to fail mounting of file systems with more than 2**32 blocks if the -o noextents option is given. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	dd919b9822	ext4: Enable delalloc by default. Enable delalloc by default to ensure it gets sufficient testing and because it makes the filesystem much more efficient. Add a nodealalloc option to disable delayed allocation, and update ext4_show_options to show delayed allocation off if it is disabled. If the data=journal mount option is used, disable delayed allocation since the delalloc code doesn't support data=journal yet. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2008-07-11 19:27:31 -04:00
Mingming Cao	3e3398a08d	ext4: delayed allocation i_blocks fix for stat Right now i_blocks is not getting updated until the blocks are actually allocaed on disk. This means with delayed allocation, right after files are copied, "ls -sF" shoes the file as taking 0 blocks on disk. "du" also shows the files taking zero space, which is highly confusing to the user. Since delayed allocation already keeps track of per-inode total number of blocks that are subject to delayed allocation, this patch fix this by using that to adjust the value returned by stat(2). When real block allocation is done, the i_blocks will get updated. Since the reserved blocks for delayed allocation will be decreased, this will be keep value returned by stat(2) consistent. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	632eaeab1f	ext4: fix delalloc i_disksize early update issue Ext4_da_write_end() used walk_page_buffers() with a callback function of ext4_bh_unmapped_or_delay() to check if it extended the file size without allocating any blocks (since in this case i_disksize needs to be updated). However, this is didn't work proprely because the buffer head has not been marked dirty yet --- this is done later in block_commit_write() --- which caused ext4_bh_unmapped_or_delay() to always return false. In addition, walk_page_buffers() checks all of the buffer heads covering the page, and the only buffer_head that should be checked is the one covering the end of the write. Otherwise, given a 1k blocksize filesystem and a 4k page size, the buffer head covering the first 1k stripe of the file could be unmapped (because it was a sparse file), and the second or third buffer_head covering that page could be mapped, and using walk_page_buffers() would fail in this case since it would stop at the first unmapped buffer_head and return true. The core problem is that walk_page_buffers() was intended to do work in a callback function, and a non-zero return value indicated a failure, which termined the walk of the buffer heads covering the page. It was not intended to be used with a boolean function, such as ext4_bh_unmapped_or_delay(). Add addtional fix from Aneesh to protect i_disksize update rave with truncate. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	f0e6c98593	ext4: Handle page without buffers in ext4__writepage() It can happen that buffers are removed from the page before it gets marked dirty and then is passed to writepage(). In writepage() we just initialize the buffers and check whether they are mapped and non delay. If they are mapped and non delay we write the page. Otherwise we mark them dirty. With this change we don't do block allocation at all in ext4__write_page. writepage() can get called under many condition and with a locking order of journal_start -> lock_page, we should not try to allocate blocks in writepage() which get called after taking page lock. writepage() can get called via shrink_page_list even with a journal handle which was created for doing inode update. For example when doing ext4_da_write_begin we create a journal handle with credit 1 expecting a i_disksize update for the inode. But ext4_da_write_begin can cause shrink_page_list via _grab_page_cache. So having a valid handle via ext4_journal_current_handle is not a guarantee that we can use the handle for block allocation in writepage, since we shouldn't be using credits that had been reserved for other updates. That it could result in we running out of credits when we update inodes. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	cd1aac3292	ext4: Add ordered mode support for delalloc This provides a new ordered mode implementation which gets rid of using buffer heads to enforce the ordering between metadata change with the related data chage. Instead, in the new ordering mode, it keeps track of all of the inodes touched by each transaction on a list, and when that transaction is committed, it flushes all of the dirty pages for those inodes. In addition, the new ordered mode reverses the lock ordering of the page lock and transaction lock, which provides easier support for delayed allocation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	61628a3f3a	ext4: Invert lock ordering of page_lock and transaction start in delalloc With the reverse locking, we need to start a transation before taking the page lock, so in ext4_da_writepages() we need to break the write-out into chunks, and restart the journal for each chunck to ensure the write-out fits in a single transaction. Updated patch from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> which fixes delalloc sync hang with journal lock inversion, and address the performance regression issue. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	d2a1763791	ext4: delayed allocation ENOSPC handling This patch does block reservation for delayed allocation, to avoid ENOSPC later at page flush time. Blocks(data and metadata) are reserved at da_write_begin() time, the freeblocks counter is updated by then, and the number of reserved blocks is store in per inode counter. At the writepage time, the unused reserved meta blocks are returned back. At unlink/truncate time, reserved blocks are properly released. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix the oldallocator block reservation accounting with delalloc, added lock to guard the counters and also fix the reservation for meta blocks. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-14 17:52:37 -04:00
Mingming Cao	e8ced39d5e	percpu_counter: new function percpu_counter_sum_and_set Delayed allocation need to check free blocks at every write time. percpu_counter_read_positive() is not quit accurate. delayed allocation need a more accurate accounting, but using percpu_counter_sum_positive() is frequently is quite expensive. This patch added a new function to update center counter when sum per-cpu counter, to increase the accurate rate for next percpu_counter_read() and require less calling expensive percpu_counter_sum(). Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Alex Tomas	64769240bd	ext4: Add delayed allocation support in data=writeback mode Updated with fixes from Mingming Cao <cmm@us.ibm.com> to unlock and release the page from page cache if the delalloc write_begin failed, and properly handle preallocated blocks. Also added a fix to clear buffer_delay in block_write_full_page() after allocating a delayed buffer. Updated with fixes from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to update i_disksize properly and to add bmap support for delayed allocation. Updated with a fix from Valerie Clement <valerie.clement@bull.net> to avoid filesystem corruption when the filesystem is mounted with the delalloc option and blocksize < pagesize. Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2008-07-11 19:27:31 -04:00
Alex Tomas	29a814d2ee	vfs: add hooks for ext4's delayed allocation support Export mpage_bio_submit() and __mpage_writepage() for the benefit of ext4's delayed allocation support. Also change __block_write_full_page so that if buffers that have the BH_Delay flag set it will call get_block() to get the physical block allocated, just as in the !BH_Mapped case. Signed-off-by: Alex Tomas <alex@clusterfs.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	87c89c232c	jbd2: Remove data=ordered mode support using jbd buffer heads Signed-off-by: Jan Kara <jack@suse.cz>	2008-07-11 19:27:31 -04:00
Jan Kara	678aaf4814	ext4: Use new framework for data=ordered mode in JBD2 This patch makes ext4 use inode-based implementation of data=ordered mode in JBD2. It allows us to unify some data=ordered and data=writeback paths (especially writepage since we don't have to start a transaction anymore) and remove some buffer walking. Updated fix from Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> to fix file system hang due to corrupt jinode values. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	c851ed5401	jbd2: Implement data=ordered mode handling via inodes This patch adds necessary framework into JBD2 to be able to track inodes with each transaction and write-out their dirty data during transaction commit time. This new ordered mode brings all sorts of advantages such as possibility to get rid of journal heads and buffer heads for data buffers in ordered mode, better ordering of writes on transaction commit, simplification of some JBD code, no more anonymous pages when truncate of data being committed happens. Also with this new ordered mode, delayed allocation on ordered mode is much simpler. Signed-off-by: Jan Kara <jack@suse.cz>	2008-07-11 19:27:31 -04:00
Jan Kara	9ddfc3dc75	ext4: Fix lock inversion in ext4_ext_truncate() We cannot call ext4_orphan_add() from under i_data_sem because that causes a lock ordering violation between i_data_sem and and the superblock lock. Updated with Aneesh's locking order fix Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	cf108bca46	ext4: Invert the locking order of page_lock and transaction start This changes are needed to support data=ordered mode handling via inodes. This enables us to get rid of the journal heads and buffer heads for data buffers in the ordered mode. With the changes, during tranasaction commit we writeout the inode pages using the writepages()/writepage(). That implies we take page lock during transaction commit. This can cause a deadlock with the locking order page_lock -> jbd2_journal_start, since the jbd2_journal_start can wait for the journal_commit to happen and the journal_commit now needs to take the page lock. To avoid this dead lock reverse the locking order. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	c7d206b337	vfs: Move mark_inode_dirty() from under page lock in generic_write_end() There's no need to call mark_inode_dirty() under page lock in generic_write_end(). It unnecessarily makes hold time of page lock longer and more importantly it forces locking order of page lock and transaction start for journaling filesystems. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	2e9ee85035	ext4: Use page_mkwrite vma_operations to get mmap write notification. We would like to get notified when we are doing a write on mmap section. This is needed with respect to preallocated area. We split the preallocated area into initialzed extent and uninitialzed extent in the call back. This let us handle ENOSPC better. Otherwise we get ENOSPC in the writepage and that would result in data loss. The changes are also needed to handle ENOSPC when writing to an mmap section of files with holes. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Frederic Bohe	5f21b0e642	ext4: fix online resize with mballoc Update group infos when updating a group's descriptor. Add group infos when adding a group's descriptor. Refresh cache pages used by mb_alloc when changes occur. This will probably need modifications when META_BG resizing will be allowed. Signed-off-by: Frederic Bohe <frederic.bohe@bull.net> Signed-off-by: Mingming Cao <cmm@us.ibm.com>	2008-07-11 19:27:31 -04:00
Eric Sandeen	953e622b60	ext4: use atomic functions to set bh_state Use the BUFFER_FNS functions (set_buffer_foo) to set buffer head state atomically instead of nonatomic __set_bit(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Jan Kara	47b4a50beb	ext4: Set journal pointer to NULL when journal is released Set sbi->s_journal to NULL after we call journal_destroy(). This will be later needed because after journal_destroy() is called, ext4_clear_inode() can still be called for some inodes (e.g. root inode) and we'll need to detect there that journal doesn't exists anymore. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	0703143107	ext4: mballoc avoid use root reserved blocks for non root allocation mballoc allocation missed check for blocks reserved for root users. Add ext4_has_free_blocks() check before allocation. Also modified ext4_has_free_blocks() to support multiple block allocation request. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Eric Sandeen	d755fb3842	ext4: call blkdev_issue_flush on fsync To ensure that bits are truly on-disk after an fsync, we should call blkdev_issue_flush if barriers are supported. Inspired by an old thread on barriers, by reiserfs & xfs which do the same, and by a patch SuSE ships with their kernel Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	654b4908bc	ext4: cleanup block allocator Move the code related to block allocation to a single function and add helper funtions to differient allocation for data and meta data blocks Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	7061eba75c	ext4: Use inode preallocation with -o noextents When mballoc is enabled, block allocation for old block-based files are allocated using mballoc allocator instead of old block-based allocator. The old ext3 block reservation is turned off when mballoc is turned on. However, the in-core preallocation is not enabled for block-based/ non-extent based file block allocation. This result in performance regression, as now we don't have "reservation" ore in-core preallocation to prevent interleaved fragmentation in multiple writes workload. This patch fix this by enable per inode in-core preallocation for non extent files when mballoc is used. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Akinobu Mita	6afd670713	ext4: fix ext4_init_block_bitmap() for metablock block group When meta_bg feature is enabled and s_first_meta_bg != 0, ext4_init_block_bitmap() miscalculates the number of block used by the group descriptor table (0 or 1 for metablock block group) This patch fixes this by using ext4_bg_num_gdb() Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Tweedie <sct@redhat.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Andreas Dilger <adilger@sun.com>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	7477827f66	ext4: Fix sparse warning Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Li Zefan	d9c769b769	ext4: cleanup never-used magic numbers from htree code dx_root_limit() will had some dead code which forced it to always return 20, and dx_node_limit to always return 22 for debugging purposes. Remove it. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	9102e4fa80	ext4: Fix ext4_ext_journal_restart() to reflect errors up to the caller Fix ext4_ext_journal_restart() so it returns any errors reported by ext4_journal_extend() and ext4_journal_restart(). Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	1973adcba5	ext4: Make ext4_ext_find_extent fills ext_path completely When pos=0 or depth, the fields of ext4_ext_path is are not completely filled. This patch also removes some unnecessary code. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	787e0981fa	ext4: return error when calling ext4_ext_split failed ext4_ext_create_new_leaf must return error when its calling to ext4_ext_split failed. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	a379cd1d6b	ext4: Update i_disksize properly when allocating from fallocate area. When allocating unitialized space at the end of file which had been preallocated with the FALLOC_FL_KEEP_SIZE option, the file size is not updated at that time. But the later we are not updating the file size when writing to that preallocated space. These changes are for code correctness. This patch allows us to update the i_disksize at the write_end() callback of filesystem properly. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	363d4251d4	ext4: remove quota allocation when ext4_mb_new_blocks fails Quota allocation is not removed when ext4_mb_new_blocks calls kmem_cache_alloc failed. Also make sure the allocation context is freed on the error path. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Li Zefan	f9a8ac99fd	ext4: remove redundant code in ext4_fill_super() The previous sb_min_blocksize() has already set the block size. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	530576bbf3	jbd2: fix race between jbd2_journal_try_to_free_buffers() and jbd2 commit transaction journal_try_to_free_buffers() could race with jbd commit transaction when the later is holding the buffer reference while waiting for the data buffer to flush to disk. If the caller of journal_try_to_free_buffers() request tries hard to release the buffers, it will treat the failure as error and return back to the caller. We have seen the directo IO failed due to this race. Some of the caller of releasepage() also expecting the buffer to be dropped when passed with GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers(). With this patch, if the caller is passing the GFP_KERNEL to indicating this call could wait, in case of try_to_free_buffers() failed, let's waiting for journal_commit_transaction() to finish commit the current committing transaction , then try to free those buffers again with journal locked. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Reviewed-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-13 21:06:39 -04:00
Jose R. Santos	772cb7c83b	ext4: New inode allocation for FLEX_BG meta-data groups. This patch mostly controls the way inode are allocated in order to make ialloc aware of flex_bg block group grouping. It achieves this by bypassing the Orlov allocator when block group meta-data are packed toghether through mke2fs. Since the impact on the block allocator is minimal, this patch should have little or no effect on other block allocation algorithms. By controlling the inode allocation, it can basically control where the initial search for new block begins and thus indirectly manipulate the block allocator. This allocator favors data and meta-data locality so the disk will gradually be filled from block group zero upward. This helps improve performance by reducing seek time. Since the group of inode tables within one flex_bg are treated as one giant inode table, uninitialized block groups would not need to partially initialize as many inode table as with Orlov which would help fsck time as the filesystem usage goes up. Signed-off-by: Jose R. Santos <jrs@us.ibm.com> Signed-off-by: Valerie Clement <valerie.clement@bull.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Theodore Ts'o	736603ab29	jbd2: Add commit time into the commit block Carlo Wood has demonstrated that it's possible to recover deleted files from the journal. Something that will make this easier is if we can put the time of the commit into commit block. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Stoyan Gaydarov	4db9c54a53	ext4: replace __FUNCTION__ occurrences __FUNCTION__ is gcc-specific, use __func__ instead Signed-off-by: Stoyan Gaydarov <stoyboyker@gmail.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-13 21:03:29 -04:00
Shen Feng	7e5a8cdd84	ext4: fix error processing in mb_free_blocks The error processing of the return value of mb_free_blocks is meanless because it only returns 0. This fix includes - make mb_free_blocks return void - remove the error processing part in callers - unlock group before calling ext4_error in mb_free_blocks Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-13 21:03:31 -04:00
Shen Feng	cfbe7e4f5e	ext4: error proc entry creation when the fs/ext4 is not correctly created When the directory fs/ext4 is not correctly created under proc, the entry under this directory should not be created. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-13 21:03:31 -04:00
Li Zefan	f795e14073	ext4: fix build failure if DX_DEBUG is enabled ext4_next_entry() is used by the debugging function dx_show_leaf(), so it must be defined before that function. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Theodore Ts'o	7ad72ca60b	ext4: Remove unused variable from ext4_show_options Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Theodore Ts'o	574ca174c9	ext4: Rename read_block_bitmap() to ext4_read_block_bitmap() Since this a non-static function, make it be ext4 specific to avoid conflicts with potentially other filesystems. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	3537576a70	ext4: remove double definitions of xattr macros remove the definitions of macros XATTR_TRUSTED_PREFIX and XATTR_USER_PREFIX since they are defined in linux/xattr.h Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	74767c5a2d	ext4: miscellaneous error checks and coding cleanups for mballoc ext4_mb_seq_history_open(): check if sbi->s_mb_history is NULL ext4_mb_history_init(): replace kmalloc and memset with kzalloc ext4_mb_init_backend(): remove memset since kzalloc is used ext4_mb_init(): the return value of ext4_mb_init_backend is int, but i is unsigned, replace it with a new int variable. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	fdf6c7a768	ext4: add error processing when calling ext4_mb_init_cache in mballoc Add error processing for ext4_mb_load_buddy when it calls ext4_mb_init_cache. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Mingming Cao	31b481dc7c	ext4: Fix ext4_mb_init_cache return error ext4_mb_init_cache() incorrectly always return EIO on success. This causes the caller of ext4_mb_init_cache() fail when it checks the return value. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	69baee062a	ext4: improve some code in rb tree part of dir.c * remove unnecessary code in free_rb_tree_fname * rename free_rb_tree_fname to ext4_htree_create_dir_info since it and ext4_htree_free_dir_info are a pair * replace kmalloc with kzalloc in ext4_htree_free_dir_info All these make the code more readable and simple. PS: this patch is also suitable for ext3. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Alexey Dobriyan	91d9982779	ext4: switch to seq_files Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2008-07-11 19:27:31 -04:00
Julia Lawall	07d45f1267	ext4: Use BUG_ON() instead of BUG() if (...) BUG(); should be replaced with BUG_ON(...) when the test has no side-effects to allow a definition of BUG_ON that drops the code completely. The semantic patch that makes this change is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @ disable unlikely @ expression E,f; @@ ( if (<... f(...) ...>) { BUG(); } \| - if (unlikely(E)) { BUG(); } + BUG_ON(E); ) @@ expression E,f; @@ ( if (<... f(...) ...>) { BUG(); } \| - if (E) { BUG(); } + BUG_ON(E); ) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	ed8f9c751f	ext4: start searching for the right extent from the goal group. With mballoc we search for the best extent using different criteria. We should always use the goal group when we are starting with a new criteria. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Shen Feng	8a35694e11	ext4: fix comments to say "ext4" Change second/third to fourth. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Aneesh Kumar K.V	e7dfb2463e	ext4: Fix mb_find_next_bit not to return larger than max Some architectures implement ext4_find_next_bit and ext4_find_next_zero_bit in such a way that they return greater than max for some input values. Make sure mb_find_next_bit and mb_find_next_zero_bit return the right values. On 2.6.25 we have include/asm-x86/bitops_32.h static inline unsigned find_first_bit(const unsigned long addr, unsigned size) { unsigned x = 0; while (x < size) { unsigned long val = addr++; if (val) return __ffs(val) + x; x += (sizeof(*addr)<<3); } return x; } This can return value greater than size. Reported and fixed here for lustre https://bugzilla.lustre.org/show_bug.cgi?id=15932 https://bugzilla.lustre.org/attachment.cgi?id=17205 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Duane Griffin	f3b35f063e	ext4: validate directory entry data before use ext4_dx_find_entry uses ext4_next_entry without verifying that the entry is valid. If its rec_len == 0 this causes an infinite loop. Refactor the loop to check the validity of entries before checking whether they match and moving onto the next one. There are other uses of ext4_next_entry in this file which also look problematic. They should be reviewed and fixed if/when we have a test-case that triggers them. This patch fixes the first case (image hdb.25.softlockup.gz) reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882. Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Duane Griffin	71dc8fbcf5	ext4: handle deleting corrupted indirect blocks While freeing indirect blocks we attach a journal head to the parent buffer head, free the blocks, then journal the parent. If the indirect block list is corrupted and points to the parent the journal head will be detached when the block is cleared, causing an OOPS. Check for that explicitly and handle it gracefully. This patch fixes the third case (image hdb.20000057.nullderef.gz) reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882. Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Duane Griffin	91ef4caf80	ext4: handle corrupted orphan list at mount If the orphan node list includes valid, untruncatable nodes with nlink > 0 the ext4_orphan_cleanup loop which attempts to delete them will not do so, causing it to loop forever. Fix by checking for such nodes in the ext4_orphan_get function. This patch fixes the second case (image hdb.20000009.softlockup.gz) reported in http://bugzilla.kernel.org/show_bug.cgi?id=10882. Signed-off-by: Duane Griffin <duaneg@dghda.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2008-07-11 19:27:31 -04:00
Dave Chinner	49641f1acf	Fix reference counting race on log buffers When we release the iclog, we do an atomic_dec_and_lock to determine if we are the last reference and need to trigger update of log headers and writeout. However, in xlog_state_get_iclog_space() we also need to check if we have the last reference count there. If we do, we release the log buffer, otherwise we decrement the reference count. But the compare and decrement in xlog_state_get_iclog_space() is not atomic, so both places can see a reference count of 2 and neither will release the iclog. That leads to a filesystem hang. Close the race by replacing the atomic_read() and atomic_dec() pair with atomic_add_unless() to ensure that they are executed atomically. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Tim Shimmin <tes@sgi.com> Tested-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-11 11:37:18 -07:00
David Howells	4abaca17e7	[GFS2] Fix GFS2's use of do_div() in its quota calculations Fix GFS2's need_sync()'s use of do_div() on an s64 by using div_s64() instead. This does assume that gt_quota_scale_den can be cast to an s32. This was introduced by patch `b3b94faa5f`. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2008-07-11 14:35:01 +01:00
Hugh Dickins	96a8e13ed4	exec: fix stack excutability without PT_GNU_STACK Kernel Bugzilla #11063 points out that on some architectures (e.g. x86_32) exec'ing an ELF without a PT_GNU_STACK program header should default to an executable stack; but this got broken by the unlimited argv feature because stack vma is now created before the right personality has been established: so breaking old binaries using nested function trampolines. Therefore re-evaluate VM_STACK_FLAGS in setup_arg_pages, where stack vm_flags used to be set, before the mprotect_fixup. Checking through our existing VM_flags, none would have changed since insert_vm_struct: so this seems safer than finding a way through the personality labyrinth. Reported-by: pageexec@freemail.hu Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2008-07-10 13:25:43 -07:00
Mark Fasheh	e988cf1cfe	ocfs2: Fix flags in ocfs2_file_lock The stack-glue merge changed the way we use flags in dlmglue in that we now use the fs/dlm equivalents. Unfortunately, a merge error left the new flock code only partially updated. This took a while to show up though, because the lock level constants are actually identical between o2dlm and fs/dlm. The _CONVERT and _NOQUEUE flags have different values though, which is eventually causing a crash in flags_to_o2dlm(). Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2008-07-10 09:25:39 -07:00
Li Xiaodong	a93a6ce242	[GFS2] Remove unused declaration The implementation of gfs2_inode_attr_in is removed. So remove its declaration. Signed-off-by: Li Xiaodong <lixd@cn.fujitsu.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2008-07-10 16:22:23 +01:00
Steven Whitehouse	c9f6a6bbc2	[GFS2] Remove support for unused and pointless flag The ability to mark files for direct i/o access when opened normally is both unused and pointless, so this patch removes support for that feature. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2008-07-10 16:09:29 +01:00
Steven Whitehouse	9cabcdbd46	[GFS2] Replace rgrp "recent list" with mru list This patch removes the "recent list" which is used during allocation and replaces it with the (already existing) mru list used during deletion. The "recent list" was not a true mru list leading to a number of inefficiencies including a "next" function which made scanning the list an order N^2 operation wrt to the number of list elements. This should increase allocation performance with large numbers of rgrps. Its also a useful preparation and cleanup before some further changes which are planned in this area. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2008-07-10 15:54:12 +01:00
Chuck Lever	f45663ce5f	NFS: Allow either strict or sloppy mount option parsing The kernel's NFS client mount option parser currently doesn't allow unrecognized or incorrect mount options. This prevents misspellings or incorrectly specified mount options from possibly causing silent data corruption. However, NFS mount options are not standardized, so different operating systems can use differently spelled mount options to support similar features, or can support mount options which no other operating system supports. "Sloppy" mount option parsing, which allows the parser to ignore any option it doesn't recognize, is needed to support automounters that often use maps that are shared between heterogenous operating systems. The legacy mount command ignores the validity of the values of mount options entirely, except for the "sec=" and "proto=" options. If an incorrect value is specified, the out-of-range value is passed to the kernel; if a value is specified that contains non-numeric characters, it appears as though the legacy mount command sets that option to zero (probably incorrect behavior in general). In any case, this sets a precedent which we will partially follow for the kernel mount option parser: + if "sloppy" is not set, the parser will be strict about both unrecognized options (same as legacy) and invalid option values (stricter than legacy) + if "sloppy" is set, the parser will ignore unrecognized options and invalid option values (same as legacy) An "invalid" option value in this case means that either the type (integer, short, or string) or sign (for integer values) of the specified value is incorrect. This patch does two things: it changes the NFS client's mount option parsing loop so that it parses the whole string instead of failing at the first unrecognized option or invalid option value. An unrecognized option or an invalid option value cause the option to be skipped. Then, the patch adds a "sloppy" mount option that allows the parsing to succeed anyway if there were any problems during parsing. When parsing a set of options is complete, if there are errors and "sloppy" was specified, return success anyway. Otherwise, only return success if there are no errors. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:44 -04:00
Chuck Lever	6738b2512b	NFS4: Set security flavor default for NFSv4 mounts like other defaults Set the default security flavor when we set the other mount option default values for NFSv4. This cleans up the NFSv4 mount option parsing path to look like the NFSv2/v3 one. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:43 -04:00
Chuck Lever	dd07c94750	NFS: Set security flavor default for NFSv2/3 mounts like other defaults Set the default security flavor when we set the other mount option default values. After this change, only the legacy user-space mount path needs to set the NFS_MOUNT_SECFLAVOUR flag. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:42 -04:00
Chuck Lever	01060c896e	NFS: Refactor logic for parsing NFS security flavor mount options Clean up: Refactor the NFS mount option parsing function to extract the security flavor parsing logic into a separate function. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:41 -04:00
Chuck Lever	0e0cab744b	NFS: use documenting macro constants for initializing ac{reg, dir}{min, max} Clean up. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:40 -04:00
Chuck Lever	ed596a8adb	NFS: Move the nfs_set_port() call out of nfs_parse_mount_options() The remount path does not need to set the port in the server address. Since it's not really a part of option parsing, move the nfs_set_port() call to nfs_parse_mount_options()'s callers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:39 -04:00
Trond Myklebust	259875efed	NFS: set transport defaults after mount option parsing is finished Move the UDP/TCP default timeo/retrans settings for text mounts to nfs_init_timeout_values(), which was were they were always being initialised (and sanity checked) for binary mounts. Document the default timeout values using appropriate #defines. Ensure that we initialise and sanity check the transport protocols that may have been specified by the user. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:38 -04:00
Chuck Lever	40fef8a649	SUNRPC: Use only rpcbind v2 for AF_INET requests Some server vendors support the higher versions of rpcbind only for AF_INET6. The kernel doesn't need to use v3 or v4 for AF_INET anyway, so change the kernel's rpcbind client to query AF_INET servers over rpcbind v2 only. This has a few interesting benefits: 1. If the rpcbind request is going over TCP, and the server doesn't support rpcbind versions 3 or 4, the client reduces by two the number of ephemeral ports left in TIME_WAIT for each rpcbind request. This will help during NFS mount storms. 2. The rpcbind interaction with servers that don't support rpcbind versions 3 or 4 will use less network traffic. Also helpful during mount storms. 3. We can eliminate the kernel build option that controls whether the kernel's rpcbind client uses rpcbind version 3 and 4 for AF_INET servers. Less complicated kernel configuration... Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:37 -04:00
Jeff Layton	5afc597c5f	nfs4: fix potential race with rapid nfs_callback_up/down cycle If the nfsv4 callback thread is rapidly brought up and down, it's possible that nfs_callback_svc might never get a chance to run. If this happens, the cleanup at thread exit might never occur, throwing the refcounting off and nfs_callback_info in an incorrect state. Move the clean functions into nfs_callback_down. Also change the nfs_callback_info struct to track the svc_rqst rather than svc_serv since we need to know that to call svc_exit_thread. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:32 -04:00
Jeff Layton	ee84dfc454	nfs4: remove BKL from nfs_callback_up and nfs_callback_down The nfs_callback_mutex is sufficient protection. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2008-07-09 12:09:31 -04:00

... 2 3 4 5 6 ...

9615 Commits (b1da47e29e467f1ec36dc78d009bfb109fd533c7)