Commit graph

27622 commits

Author SHA1 Message Date
Chris Mason
cd1cfc4915 Btrfs: add a barrier before a waitqueue_active check
We were missing wakeups on the delayed ref waitqueue due
to races on waitqueue_active.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 16:15:08 -04:00
Chris Mason
e9fbcb4220 Btrfs: call the ordered free operation without any locks held
Each ordered operation has a free callback, and this was called with the
worker spinlock held.  Josef made the free callback also call iput,
which we can't do with the spinlock.

This drops the spinlock for the free operation and grabs it again before
moving through the rest of the list.  We'll circle back around to this
and find a cleaner way that doesn't bounce the lock around so much.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
cc: stable@kernel.org
2012-07-25 16:15:07 -04:00
Mitch Harder
2b0ce2c290 Btrfs: Check INCOMPAT flags on remount and add helper function
In support of the recently added capability to remount with lzo
compression, provide a helper function to check the compression
INCOMPAT flags when remounting with lzo compression, and set
the flags if necessary.

Also, implement the new helper function when defragmenting with
explicit lzo compression and when setting the default subvolume.

Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 16:14:31 -04:00
Chris Mason
b478b2baa3 Merge branch 'qgroup' of git://git.jan-o-sch.net/btrfs-unstable into for-linus
Conflicts:
	fs/btrfs/ioctl.c
	fs/btrfs/ioctl.h
	fs/btrfs/transaction.c
	fs/btrfs/transaction.h

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-25 16:11:38 -04:00
Liu Bo
67c9684f48 Btrfs: improve multi-thread buffer read
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.

Here is a scenario in fio jobs:

We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.

And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:

     t1          t2          t3          t4
   add Page1
   read Page1  add Page2
     |         read Page2  add Page3
     |            |        read Page3  add Page4
     |            |           |        read Page4
-----|------------|-----------|-----------|--------
     v            v           v           v
    bio          bio         bio         bio

Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio.  Thus, we can end up with far more bios
than we need.

Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.

With that said, we can make each bio hold more pages and reduce the number
of bios we need.

Here is some numbers taken from fio results:
         w/o patch                 w patch
       -------------  --------  ---------------
READ:    745MB/s        +25%       934MB/s

[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/

[READ]
filename=foobar
size=2000M
invalidate=1

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:10 -04:00
Liu Bo
df57dbe6bf Btrfs: make btrfs's allocation smoothly with preallocation
For backref walking, we've introduce delayed ref's sequence.  However,
it changes our preallocation behavior.

The story is that when we preallocate an extent and then mark it written
piece by piece, the ideal case should be that we don't need to COW the
extent, which is why we use 'preallocate'.

But we may not make use of preallocation, since when we check for cross refs on
the extent, we may have two ref entries which have the same content except
the sequence value, and we recognize them as cross refs and do COW to allocate
another extent.

So we end up with several pieces of space instead of an whole extent.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:10 -04:00
Josef Bacik
51561ffec9 Btrfs: lock the transition from dirty to writeback for an eb
There is a small window where an eb can have no IO bits set on it, which
could potentially result in extent_buffer_under_io() returning false when we
want it to return true, which could result in not fun things happening.  So
in order to protect this case we need to hold the refs_lock when we make
this transition to make sure we get reliable results out of
extent_buffer_udner_io().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:09 -04:00
Josef Bacik
594831c4b2 Btrfs: fix potential race in extent buffer freeing
This sounds sort of impossible but it is the only thing I can think of and
at the very least it is theoretically possible so here it goes.

If we are in try_release_extent_buffer we will check that the ref count on
the extent buffer is 1 and not under IO, and then go down and clear the tree
ref.  If between this check and clearing the tree ref somebody else comes in
and grabs a ref on the eb and the marks it dirty before
try_release_extent_buffer() does it's tree ref clear we can end up with a
dirty eb that will be freed while it is still dirty which will result in a
panic.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:09 -04:00
Josef Bacik
e64860aa05 Btrfs: don't return true in releasepage unless we actually freed the eb
I noticed while looking at an extent_buffer race that we will
unconditionally return 1 if we get down to release_extent_buffer after
clearing the tree ref.  However we can easily race in here and get a ref on
the eb and not actually free the eb.  So make release_extent_buffer return 1
if it free'd the eb and 0 if not so we can be a little kinder to the vm.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:08 -04:00
Stefan Behrens
a98cdb85b9 Btrfs: suppress printk() if all device I/O stats are zero
Code is added to suppress the I/O stats printing at mount time if all
statistic values are zero.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-07-23 16:28:07 -04:00
Stefan Behrens
5021976d8d Btrfs: remove unwanted printk() for btrfs device I/O stats
People complained about the annoying kernel log message
"btrfs: no dev_stats entry found ... (OK on first mount after mkfs)"
everytime a filesystem is mounted for the first time after running
mkfs. Since the distribution of the btrfs-progs is not synchronized
to the kernel version, mkfs like it is now will be used also in the
future. Then this message is not useful to find errors, it is just
annoying. This commit removes the printk().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-07-23 16:28:07 -04:00
Li Zefan
18077bb413 Btrfs: rewrite BTRFS_SETGET_FUNCS
BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and
btrfs_foo() functions, which read and write specific fields in the
extent buffer.

The total number of set/get functions is ~200, but in fact we only
need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64.

It results in redunction of ~37K bytes.

   text    data     bss     dec     hex filename
 629661   12489     216  642366   9cd3e fs/btrfs/btrfs.o.orig
 592637   12489     216  605342   93c9e fs/btrfs/btrfs.o

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:06 -04:00
Li Zefan
293f7e0740 Btrfs: zero unused bytes in inode item
The otime field is not zeroed, so users will see random otime in an old
filesystem with a new kernel which has otime support in the future.

The reserved bytes are also not zeroed, and we'll have compatibility
issue if we make use of those bytes.

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:05 -04:00
Li Zefan
b4d7c3c945 Btrfs: kill free_space pointer from inode structure
Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
which means every inode has the same BTRFS_I(inode)->free_space pointer.

This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).

Signed-off-by: Li Zefan <lizefan@huawei.com>
2012-07-23 16:28:05 -04:00
Anand Jain
d5b025d510 btrfs read error corrected message floods the console during recovery
Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:04 -04:00
Jan Schmidt
e6466e354a Btrfs: fix buffer leak in btrfs_next_old_leaf
When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
rare case of using the deadlock avoidance code needed for the tree mod log.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:03 -04:00
Liu Bo
f6175efab1 Btrfs: do not count in readonly bytes
If a block group is ro, do not count its entries in when we dump space info.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:03 -04:00
Liu Bo
799ffc3c31 Btrfs: add ro notification to dump_space_info
Block group has ro attributes, make dump_space_info show it.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:02 -04:00
Liu Bo
cf7c1ef6e1 Btrfs: fix a bug of writting free space cache during balance
Here is the whole story:
1)
A free space cache consists of two parts:
o  free space cache inode, which is special becase it's stored in root tree.
o  free space info, which is stored as the above inode's file data.

But we only build up another new inode and does not flush its free space info
onto disk when we _clear and setup_ free space cache, and this ends up with
that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.

And holding DC_SETUP means that we will not truncate this free space cache inode,
which means the disk offset of its file extent will remain _unchanged_ at least
until next transaction finishes committing itself.

2)
We can set a block group readonly when we relocate the block group.

However,
if the readonly block group covers the disk offset where our free space cache
inode is going to write, it will force the free space cache inode into
cow_file_range() and it'll end up hitting a BUG_ON.

3)
Due to the above analysis, we fix this bug by adding the missing dirty flag.

4)
However, it's not over, there is still another case, nospace_cache.

With nospace_cache, we do not want to set dirty flag, instead we just truncate
free space cache inode and bail out with setting cache state DC_WRITTEN.

We can benifit from it since it saves us another 'pre-allocation' part which
usually costs a lot.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:02 -04:00
Liu Bo
0678938423 Btrfs: do not abort transaction in prealloc case
During disk balance, we prealloc new file extent for file data relocation,
but we may fail in 'no available space' case, and it leads to flipping btrfs
into readonly.

It is not necessary to bail out and abort transaction since we do have several
ways to rescue ourselves from ENOSPC case.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:01 -04:00
Liu Bo
83eea1f1ba Btrfs: kill root from btrfs_is_free_space_inode
Since root can be fetched via BTRFS_I macro directly, we can save an args
for btrfs_is_free_space_inode().

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:00 -04:00
Liu Bo
51a8cf9d2d Btrfs: fix btrfs_is_free_space_inode to recognize btree inode
For btree inode, its root is also 'tree root', so btree inode can be
misunderstood as a free space inode.

We should add one more check for btree inode.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:28:00 -04:00
Stefan Behrens
c0901581ad Btrfs: avoid I/O repair BUG() from btree_read_extent_buffer_pages()
From btree_read_extent_buffer_pages(), currently repair_io_failure()
can be called with mirror_num being zero when submit_one_bio() returned
an error before. This used to cause a BUG_ON(!mirror_num) in
repair_io_failure() and indeed this is not a case that needs the I/O
repair code to rewrite disk blocks.
This commit prevents calling repair_io_failure() in this case and thus
avoids the BUG_ON() and malfunction.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:59 -04:00
Josef Bacik
f4c738c2e7 Btrfs: rework shrink_delalloc
So shrink_delalloc has grown all sorts of cruft over the years thanks to
many reworkings of how we track enospc.  What happens now as we fill up the
disk is we will loop for freaking ever hoping to reclaim a arbitrary amount
of space of metadata, this was from when everybody flushed at the same time.
Now we only have people flushing one at a time.  So instead of trying to
reclaim a huge amount of space, just try to flush a decent chunk of space,
and stop looping as soon as we have enough free space to satisfy our
reservation.  This makes xfstests 224 go much faster.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:58 -04:00
Liu Bo
b9ca0664dc Btrfs: do not set subvolume flags in readonly mode
$ mkfs.btrfs /dev/sdb7
$ btrfstune -S1 /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
mount: block device /dev/sdb7 is write-protected, mounting read-only
$ btrfs dev add /dev/sdb8 /mnt/btrfs/

Now we get a btrfs in which mnt flags has readonly but sb flags does
not.  So for those ioctls that only check sb flags with MS_RDONLY, it
is going to be a problem.
Setting subvolume flags is such an ioctl, we should use mnt_want_write_file()
to check RO flags.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:58 -04:00
Liu Bo
e54bfa3104 Btrfs: use mnt_want_write_file instead of mnt_want_write
mnt_want_write_file is faster when file has been opened for write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:57 -04:00
Liu Bo
768e9dfe82 Btrfs: remove redundant r/o check for superblock
mnt_want_write() and mnt_want_write_file() will check sb->s_flags with
MS_RDONLY, and we don't need to do it ourselves.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
a874a63e13 Btrfs: check write access to mount earlier while creating snapshots
Move check of write access to mount into upper functions so that we can
use mnt_want_write_file instead, which is faster than mnt_want_write.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
2012-07-23 16:27:56 -04:00
Liu Bo
287082b0bd Btrfs: fix typo in cow_file_range_async and async_cow_submit
It should be 10 * 1024 * 1024.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-23 16:27:55 -04:00
Josef Bacik
0e72110692 Btrfs: change how we indicate we're adding csums
There is weird logic I had to put in place to make sure that when we were
adding csums that we'd used the delalloc block rsv instead of the global
block rsv.  Part of this meant that we had to free up our transaction
reservation before we ran the delayed refs since csum deletion happens
during the delayed ref work.  The problem with this is that when we release
a reservation we will add it to the global reserve if it is not full in
order to keep us going along longer before we have to force a transaction
commit.  By releasing our reservation before we run delayed refs we don't
get the opportunity to drain down the global reserve for the work we did, so
we won't refill it as often.  This isn't a problem per-se, it just results
in us possibly committing transactions more and more often, and in rare
cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
out of space in our block rsv.

This also helps us by holding onto space while the delayed refs run so we
don't end up with as many people trying to do things at the same time, which
again will help us not force commits or hit the use_block_rsv warnings.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:55 -04:00
Tsutomu Itoh
b995929515 Btrfs: return error of btrfs_update_inode() to caller
We didn't check error of btrfs_update_inode(), but that error looks
easy to bubble back up.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:54 -04:00
Dan Carpenter
23291a044c Btrfs: fix error handling in __add_reloc_root()
We dereferenced "node" in the error message after freeing it.  Also
btrfs_panic() can return so we should return an error code instead of
continuing.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23 16:27:53 -04:00
Ilya Dryomov
44c44af2f4 Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mounting
There used to be a BUG_ON(ret) there before EH patch (79787eaa) went in.
Bail out with EINVAL.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-23 16:27:53 -04:00
Ilya Dryomov
fed425c742 Btrfs: do not return EINVAL instead of ENOMEM from open_ctree()
When bailing from open_ctree() err is returned, not ret.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-23 16:27:52 -04:00
Josef Bacik
02db0844be Btrfs: add DEVICE_READY ioctl
This will be used in conjunction with btrfs device ready <dev>.  This is
needed for initrd's to have a nice and lightweight way to tell if all of the
devices needed for a file system are in the cache currently.  This keeps
them from having to do mount+sleep loops waiting for devices to show up.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:42 -04:00
Josef Bacik
96c3f4331a Btrfs: flush delayed inodes if we're short on space
Those crazy gentoo guys have been complaining about ENOSPC errors on their
portage volumes.  This is because doing things like untar tends to create
lots of new files which will soak up all the reservation space in the
delayed inodes.  Usually this gets papered over by the fact that we will try
and commit the transaction, however if this happens in the wrong spot or we
choose not to commit the transaction you will be screwed.  So add the
ability to expclitly flush delayed inodes to free up space.  Please test
this out guys to make sure it works since as usual I cannot reproduce.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 15:41:40 -04:00
David Sterba
b27f7c0c15 btrfs: join DEV_STATS ioctls to one
Commit c11d2c236c (Btrfs: add ioctl to get and reset the device
stats) introduced two ioctls doing almost the same thing distinguished
by just the ioctl number which encodes "do reset after read". I have
suggested

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html

to implement it via the ioctl args. This hasn't happen, and I think we
should use a more clean way to pass flags and should not waste ioctl
numbers.

CC: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-07-23 15:41:40 -04:00
Andrew Mahone
a43a211133 btrfs: ignore unfragmented file checks in defrag when compression enabled - rebased
Rebased on btrfs-next and retested.

Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip
checks for adjacent extents and extent size when deciding whether to defrag,
as these can prevent an uncompressed and unfragmented file from being
compressed as requested.

Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>
2012-07-23 15:41:39 -04:00
Dan Carpenter
e4b50e14c8 Btrfs: small naming cleanup in join_transaction()
"root->fs_info" and "fs_info" are the same, but "fs_info" is prefered
because it is shorter and that's what is used in the rest of the
function.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
2012-07-23 15:41:39 -04:00
Alexander Block
2bc5565286 Btrfs: don't update atime on RO subvolumes
Before the update_time inode operation was indroduced, it was
not possible to prevent updates of atime on RO subvolumes. VFS
was only able to check for RO on the mount, but did not know
anything about btrfs subvolumes.

btrfs_update_time does now check if the root is RO and skip
updating of times.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
2012-07-23 15:41:38 -04:00
Arnd Hannemann
063849eafd Btrfs: allow mount -o remount,compress=no
Btrfs allows to turn on compression on a mounted and used filesystem
by issuing mount -o remount,compress=lzo.
This patch allows to turn compression off again
while the filesystem is mounted. As suggested by David Sterba
if the compress-force option was set, it is implicitly cleared
if compression is turned off.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
2012-07-23 15:41:38 -04:00
Josef Bacik
c5c3c5f31e Btrfs: remove ->dirty_inode
We do all of our inode updating when we change it, and now that we do
->update_time we don't need ->dirty_inode for atime updates anymore, so just
remove it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2012-07-23 15:41:38 -04:00
Chris Mason
cbea5ac1ee Btrfs: reduce calls to wake_up on uncontended locks
The btrfs locks were unconditionally calling wake_up as the
locks were released.  This lead to extra thrashing on the waitqueue,
especially for locks that were dominated by readers.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:18 -04:00
Chris Mason
e39e64ac0c Btrfs: don't wait around for new log writers on an SSD
Waiting on spindles improves performance, but ssds want all the
IO as quickly as we can push it down.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-07-23 15:36:17 -04:00
Linus Torvalds
ce9f8d6b39 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
Pull pnfs/ore fixes from Boaz Harrosh:
 "These are catastrophic fixes to the pnfs objects-layout that were just
  discovered.  They are also destined for @stable.

  I have found these and worked on them at around RC1 time but
  unfortunately went to the hospital for kidney stones and had a very
  slow recovery.  I refrained from sending them as is, before proper
  testing, and surly I have found a bug just yesterday.

  So now they are all well tested, and have my sign-off.  Other then
  fixing the problem at hand, and assuming there are no bugs at the new
  code, there is low risk to any surrounding code.  And in anyway they
  affect only these paths that are now broken.  That is RAID5 in pnfs
  objects-layout code.  It does also affect exofs (which was not broken)
  but I have tested exofs and it is lower priority then objects-layout
  because no one is using exofs, but objects-layout has lots of users."

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
  pnfs-obj: don't leak objio_state if ore_write/read fails
  ore: Unlock r4w pages in exact reverse order of locking
  ore: Remove support of partial IO request (NFS crash)
  ore: Fix NFS crash by supporting any unaligned RAID IO
2012-07-20 11:43:53 -07:00
Linus Torvalds
1793416287 Fix a bug in UBIFS free space fix-up reported already twice recently:
http://lists.infradead.org/pipermail/linux-mtd/2012-May/041408.html
 http://lists.infradead.org/pipermail/linux-mtd/2012-June/042422.html
 
 and we finally have the fix. I am quite confident the fix is correct
 because I could reproduce the problem with nandsim and verify the
 fix. It was also verified by Iwo (the reporter).
 
 I am also confident that this is OK to merge the fix so late because
 this patch affects only the fixup functionality, which is not used by
 most users.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQCQZ1AAoJECmIfjd9wqK0fosP/RD3Ruo5ILvTtBThKJPUoeld
 kihD9w3rk26cILlpGA3Cs/kaoOj/wPtjMVKGVkw50cWKRQemFLMh4ZcbCepfae+b
 g+YsH+ihkINgjdpKM351lgSCS+NEPJ695zmxNJ+/zjM5+ewfP6vK0qivnjF7w81k
 jLAVt80a1nhjNPyDMeQVr69HxBegYuX927LL4onJULYqvmrSiX/5tXzI+02emjDf
 9gA99fyc4pLNAJzzQyr44pogNaSME+Q90p4PAd11tlaVfn1kXgCXA3Ybv2cy7cer
 ipQfHQzfMjiCMO7Kpt5Ja3necuTarZsHV4UtmXhc4uIOr5p57dJX7RfBzA3j4RmV
 2ZFynqjl7n6ZT0pAM/0F9h9FyjZrCcgg1BGcEsqfJv2Yu7txOX1Qo2gkEvYJl8Sx
 Q2G6xNdzyib8MXClm4L2Zix16WqAF7CyUZo+szUTpdO8PPzgJ/vNpAk+3yqoVeep
 0Dr0HmTMRuP6tJGa9TH58QlvhClkXGSb7ukk1UlV4RVXtvvYtjVBwUXoHSUHNDJO
 HB9B+7ViTIjm9fdILqCX5wtrnZZQgFd1hBiQ/13/ZFrtB1hz5WfOdgfLRIBifjbq
 hGkwQyb5zsWTm7KGTOV0Yncmbnkut4zSJpMCbjZvcPJ2r5zwNwImKdvLBJ7oCKmd
 nPZ2dJmJYdKw2L00SzGZ
 =bUPG
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs

Pull UBIFS free space fix-up bugfix from Artem Bityutskiy:
 "It's been reported already twice recently:

    http://lists.infradead.org/pipermail/linux-mtd/2012-May/041408.html
    http://lists.infradead.org/pipermail/linux-mtd/2012-June/042422.html

  and we finally have the fix.  I am quite confident the fix is correct
  because I could reproduce the problem with nandsim and verify the fix.
  It was also verified by Iwo (the reporter).

  I am also confident that this is OK to merge the fix so late because
  this patch affects only the fixup functionality, which is not used by
  most users."

* tag 'upstream-3.5-rc8' of git://git.infradead.org/linux-ubifs:
  UBIFS: fix a bug in empty space fix-up
2012-07-20 11:42:30 -07:00
Boaz Harrosh
c999ff6802 pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
It is very common for the end of the file to be unaligned on
stripe size. But since we know it's beyond file's end then
the XOR should be preformed with all zeros.

Old code used to just read zeros out of the OSD devices, which is a great
waist. But what scares me more about this situation is that, we now have
pages attached to the file's mapping that are beyond i_size. I don't
like the kind of bugs this calls for.

Fix both birds, by returning a global zero_page, if offset is beyond
i_size.

TODO:
	Change the API to ->__r4w_get_page() so a NULL can be
	returned without being considered as error, since XOR API
	treats NULL entries as zero_pages.

[Bug since 3.2. Should apply the same way to all Kernels since]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:50:31 +03:00
Boaz Harrosh
9909d45a85 pnfs-obj: don't leak objio_state if ore_write/read fails
[Bug since 3.2 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:50:30 +03:00
Boaz Harrosh
537632e0a5 ore: Unlock r4w pages in exact reverse order of locking
The read-4-write pages are locked in address ascending order.
But where unlocked in a way easiest for coding. Fix that,
locks should be released in opposite order of locking, .i.e
descending address order.

I have not hit this dead-lock. It was found by inspecting the
dbug print-outs. I suspect there is an higher lock at caller that
protects us, but fix it regardless.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:49:25 +03:00
Boaz Harrosh
62b62ad873 ore: Remove support of partial IO request (NFS crash)
Do to OOM situations the ore might fail to allocate all resources
needed for IO of the full request. If some progress was possible
it would proceed with a partial/short request, for the sake of
forward progress.

Since this crashes NFS-core and exofs is just fine without it just
remove this contraption, and fail.

TODO:
	Support real forward progress with some reserved allocations
	of resources, such as mem pools and/or bio_sets

[Bug since 3.2 Kernel]
CC: Stable Tree <stable@kernel.org>
CC: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:47:43 +03:00