Commit Graph

92 Commits (5051f76883897ea3d3d034c92e7b84236da2ec57)

Author SHA1 Message Date
Joel Becker c0e4133851 ocfs2: Attach the connection to the lksb
We're going to want it in the ast functions, so we convert union
ocfs2_dlm_lksb to struct ocfs2_dlm_lksb and let it carry the connection.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:14 -08:00
Tiger Yang b89c54282d ocfs2: add extent block stealing for ocfs2 v5
This patch add extent block (metadata) stealing mechanism for
extent allocation. This mechanism is same as the inode stealing.
if no room in slot specific extent_alloc, we will try to
allocate extent block from the next slot.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-26 15:41:07 -08:00
Sunil Mushran a191282601 ocfs2: Prevent a livelock in dlmglue
There is possibility of a livelock in __ocfs2_cluster_lock(). If a node were
to get an ast for an upconvert request, followed immediately by a bast,
there is a small window where the fs may downconvert the lock before the
process requesting the upconvert is able to take the lock.

This patch adds a new flag to indicate that the upconvert is still in
progress and that the dc thread should not downconvert it right now.

Wengang Wang <wen.gang.wang@oracle.com> and Joel Becker
<joel.becker@oracle.com> contributed heavily to this patch.

Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2010-02-02 23:51:13 -08:00
Linus Torvalds 45e62974fb Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
  ocfs2/trivial: Use le16_to_cpu for a disk value in xattr.c
  ocfs2/trivial: Use proper mask for 2 places in hearbeat.c
  Ocfs2: Let ocfs2 support fiemap for symlink and fast symlink.
  Ocfs2: Should ocfs2 support fiemap for S_IFDIR inode?
  ocfs2: Use FIEMAP_EXTENT_SHARED
  fiemap: Add new extent flag FIEMAP_EXTENT_SHARED
  ocfs2: replace u8 by __u8 in ocfs2_fs.h
  ocfs2: explicit declare uninitialized var in user_cluster_connect()
  ocfs2-devel: remove redundant OCFS2_MOUNT_POSIX_ACL check in ocfs2_get_acl_nolock()
  ocfs2: return -EAGAIN instead of EAGAIN in dlm
  ocfs2/cluster: Make fence method configurable - v2
  ocfs2: Set MS_POSIXACL on remount
  ocfs2: Make acl use the default
  ocfs2: Always include ACL support
2009-12-24 12:59:11 -08:00
Sunil Mushran 7aee47b0bb ocfs2: Trivial cleanup of jbd compatibility layer removal
Mainline commit 53ef99cad9 removed the
JBD compatibility layer from OCFS2. This patch removes the last remaining
remnants of that.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-11-13 15:45:05 -08:00
Jan Kara 5297aad80c ocfs2: Make acl use the default
Change acl mount options handling to match the one of XFS and BTRFS and
hopefully it is also easier to use now. When admin does not specify any
acl mount option, acls are enabled if and only if the filesystem has
xattr feature enabled. If admin specifies 'acl' mount option, we fail
the mount if the filesystem does not have xattr feature and thus acls
cannot be enabled.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-10-28 23:06:32 -07:00
Tao Ma 1aa75fea64 ocfs2: Add functions for extents refcounted.
Add function ocfs2_mark_extent_refcounted which can mark
an extent refcounted.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:34 -07:00
Tao Ma 374a263e79 ocfs2: Add refcount tree lock mechanism.
Implement locking around struct ocfs2_refcount_tree.  This protects
all read/write operations on refcount trees.  ocfs2_refcount_tree
has its own lock and its own caching_info, protecting buffers among
multiple nodes.

User must call ocfs2_lock_refcount_tree before his operation on
the tree and unlock it after that.

ocfs2_refcount_trees are referenced by the block number of the
refcount tree root block, So we create an rb-tree on the ocfs2_super
to look them up.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:29 -07:00
Tao Ma f2c870e3b1 ocfs2: Add ocfs2_read_refcount_block.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2009-09-22 20:09:26 -07:00
Joel Becker 0cf2f7632b ocfs2: Pass struct ocfs2_caching_info to the journal functions.
The next step in divorcing metadata I/O management from struct inode is
to pass struct ocfs2_caching_info to the journal functions.  Thus the
journal locks a metadata cache with the cache io_lock function.  It also
can compare ci_last_trans and ci_created_trans directly.

This is a large patch because of all the places we change
ocfs2_journal_access..(handle, inode, ...) to
ocfs2_journal_access..(handle, INODE_CACHE(inode), ...).

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-04 16:07:50 -07:00
Joel Becker 292dd27ec7 ocfs2: move ip_created_trans to struct ocfs2_caching_info
Similar ip_last_trans, ip_created_trans tracks the creation of a journal
managed inode.  This specifically tracks what transaction created the
inode.  This is so the code can know if the inode has ever been written
to disk.

This behavior is desirable for any journal managed object.  We move it
to struct ocfs2_caching_info as ci_created_trans so that any object
using ocfs2_caching_info can rely on this behavior.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-04 16:07:49 -07:00
Joel Becker 66fb345ddd ocfs2: move ip_last_trans to struct ocfs2_caching_info
We have the read side of metadata caching isolated to struct
ocfs2_caching_info, now we need the write side.  This means the journal
functions.  The journal only does a couple of things with struct inode.

This change moves the ip_last_trans field onto struct
ocfs2_caching_info as ci_last_trans.  This field tells the journal
whether a pending journal flush is required.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-04 16:07:49 -07:00
Joel Becker 6e5a3d7538 ocfs2: Change metadata caching locks to an operations structure.
We don't really want to cart around too many new fields on the
ocfs2_caching_info structure.  So let's wrap all our access of the
parent object in a set of operations.  One pointer on caching_info, and
more flexibility to boot.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-04 16:07:48 -07:00
Joel Becker 47460d65a4 ocfs2: Make the ocfs2_caching_info structure self-contained.
We want to use the ocfs2_caching_info structure in places that are not
inodes.  To do that, it can no longer rely on referencing the inode
directly.

This patch moves the flags to ocfs2_caching_info->ci_flags, stores
pointers to the parent's locks on the ocfs2_caching_info, and renames
the constants and flags to reflect its independant state.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-09-04 16:07:47 -07:00
Jan Kara f7b1aa69be ocfs2: Fix deadlock on umount
In commit ea455f8ab6, we moved the dentry lock
put process into ocfs2_wq. This causes problems during umount because ocfs2_wq
can drop references to inodes while they are being invalidated by
invalidate_inodes() causing all sorts of nasty things (invalidate_inodes()
ending in an infinite loop, "Busy inodes after umount" messages etc.).

We fix the problem by stopping ocfs2_wq from doing any further releasing of
inode references on the superblock being unmounted, wait until it finishes
the current round of releasing and finally cleaning up all the references in
dentry_lock_list from ocfs2_put_super().

The issue was tracked down by Tao Ma <tao.ma@oracle.com>.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-07-21 15:47:55 -07:00
Jan Kara cb25797d45 ocfs2: Add lockdep annotations
Add lockdep support to OCFS2. The support also covers all of the cluster
locks except for open locks, journal locks, and local quotafile locks. These
are special because they are acquired for a node, not for a particular process
and lockdep cannot deal with such type of locking.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:34:26 -07:00
Sunil Mushran 692684e19e ocfs2: Stop orphan scan as early as possible during umount
Currently if the orphan scan fires a tick before the user issues the umount,
the umount will wait for the queued orphan scan tasks to complete.

This patch makes the umount stop the orphan scan as early as possible so as
to reduce the probability of the queued tasks slowing down the umount.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-22 14:24:51 -07:00
Joel Becker 73be192b17 ocfs2: Add statistics for the checksum and ecc operations.
It would be nice to know how often we get checksum failures.  Even
better, how many of them we can fix with the single bit ecc.  So, we add
a statistics structure.  The structure can be installed into debugfs
wherever the user wants.

For ocfs2, we'll put it in the superblock-specific debugfs directory and
pass it down from our higher-level functions.  The stats are only
registered with debugfs when the filesystem supports metadata ecc.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:15:36 -07:00
Srinivas Eeda 15633a220f ocfs2 patch to track delayed orphan scan timer statistics
Patch to track delayed orphan scan timer statistics.

Modifies ocfs2_osb_dump to print the following:
  Orphan Scan=> Local: 10  Global: 21  Last Scan: 67 seconds ago

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:31 -07:00
Srinivas Eeda 83273932fb ocfs2: timer to queue scan of all orphan slots
When a dentry is unlinked, the unlinking node takes an EX on the dentry lock
before moving the dentry to the orphan directory. Other nodes that have
this dentry in cache have a PR on the same dentry lock.  When the EX is
requested, the other nodes flag the corresponding inode as MAYBE_ORPHANED
during downconvert.  The inode is finally deleted when the last node to iput
the inode sees that i_nlink==0 and the MAYBE_ORPHANED flag is set.

A problem arises if a node is forced to free dentry locks because of memory
pressure. If this happens, the node will no longer get downconvert
notifications for the dentries that have been unlinked on another node.
If it also happens that node is actively using the corresponding inode and
happens to be the one performing the last iput on that inode, it will fail
to delete the inode as it will not have the MAYBE_ORPHANED flag set.

This patch fixes this shortcoming by introducing a periodic scan of the
orphan directories to delete such inodes. Care has been taken to distribute
the workload across the cluster so that no one node has to perform the task
all the time.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
2009-06-03 19:14:31 -07:00
Srinivas Eeda 9140db04ef ocfs2: recover orphans in offline slots during recovery and mount
During recovery, a node recovers orphans in it's slot and the dead node(s). But
if the dead nodes were holding orphans in offline slots, they will be left
unrecovered.

If the dead node is the last one to die and is holding orphans in other slots
and is the first one to mount, then it only recovers it's own slot, which
leaves orphans in offline slots.

This patch queues complete_recovery to clean orphans for all offline slots
during mount and node recovery.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-04-03 11:39:26 -07:00
wengang wang 6ca497a83e ocfs2: fix rare stale inode errors when exporting via nfs
For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
ocfs2_get_dentry() may read from disk when the inode is not in memory,
without any cross cluster lock. this leads to the file system loading a
stale inode.

This patch fixes above problem.

Solution is that in case of inode is not in memory, we get the cluster
lock(PR) of alloc inode where the inode in question is allocated from (this
causes node on which deletion is done sync the alloc inode) before reading
out the inode itsself. then we check the bitmap in the group (the inode in
question allcated from) to see if the bit is clear. if it's clear then it's
stale. if the bit is set, we then check generation as the existing code
does.

We have to read out the inode in question from disk first to know its alloc
slot and allot bit. And if its not stale we read it out using ocfs2_iget().
The second read should then be from cache.

And also we have to add a per superblock nfs_sync_lock to cover the lock for
alloc inode and that for inode in question. this is because ocfs2_get_dentry()
and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
that mutliple ocfs2_delete_inode() can run concurrently in normal case.

[mfasheh@suse.com: build warning fixes and comment cleanups]
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-04-03 11:39:25 -07:00
Tao Ma feb473a6e8 ocfs2: Optimize inode group allocation by recording last used group.
In ocfs2, the block group search looks for the "emptiest" group
to allocate from. So if the allocator has many equally(or almost
equally) empty groups, new block group will tend to get spread
out amongst them.

So we add osb_inode_alloc_group in ocfs2_super to record the last
used inode allocation group.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.

I have done some basic test and the results are a ten times improvement on
some cold-cache stat workloads.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-04-03 11:39:18 -07:00
Mark Fasheh 1d46dc08d3 ocfs2: fix leaf start calculation in ocfs2_dx_dir_rebalance()
ocfs2_dx_dir_rebalance() is passed the block offset of a dx leaf which needs
rebalancing. Since we rebalance an entire cluster at a time however, this
function needs to calculate the beginning of that cluster, in blocks. The
calculation was wrong, which would result in a read of non-leaf blocks. Fix
the calculation by adding ocfs2_block_to_cluster_start() which is a more
straight-forward way of determining this.

Reported-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-04-03 11:39:17 -07:00
Mark Fasheh 198a1ca3b7 ocfs2: Increase max links count
Since we've now got a directory format capable of handling a large number of
entries, we can increase the maximum link count supported. This only gets
increased if the directory indexing feature is turned on.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
2009-04-03 11:39:16 -07:00
Mark Fasheh 9b7895efac ocfs2: Add a name indexed b-tree to directory inodes
This patch makes use of Ocfs2's flexible btree code to add an additional
tree to directory inodes. The new tree stores an array of small,
fixed-length records in each leaf block. Each record stores a hash value,
and pointer to a block in the traditional (unindexed) directory tree where a
dirent with the given name hash resides. Lookup exclusively uses this tree
to find dirents, thus providing us with constant time name lookups.

Some of the hashing code was copied from ext3. Unfortunately, it has lots of
unfixed checkpatch errors. I left that as-is so that tracking changes would
be easier.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
2009-04-03 11:39:15 -07:00
Sunil Mushran 59b526a307 ocfs2: Remove debugfs file local_alloc_stats
This patch removes the debugfs file local_alloc_stats as that information
is now included in the fs_state debugfs file.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-04-03 11:39:15 -07:00
Sunil Mushran 50397507e8 ocfs2: Expose the file system state via debugfs
This patch creates a per mount debugfs file, fs_state, which exposes
information like, cluster stack in use, states of the downconvert, recovery
and commit threads, number of journal txns, some allocation stats, list of
all slots, etc.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-04-03 11:39:15 -07:00
Tao Ma c8b9cf9a7c ocfs2: lock the metaecc process for xattr bucket
For other metadata in ocfs2, metaecc is checked in ocfs2_read_blocks
with io_mutex held. While for xattr bucket, it is calculated by
the whole buckets. So we have to add a spin_lock to prevent multiple
processes calculating metaecc.

Signed-off-by: Tao Ma <tao.ma@oracle.com>
Tested-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-26 11:51:11 -08:00
Jan Kara ea455f8ab6 ocfs2: Push out dropping of dentry lock to ocfs2_wq
Dropping of last reference to dentry lock is a complicated operation involving
dropping of reference to inode. This can get complicated and quota code in
particular needs to obtain some quota locks which leads to potential deadlock.
Thus we defer dropping of inode reference to ocfs2_wq.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-02-02 14:20:16 -08:00
Mark Fasheh 87d35a74b1 ocfs2: Add directory block trailers.
Future ocfs2 features metaecc and indexed directories need to store a
little bit of data in each dirblock.  For compatibility, we place this
in a trailer at the end of the dirblock.  The trailer plays itself as an
empty dirent, so that if the features are turned off, it can be reused
without requiring a tunefs scan.

This code adds the trailer and validates it when the block is read in.

[ Mark is the original author, but I reinserted this code before his
  dir index work.  -- Joel ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:40:34 -08:00
Joel Becker 13723d00e3 ocfs2: Use metadata-specific ocfs2_journal_access_*() functions.
The per-metadata-type ocfs2_journal_access_*() functions hook up jbd2
commit triggers and allow us to compute metadata ecc right before the
buffers are written out.  This commit provides ecc for inodes, extent
blocks, group descriptors, and quota blocks.  It is not safe to use
extened attributes and metaecc at the same time yet.

The ocfs2_extent_tree and ocfs2_path abstractions in alloc.c both hide
the type of block at their root.  Before, it didn't matter, but now the
root block must use the appropriate ocfs2_journal_access_*() function.
To keep this abstract, the structures now have a pointer to the matching
journal_access function and a wrapper call to call it.

A few places use naked ocfs2_write_block() calls instead of adding the
blocks to the journal.  We make sure to calculate their checksum and ecc
before the write.

Since we pass around the journal_access functions.  Let's typedef them
in ocfs2.h.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:40:32 -08:00
Joel Becker 70ad1ba7b4 ocfs2: Add the underlying blockcheck code.
This is the code that computes crc32 and ecc for ocfs2 metadata blocks.
There are high-level functions that check whether the filesystem has the
ecc feature, mid-level functions that work on a single block or array of
buffer_heads, and the low-level ecc hamming code that can handle
multiple buffers like crc32_le().

It's not hooked up to the filesystem yet.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:40:31 -08:00
Jan Kara 19ece546a4 ocfs2: Enable quota accounting on mount, disable on umount
Enable quota usage tracking on mount and disable it on umount. Also
add support for quota on and quota off quotactls and usrquota and
grpquota mount options. Add quota features among supported ones.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:40:24 -08:00
Jan Kara 2205363dce ocfs2: Implement quota recovery
Implement functions for recovery after a crash. Functions just
read local quota file and sync info to global quota file.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:40:24 -08:00
Joel Becker 5e96581a37 ocfs2: Wrap extent block reads in a dedicated function.
We weren't consistently checking extent blocks after we read them.
Most places checked the signature, but none checked h_blkno or
h_fs_signature.  Create a toplevel ocfs2_read_extent_block() that does
the read and the validation.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:36:53 -08:00
Joel Becker 4203530613 ocfs2: Morph the haphazard OCFS2_IS_VALID_GROUP_DESC() checks.
Random places in the code would check a group descriptor bh to see if it
was valid. The previous commit unified descriptor block reads,
validating all block reads in the same place.  Thus, these checks are no
longer necessary.  Rather than eliminate them, however, we change them
to BUG_ON() checks.  This ensures the assumptions remain true.  All of
the code paths to these checks have been audited to ensure they come
from a validated descriptor read.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:36:53 -08:00
Joel Becker 10995aa245 ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks.
Random places in the code would check a dinode bh to see if it was
valid.  Not only did they do different levels of validation, they
handled errors in different ways.

The previous commit unified inode block reads, validating all block
reads in the same place.  Thus, these haphazard checks are no longer
necessary.  Rather than eliminate them, however, we change them to
BUG_ON() checks.  This ensures the assumptions remain true.  All of the
code paths to these checks have been audited to ensure they come from a
validated inode read.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:36:52 -08:00
Tiger Yang 929fb014e0 ocfs2: add POSIX ACL API
This patch adds POSIX ACL(access control lists) APIs in ocfs2. We convert
struct posix_acl to many ocfs2_acl_entry and regard them as an extended
attribute entry.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05 08:34:20 -08:00
Coly Li 3b5da0189c ocfs2: comments typo fix
This patch fixes two typos in comments of ocfs2.

Signed-off-by: Coly Li <coyli@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-12-01 14:46:31 -08:00
Joel Becker f6087fb799 ocfs2: Check xattr block signatures properly.
The xattr.c code is currently memcmp()ing naking buffer pointers.
Create the OCFS2_IS_VALID_XATTR_BLOCK() macro to match its peers and use
that.

In addition, failed signature checks were returning -EFAULT, which is
completely wrong.  Return -EIO.

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-11-10 09:51:44 -08:00
Joel Becker 2b4e30fbde ocfs2: Switch over to JBD2.
ocfs2 wants JBD2 for many reasons, not the least of which is that JBD is
limiting our maximum filesystem size.

It's a pretty trivial change.  Most functions are just renamed.  The
only functional change is moving to Jan's inode-based ordered data mode.
It's better, too.

Because JBD2 reads and writes JBD journals, this is compatible with any
existing filesystem.  It can even interact with JBD-based ocfs2 as long
as the journal is formated for JBD.

We provide a compatibility option so that paranoid people can still use
JBD for the time being.  This will go away shortly.

[ Moved call of ocfs2_begin_ordered_truncate() from ocfs2_delete_inode() to
  ocfs2_truncate_for_delete(). --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 17:02:43 -07:00
Joel Becker 12462f1d9f ocfs2: Add the 'inode64' mount option.
Now that ocfs2 limits inode numbers to 32bits, add a mount option to
disable the limit.  This parallels XFS.  64bit systems can handle the
larger inode numbers.

[ Added description of inode64 mount option in ocfs2.txt. --Mark ]

Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:08 -07:00
Tiger Yang 8154da3d21 ocfs2: Add incompatible flag for extended attribute
This patch adds the s_incompat flag for extended attribute support. This
helps us ensure that older versions of Ocfs2 or ocfs2-tools will not be able
to mount a volume with xattr support.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:03 -07:00
Tiger Yang cf1d6c763f ocfs2: Add extended attribute support
This patch implements storing extended attributes both in inode or a single
external block. We only store EA's in-inode when blocksize > 512 or that
inode block has free space for it. When an EA's value is larger than 80
bytes, we will store the value via b-tree outside inode or block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:02 -07:00
Tiger Yang fdd77704a8 ocfs2: reserve inline space for extended attribute
Add the structures and helper functions we want for handling inline extended
attributes. We also update the inline-data handlers so that they properly
function in the event that we have both inline data and inline attributes
sharing an inode block.

Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 16:57:02 -07:00
Mark Fasheh 9a8ff578fb ocfs2: track local alloc state via debugfs
A per-mount debugfs file, "local_alloc" is created which when read will
expose live state of the nodes local alloc file. Performance impact is
minimal, only a bit of memory overhead per mount point. Still, the code is
hidden behind CONFIG_OCFS2_FS_STATS. This feature will help us debug
local alloc performance problems on a live system.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:58 -07:00
Mark Fasheh 9c7af40b21 ocfs2: throttle back local alloc when low on disk space
Ocfs2's local allocator disables itself for the duration of a mount point
when it has trouble allocating a large enough area from the primary bitmap.
That can cause performance problems, especially for disks which were only
temporarily full or fragmented. This patch allows for the allocator to
shrink it's window first, before being disabled. Later, it can also be
re-enabled so that any performance drop is minimized.

To do this, we allow the value of osb->local_alloc_bits to be shrunk when
needed. The default value is recorded in a mostly read-only variable so that
we can re-initialize when required.

Locking had to be updated so that we could protect changes to
local_alloc_bits. Mostly this involves protecting various local alloc values
with the osb spinlock. A new state is also added, OCFS2_LA_THROTTLED, which
is used when the local allocator is has shrunk, but is not disabled. If the
available space dips below 1 megabyte, the local alloc file is disabled. In
either case, local alloc is re-enabled 30 seconds after the event, or when
an appropriate amount of bits is seen in the primary bitmap.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:57 -07:00
Mark Fasheh ebcee4b5c9 ocfs2: Track local alloc bits internally
Do this instead of tracking absolute local alloc size. This avoids
needless re-calculatiion of bits from bytes in localalloc.c. Additionally,
the value is now in a more natural unit for internal file system bitmap
work.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-10-13 13:57:57 -07:00
Sunil Mushran 539d826409 [PATCH 2/2] ocfs2: Fix race between mount and recovery
As the fs recovery is asynchronous, there is a small chance that another
node can mount (and thus recover) the slot before the recovery thread
gets to it.

If this happens, the recovery thread will block indefinitely on the
journal/slot lock as that lock will be held for the duration of the mount
(by design) by the node assigned to that slot.

The solution implemented is to keep track of the journal replays using
a recovery generation in the journal inode, which will be incremented by the
thread replaying that journal. The recovery thread, before attempting the
blocking lock on the journal/slot lock, will compare the generation on disk
with what it has cached and skip recovery if it does not match.

This bug appears to have been inadvertently introduced during the mount/umount
vote removal by mainline commit 34d024f843. In the
mount voting scheme, the messaging would indirectly indicate that the slot
was being recovered.

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2008-07-31 16:21:14 -07:00