Commit Graph

110342 Commits (82124d60354846623a4b94af335717a5e142a074)

Author SHA1 Message Date
Kiyoshi Ueda 82124d6035 block: add request submission interface
This patch adds blk_insert_cloned_request(), a generic request
submission interface for request stacking drivers.
Request-based dm will use it to submit their clones to underlying
devices.

blk_rq_check_limits() is also added because it is possible that
the lower queue has stronger limitations than the upper queue
if multiple drivers are stacking at request-level.
Not only for blk_insert_cloned_request()'s internal use, the function
will be used by request-based dm when the queue limitation is
modified (e.g. by replacing dm's table).

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:18 +02:00
Kiyoshi Ueda 32fab448e5 block: add request update interface
This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.

Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.

  - Request stacking drivers can't use blk_end_request() directly from
    the lower driver's completion context (bio->bi_end_io or rq->end_io),
    because some device drivers (e.g. ide) may try to complete
    their request with queue lock held, and it may cause deadlock.
    See below for detailed description of possible deadlock:
    <http://marc.info/?l=linux-kernel&m=120311479108569&w=2>

  - To solve that, request-based dm offloads the completion of
    cloned struct request to softirq context (i.e. using
    blk_complete_request() from rq->end_io).

  - Though it is possible to use the same solution from bio->bi_end_io,
    it will delay the notification of bio completion to the original
    submitter.  Also, it will cause inefficient partial completion,
    because the lower driver can't perform the cloned request anymore
    and request-based dm needs to requeue and redispatch it to
    the lower driver again later.  That's not good.

  - So request-based dm needs blk_update_request() to perform the bio
    completion in the lower driver's completion context, which is more
    efficient.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:18 +02:00
Jens Axboe e3335de940 block: blk_cleanup_queue() should call blk_sync_queue()
When a driver calls blk_cleanup_queue(), the device should be fully idle.
However, the block layer may have pending plugging timers and the IO
schedulers may have pending work in the work queues. So quisce the device
by waiting for the timer and flushing the work queues.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:18 +02:00
Chris Lalancette 9246b5f06d block: Expand Xen blkfront for > 16 xvd
Until recently, the maximum number of xvd block devices you could attach
to a Xen domU was 16. This limitation turned out to be problematic for
some users, so it was expanded to handle a much larger number of disks.
However, this requires a couple of changes in the way that blkfront
scans for disks. This functionality is already present in the Xen
linux-2.6.18-xen.hg tree; the attached patch adds this functionality to
the mainline xen-blkfront implementation. I successfully tested it on a
2.6.25 tree, and build tested it on 2.6.27-rc3.

Signed-off-by: Chris Lalancette <clalance@redhat.com>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:18 +02:00
Jens Axboe 9c02f2b02e block: cleanup some of the integrity stuff in blkdev.h
Don't put functions that are only used in fs/bio-integrity.c in
blkdev.h, it's much cleaner to just keep it in there. Also kill
completely unused bdev_get_tag_size()

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:17 +02:00
Jens Axboe 7ba1fbaa4a block: use rq complete marking in blk_abort_request()
We cannot abort a request if we raced with the timeout handler already,
or with the IO completion. So make blk_abort_request() mark the request
as complete, and only continue if we succeeded.

Found and suggested by Mike Anderson <andmike@linux.vnet.ibm.com>

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:17 +02:00
Jens Axboe 581d4e28d9 block: add fault injection mechanism for faking request timeouts
Only works for the generic request timer handling. Allows one to
sporadically ignore request completions, thus exercising the timeout
handling.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:17 +02:00
Jens Axboe 0a0d96b03a block: add bio_kmalloc()
Not all callers need (or want!) the mempool backing guarentee, it
essentially means that you can only use bio_alloc() for short allocations
and not for preallocating some bio's at setup or init time.

So add bio_kmalloc() which does the same thing as bio_alloc(), except
it just uses kmalloc() as the backing instead of the bio mempools.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:17 +02:00
Hugh Dickins 3e6053d76d block: adjust blkdev_issue_discard for swap
Two mods to blkdev_issue_discard(), thinking ahead to its use on swap:

1. Add gfp_mask argument, so swap allocation can use it where GFP_KERNEL
   might deadlock but GFP_NOIO is safe.

2. Enlarge nr_sects argument from unsigned to sector_t: unsigned long is
   enough to cover a whole swap area, but sector_t suits any partition.

Change sb_issue_discard()'s nr_blocks to sector_t too; but no need seen
for a gfp_mask there, just pass GFP_KERNEL down to blkdev_issue_discard().

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:17 +02:00
FUJITA Tomonori 4677735f03 sg: remove unnecessary blk_rq_unmap_user
blk_rq_unmap_user in sg_finish_rem_req can take care of all the cases.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:16 +02:00
FUJITA Tomonori 0b6cb26c66 sg: remove sg_read_xfer
sg_read_xfer was used to copy data to user space for READ
commands. blk_rq_unmap_user does the job so sg_read_xfer does nothing
useful.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:15 +02:00
FUJITA Tomonori c3919af235 sg: remove sg_write_xfer
sg_write_xfer was used to copy data from user space for WRITE
commands. blk_rq_map_user_iov and blk_rq_map_user do the job so
sg_write_xfer does nothing useful.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:14 +02:00
FUJITA Tomonori 626710c9d6 sg: incorporate sg_build_direct into sg_start_req
Calling blk_rq_map_user() at a single place is better than at
different two places. It makes the code more understandable.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:14 +02:00
FUJITA Tomonori 44c7b0eaa0 sg: remove __sg_start_req
__sg_start_req() was used temporarily to call blk_get_request() during
converting sg to use the block layer.

Now sg always calls blk_get_request() so we can move blk_get_request()
to sg_start_req(). We don't need __sg_start_req anymore.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:14 +02:00
FUJITA Tomonori fd1c1de076 sg: remove b_malloc_len in sg_scatter_hold struct
It's not used for anything useful after the block layer conversion.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:14 +02:00
FUJITA Tomonori 7e56cb0f7e sg: remove SG_ALLOW_DIO_CODE define
sg had lots of the own functions for the direct IO but now sg uses the
block layer functions for it. There are only five lines for the direct
IO. SG_ALLOW_DIO_CODE define was used to compile out the direct IO
code but we don't need the define. If someone wants to remove the
direct IO code, he can do easily without the define.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:14 +02:00
FUJITA Tomonori a91a3a20e0 sg: rename sg_cmd_done sg_rq_end_io
old sg_rq_end_io() was used to wrap sg_cmd_done during converting sg
to use the block layer (in order to cover the difference
scsi_execute_async and blk_execute_rq_nowait). Now we don't need it so
let's remove it.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:14 +02:00
Mike Anderson 224cb3e981 dm: Call blk_abort_queue on failed paths
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:14 +02:00
Mike Anderson 11914a53d2 block: Add interface to abort queued requests
Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:13 +02:00
Jens Axboe 242f9dcb8b block: unify request timeout handling
Right now SCSI and others do their own command timeout handling.
Move those bits to the block layer.

Instead of having a timer per command, we try to be a bit more clever
and simply have one per-queue. This avoids the overhead of having to
tear down and setup a timer for each command, so it will result in a lot
less timer fiddling.

Signed-off-by: Mike Anderson <andmike@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:13 +02:00
Andrew Patterson 608aeef17a Call flush_disk() after detecting an online resize.
We call flush_disk() to make sure the buffer cache for the disk is
flushed after a disk resize. There are two resize cases, growing and
shrinking. Given that users can shrink/then grow a disk before
revalidate_disk() is called, we treat the grow case identically to
shrinking. We need to flush the buffer cache after an online shrink
because, as James Bottomley puts it,

     The two use cases for shrinking I can see are

     1. planned: the fs is already shrunk to within the new boundaries
        and all data is relocated, so invalidate is fine (any dirty
        buffers that might exist in the shrunk region are there only
        because they were relocated but not yet written to their
        original location).
     2. unplanned:  In this case, the fs is probably toast, so whether
        we invalidate or not isn't going to make a whole lot of
        difference; it's still going to try to read or write from
        sectors beyond the new size and get I/O errors.

Immediately invalidating shrunk disks will cause errors for outstanding
I/Os for reads/write beyond the new end of the disk to be generated
earlier then if we waited for the normal buffer cache operation. It also
removes a potential security hole where we might keep old data around
from beyond the end of the shrunk disk if the disk was not invalidated.

Signed-off-by: Andrew Patterson <andrew.patterson@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:13 +02:00
Andrew Patterson 56ade44b46 Added flush_disk to factor out common buffer cache flushing code.
We need to be able to flush the buffer cache for for more than
just when a disk is changed, so we factor out common cache flush code
in check_disk_change() to an internal flush_disk() routine.  This
routine will then be used for both disk changes and disk resizes (in a
later patch).

Include the disk name in the text indicating that there are busy
inodes on the device and increase the KERN severity of the message.

Signed-off-by: Andrew Patterson <andrew.patterson@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:13 +02:00
Andrew Patterson f98a8cae12 SCSI sd driver calls revalidate_disk wrapper.
Modify the SCSI disk driver to call the revalidate_disk()
wrapper. This allows us to do some housekeeping such as accounting for
a disk being resized online. The wrapper will call
sd_revalidate_disk() at the appropriate time.

Signed-off-by: Andrew Patterson <andrew.patterson@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:13 +02:00
Andrew Patterson 9bc3ffbfbd Check for device resize when rescanning partitions
Check for device resize in the rescan_partitions() routine. If the device
has been resized, the bdev size is set to match. The rescan_partitions()
routine is called when opening the device and when calling the
BLKRRPART ioctl.

Signed-off-by: Andrew Patterson <andrew.patterson@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:12 +02:00
Andrew Patterson c3279d1454 Adjust block device size after an online resize of a disk.
The revalidate_disk routine now checks if a disk has been resized by
comparing the gendisk capacity to the bdev inode size.  If they are
different (usually because the disk has been resized underneath the kernel)
the bdev inode size is adjusted to match the capacity.

Signed-off-by: Andrew Patterson <andrew.patterson@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:12 +02:00
Andrew Patterson 0c002c2f74 Wrapper for lower-level revalidate_disk routines.
This is a wrapper for the lower-level revalidate_disk call-backs such
as sd_revalidate_disk(). It allows us to perform pre and post
operations when calling them.

We will use this wrapper in a later patch to adjust block device sizes
after an online resize (a _post_ operation).

Signed-off-by: Andrew Patterson <andrew.patterson@hp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:12 +02:00
Tejun Heo 243294dae0 block: fix duplicate headers for /proc/partitions
seqf can be started multiple times for a read and the header should be
printed only for the initial one.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:12 +02:00
FUJITA Tomonori fad7f01e61 sg: set dxferp to NULL for READ with the older SG interface
With the older SG interface, we don't know a user-space address to
trasfer data when executing a SCSI command. So we can't pass a
user-space address to blk_rq_map_user.

This patch fixes sg to pass a NULL user-space address to
blk_rq_map_user so that it just sets up a request and bios with page
frames propely without data transfer.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:12 +02:00
FUJITA Tomonori 818827669d block: make blk_rq_map_user take a NULL user-space buffer
This patch changes blk_rq_map_user to accept a NULL user-space buffer
with a READ command if rq_map_data is not NULL. Thus a caller can pass
page frames to lk_rq_map_user to just set up a request and bios with
page frames propely. bio_uncopy_user (called via blk_rq_unmap_user)
doesn't copy data to user space with such request.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:11 +02:00
Jens Axboe 839e96afba block: update comment on end_request()
It refers to functions that no longer exist after the IO completion
changes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:11 +02:00
Tejun Heo 55dc7db70a init: DEBUG_BLOCK_EXT_DEVT requires explicit root= param
DEBUG_BLOCK_EXT_DEVT shuffles SCSI and IDE device numbers and root
device number set using rdev become meaningless.  Root devices should
be explicitly specified using textual names.  Warn about it if root
can't be found and DEBUG_BLOCK_EXT_DEVT is enabled.  Also, add warning
to the help text.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:11 +02:00
Tejun Heo 2bbedcb4c1 block: don't test for partition size in bdget_disk() and blk_lookup_devt()
bdget_disk() and blk_lookup_devt() never cared whether the specified
partition (or disk) is zero sized or not.  I got confused while
converting those not to depend on consecutive minor numbers in commit
5a6411b1178baf534aa9138052864dfa89d3eada and later when dev0 was added
it broke callers which expected to get valid return for zero sized
disk devices.

So, they never needed nr_sects checks in the first place.  Kill them.

This problem was spotted and debugged by Bartlmoiej Zolnierkiewicz.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:11 +02:00
Jens Axboe 759f8ca304 Change default value of CONFIG_DEBUG_BLOCK_EXT_DEVT to 'n'
It's a debug option that you would explicitly enable to test this
feature, we should default it to 'n' to prevent accidental surprises
for now.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:11 +02:00
Harvey Harrison aeb3d3a81e block: kmalloc args reversed, small function definition fixes
Noticed by sparse:
block/blk-softirq.c:156:12: warning: symbol 'blk_softirq_init' was not declared. Should it be static?
block/genhd.c:583:28: warning: function 'bdget_disk' with external linkage has definition
block/genhd.c:659:17: warning: incorrect type in argument 1 (different base types)
block/genhd.c:659:17:    expected unsigned int [unsigned] [usertype] size
block/genhd.c:659:17:    got restricted gfp_t
block/genhd.c:659:29: warning: incorrect type in argument 2 (different base types)
block/genhd.c:659:29:    expected restricted gfp_t [usertype] flags
block/genhd.c:659:29:    got unsigned int
block: kmalloc args reversed

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:11 +02:00
FUJITA Tomonori 01cfcddd98 sg: use blk_rq_aligned helper function
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Douglas Gilbert <dougg@torque.net>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:11 +02:00
FUJITA Tomonori 879040742c block: add blk_rq_aligned helper function
This adds blk_rq_aligned helper function to see if alignment and
padding requirement is satisfied for DMA transfer. This also converts
blk_rq_map_kern and __blk_rq_map_user to use the helper function.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:11 +02:00
FUJITA Tomonori 4d8ab62e08 bio: convert bio_copy_kern to use bio_copy_user
bio_copy_kern and bio_copy_user are very similar. This converts
bio_copy_kern to use bio_copy_user.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:10 +02:00
FUJITA Tomonori 10db10d144 sg: convert the indirect IO path to use the block layer
This patch converts the indirect IO path (including mmap IO and old
struct sg_header) to use the block layer functions (blk_get_request,
blk_execute_rq_nowait, blk_rq_map_user, etc) instead of
scsi_execute_async().

[Jens: fixed compile error with SCSI logging enabled]

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Douglas Gilbert <dougg@torque.net>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:10 +02:00
FUJITA Tomonori 6e5a30cba5 sg: convert the direct IO path to use the block layer
This patch converts the direct IO path (SG_FLAG_DIRECT_IO) to use the
block layer functions (blk_get_request, blk_execute_rq_nowait,
blk_rq_map_user, etc) instead of scsi_execute_async().

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Douglas Gilbert <dougg@torque.net>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:10 +02:00
FUJITA Tomonori 10865dfa34 sg: convert the non-data path to use the block layer
This patch converts the non data path to use the block layer functions
(blk_get_request, blk_execute_rq_nowait, etc) instead of uses
scsi_execute_async().

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Douglas Gilbert <dougg@torque.net>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:10 +02:00
FUJITA Tomonori 152e283fdf block: introduce struct rq_map_data to use reserved pages
This patch introduces struct rq_map_data to enable bio_copy_use_iov()
use reserved pages.

Currently, bio_copy_user_iov allocates bounce pages but
drivers/scsi/sg.c wants to allocate pages by itself and use
them. struct rq_map_data can be used to pass allocated pages to
bio_copy_user_iov.

The current users of bio_copy_user_iov simply passes NULL (they don't
want to use pre-allocated pages).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Douglas Gilbert <dougg@torque.net>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:10 +02:00
FUJITA Tomonori a3bce90edd block: add gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov
Currently, blk_rq_map_user and blk_rq_map_user_iov always do
GFP_KERNEL allocation.

This adds gfp_mask argument to blk_rq_map_user and blk_rq_map_user_iov
so sg can use it (sg always does GFP_ATOMIC allocation).

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Douglas Gilbert <dougg@torque.net>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:10 +02:00
Aaron Carroll 45333d5a31 cfq-iosched: fix queue depth detection
CFQ's detection of queueing devices assumes a non-queuing device and detects
if the queue depth reaches a certain threshold.  Under some workloads (e.g.
synchronous reads), CFQ effectively forces a unit queue depth, thus defeating
the detection logic.  This leads to poor performance on queuing hardware,
since the idle window remains enabled.

This patch inverts the sense of the logic: assume a queuing-capable device,
and detect if the depth does not exceed the threshold.

Signed-off-by: Aaron Carroll <aaronc@gelato.unsw.edu.au>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:09 +02:00
Jens Axboe 605401618c block: don't use bio_has_data() in the completion path
We should just check for rq->bio, as that is really the information
we are looking for. Even if the bio attached doesn't carry data,
we still need to do IO post processing on it.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:09 +02:00
Jens Axboe ab780f1ece block: inherit CPU completion on bio->rq and rq->rq merges
Somewhat incomplete, as we do allow merges of requests and bios
that have different completion CPUs given. This is done on the
assumption that a larger IO is still more beneficial than CPU
locality.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:09 +02:00
Jens Axboe c7c22e4d5c block: add support for IO CPU affinity
This patch adds support for controlling the IO completion CPU of
either all requests on a queue, or on a per-request basis. We export
a sysfs variable (rq_affinity) which, if set, migrates completions
of requests to the CPU that originally submitted it. A bio helper
(bio_set_completion_cpu()) is also added, so that queuers can ask
for completion on that specific CPU.

In testing, this has been show to cut the system time by as much
as 20-40% on synthetic workloads where CPU affinity is desired.

This requires a little help from the architecture, so it'll only
work as designed for archs that are using the new generic smp
helper infrastructure.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:09 +02:00
Jens Axboe 18887ad910 block: make kblockd_schedule_work() take the queue as parameter
Preparatory patch for checking queuing affinity.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:09 +02:00
Jens Axboe b646fc59b3 block: split softirq handling into blk-softirq.c
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:09 +02:00
Jens Axboe 0835da67c1 block: use linux/uaccess.h in elevator.c instead of asm variant
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:09 +02:00
Tejun Heo 3e1a7ff8a0 block: allow disk to have extended device number
Now that disk and partition handlings are mostly unified, it's easy to
allow disk to have extended device number.  This patch makes
add_disk() use extended device number if disk->minors is zero.  Both
sd and ide-disk are updated to use this.

* sd_format_disk_name() is implemented which can generically determine
  the drive name.  This removes disk number restriction stemming from
  limited device names.

* If sd index goes over SD_MAX_DISKS (which can be increased now BTW),
  sd simply doesn't initialize minors letting block layer choose
  extended device number.

* If CONFIG_DEBUG_EXT_DEVT is set, both sd and ide-disk always set
  minors to 0 and use extended device numbers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-10-09 08:56:08 +02:00