summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/disk-io.h
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: leave btree locks spinning more oftenChris Mason2009-03-241-0/+1
| | | | | | | | | | | | | | | | | | | | btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: make a lockdep class for the extent buffer locksChris Mason2009-02-121-0/+10
| | | | | | | | | | | | | | | | | | | | Btrfs is currently using spin_lock_nested with a nested value based on the tree depth of the block. But, this doesn't quite work because the max tree depth is bigger than what spin_lock_nested can deal with, and because locks are sometimes taken before the level field is filled in. The solution here is to use lockdep_set_class_and_name instead, and to set the class before unlocking the pages when the block is read from the disk and just after init of a freshly allocated tree block. btrfs_clear_path_blocking is also changed to take the locks in the proper order, and it also makes sure all the locks currently held are properly set to blocking before it tries to retake the spinlocks. Otherwise, lockdep gets upset about bad lock orderin. The lockdep magic cam from Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix tree logs parallel syncYan Zheng2009-01-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | To improve performance, btrfs_sync_log merges tree log sync requests. But it wrongly merges sync requests for different tree logs. If multiple tree logs are synced at the same time, only one of them actually gets synced. This patch has following changes to fix the bug: Move most tree log related fields in btrfs_fs_info to btrfs_root. This allows merging sync requests separately for each tree log. Don't insert root item into the log root tree immediately after log tree is allocated. Root item for log tree is inserted when log tree get synced for the first time. This allows syncing the log root tree without first syncing all log trees. At tree-log sync, btrfs_sync_log first sync the log tree; then updates corresponding root item in the log root tree; sync the log root tree; then update the super block. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: superblock duplicationYan Zheng2008-12-081-2/+15
| | | | | | | | | | | This patch implements superblock duplication. Superblocks are stored at offset 16K, 64M and 256G on every devices. Spaces used by superblocks are preserved by the allocator, which uses a reverse mapping function to find the logical addresses that correspond to superblocks. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: mount ro and remount supportYan Zheng2008-11-121-0/+2
| | | | | | | | | | | This patch adds mount ro and remount support. The main changes in patch are: adding btrfs_remount and related helper function; splitting the transaction related code out of close_ctree into btrfs_commit_super; updating allocator to properly handle read only block group. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: Add ordered async work queuesChris Mason2008-11-061-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs uses kernel threads to create async work queues for cpu intensive operations such as checksumming and decompression. These work well, but they make it difficult to keep IO order intact. A single writepages call from pdflush or fsync will turn into a number of bios, and each bio is checksummed in parallel. Once the checksum is computed, the bio is sent down to the disk, and since we don't control the order in which the parallel operations happen, they might go down to the disk in almost any order. The code deals with this somewhat by having deep work queues for a single kernel thread, making it very likely that a single thread will process all the bios for a single inode. This patch introduces an explicitly ordered work queue. As work structs are placed into the queue they are put onto the tail of a list. They have three callbacks: ->func (cpu intensive processing here) ->ordered_func (order sensitive processing here) ->ordered_free (free the work struct, all processing is done) The work struct has three callbacks. The func callback does the cpu intensive work, and when it completes the work struct is marked as done. Every time a work struct completes, the list is checked to see if the head is marked as done. If so the ordered_func callback is used to do the order sensitive processing and the ordered_free callback is used to do any cleanup. Then we loop back and check the head of the list again. This patch also changes the checksumming code to use the ordered workqueues. One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add zlib compression supportChris Mason2008-10-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Tree logging fixesChris Mason2008-09-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Pin down data blocks to prevent them from being reallocated like so: trans 1: allocate file extent trans 2: free file extent trans 3: free file extent during old snapshot deletion trans 3: allocate file extent to new file trans 3: fsync new file Before the tree logging code, this was legal because the fsync would commit the transation that did the final data extent free and the transaction that allocated the extent to the new file at the same time. With the tree logging code, the tree log subtransaction can commit before the transaction that freed the extent. If we crash, we're left with two different files using the extent. * Don't wait in start_transaction if log replay is going on. This avoids deadlocks from iput while we're cleaning up link counts in the replay code. * Don't deadlock in replay_one_name by trying to read an inode off the disk while holding paths for the directory * Hold the buffer lock while we mark a buffer as written. This closes a race where someone is changing a buffer while we write it. They are supposed to mark it dirty again after they change it, but this violates the cow rules. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a write ahead tree log to optimize synchronous operationsChris Mason2008-09-251-1/+7
| | | | | | | | | | | File syncs and directory syncs are optimized by copying their items into a special (copy-on-write) log tree. There is one log tree per subvolume and the btrfs super block points to a tree of log tree roots. After a crash, items are copied out of the log tree and back into the subvolume. See tree-log.c for all the details. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Wait for async bio submissions to make some progress at queue timeChris Mason2008-09-251-0/+1
| | | | | | | | Before, the btrfs bdi congestion function was used to test for too many async bios. This keeps that check to throttle pdflush, but also adds a check while queuing bios. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Transaction commit: don't use filemap_fdatawaitChris Mason2008-09-251-0/+1
| | | | | | | | | | | | After writing out all the remaining btree blocks in the transaction, the commit code would use filemap_fdatawait to make sure it was all on disk. This means it would wait for blocks written by other procs as well. The new code walks the list of blocks for this transaction again and waits only for those required by this transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Online btree defragmentation fixesChris Mason2008-09-251-6/+0
| | | | | | | | | | The btree defragger wasn't making forward progress because the new key wasn't being saved by the btrfs_search_forward function. This also disables the automatic btree defrag, it wasn't scaling well to huge filesystems. The auto-defrag needs to be done differently. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Add btrfs_end_transaction_throttle to force writers to wait for pending commitsChris Mason2008-09-251-1/+0
| | | | | | | | | | | | | | | | The existing throttle mechanism was often not sufficient to prevent new writers from coming in and making a given transaction run forever. This adds an explicit wait at the end of most operations so they will allow the current transaction to close. There is no wait inside file_write, inode updates, or cow filling, all which have different deadlock possibilities. This is a temporary measure until better asynchronous commit support is added. This code leads to stalls as it waits for data=ordered writeback, and it really needs to be fixed. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add mount -o degraded to allow mounts to continue with missing devicesChris Mason2008-09-251-1/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Handle write errors on raid1 and raid10Chris Mason2008-09-251-1/+1
| | | | | | | | | | | | When duplicate copies exist, writes are allowed to fail to one of those copies. This changeset includes a few changes that allow the FS to continue even when some IOs fail. It also adds verification of the parent generation number for btree blocks. This generation is stored in the pointer to a block, and it ensures that missed writes to are detected. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Pass down the expected generation number when reading tree blocksChris Mason2008-09-251-3/+4
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Create a work queue for bio writesChris Mason2008-09-251-0/+3
| | | | | | | This allows checksumming to happen in parallel among many cpus, and keeps us from bogging down pdflush with the checksumming code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Write out all super blocks on commit, and bring back proper barrier ↵Chris Mason2008-09-251-0/+1
| | | | | | support Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Handle data block end_io through the async work queueChris Mason2008-09-251-0/+2
| | | | | | | Before it was done by the bio end_io routine, the work queue code is able to scale much better with faster IO subsystems. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Verify checksums on tree blocks found without read_tree_blockChris Mason2008-09-251-0/+2
| | | | | | | | | | | | | | | | | Checksums were only verified by btrfs_read_tree_block, which meant the functions to probe the page cache for blocks were not validating checksums. Normally this is fine because the buffers will only be in cache if they have already been validated. But, there is a window while the buffer is being read from disk where it could be up to date in the cache but not yet verified. This patch makes sure all buffers go through checksum verification before they are used. This is safer, and it prevents modification of buffers before they go through the csum code. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for device scanning and detection ioctlsChris Mason2008-09-251-1/+3
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for multiple devices per filesystemChris Mason2008-09-251-0/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add some simple throttling to wait for data=ordered and snapshot deletionChris Mason2008-09-251-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add data=ordered supportChris Mason2008-09-251-0/+2
| | | | | | | | This forces file data extents down the disk along with the metadata that references them. The current implementation is fairly simple, and just writes out all of the dirty pages in an inode before the commit. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Support for online FS resize (grow and shrink)Chris Mason2008-09-251-0/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add back file data checksummingChris Mason2008-09-251-0/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add back the online defragging codeChris Mason2008-09-251-0/+7
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Allow tree blocks larger than the page sizeChris Mason2008-09-251-4/+5
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Create extent_buffer interface for large blocksizesChris Mason2008-09-251-44/+11
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Use balance_dirty_pages_nr on btree blocksChris Mason2008-09-251-1/+1
| | | | | | | | btrfs_btree_balance_dirty is changed to pass the number of pages dirtied for more accurate dirty throttling. This lets the VM make better decisions about when to force some writeback. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Add support for defragging files via btrfsctl -d. Avoid OOM on extent treeChris Mason2007-09-101-0/+2
| | | | | | defrag. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add per-root block accounting and sysfs entriesJosef Bacik2007-08-291-1/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Extent based page cache code. This uses an rbtree of extents and testsChris Mason2007-08-271-1/+0
| | | | | | instead of buffer heads. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add BH_Defrag to mark buffers that are in need of defraggingChris Mason2007-08-101-0/+2
| | | | | | | | This allows the tree walking code to defrag only the newly allocated buffers, it seems to be a good balance between perfect defragging and the performance hit of repeatedly reallocating blocks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: crash recovery fixesChris Mason2007-06-281-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add the ability to find and remove dead roots after a crash.Chris Mason2007-06-221-0/+3
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: add GPLv2Chris Mason2007-06-121-0/+18
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix page cache memory leakChris Mason2007-05-021-0/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: directory readaheadChris Mason2007-05-011-0/+6
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: add a device id to device itemsChris Mason2007-04-121-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: add disk ioctl, mostly workingChris Mason2007-04-121-0/+6
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: create a logical->phsyical block number mapping schemeChris Mason2007-04-111-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: groundwork for subvolume and snapshot rootsChris Mason2007-04-091-0/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: still corruption huntingChris Mason2007-04-021-3/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: verify csums on readChris Mason2007-03-291-0/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: use a btree inode instead of sb_getblkChris Mason2007-03-281-2/+3
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: transaction reworkChris Mason2007-03-221-4/+4
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Mountable btrfs, with readdirChris Mason2007-03-221-23/+28
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: initial move to kernel module landChris Mason2007-03-211-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: transaction handles everywhereChris Mason2007-03-161-6/+10
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>