summaryrefslogtreecommitdiffstats
path: root/xlators/cluster/ec/src/ec-inode-write.c
Commit message (Collapse)AuthorAgeFilesLines
* cluster/ec: quorum-count implementationPranith Kumar K2019-09-081-32/+29
| | | | | | fixes: #721 Change-Id: I5333540e3c635ccf441cf1f4696e4c8986e38ea8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Update lock->good_mask on parent fop failurePranith Kumar K2019-08-071-0/+2
| | | | | | | | | | When discard/truncate performs write fop, it should do so after updating lock->good_mask to make sure readv happens on the correct mask fixes bz#1727081 Change-Id: Idfef0bbcca8860d53707094722e6ba3f81c583b7 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Fix reopen flags to avoid misbehaviorPranith Kumar K2019-07-301-2/+5
| | | | | | | | | | | | | | | | | | | | | | | Problem: when a file needs to be re-opened O_APPEND and O_EXCL flags are not filtered in EC. - O_APPEND should be filtered because EC doesn't send O_APPEND below EC for open to make sure writes happen on the individual fragments instead of at the end of the file. - O_EXCL should be filtered because shd could have created the file so even when file exists open should succeed - O_CREAT should be filtered because open happens with gfid as parameter. So open fop will create just the gfid which will lead to problems. Fix: Filter out these two flags in reopen. Change-Id: Ia280470fcb5188a09caa07bf665a2a94bce23bc4 Fixes: bz#1733935 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: Always read from good-maskPranith Kumar K2019-07-261-5/+22
| | | | | | | | | | There are cases where fop->mask may have fop->healing added and readv shouldn't be wound on fop->healing. To avoid this always wind readv to lock->good_mask fixes bz#1727081 Change-Id: I2226ef0229daf5ff315d51e868b980ee48060b87 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: fix EIO error for concurrent writes on sparse filesXavi Hernandez2019-07-241-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC doesn't allow concurrent writes on overlapping areas, they are serialized. However non-overlapping writes are serviced in parallel. When a write is not aligned, EC first needs to read the entire chunk from disk, apply the modified fragment and write it again. The problem appears on sparse files because a write to an offset implicitly creates data on offsets below it (so, in some way, they are overlapping). For example, if a file is empty and we read 10 bytes from offset 10, read() will return 0 bytes. Now, if we write one byte at offset 1M and retry the same read, the system call will return 10 bytes (all containing 0's). So if we have two writes, the first one at offset 10 and the second one at offset 1M, EC will send both in parallel because they do not overlap. However, the first one will try to read missing data from the first chunk (i.e. offsets 0 to 9) to recombine the entire chunk and do the final write. This read will happen in parallel with the write to 1M. What could happen is that half of the bricks process the write before the read, and the half do the read before the write. Some bricks will return 10 bytes of data while the otherw will return 0 bytes (because the file on the brick has not been expanded yet). When EC tries to recombine the answers from the bricks, it can't, because it needs more than half consistent answers to recover the data. So this read fails with EIO error. This error is propagated to the parent write, which is aborted and EIO is returned to the application. The issue happened because EC assumed that a write to a given offset implies that offsets below it exist. This fix prevents the read of the chunk from bricks if the current size of the file is smaller than the read chunk offset. This size is correctly tracked, so this fixes the issue. Also modifying ec-stripe.t file for Test #13 within it. In this patch, if a file size is less than the offset we are writing, we fill zeros in head and tail and do not consider it strip cache miss. That actually make sense as we know what data that part holds and there is no need of reading it from bricks. Change-Id: Ic342e8c35c555b8534109e9314c9a0710b6225d6 Fixes: bz#1730715 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* multiple files: another attempt to remove includesYaniv Kaul2019-06-141-4/+0
| | | | | | | | | | | | | | | | | | There are many include statements that are not needed. A previous more ambitious attempt failed because of *BSD plafrom (see https://review.gluster.org/#/c/glusterfs/+/21929/ ) Now trying a more conservative reduction. It does not solve all circular deps that we have, but it does reduce some of them. There is just too much to handle reasonably (dht-common.h includes dht-lock.h which includes dht-common.h ...), but it does reduce the overall number of lines of include we need to look at in the future to understand and fix the mess later one. Change-Id: I550cd001bdefb8be0fe67632f783c0ef6bee3f9f updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/ec: fix fd reopenXavi Hernandez2019-04-231-37/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently EC tries to reopen fd's that have been opened while a brick was down. This is done as part of regular write operations, just after having acquired the locks, and it's sent as a sub-fop of the main write fop. There were two problems: 1. The reopen was attempted on all UP bricks, even if a previous lock didn't succeed. This is incorrect because most probably the open will fail. 2. If reopen is sent and fails, the error is propagated to the main operation, causing it to fail when it shouldn't. To fix this, we only attempt reopens on bricks where the current fop owns a lock, and we prevent any error to be propagated to the main fop. To implement this behaviour an argument used to indicate the minimum number of required answers has overloaded to also include some flags. To make the change consistent, it has been necessary to rename the argument, which means that a lot of files have been changed. However there are no functional changes. This change has also uncovered a problem in discard code, which didn't correctely process requests of small sizes because no real discard fop was being processed, only a write of 0's on some region. In this case some fields of the fop remained uninitialized or with incorrect values. To fix this, a new function has been created to simulate success on a fop and it's used in the discard case. Thanks to Pranith for providing a test script that has also detected an issue in this patch. This patch includes a small modification of this script to force data to be written into bricks before stopping them. Change-Id: If272343873369186c2fb8f43c1d9c52c3ea304ec Fixes: bz#1699866 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* ec: fix truncate lock to cover the write in tuncate cleanKinglong Mee2019-04-121-2/+6
| | | | | | | | | | ec_truncate_clean does writing under the lock granted for truncate, but the lock is calculated by ec_adjust_offset_up, so that, the write in ec_truncate_clean is out of lock. Updates: bz#1699189 Change-Id: Idbe1fd48d26afe49c36b77db9f12e0907f5a4134 Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
* libglusterfs: Move devel headers under glusterfs directoryShyamsundarR2018-12-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
* all: fix warnings on non 64-bits architecturesXavi Hernandez2018-10-101-18/+18
| | | | | | | | | | When compiling in other architectures there appear many warnings. Some of them are actual problems that prevent gluster to work correctly on those architectures. Change-Id: Icdc7107a2bc2da662903c51910beddb84bdf03c0 fixes: bz#1632717 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* Land part 2 of clang-format changesGluster Ant2018-09-121-900/+840
| | | | | Change-Id: Ia84cc24c8924e6d22d02ac15f611c10e26db99b4 Signed-off-by: Nigel Babu <nigelb@redhat.com>
* All: run codespell on the code and fix issues.Yaniv Kaul2018-07-221-1/+1
| | | | | | | | | | | | Please review, it's not always just the comments that were fixed. I've had to revert of course all calls to creat() that were changed to create() ... Only compile-tested! Change-Id: I7d02e82d9766e272a7fd9cc68e51901d69e5aab5 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/ec: Fix bugs in stripe-cache featureAshish Pandey2017-12-051-0/+3
| | | | | | | | | | | | | | 1 - This patch fixes a bug in ec_update_stripe() that prevented some stripes to be updated after a write. 2 - This patch also include code modification for the case in which a file does not exist and we write on unaligned offset and user size, the last stripe on which "end" will fall should also be cached. Change-Id: I069cb4be1c8d59c206e3b35a6991e1fbdbc9b474 BUG: 1520758 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: EC DISCARD doesn't punch hole properlySunil Kumar Acharya2017-11-281-2/+4
| | | | | | | | | | | | | | Problem: DISCARD operation on EC volume was punching hole of lesser size than the specified size in some cases. Solution: EC was not handling punch hole for tail part in some cases. Updated the code to handle it appropriately. BUG: 1516206 Change-Id: If3e69e417c3e5034afee04e78f5f78855e65f932 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
* cluster/ec: Keep last written strip in in-memory cacheAshish Pandey2017-11-101-14/+197
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Consider an EC volume with configuration 4 + 2. The stripe size for this would be 512 * 4 = 2048. That means, 2048 bytes of user data stored in one stripe. Let's say 2048 + 512 = 2560 bytes are already written on this volume. 512 Bytes would be in second stripe. Now, if there are sequential writes with offset 2560 and of size 1 Byte, we have to read the whole stripe, encode it with 1 Byte and then again have to write it back. Next, write with offset 2561 and size of 1 Byte will again READ-MODIFY-WRITE the whole stripe. This is causing bad performance because of lots of READ request travelling over the network. There are some tools and scenario's where such kind of load is coming and users are not aware of that. Example: fio and zip Solution: One possible solution to deal with this issue is to keep last stripe in memory. This way, we need not to read it again and we can save READ fop going over the network. Considering the above example, we have to keep last 2048 bytes (maximum) in memory per file. Change-Id: I3f95e6fc3ff81953646d374c445a40c6886b0b85 BUG: 1471753 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* cluster/ec: Implement DISCARD FOP for ECSunil Kumar Acharya2017-10-251-42/+323
| | | | | | | | | | | Updates #254 This code change implements DISCARD FOP support for EC. BUG: 1461018 Change-Id: I09a9cb2aa9d91ec27add4f422dc9074af5b8b2db Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
* cluster/ec: Allow parallel writes in EC if possiblePranith Kumar K2017-10-241-42/+82
| | | | | | | | | | | | | | | | | | Problem: Ec at the moment sends one modification fop after another, so if some of the disks become slow, for a while then the wait time for the writes that are waiting in the queue becomes really bad. Fix: Allow parallel writes when possible. For this we need to make 3 changes. 1) Each fop now has range parameters they will be updating. 2) Xattrop is changed to handle parallel xattrop requests where some would be modifying just dirty xattr. 3) Fops that refer to size now take locks and update the locks. Fixes #251 Change-Id: Ibc3c15372f91bbd6fb617f0d99399b3149fa64b2 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
* cluster/ec: add functions for stripe alignmentXavier Hernandez2017-10-131-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes old functions to align offsets and sizes to stripe size boundaries and adds new ones to offer more possibilities. The new functions are: * ec_adjust_offset_down() Aligns a given offset to a multiple of the stripe size equal or smaller than the initial one. It returns the size of the gap between the aligned offset and the given one. * ec_adjust_offset_up() Aligns a given offset to a multiple of the stripe size equal or greater than the initial one. It returns the size of the skipped region between the given offset and the aligned one. If an overflow happens, the returned valid has negative sign (but correct value) and the offset is set to the maximum value (not aligned). * ec_adjust_size_down() Aligns the given size to a multiple of the stripe size equal or smaller than the initial one. It returns the size of the missed region between the aligned size and the given one. * ec_adjust_size_up() Aligns the given size to a multiple of the stripe size equal or greater than the initial one. It returns the size of the gap between the given size and the aligned one. If an overflow happens, the returned value has negative sign (but correct value) and the size is set to the maximum value (not aligned). These functions have been defined in ec-helpers.h as static inline since they are very small and compilers can optimize them (specially the 'scale' argument). Change-Id: I4c91009ad02f76c73772034dfde27ee1c78a80d7 Signed-off-by: Xavier Hernandez <jahernan@redhat.com>
* ec/cluster: Update failure of fop on a brick properlyAshish Pandey2017-07-311-7/+16
| | | | | | | | | | | | | | | | | | | | | | | | Problem: In case of truncate, if writev or open fails on a brick, in some cases it does not mark the failure onlock->good_mask. This causes the update of size and version on all the bricks even if it has failed on one of the brick. That ultimately causes a data corruption. Solution: In callback of such writev and open calls, mark fop->good for parent too. Thanks Pranith Kumar K <pkarampu@redhat.com> for finding the root cause. Change-Id: I8a1da2888bff53b91a0d362b8c44fcdf658e7466 BUG: 1476205 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: https://review.gluster.org/17906 Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Implement FALLOCATE FOP for ECSunil Kumar Acharya2017-05-231-1/+202
| | | | | | | | | | | | | | | FALLOCATE file operations is not implemented in the existing EC code. This change set implements it for EC. BUG: 1448293 Change-Id: Id9ed914db984c327c16878a5b2304a0ea461b623 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com> Reviewed-on: https://review.gluster.org/15200 Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Check xdata to avoid memory leakAshish Pandey2016-12-021-0/+3
| | | | | | | | | | | | | | | | | | | | | | Problem: ec_writev_start calls ec_make_internal_fop_xdata to set "yes" in xdata before ec_readv (an internal fop) is called for head and tail. Second call to this function is overwriting the previous allocated dict_t to "xdata", which results in memory leak. Solution: In ec_make_internal_fop_xdata, check if *xdata is NULL or not to avoid overwriting *xdata. Change-Id: I49b83923e11aff9b92d002e86424c0c2e1f5f74f BUG: 1400818 Signed-off-by: Ashish Pandey <aspandey@redhat.com> Reviewed-on: http://review.gluster.org/16007 Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org>
* cluster/ec: Add support for hardware accelerationXavier Hernandez2016-09-081-92/+91
| | | | | | | | | | | | | | | | | | | | | | | This patch implements functionalities for fast encoding/decoding using hardware support. Currently optimized x86_64, SSE and AVX is added. Additionally this patch implements a caching mecanism for inverse matrices to reduce computation time, as well as a new method for computing the inverse that takes quadratic time instead of cubic. Finally some unnecessary memory copies have been eliminated to further increase performance. Change-Id: I26c75f26fb4201bd22b51335448ea4357235065a BUG: 1289922 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/12837 Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Smoke: Gluster Build System <jenkins@build.gluster.org> NetBSD-regression: NetBSD Build System <jenkins@build.gluster.org> CentOS-regression: Gluster Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Create copy of dict for setting internal xattrsPranith Kumar K2015-12-011-11/+11
| | | | | | | | | | | | | | | | | | | | | | Problem: Ec takes a ref of the request xdata and sets trusted.ec.version/algo etc xattrs as part of it. But this request xdata could be using same dictionary to do the operation on multiple subvolumes, due to which other subvolumes will have internal xattrs of ec in it and will be created on subvols where they are not supposed to appear. Fix: Take a copy of the request xdata/dict to prevent this from happening. Most of the debugging work and test script is contributed by Nitya. BUG: 1286910 Change-Id: If146435dfb89656158dbed3862a3e9a0cda60581 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/12831 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec: Mark internal fops appropriatelyXavier Hernandez2015-11-191-13/+48
| | | | | | | | | | | | | 1) Mark read fops in read-modify-write by EC as internal. 2) Handle uid/gid set/reset correctly BUG: 1282761 Change-Id: I5c1ce0cd6213367eaead5fed33aa2397c4e46df7 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/12599 Tested-by: Gluster Build System <jenkins@build.gluster.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* fd: Do fd_bind on successful openPranith Kumar K2015-08-281-0/+1
| | | | | | | | | | | | | | | - fd_unref should decrement fd->inode->fd_count only if it is present in the inode's fd list. - successful open/opendir should perform fd_bind. Change-Id: I81dd04f330e2fee86369a6dc7147af44f3d49169 BUG: 1207735 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/11044 Reviewed-by: Anoop C S <anoopcs@redhat.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Raghavendra G <rgowdapp@redhat.com>
* cluster/ec: Fix tracking of good bricksXavier Hernandez2015-08-061-28/+19
| | | | | | | | | | | | | | | | | | | The bitmask of good and bad bricks was kept in the context of the corresponding inode or fd. This was problematic when an external process (another client or the self-heal process) did heal the bricks but no one changed the bitmaks of other clients. This patch removes the bitmask stored in the context and calculates which bricks are healthy after locking them and doing the initial xattrop. After that, it's updated using the result of each fop. Change-Id: I225e31cd219a12af4ca58871d8a4bb6f742b223c BUG: 1236065 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/11844 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Minimize usage of EIO errorXavier Hernandez2015-07-281-330/+180
| | | | | | | | | | Change-Id: I82e245615419c2006a2d1b5e94ff0908d2f5e891 BUG: 1245276 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/11741 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: NetBSD Build System <jenkins@build.gluster.org>
* ec: Porting messages to new logging frameworkNandaja Varma2015-06-261-61/+120
| | | | | | | | | Change-Id: Ia05ae750a245a37d48978e5f37b52f4fb0507a8c BUG: 1194640 Signed-off-by: Nandaja Varma <nandaja.varma@gmail.com> Reviewed-on: http://review.gluster.org/10465 Tested-by: NetBSD Build System <jenkins@build.gluster.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec: Forced unlock when lock contention is detectedXavier Hernandez2015-05-271-81/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | EC uses an eager lock mechanism to optimize multiple read/write requests on the same entry or inode. This increases performance but can have adverse results when other clients try to access the same entry/inode. To solve this, this patch adds a functionality to detect when this happens and force an earlier release to not block other clients. The method consists on requesting GF_GLUSTERFS_INODELK_COUNT and GF_GLUSTERFS_ENTRYLK_COUNT for all fops that take a lock. When this count is greater than one, the lock is marked to be released. All fops already waiting for this lock will be executed normally before releasing the lock, but new requests that also require it will be blocked and restarted after the lock has been released and reacquired again. Another problem was that some operations did correctly lock the parent of an entry when needed, but got the size and version xattrs from the entry instead of the parent. This patch solves this problem by binding all queries of size and version to each lock and replacing all entrylk calls by inodelk ones to remove concurrent updates on directory metadata. This also allows rename to correctly update source and destination directories. Change-Id: I2df0b22bc6f407d49f3cbf0733b0720015bacfbd BUG: 1165041 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/10852 Tested-by: NetBSD Build System Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* cluster/ec: Fix iobuf mem leakRaghavendra Talur2015-04-101-2/+2
| | | | | | | | | | | | | | | | iobuf_get and iobref_add implicitly ref the iobuf. Hence, it is necessary to unref iobuf before setting it to NULL. Change-Id: Icadd8925574cf04fe708d8090868e49356653a8e Signed-off-by: Raghavendra Talur <rtalur@redhat.com> Reviewed-on: http://review.gluster.org/9818 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Niels de Vos <ndevos@redhat.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/ec: Have same ec_manager_* for [f]set/[f]removexattrPranith Kumar K2015-04-081-206/+80
| | | | | | | | | | | | | ec_manager_xxx() function for [f]set/[f]remove xattr is exactly same except the reporting part. So moved that to common function and use same ec_manager_xattr() function for all these fops. Change-Id: Iaa57023b800f8d1f3f6a827f4ceba9b0a0337336 BUG: 1199767 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/10036 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* cluster/ec: Refactor inode-writevPranith Kumar K2015-04-061-441/+87
| | | | | | | | | | | | | | All _cbk() functions in inode-write.c do same things, i.e. store op_ret/op_errno, stat structures if they are available and combine them. Moved this common operation into one function ec_inode_write_cbk() and made all the other _cbk() functions to use this instead. Change-Id: I2387b9f2d9598ced6299a26ea1900e9deb9fadc4 BUG: 1199767 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/9981 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com>
* ec: Add trusted.ec.dirty xattrXavier Hernandez2015-02-231-2/+2
| | | | | | | | | | | | | | | | | This xattr will be incremented before each data modifying operation and decremented after it. This will add the possibility to detect partially updated writes and refuse them on reads. It will also be useful for interacting with index xlator and have a way to heal dispersed files from the self-heal daemon. Change-Id: Ie644a8dd074ae0f254c809c5863bdb030be5486a BUG: 1190581 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/9607 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* ec: Fix posix compliance failuresXavier Hernandez2015-01-281-34/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch solves some problems that caused dispersed volumes to not pass posix smoke tests: * Problems in open/create with O_WRONLY Opening files with -w- permissions using O_WRONLY returned an EACCES error because internally O_WRONLY was replaced with O_RDWR. * Problems with entrylk on renames. When source and destination were the same, ec tried to acquire the same entrylk twice, causing a deadlock. * Overwrite of a variable when reordering locks. On a rename, if the second lock needed to be placed at the beggining of the list, the 'lock' variable was overwritten and later its timer was cancelled, cancelling the incorrect one. * Handle O_TRUNC in open. When O_TRUNC was received in an open call, it was blindly propagated to child subvolumes. This caused a discrepancy between real file size and the size stored into trusted.ec.size xattr. This has been solved by removing O_TRUNC from open and later calling ftruncate. Change-Id: I20c3d6e1c11be314be86879be54b728e01013798 BUG: 1161886 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/9420 Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Pranith Kumar Karampuri <pkarampu@redhat.com> Tested-by: Pranith Kumar Karampuri <pkarampu@redhat.com>
* ec: Remove O_APPEND from flags on create and open.Xavier Hernandez2015-01-091-51/+53
| | | | | | | | | | | | | | | | | | | Allowing O_APPEND flag to pass through to the brick files corrupts fragment contents because writes are not stored on the desired place. Write fop has been modified so that it uses current file size as its write offset. This guarantees that all writes, even those comming from different file descriptors and clients, will write to the end of the file. Change-Id: I9f721f12217a98231fe52e344166d1c94172c272 BUG: 1161621 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/9079 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/ec: Handle internal xattr get/setPranith Kumar K2015-01-081-52/+45
| | | | | | | | | | | | | | | | | | | | Problem: Internal xattrs of EC like trusted.ec.size/config/version can be modified by users and that can lead to misbehavior in EC. Fix: Don't let the user modify the xattrs. Hide these xattrs in getfattr outputs. Change-Id: I39cec96ae12826b506b496fda7da74201015fd75 BUG: 1178688 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com> Reviewed-on: http://review.gluster.org/9385 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Emmanuel Dreyfus <manu@netbsd.org> Tested-by: Emmanuel Dreyfus <manu@netbsd.org> Reviewed-by: Xavier Hernandez <xhernandez@datalab.es>
* ec: Fix return errors when not enough bricksXavier Hernandez2014-12-051-0/+5
| | | | | | | | | | | | | | | | | | | | | Changes introduced by this patch: * Fix an incorrect error propagation when the state of the life cycle of a fop returns an error. * Fix incorrect unlocking of failed locks. * Return ENOTCONN if there aren't enough bricks online. * In readdir(p) check that the fd has been successfully open by a previous opendir. Change-Id: Ib44f25a1297849ebcbab839332f3b6359f275ebe BUG: 1162805 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/9098 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* ec: Change licenseXavier Hernandez2014-12-031-16/+6
| | | | | | | | | | Change-Id: Iae90ade2421898417b53dec0417a610cf306c44b BUG: 1168167 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/9201 Reviewed-by: Kaleb KEITHLEY <kkeithle@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* ec: Fix self-heal issuesXavier Hernandez2014-10-211-14/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: Doing an 'ls' of a directory that has been modified while one of the bricks was down, sometimes returns the old directory contents. Cause: Directories are not marked when they are modified as files are. The ec xlator balances requests amongst available and healthy bricks. Since there is no way to detect that a directory is out of date in one of the bricks, it is used from time to time to return the directory contents. Solution: Basically the solution consists in use versioning information also for directories, however some additional changes have been necessary. Changes: * Use directory versioning: This required to lock full directory instead of a single entry for all requests that add or remove entries from it. This is needed to allow atomic version update. This affects the following fops: create, mkdir, mknod, link, symlink, rename, unlink, rmdir Another side effect is that opendir requires to do a previous lookup to get versioning information and discard out of date bricks for subsequent readdir(p) calls. * Restrict directory self-heal: Till now, when one discrepancy was found in lookup, a self-heal was automatically started. This caused the versioning information of a bad directory to be healed instantly, making the original problem to reapear again. To solve this, when a missing directory is detected in one or more bricks on lookup or opendir fops, only a partial self-heal is performed on it. A partial self-heal basically creates the directory but does not restore any additional information. This avoids that an 'ls' could repair the directory and cause the problem to happen again. With this change, output of 'ls' is always consistent. However, since the directory has been created in the brick, this allows any other operation on it (create new files, for example) to succeed on all bricks and not add additional work to the self-heal process. To force a self-heal of a directory, any other operation must be done on it. For example a getxattr. With these changes, the correct healing procedure that would avoid inconsistent directory browsing consists on a post-order traversal of directoriesi being healed. This way, the directory contents will be healed before healing the directory itslef. * Additional changes to fix self-heal errors - Don't use fop->fd to decide between fd/loc. open, opendir and create have an fd, but the correct data is in loc. - Fix incorrect management of bad bricks per inode/fd. - Fix incorrect selection of fop's target bricks when there are bad bricks involved. - Improved ec_loc_parent() to always return a parent loc as complete as possible. Change-Id: Iaf3df174d7857da57d4a87b4a8740a7048b366ad BUG: 1149726 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/8916 Reviewed-by: Dan Lambright <dlambrig@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com>
* ec: Fix invalid inode lock in ftruncateXavier Hernandez2014-09-191-9/+9
| | | | | | | | | | | | | | | | | | | The fops 'truncate' and 'ftruncate' share some code and inodelk() was always made against the inode inside the loc_t structure instead of that of fd_t. Since ftruncate has the loc initialized to NULL, this fop was executed without any lock, allowing some concurrent modifications in the file size. Also changed the way in which 'fop' and 'ffop' are differentiated in shared code. Now it uses 'id' field instead of checking if 'fd' is NULL. Change-Id: Ibd18accf2652193b395a841b9029729e5f4867c6 BUG: 1140396 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/8695 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* ec: Optimize read/write performanceXavier Hernandez2014-09-151-52/+65
| | | | | | | | | | | | | | | | | | This patch significantly improves performance of read/write operations on a dispersed volume by reusing previous inodelk/ entrylk operations on the same inode/entry. This reduces the latency of each individual operation considerably. Inode version and size are also updated when needed instead of on each request. This gives an additional boost. Change-Id: I4b98d5508c86b53032e16e295f72a3f83fd8fcac BUG: 1122586 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/8369 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Reviewed-by: Dan Lambright <dlambrig@redhat.com>
* cluster/ec: Fix incorrect management of NFS requestsXavier Hernandez2014-08-021-38/+17
| | | | | | | | | | | | | | | | | | | Some operations, specially those comming from NFS, do not use a regular fd and use an anonymous fd (i.e. a previous open call has not been sent). Any context information created during open or create will not be present on these fd's, so we simply return NULL for contexts of those fd. Also it seems that NFS can send write requests with a very big buffer (higher that the default value of 128 KB). Some changes have been made to correctly handle these large buffers. Change-Id: I281476bd0d2cbaad231822248d6a616fcf5d4003 BUG: 1122417 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/8367 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* ec: Fixed coveriry scan issuesXavier Hernandez2014-07-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CID list: 1226163 Logically dead code 1226166 Missing break in switch 1226167 Missing break in switch 1226168 Missing break in switch 1226169 Missing break in switch 1226170 Missing break in switch 1226171 Missing break in switch 1226172 Missing break in switch 1226173 Missing break in switch 1226174 Missing break in switch 1226175 Missing break in switch 1226176 Missing break in switch 1226177 Missing break in switch 1226178 Data race condition 1226179 Data race condition 1226180 Data race condition 1226181 Thread deadlock 1226182 Uninitialized pointer read 1226183 Uninitialized pointer read 1226184 Read from pointer after free Change-Id: I4d33aa42289371927175c43bb29e018df64fb943 BUG: 789278 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/8317 Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>
* cluster/ec: Added erasure code translatorXavier Hernandez2014-07-111-0/+2235
Change-Id: I293917501d5c2ca4cdc6303df30cf0b568cea361 BUG: 1118629 Signed-off-by: Xavier Hernandez <xhernandez@datalab.es> Reviewed-on: http://review.gluster.org/7749 Reviewed-by: Krishnan Parthasarathi <kparthas@redhat.com> Reviewed-by: Jeff Darcy <jdarcy@redhat.com> Tested-by: Gluster Build System <jenkins@build.gluster.com> Reviewed-by: Vijay Bellur <vbellur@redhat.com>