glusterfs.git - GlusterFS is a distributed file-system capable of scaling to several petabytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system.

	Commit message (Collapse)	Author	Age	Files	Lines
*	features/index: Optimize link-count fetching code path (#1789)	Pranith Kumar Karampuri	2021-03-10	6	-47/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* features/index: Optimize link-count fetching code path Problem: AFR requests 'link-count' in lookup to check if there are any pending heals. Based on this information, afr will set dirent->inode to NULL in readdirp when heals are ongoing to prevent serving bad data. When heals are completed, link-count xattr is leading to doing an opendir of xattrop directory and then reading the contents to figure out that there is no healing needed for every lookup. This was not detected until this github issue because ZFS in some cases can lead to very slow readdir() calls. Since Glusterfs does lot of lookups, this was slowing down all operations increasing load on the system. Code problem: index xlator on any xattrop operation adds index to the relevant dirs and after the xattrop operation is done, will delete/keep the index in that directory based on the value fetched in xattrop from posix. AFR sends all-zero xattrop for changelog xattrs. This is leading to priv->pending_count manipulation which sets the count back to -1. Next Lookup operation triggers opendir/readdir to find the actual link-count in lookup because in memory priv->pending_count is -ve. Fix: 1) Don't add to index on all-zero xattrop for a key. 2) Set pending-count to -1 when the first gfid is added into xattrop directory, so that the next lookup can compute the link-count. fixes: #1764 Change-Id: I8a02c7e811a72c46d78ddb2d9d4fdc2222a444e9 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * addressed comments Change-Id: Ide42bb1c1237b525d168bf1a9b82eb1bdc3bc283 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * tests: Handle base index absence Change-Id: I3cf11a8644ccf23e01537228766f864b63c49556 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * Addressed LOCK based comments, .t comments Change-Id: I5f53e40820cade3a44259c1ac1a7f3c5f2f0f310 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	cli: syntax check for arbiter volume creation (#2207)	Ravishankar N	2021-03-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	commit 8e7bfd6a58b444b26cb50fb98870e77302f3b9eb changed the syntax for arbiter volume creation to 'replica 2 arbiter 1', while still allowing the old syntax of 'replica 3 arbiter 1'. But while doing so, it also removed a conditional check, thereby allowing replica count > 3. This patch fixes it. Fixes: #2192 Change-Id: Ie109325adb6d78e287e658fd5f59c26ad002e2d3 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: remove offensive language	Ravishankar N	2020-12-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	TODO: Remove 'slave-timeout' and 'slave-gluster-command-dir'. These variables are defined in geo-replication/gsyncd.conf.in. So I will remove them when I change that folder. Change-Id: Ib9167ca586d83e01f8ec755cdf58b3438184c9dd Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	glusterd: fix bug in enabling granular-entry-heal (#1752)	Ravishankar N	2020-11-05	1	-5/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit f5e1eb87d4af44be3b317b7f99ab88f89c2f0b1a meant to enable the volume option only for replica volumes but inadvertently enabled it for all volume types. Fixing it now. Also found a bug in glusterd where disabling the option on plain distribute was succeeding even though setting it in the fist place fails. Fixed that too. Fixes: #1483 Change-Id: Icb6c169a8eec44cc4fb4dd636405d3b3485e91b4 Reported-by: Sheetal Pamecha <spamecha@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	glusterd/afr: enable granular-entry-heal by default (#1621)	Ravishankar N	2020-10-22	8	-3/+506
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. The option has been enabled and tested for quite some time now in RHHI-V downstream and I think it is safe to make it 'on' by default. Since it is not possible to simply change it from 'off' to 'on' without breaking rolling upgrades, old clients etc., I have made it default only for new volumes starting from op-verison GD_OP_VERSION_9_0. Note: If you do a volume reset, the option will be turned back off. This is okay as the dir's gfid will be captured in 'xattrop' folder and heals will proceed. There might be stale entries inside entry-changes' folder, which will be removed when we enable the option again. 2. I encountered a cust. issue where entry heal was pending on a dir. with 236436 files in it and the glustershd.log output was just stuck at "performing entry selfheal", so I have added logs to give us more info in DEBUG level about whether entry heal and data heal are progressing (metadata heal doesn't take much time). That way, we have a quick visual indication to say things are not 'stuck' if we briefly enable debug logs, instead of taking statedumps or checking profile info etc. Fixes: #1483 Change-Id: I4f116f8c92f8cd33f209b758ff14f3c7e1981422 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Heal directory rename without rmdir/mkdir	Pranith Kumar K	2020-04-13	4	-0/+708
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem1: When a directory is renamed while a brick is down entry-heal always did an rm -rf on that directory on the sink on old location and did mkdir and created the directory hierarchy again in the new location. This is inefficient. Problem2: Renamedir heal order may lead to a scenario where directory in the new location could be created before deleting it from old location leading to 2 directories with same gfid in posix. Fix: As part of heal, if oldlocation is healed first and is not present in source-brick always rename it into a hidden directory inside the sink-brick so that when heal is triggered in new-location shd can rename it from this hidden directory to the new-location. If new-location heal is triggered first and it detects that the directory already exists in the brick, then it should skip healing the directory until it appears in the hidden directory. Credits: Ravi for rename-data-loss.t script Fixes: #1211 Change-Id: I0cba2006f35cd03d314d18211ce0bd530e254843 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: provide an option to mark tests as 'flaky'	Amar Tumballi	2020-08-18	1	-203/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	* also add some time gap in other tests to see if we get things properly * create a directory 'tests/000/', which can host any tests, which are flaky. * move all the tests mentioned in the issue to above directory. * as the above dir gets tested first, all flaky tests would be reported quickly. * change `run-tests.sh` to continue tests even if flaky tests fail. Reference: gluster/project-infrastructure#72 Updates: #1000 Change-Id: Ifdafa38d083ebd80f7ae3cbbc9aa3b68b6d21d0e Signed-off-by: Amar Tumballi <amar@kadalu.io>
*	afr/split-brain: fix client side split-brain resolution when quorum is enabled	Mohammed Rafi KC	2020-07-29	1	-0/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: If we set favourite child policy, then automatic split-brain resolution should work in all cases. This was failing when quorum count was set to a non-zero value. The initial lookup before the read txn was failing with ENOTCONN. Since we don't have a readable subvol, we were failing it. We were only looking to the split brain resolution choice set through the cli command. Fix: We will now consider the favourite child policy if split-brain choice has not been set via cli command. Change-Id: Id2016c3a90d0763ac6f1a0131571053f595576f0 Fixes: #1404 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
*	cluster/afr: Delay post-op for fsync	Pranith Kumar K	2020-05-29	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: AFR doesn't delay post-op for fsync fop. For fsync heavy workloads this leads to un-necessary fxattrop/finodelk for every fsync leading to bad performance. Fix: Have delayed post-op for fsync. Add special flag in xdata to indicate that afr shouldn't delay post-op in cases where either the process will terminate or graph-switch would happen. Otherwise it leads to un-necessary heals when the graph-switch/process-termination happens before delayed-post-op completes. Fixes: #1253 Change-Id: I531940d13269a111c49e0510d49514dc169f4577 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	mgmt/glusterd: Stop old shd before increasing replica count	Pranith Kumar K	2020-05-13	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In add-brick that increases replica count SHD was restarted after pending xattrs are set on the new bricks and adding bricks. But before restarting SHD there is a possibility that old SHD would do a scan on root-directory see no heal is needed and delete index for root-dir leading to no heals until lookup is executed on the mount Fix: Stop shd, perform pending-xattr setting/adding new bricks and then restart shd Fixes: #1240 Change-Id: I94fd7c6c909211b597185dfe097a559db6c0d00f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	afr: event gen changes	Ravishankar N	2020-04-10	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The general idea of the changes is to prevent resetting event generation to zero in the inode ctx, since event gen is something that should follow 'causal order'. Change #1: For a read txn, in inode refresh cbk, if event_generation is found zero, we are failing the read fop. This is not needed because change in event gen is only a marker for the next inode refresh to happen and should not be taken into account by the current read txn. Change #2: The event gen being zero above can happen if there is a racing lookup, which resets even get (in afr_lookup_done) if there are non zero afr xattrs. The resetting is done only to trigger an inode refresh and a possible client side heal on the next lookup. That can be acheived by setting the need_refresh flag in the inode ctx. So replaced all occurences of resetting even gen to zero with a call to afr_inode_need_refresh_set(). Change #3: In both lookup and discover path, we are doing an inode refresh which is not required since all 3 essentially do the same thing- update the inode ctx with the good/bad copies from the brick replies. Inode refresh also triggers background heals, but I think it is okay to do it when we call refresh during the read and write txns and not in the lookup path. The .ts which relied on inode refresh in lookup path to trigger heals are now changed to do read txn so that inode refresh and the heal happens. Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec Fixes: #1179 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Erik Jacobson <erik.jacobson at hpe.com>
*	cluster/afr: Fixes for halo	Pranith Kumar K	2020-02-04	1	-0/+61
\| \| \| \| \| \| \| \| \| \| \|	Current implementation assumes that ping-event will come after connect event but that may not be the case in the cases where after socket connection fds need to be re-opened which would consume more time. So handle any order of the ping/child-up events. fixes: bz#1800583 Change-Id: I6bcdc0caa503bdc039ef2b4739fbf4afae121f05 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Add afr_seek to fops table	Pranith Kumar K	2019-10-10	1	-0/+55
\| \| \| \| \| \|	fixes: bz#1760189 Change-Id: Iffbf8d6f4c50b8e2de8364658697bdbe96549f5d Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	ctime/rebalance: Heal ctime xattr on directory during rebalance	Kotresh HR	2019-07-29	2	-0/+253
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After add-brick and rebalance, the ctime xattr is not present on rebalanced directories on new brick. This patch fixes the same. Note that ctime still doesn't support consistent time across distribute sub-volume. This patch also fixes the in-memory inconsistency of time attributes when metadata is self healed. Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df fixes: bz#1734026 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	posix/ctime: Fix ctime upgrade issue	Kotresh HR	2019-06-13	1	-16/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: On a EC volume, during upgrade from the older version where ctime feature is not enabled(or not present) to the newer version where the ctime feature is available (enabled default), the self heal hangs and doesn't complete. Cause: The ctime feature has both client side code (utime) and server side code (posix). The feature is driven from client. Only if the client side sets the time in the frame, should the server side sets the time attributes in xattr. But posix setattr/fseattr was not doing that. When one of the server nodes is updated, since ctime is enabled by default, it starts setting xattr on setattr/fseattr on the updated node/brick. On a EC volume the first two updated nodes(bricks) are not a problem because there are 4 other bricks with consistent data. However once the third brick is updated, the new attribute(mdata xattr) will cause an inconsistency on metadata on 3 bricks, which prevents the file to be repaired. Fix: Don't create mdata xattr with utimes/utimensat system call. Only update if already present. Change-Id: Ieacedecb8a738bb437283ef3e0f042fd49dc4c8c fixes: bz#1720201 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	tests: Fix split-brain-favorite-child-policy.t failure	karthik-us	2019-06-10	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The test case is failing to heal the volume within $HEAL_TIMEOUT @195. This is happening because as part of split-brain resolution the file gets expunged from the sink and the new entry mark for that file will be done on the source bricks as part of impunging. Since the source bricks shd-threads failed to get the heal-domain lock, they will wait for the heal-timeout of 10 minutes, which is greater than $HEAL_TIMEOUT. Fix: Set the cluster.heal-timeout to 5 seconds to trigger the heal so that one of the source brick heals the file within the $HEAL_TIMEOUT. Change-Id: Ie73c578cc5361c0d617a48ccc86026734d20ba8c fixes: bz#1718998 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	tests: Fix spurious failures in ta-write-on-bad-brick.t	Pranith Kumar K	2019-05-21	3	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: afr_child_up_status_meta works only when LOOKUP on $M0 is successful. There are cases where quorum is not met and LOOKUP fails on $M0 which leads to failures similar to: grep: /mnt/glusterfs/0/.meta/graphs/active/patchy-replicate-0/private: Transport endpoint is not connected This was happening once in a while based on attribute-timeout and md-cache not serving the lookup. Fix: Find child-up status based on statedump instead. Also changed mount options to include --entry-timeout=0 and --attribute-timeout=0 updates bz#1193929 Change-Id: Ic0de72c3006d7399a5feb3e4d10d4748949b2ab3 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: improve and fix some test scripts	Xavier Hernandez	2018-01-19	3	-1/+23
\| \| \| \| \| \|	Change-Id: Iceefe22af754096c599dc570d4894d14fce4deae Updates: bz#1193929 Signed-off-by: Xavier Hernandez <xhernandez@redhat.com>
*	cluster/afr: Invalidate inode on change of split-brain-choice	Pranith Kumar K	2019-04-04	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \|	When split-brain choice is changed from one brick to another brick, inode-invalidate is not called so readv call is served from cache leading to failures in split-brain-resolution.t. Fixed it by calling inode_invaldate() when this happens. updates bz#1193929 Change-Id: I2624614eec38c0303f3e1dc55dfae3d4b864218b Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/thin-arbiter: Consider thin-arbiter before marking new entry changelog	Ashish Pandey	2018-12-21	1	-6/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a fop to create an entry fails on one of the data brick, we mark the pending changelog on the entry on brick for which it was successful. This is done as part of post op phase to make sure that entry gets healed even if it gets renamed to some other path where its parent was not marked as bad. As it happens as part of post op, we should consider thin-arbiter to check if the brick, which was successful, is the good brick or not. This will avoide split brain and other issues. Change-Id: I12686675be98f02f70a5186b3ed748c541514d53 updates: bz#1662264 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	tests: run nfs tests only if --enable-gnfs is provided	Amar Tumballi	2019-01-11	2	-0/+5
\| \| \| \| \| \|	Fixes: bz#1665358 Change-Id: Idbf88ec3ac683733b32c313377eeb72f2819bf0d Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	cluster/afr: Disable client side heals in AFR by default.	Sunil Kumar Acharya	2018-12-27	4	-1/+10
\| \| \| \| \| \| \| \| \|	With this changeset, default value for the AFR client side heal volume option is set to "off" fixes: bz#1663102 Change-Id: Ie4016932339c4896487e3e7cb5caca68739b7ba2 Signed-off-by: Sunil Kumar Acharya <sheggodu@redhat.com>
*	cluster/ta: Check number/type of locks held on ta file	Ashish Pandey	2018-10-23	1	-0/+68
\| \| \| \| \| \|	Change-Id: Iec47856ce2819e7d7d38a60279602e53ba45858d updates: bz#1624332 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	cluster/afr: Add test for thin-arbiter feature	Ashish Pandey	2018-08-31	1	-0/+51
\| \| \| \| \| \| \| \| \|	Test : Check success/failure of write fop while different bricks/ta process are down. Change-Id: I3c376935df93ebf1f794c964bd19bc1280d91c59 updates: bz#1624332 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	tiering: remove the translator from build and glusterd	Amar Tumballi	2018-10-03	1	-29/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Based on the proposal to remove few features as they are not actively maintained [1], removing tier translator from the build. Also make sure there are no regression tests involving tiering feature are present. [1] https://lists.gluster.org/pipermail/gluster-users/2018-July/034400.html Change-Id: I2c177f711f9b54b7b24e1a13525ff3132bd9a9c5 updates: bz#1642807 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	afr: thin-arbiter 2 domain locking and in-memory state	Ravishankar N	2018-09-23	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	2 domain locking + xattrop for write-txn failures: -------------------------------------------------- - A post-op wound on TA takes AFR_TA_DOM_NOTIFY range lock and AFR_TA_DOM_MODIFY full lock, does xattrop on TA and releases AFR_TA_DOM_MODIFY lock and stores in-memory which brick is bad. - All further write txn failures are handled based on this in-memory value without querying the TA. - When shd heals the files, it does so by requesting full lock on AFR_TA_DOM_NOTIFY domain. Client uses this as a cue (via upcall), releases AFR_TA_DOM_NOTIFY range lock and invalidates its in-memory notion of which brick is bad. The next write txn failure is wound on TA to again update the in-memory state. - Any incomplete write txns before the AFR_TA_DOM_NOTIFY upcall release request is got is completed before the lock is released. - Any write txns got after the release request are maintained in a ta_waitq. - After the release is complete, the ta_waitq elements are spliced to a separate queue which is then processed one by one. - For fops that come in parallel when the in-memory bad brick is still unknown, only one is wound to TA on wire. The other ones are maintained in a ta_onwireq which is then processed after we get the response from TA. Change-Id: I32c7b61a61776663601ab0040e2f0767eca1fd64 updates: bz#1579788 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Signed-off-by: Ashish Pandey <aspandey@redhat.com>
*	cluster/afr: Use 2 domain locking in SHD for thin-arbiter	karthik-us	2018-05-30	1	-0/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With this change when SHD starts the index crawl it requests all the clients to release the AFR_TA_DOM_NOTIFY lock so that clients will know the in memory state is no more valid and any new operations needs to query the thin-arbiter if required. When SHD completes healing all the files without any failure, it will again take the AFR_TA_DOM_NOTIFY lock and gets the xattrs on TA to see whether there are any new failures happened by that time. If there are new failures marked on TA, SHD will start the crawl immediately to heal those failures as well. If there are no new failures, then SHD will take the AFR_TA_DOM_MODIFY lock and unsets the xattrs on TA, so that both the data bricks will be considered as good there after. Change-Id: I037b89a0823648f314580ba0716d877bd5ddb1f1 fixes: bz#1579788 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	afr: thin-arbiter read txn changes	Ravishankar N	2018-08-31	1	-0/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If both data bricks are up, read subvol will be based on read_subvols. If only one data brick is up: - First qeury the data-brick that is up. If it blames the other brick, allow the reads. - If if doesn't, query the TA to obtain the source of truth. TODO: See if in-memory state can be maintained for read txns (BZ 1624358). updates: bz#1579788 Change-Id: I61eec35592af3a1aaf9f90846d9a358b2e4b2fcc Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Delegate name-heal when possible	Pranith Kumar K	2018-08-27	1	-0/+112
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When name-self-heal is triggered on the mount, it blocks lookup until name-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs name heal and all of them trigger heals waiting for other clients to complete heal. Fix: When a name-heal is needed but quorum number of names have the file and pending xattrs exist on the parent, then better to delegate the heal to SHD which will be completed as part of entry-heal of the parent directory. We could also do the same for quorum-number of names not present but we don't have any known use-case where this is a frequent occurrence so not changing that part at the moment. When there is a gfid mismatch or missing gfid it is important to complete the heal so that next rename doesn't assume everything is fine and perform a rename etc fixes bz#1622821 Change-Id: I8b002c85dffc6eb6f2833e742684a233daefeb2c Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Delegate metadata heal with pending xattrs to SHD	Pranith Kumar K	2018-08-27	1	-10/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When metadata-self-heal is triggered on the mount, it blocks lookup until metadata-self-heal completes. But that can lead to hangs when lot of clients are accessing a directory which needs metadata heal and all of them trigger heals waiting for other clients to complete heal. Fix: Only when the heal is needed but the pending xattrs are not set, trigger metadata heal that could block lookup. This is the only case where different clients may give different metadata to the clients without heals, which should be avoided. Updates bz#1622821 Change-Id: I6089e9fda0770a83fb287941b229c882711f4e66 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	performance/readdir-ahead: keep stats of cached dentries in sync with ↵	Krutika Dhananjay	2017-01-17	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	modifications PROBLEM: Stats of dentries that are readdirp'd ahead can become stale due to fops like writes, truncate etc that modify the file pointed by dentries. When a readdir is finally wound at offset corresponding to these entries, the iatts that are returned to the application come from readdir-ahead's cache, which are stale by now. This problem gets further aggravated when caching translators/modules cache and continue to serve this stale information. FIX: * Store the iatt in context of the inode pointed by dentry. * Whenever the inode pointed by dentry undergoes modification, in cbk of modification fop, update the iatt stored in inode-ctx to reflect the modification. * When serving a readdirp response from application, update iatts of dentries with the iatts stored in the context of inodes pointed by these dentries. * Some fops don't have valid iatts in their responses. For eg., write response whose data is still cached in write-behind will have zeroed out stat. In this case keep only ia_type and ia_gfid and reset rest of the iatt members to zero. - fuse-bridge in this case just sends "entry" information back to kernel and attr is not sent. - gfapi sets entry->inode to NULL and zeroes out the entire stat * There is one tiny race between the entry creation and a readdirp on its parent dir, which could cause the inode-ctx setting and inode ctx reading to happen on two different inode objects. To prevent this, when entry->inode doesn't eqaul to linked_inode, - fuse-bridge is made to send only "entry" information without attributes - gfapi sets entry->inode to NULL and zeroes out the entire stat. Change-Id: Ia27ff49a61922e88c73a1547ad8aacc9968a69df BUG: 1390050 Updates: bz#1390050 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	tests: Add thin-arbiter.rc for writing tests for thin-arbiter	Pranith Kumar K	2018-08-14	1	-0/+44
\| \| \| \| \| \|	fixes bz#1615789 Change-Id: I1f42e78fec5ddaf2a425dc4b82c9a20472aa146d Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: Fix for gfid-mismatch-resolution-with-fav-child-policy.t failure	karthik-us	2018-08-13	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This test was retried once on build https://build.gluster.org/job/regression-on-demand-multiplex/174/ (logs for the first try is not available with this build) Test case was failing in line #47 where it was was checking for the heal count to be 0. Line #51 had passed that means file got the gfid split brain resolved, and both the bricks had same gfids. At line #54 it again failed which checks for the md5sum on both the bricks. At this point the md5sum of the brick where the file got impunged had the md5sum same as the newly created empty file. This means the data heal has not happened for the file. At line #64 enabling granular-entry-heal faild, but without the logs it is not possible to debug this issue. Change-Id: I56d854dbb9e188cafedfd24a9d463603ae79bd06 fixes: bz#1615331 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	tests: fix replace-brick-self-heal.t failure	Ravishankar N	2018-08-13	1	-1/+1
\| \| \| \| \| \| \| \|	Please see BZ for details. Change-Id: Id9273432874bc6a452ac96b2b8c7a61ea6c5b98d Fixes: bz#1615239 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: potential fixes for tests/basic/afr/add-brick-self-heal.t	Ravishankar N	2018-08-10	1	-0/+7
\| \| \| \| \| \| \| \|	Please see bug description for details. Change-Id: Ieb6bce6d1d5c4c31f1878dd1a1c3d007d8ff81d5 fixes: bz#1614654 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: Set heal-timeout to 5 seconds	Pranith Kumar K	2018-08-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Shd keeps doing heals in a loop until it heals at least one entry in the previous run. A heal is termed successful only if it heals both metadata and entry/data heal i.e. the entry needs to be completely healed by just that healer. In tests/basic/afr/granular-esh/replace-brick.t test, brick-0 is old and brick-1 is new. After replace-brick only root-gfid will be present in brick-0's index 1) shd-thread corresponding to brick-0 does metadata heal, this creates root-gfid in brick-0's 'dirty' index. 2) Both healer threads corresponding to brick-0 and brick-1 now try to heal root-gfid and brick-1 gets the heal-domain lock. brick-0's shd-thread will experience a failure and it goes back to waiting for 10 minutes (cluster.heal-timeout). 3) When brick-1's healer-thread completes healing root-gfid it creates 5 files which create indices in brick-0, so until brick-0 doesn't trigger one more heal, heal won't happen. $HEAL_TIMEOUT is set at 120 seconds, which is lesser than cluster.heal-timeout, so decreasing this to 5 seconds so that the next heal is triggered which will do the heals. fixes bz#1613807 Change-Id: I881133fc28880d8615fbc4558a0dfa0dc63d7798 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	Revert "performance/readdir-ahead: Invalidate cached dentries if they're ↵	Raghavendra G	2018-08-03	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	modified while in cache" This reverts commit 7131de81f72dda0ef685ed60d0887c6e14289b8c. With the latest master, I created a single brick volume and some files inside it. [root@rhgs313-6 ~]# umount -f /mnt/fuse1; mount -t glusterfs -s 192.168.122.6:/thunder /mnt/fuse1; ls -l /mnt/fuse1/; echo "Trying again"; ls -l /mnt/fuse1 umount: /mnt/fuse1: not mounted total 0 ----------. 0 root root 0 Jan 1 1970 file-1 ----------. 0 root root 0 Jan 1 1970 file-2 ----------. 0 root root 0 Jan 1 1970 file-3 ----------. 0 root root 0 Jan 1 1970 file-4 ----------. 0 root root 0 Jan 1 1970 file-5 d---------. 0 root root 0 Jan 1 1970 subdir Trying again total 3 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-1 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-2 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-3 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-4 -rw-r--r--. 1 root root 33 Aug 3 14:06 file-5 d---------. 0 root root 0 Jan 1 1970 subdir [root@rhgs313-6 ~]# Conversation can be followed on gluster-devel on thread with subj: tests/bugs/distribute/bug-1122443.t - spurious failure. git-bisected pointed this patch as culprit. Change-Id: I1eb46f6c196f44fde8ce991840a0e724e6f50862 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Updates: bz#1390050
*	performance/md-cache: update cache only from fops issued after previous ↵	Raghavendra G	2018-06-28	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	invalidation Invalidations are triggered mainly by two codepaths - upcall and write-behind unwinding a cached write with zeroed out stat. For the case of upcall, following race can happen: * stat s1 is fetched from brick * invalidation is detected on brick * invalidation is propagated to md-cache and cache is invalidated * s1 updates md-cache with a stale state For the case of write-behind, imagine following sequence of operations, * A stat s1 was issued from application thread t1 when size of file was s1 * stat s1 completes on brick stack, but yet to reach md-cache * A write w1 from application thread t2 extends file to size s2 is cached in write-behind and response is unwound with zeroed out stat * md-cache while handling write-cbk, invalidates cache * md-cache receives response for s1, updates cache with stale stat with size of s1 overwriting invalidation state Fix is to remember when s1 was incident on md-cache and update cache with results of s1 only if the it was incident after invalidation of cache. This patch identified some bugs in regression tests which is tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1608158. As a stop gap measure I am marking following tests as bad basic/afr/split-brain-resolution.t bugs/bug-1368312.t bugs/replicate/bug-1238398-split-brain-resolution.t bugs/replicate/bug-1417522-block-split-brain-resolution.t bugs/replicate/bug-1438255-do-not-mark-self-accusing-xattrs.t Change-Id: Ia4bb9dd36494944e2d91e9e71a79b5a3974a8c77 Signed-off-by: Raghavendra G <rgowdapp@redhat.com> Updates: bz#1512691
*	build: remove bundled arg-standalone	Niels de Vos	2018-07-27	2	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	libargp or argp-standalone is available on all commonly used distributions. There is no need to bundle an unmaintained version of argp-standalone in this repository anymore. FreeBSD places the argp.h file in /usr/local/include when argp-standalone is installed. This path is not added to CPPFLAGS by default, so thats done in configure.ac as well. Change-Id: I384a53ab0a008ec9d48fd83afeaf8fbc197e91ee Fixes: bz#1609337 Signed-off-by: Niels de Vos <ndevos@redhat.com>
*	performance/readdir-ahead: Invalidate cached dentries if they're modified ↵	Krutika Dhananjay	2017-01-17	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	while in cache PROBLEM: Entries that are readdirp'd ahead can undergo modification in terms of writes, truncates which could modify their iatts. When a readdir is finally wound at offset corresponding to these entries, the iatts that are returned to the application come from readdir-ahead's cache, which are stale by now. This problem gets further aggravated when caching translators/modules cache and continue to serve this stale information. FIX: Whenever a dentry undergoes modification, in the cbk of the modification fop, a "dirty" flag (default 0) is set in its inode ctx. When it's time for readdir-ahead to serve these entries, it will read the inode ctx and check if the entry is "dirty", and if it is, set the entry's attrs to all zeroes, as an indicator to fuse, md-cache etc not to cache these attributes. Also there is one tiny race between the entry creation and a readdirp on its parent dir, which could cause the inode-ctx setting and inode ctx reading to happen on two different inode objects. To prevent this, fuse-bridge is made to drop entries for which dentry->inode is not the same as linked inode, in readdirp cbk. Change-Id: If7396507632b5268442ca580473d5155fee9cbef BUG: 1390050 Updates: bz#1390050 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com> Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
*	afr,ec: Print if the subvolume is up in statedump	Pranith Kumar K	2018-07-02	1	-0/+28
\| \| \| \| \| \|	fixes bz#1597156 Change-Id: I323eb9190e40b12df216698dcdba74a6d336beeb Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cli: change volume create syntax of arbiter volume	Amar Tumballi	2018-06-28	1	-4/+9
\| \| \| \| \| \| \|	fixes: bz#1596524 updates: gluster/glusterd2#515 Change-Id: I8a46fa2fd1fd2b0e9fbcecd3bb18d348aed9c6a9 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	tests: remove tarissue.t from BAD_TEST	Ravishankar N	2018-06-26	1	-3/+0
\| \| \| \| \| \| \| \| \| \|	BZ 1337791 marked this .t as bad based on the tar version being a likely suspect. Undoing this to check as it passes on the latest jenkins slaves. Change-Id: Ia581064a9c620351d3fe7aeef95d2644337952e1 fixes: bz#1595492 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Yaniv Kaul <ykaul@redhat.com>
*	afr: add new value for read-hash-mode volume option	Ravishankar N	2018-03-22	1	-0/+56
\| \| \| \| \| \| \| \| \| \|	Updates: #363 This new value (3) will try to wind read requests to the child of AFR having the least amount of pending requests in its queue. Change-Id: If6bda2aac9bf7aec3fc39622f78659313c4b6508 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Switch to active-fd-count for open-fd checks	Pranith Kumar K	2018-03-19	1	-0/+20
\| \| \| \| \| \|	BUG: 1557932 Change-Id: I3783e41b3812267bc10c0d05d062a31396ce135b Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Remove compound-fops usage in afr	Pranith Kumar K	2018-03-02	1	-37/+0
\| \| \| \| \| \| \| \| \|	We are not seeing much improvement with this change. So removing the feature so that it doesn't need to be maintained anymore. Fixes: #414 Change-Id: Ic7969b151544daf2547bd262a9fa03f575626411 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: bring option of per test timeout	Amar Tumballi	2018-01-18	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This uses 'timeout' command with 300 seconds default. Right now, there is just 1 test which takes more than that in a properly setup machine. Ideally best case is set the default to something like 30 seconds, and if a test is supposed to take more than that, owner should add a timeout line to test knowingly. That way, it makes test writers think about a time limit too. Change-Id: I747005ce1f208aeb2ecbf899e8feea487ecd21a0 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	afr: don't treat all cases all bricks being blamed as split-brain	Ravishankar N	2018-01-28	1	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: We currently don't have a roll-back/undoing of post-ops if quorum is not met. Though the FOP is still unwound with failure, the xattrs remain on the disk. Due to these partial post-ops and partial heals (healing only when 2 bricks are up), we can end up in split-brain purely from the afr xattrs point of view i.e each brick is blamed by atleast one of the others. These scenarios are hit when there is frequent connect/disconnect of the client/shd to the bricks while I/O or heal are in progress. Fix: Instead of undoing the post-op, pick a source based on the xattr values. If 2 bricks blame one, the blamed one must be treated as sink. If there is no majority, all are sources. Once we pick a source, self-heal will then do the heal instead of erroring out due to split-brain. Change-Id: I3d0224b883eb0945785ade0e9697a1c828aec0ae BUG: 1539358 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: Renable basic/afr/split-brain-favorite-child-policy.t	Nigel Babu	2017-11-29	1	-6/+0
\| \| \| \| \| \| \| \| \|	This test was failing due to an infra issue. The infra issue is now fixed. BUG: 1517961 Change-Id: I09dfab9c0a3ebe73c738222e6269d9e35c85eddb Signed-off-by: Nigel Babu <nigelb@redhat.com>
*	tests: mark currently failing regression tests as known issues	Amar Tumballi	2017-11-27	1	-0/+6
\| \| \| \| \| \|	Change-Id: If6c36dc6c395730dfb17b5b4df6f24629d904926 BUG: 1517961 Signed-off-by: Amar Tumballi <amarts@redhat.com>