glusterfs.git - GlusterFS is a distributed file-system capable of scaling to several petabytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system.

	Commit message (Collapse)	Author	Age	Files	Lines
*	tests: avoid empty paths in environment variables (#2349)	Xavi Hernandez	2021-04-23	1	-10/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* tests: avoid empty paths in environment variables Many variables containing paths in env.rc.in are defined in a way that leave a trailing ':' in the variable when the previous value was empty or undefined. In the particular case of 'LD_PRELOAD_PATH' variable, this causes that the system looks for dynamic libraries in the current working directory. When this directory is inside a Gluster mount point, a significant delay is caused each time a program is run (and testing framework can run lots of programs for each test). This patch prevents that variables containing paths could end with a trailing ':'. Fixes: #2348 Change-Id: I669f5a78e14f176c0a58824ba577330989d84769 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> * Fix PYTHONPATH name and duplicity Change-Id: Iaa0b092118bb86856bbe621eb03fef6fa7478971 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	protocol/client: Fix lock memory leak (#2338)	Pranith Kumar Karampuri	2021-04-22	4	-22/+145
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem-1: When an overlapping lock is issued the merged lock is not assigned the owner. When flush is issued on the fd, this particular lock is not freed leading to memory leak Fix-1: Assign the owner while merging the locks. Problem-2: On fd-destroy lock structs could be present in fdctx. For some reason with flock -x command and closing of the bash fd, it leads to this code path. Which leaks the lock structs. Fix-2: When fdctx is being destroyed in client, make sure to cleanup any lock structs. fixes: #2337 Change-Id: I298124213ce5a1cf2b1f1756d5e8a9745d9c0a1c Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	dht: fix rebalance of sparse files (#2318)	Xavi Hernandez	2021-04-09	2	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current implementation of rebalance for sparse files has a bug that, in some cases, causes a read of 0 bytes from the source subvolume. Posix xlator doesn't allow 0 byte reads and fails them with EINVAL, which causes rebalance to abort the migration. This patch implements a more robust way of finding data segments in a sparse file that avoids 0 byte reads, allowing the file to be migrated successfully. Fixes: #2317 Change-Id: Iff168dda2fb0f2edf716b21eb04cc2cc8ac3915c Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	tests: Fix incorrect variables in throttle-rebal.t (#2316)	Jamie Nguyen	2021-04-08	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	The final test doesn't test what it means to test. It still fails as expected, but only because at this point `THROTTLE_LEVEL` is still set to `garbage`. Easily fixed by correcting the typos in the variable names, and thus fixes https://github.com/gluster/glusterfs/issues/2315 Signed-off-by: Jamie Nguyen <j@jamielinux.com>
*	Removal of force option in snapshot create (#2110)	nishith-vihar	2021-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	The force option does fail for snapshot create command even though the quorum is satisfied and is redundant. The change deprecates the force option for snapshot create command and checks if all bricks are online instead of checking for quorum for creating a snapshot. Fixes: #2099 Change-Id: I45d866e67052fef982a60aebe8dec069e78015bd Signed-off-by: Nishith Vihar Sakinala <nsakinal@redhat.com>
*	afr: don't reopen fds on which POSIX locks are held (#1980)	Karthik Subrahmanya	2021-03-27	1	-0/+206
\| \| \| \| \| \| \| \| \| \|	When client.strict-locks is enabled on a volume and there are POSIX locks held on the files, after disconnect and reconnection of the clients do not re-open such fds which might lead to multiple clients acquiring the locks and cause data corruption. Change-Id: I8777ffbc2cc8d15ab57b58b72b56eb67521787c5 Fixes: #1977 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	cluster/dht: use readdir for fix-layout in rebalance (#2243)	Pranith Kumar Karampuri	2021-03-22	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: On a cluster with 15 million files, when fix-layout was started, it was not progressing at all. So we tried to do a os.walk() + os.stat() on the backend filesystem directly. It took 2.5 days. We removed os.stat() and re-ran it on another brick with similar data-set. It took 15 minutes. We realized that readdirp is extremely costly compared to readdir if the stat is not useful. fix-layout operation only needs to know that the entry is a directory so that fix-layout operation can be triggered on it. Most of the modern filesystems provide this information in readdir operation. We don't need readdirp i.e. readdir+stat. Fix: Use readdir operation in fix-layout. Do readdir+stat/lookup for filesystems that don't provide d_type in readdir operation. fixes: #2241 Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	cluster/dht: Provide option to disable fsync in data migration (#2259)	Pranith Kumar Karampuri	2021-03-17	7	-7/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At the moment dht rebalance doesn't give any option to disable fsync after data migration. Making this an option would give admins take responsibility of data in a way that is suitable for their cluster. Default value is still 'on', so that the behavior is intact for people who don't care about this. For example: If the data that is going to be migrated is already backed up or snapshotted, there is no need for fsync to happen right after migration which can affect active I/O on the volume from applications. fixes: #2258 Change-Id: I7a50b8d3a2f270d79920ef306ceb6ba6451150c4 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	features/index: Optimize link-count fetching code path (#1789)	Pranith Kumar Karampuri	2021-03-10	16	-93/+151
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* features/index: Optimize link-count fetching code path Problem: AFR requests 'link-count' in lookup to check if there are any pending heals. Based on this information, afr will set dirent->inode to NULL in readdirp when heals are ongoing to prevent serving bad data. When heals are completed, link-count xattr is leading to doing an opendir of xattrop directory and then reading the contents to figure out that there is no healing needed for every lookup. This was not detected until this github issue because ZFS in some cases can lead to very slow readdir() calls. Since Glusterfs does lot of lookups, this was slowing down all operations increasing load on the system. Code problem: index xlator on any xattrop operation adds index to the relevant dirs and after the xattrop operation is done, will delete/keep the index in that directory based on the value fetched in xattrop from posix. AFR sends all-zero xattrop for changelog xattrs. This is leading to priv->pending_count manipulation which sets the count back to -1. Next Lookup operation triggers opendir/readdir to find the actual link-count in lookup because in memory priv->pending_count is -ve. Fix: 1) Don't add to index on all-zero xattrop for a key. 2) Set pending-count to -1 when the first gfid is added into xattrop directory, so that the next lookup can compute the link-count. fixes: #1764 Change-Id: I8a02c7e811a72c46d78ddb2d9d4fdc2222a444e9 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * addressed comments Change-Id: Ide42bb1c1237b525d168bf1a9b82eb1bdc3bc283 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * tests: Handle base index absence Change-Id: I3cf11a8644ccf23e01537228766f864b63c49556 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * Addressed LOCK based comments, .t comments Change-Id: I5f53e40820cade3a44259c1ac1a7f3c5f2f0f310 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	afr: fix directory entry count (#2233)	Xavi Hernandez	2021-03-09	2	-0/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AFR may hide some existing entries from a directory when reading it because they are generated internally for private management. However the returned number of entries from readdir() function is not updated accordingly. So it may return a number higher than the real entries present in the gf_dirent list. This may cause unexpected behavior of clients, including gfapi which incorrectly assumes that there was an entry when the list was actually empty. This patch also makes the check in gfapi more robust to avoid similar issues that could appear in the future. Fixes: #2232 Change-Id: I81ba3699248a53ebb0ee4e6e6231a4301436f763 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	cli: syntax check for arbiter volume creation (#2207)	Ravishankar N	2021-03-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	commit 8e7bfd6a58b444b26cb50fb98870e77302f3b9eb changed the syntax for arbiter volume creation to 'replica 2 arbiter 1', while still allowing the old syntax of 'replica 3 arbiter 1'. But while doing so, it also removed a conditional check, thereby allowing replica count > 3. This patch fixes it. Fixes: #2192 Change-Id: Ie109325adb6d78e287e658fd5f59c26ad002e2d3 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/dht: Fix stack overflow in readdir(p) (#2170)	Xavi Hernandez	2021-02-24	1	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When parallel-readdir is enabled, readdir(p) requests sent by DHT can be immediately processed and answered in the same thread before the call to STACK_WIND_COOKIE() completes. This means that the readdir(p) cbk is processed synchronously. In some cases it may decide to send another readdir(p) request, which causes a recursive call. When some special conditions happen and the directories are big, it's possible that the number of nested calls is so high that the process crashes because of a stack overflow. This patch fixes this by not allowing nested readdir(p) calls. When a nested call is detected, it's queued instead of sending it. The queued request is processed when the current call finishes by the top level stack function. Fixes: #2169 Change-Id: Id763a8a51fb3c3314588ec7c162f649babf33099 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	tests: Move tests/basic/glusterd-restart-shd-mux.t to flaky (#2191)	mohit84	2021-02-24	1	-0/+0
\| \| \| \| \| \| \| \| \| \| \|	The test case ( tests/basic/glusterd-restart-shd-mux.t ) was introduced as a part of shd mux feature but we observed the feature is not stable and we already planned to revert a feature. For the time being I am moving a test case to flaky to avoid a frequent regression failure. Fixes: #2190 Change-Id: I4a06a5d9212fb952a864d0f26db8323690978bfc Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	Remove tests from components that are no longer in the tree (#2160)	Pranith Kumar Karampuri	2021-02-13	8	-288/+0
\| \| \| \| \|	fixes: #2159 Change-Id: Ibaaebc48b803ca6ad4335c11818c0c71a13e9f07 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	tests: Handle nanosecond duration in profile info (#2135)	Pranith Kumar Karampuri	2021-02-08	4	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: volume profile info now prints duration in nano seconds. Tests were written when the duration was printed in micro seconds. This leads to spurious failures. Fix: Change tests to handle nano second durations fixes: #2134 Change-Id: I94722be87000a485d98c8b0f6d8b7e1a526b07e7 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	glusterd-volgen: Add functionality to accept any custom xlator (#1974)	Ryo Furuhashi	2021-02-05	2	-0/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* glusterd-volgen: Add functionality to accept any custom xlator Add new function which allow users to insert any custom xlators. It makes to provide a way to add any processing into file operations. Users can deploy the plugin(xlator shared object) and integrate it to glusterfsd. If users want to enable a custom xlator, do the follows: 1. put xlator object(.so file) into "XLATOR_DIR/user/" 2. set the option user.xlator.<xlator> to the existing xlator-name to specify of the position in graph 3. restart gluster volume Options for custom xlator are able to set in "user.xlator.<xlator>.<optkey>". Fixes: #1943 Signed-off-by:Ryo Furuhashi <ryo.furuhashi.nh@hitachi.com> Co-authored-by: Yaniv Kaul <ykaul@redhat.com> Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
*	Fix comments format, remove 'is a folder' warning	Michael Scherer	2021-02-04	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	// is for C and C++, shell use #. Vim syntax coloration is misleading. This displayed in each jenkins log: ./tests/00-geo-rep/../include.rc: line 1: //: is a folder Likely no impact besides a wrong warning. Fix #2093
*	cluster/dht: Allow fix-layout only on directories (#2109)	Pranith Kumar Karampuri	2021-02-03	1	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: fix-layout operation assumes that the directory passed is directory i.e. layout->cnt == conf->subvolume_cnt. This will lead to a crash when fix-layout is attempted on a file. Fix: Disallow fix-layout on files fixes: #2107 Change-Id: I2116b8773059f67e3260e9207e20eab3de711417 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	features/shard: delay unlink of a file that has fd_count > 0 (#1563)	Vinayak hariharmath	2021-02-03	2	-0/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: Iec16d7ff5e05f29255491a43fbb6270c72868999 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I07e5a5bf9d33c24b63da72d4f3f59392c5421652 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I3679de8545f2e5b8027c4d5a6fd0592092e8dfbd Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update xlators/storage/posix/src/posix-entry-ops.c Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update fd.c * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
*	tests: 00-georep-verify-non-root-setup.t back to tests/00-geo-rep/ (#2102)	Shwetha Acharya	2021-02-03	1	-0/+0
\| \| \| \| \| \| \| \| \| \| \|	00-georep-verify-non-root-setup.t should be moved back to tests/00-geo-rep/ from tests/000-flaky/ directory as the recent failures encountered on the stated test case were not linked to the test case instead they were linked to installed libtirpc in the build environment. Fixes: #2101 Change-Id: I2b35e9ed95ad3de68ad8574ff76805f5db64c0b2 Signed-off-by: Shwetha K Acharya <sacharya@redhat.com>
*	Move 00-georep-verify-non-root-setup.t to tests/000-flaky/ (#2087)	Shwetha Acharya	2021-02-01	1	-0/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	Spurious failures of 00-georep-verify-non-root-setup.t seen only on build machines. These failures are not reproducible even on softserve / centos / fedora machines. So, moving test 00-georep-verify-non-root-setup.t to tests/000-flaky/ untill the issue is RCAd at build machines. Fixes: #2086 Change-Id: Id1eab598fa0f9ba5ba019e6b3f057a5b10fdb0ea Signed-off-by: Shwetha K Acharya <sacharya@redhat.com>
*	features/shard: unlink fails due to nospace to mknod marker file	Vinayakswami Hariharmath	2021-01-26	1	-0/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we hit the max capacity of the storage space, shard_unlink() starts failing if there is no space left on the brick to create a marker file. shard_unlink() happens in below steps: 1. create a marker file in the name of gfid of the base file under BRICK_PATH/.shard/.remove_me 2. unlink the base file 3. shard_delete_shards() deletes the shards in background by picking the entries in BRICK_PATH/.shard/.remove_me If a marker file creation fails then we can't really delete the shards which eventually a problem for user who is looking to make space by deleting unwanted data. Solution: Create the marker file by marking xdata = GLUSTERFS_INTERNAL_FOP_KEY which is considered to be internal op and allowed to create under reserved space. Fixes: #2038 Change-Id: I7facebab940f9aeee81d489df429e00ef4fb7c5d Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
*	tests: fix tests/bugs/nfs/bug-1053579.t (#2034)	Xavi Hernandez	2021-01-22	1	-23/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* tests: fix tests/bugs/nfs/bug-1053579.t On NFS the number of groups associated to a user that can be passed to the server is limited. This test created a user with 200 groups and checked that a file owned by the latest created group couldn't be accessed, under the assumption that the last group won't be passed to the server. However there's no guarantee on how the list of groups is generated, so the latest created group could be passed as one of the initial groups, making the allowing access to the file and causing the test to fail (because it was expecting to not be possible). Given that there's no way to be sure which groups will be passed, this patch changes the test so that a check is done for each group the user belongs to. Then we check that there have been some successes and some failures. Once 'manage-gids' is set, we do the same, but this time the number of failures must be 0. Fixes: #2033 Change-Id: Ide06da2861fcade2166372d1f3e9eb4ff2dd5f58 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	tests: ./tests/bugs/replicate/bug-921231.t is continuously failing (#2006)	mohit84	2021-01-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	The test case (./tests/bugs/replicate/bug-921231.t ) is continuously failing.The test case is failing because inodelk_max_latency is showing wrong value in profile. The value is not correct because recently the profile timestamp is changed from microsec to nanosec from the patch #1833. Fixes: #2005 Change-Id: Ieb683836938d986b56f70b2380103efe95657821 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	features/shard: avoid repeatative calls to gf_uuid_unparse() (#1689)	Vinayak hariharmath	2021-01-11	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \|	The issue is shard_make_block_abspath() calls gf_uuid_unparse() every time while constructing shard path. The gfid can be parsed and saved once and passed while constructing the path. Thus we can avoid calling gf_uuid_unparse(). Fixes: #1423 Change-Id: Ia26fbd5f09e812bbad9e5715242f14143c013c9c Signed-off-by: Vinayakswami Hariharmath vharihar@redhat.com
*	stripe cleanup: Remove the option from create and add-brick cmds (#1812)	Sheetal Pamecha	2021-01-05	2	-75/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* stripe cleanup: Remove the option from create and add-brick cmds This patch aims to remove the code for stripe option instead of keeping a default values of stripe/stripe-count variables and setting and getting dict options and similar redundant operations. Also removing tests for stripe volumes that have been already marked bad. Updates: #1000 Change-Id: Ic2b3cabd671f0c8dc0521384b164c3078f7ca7c6 Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * Fix regression error tests/000-flaky/basic_changelog_changelog-snapshot.t was failing due to 0 return value Change-Id: I8ea0443669c63768760526db5aa1f205978e1dbb Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * add constant stripe_count value for upgrade scenerios Change-Id: I49f3da4f106c55f9da20d0b0a299275a19daf4ba * Fix clang-format warning Change-Id: I83bae85d10c8c5b3c66f56c9f8de1ec81d0bbc95
*	geo-rep: remove offensive language	Ravishankar N	2020-12-30	12	-14/+14
\| \| \| \| \|	Change-Id: I9e34ccfd28029209262874993836dd36ad3c9c01 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: remove offensive language	Ravishankar N	2020-12-30	19	-698/+699
\| \| \| \| \| \| \| \| \| \|	TODO: Remove 'slave-timeout' and 'slave-gluster-command-dir'. These variables are defined in geo-replication/gsyncd.conf.in. So I will remove them when I change that folder. Change-Id: Ib9167ca586d83e01f8ec755cdf58b3438184c9dd Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cli/glusterd: conscious language changes for geo-rep	Ravishankar N	2020-12-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace master and slave terminology in geo-replication with primary and secondary respectively. All instances are replaced in cli and glusterd. Changes to other parts of the code to follow in separate patches. tests/00-geo-rep/* are passing thus far. Updates: #1415 Change-Id: Ifb12b7f5ce927a4a61bda1e953c1eb0fdfc8a7c5 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	core: Implement gracefull shutdown for a brick process (#1751)	mohit84	2020-12-16	2	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* core: Implement gracefull shutdown for a brick process glusterd sends a SIGTERM to brick process at the time of stopping a volume if brick_mux is not enabled.In case of brick_mux at the time of getting a terminate signal for last brick a brick process sends a SIGTERM to own process for stop a brick process.The current approach does not cleanup resources in case of either last brick is detached or brick_mux is not enabled. Solution: glusterd sends a terminate notification to a brick process at the time of stopping a volume for gracefull shutdown Change-Id: I49b729e1205e75760f6eff9bf6803ed0dbf876ae Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement gracefull shutdown for a brick process Resolve some reviwere comment Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I50e6a9e2ec86256b349aef5b127cc5bbf32d2561 * core: Implement graceful shutdown for a brick process Implement a key cluster.brick-graceful-cleanup to enable graceful shutdown for a brick process.If key value is on glusterd sends a detach request to stop the brick. Fixes: #1749 Change-Id: Iba8fb27ba15cc37ecd3eb48f0ea8f981633465c3 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement graceful shutdown for a brick process Resolve reviewer comments Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I2a8eb4cf25cd8fca98d099889e4cae3954c8579e * core: Implement gracefull shutdown for a brick process Resolve reviewer comment specific to avoid memory leak Fixes: #1749 Change-Id: Ic2f09efe6190fd3776f712afc2d49b4e63de7d1f Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement gracefull shutdown for a brick process Resolve reviewer comment specific to avoid memory leak Fixes: #1749 Change-Id: I68fbbb39160a4595fb8b1b19836f44b356e89716 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	glusterd/cli: enhance rebalance-status after replace/reset-brick (#1869)	Tamar Shacked	2020-12-08	1	-0/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* glusterd/cli: enhance rebalance-status after replace/reset-brick Rebalance status is being reset during replace/reset-brick operations. This cause 'volume status' to shows rebalance as "not started". Fix: change rebalance-status to "reset due to (replace\|reset)-brick" Change-Id: I6e3372d67355eb76c5965984a23f073289d4ff23 Signed-off-by: Tamar Shacked <tshacked@redhat.com> * glusterd/cli: enhance rebalance-status after replace/reset-brick Rebalance status is being reset during replace/reset-brick operations. This cause 'volume status' to shows rebalance as "not started". Fix: change rebalance-status to "reset due to (replace\|reset)-brick" Fixes: #1717 Signed-off-by: Tamar Shacked <tshacked@redhat.com> Change-Id: I1e3e373ca3b2007b5b7005b6c757fb43801fde33 * cli: changing rebal task ID to "None" in case status is being reset Rebalance status is being reset during replace/reset-brick operations. This cause 'volume status' to shows rebalance as "not started". Fix: change rebalance-status to "reset due to (replace\|reset)-brick" Fixes: #1717 Change-Id: Ia73a8bea3dcd8e51acf4faa6434c3cb0d09856d0 Signed-off-by: Tamar Shacked <tshacked@redhat.com>
*	glusterd: modify logic for checking hostname in add-brick (#1781)	Sheetal Pamecha	2020-12-07	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* glusterd: modify logic for checking hostname in add-brick Problem: add-brick command parses only the bricks provided in cli for a subvolume. If in same subvolume bricks are increased, these are not checked with present volume bricks. Fixes: #1779 Change-Id: I768bcf7359a008f2d6baccef50e582536473a9dc Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * removed assignment of unused variable Fixes: #1779 Change-Id: Id5ed776b28343e1225b9898e81502ce29fb480fa Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * few more changes Change-Id: I7bacedb984f968939b214f9d13546f4bf92e9df7 Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * few more changes Change-Id: I7bacedb984f968939b214f9d13546f4bf92e9df7 Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * correction in last commit Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> Change-Id: I1fd0d941cf3f32aa6e8c7850def78e5af0d88782
*	all: change 'primary' to 'root' where it makes sense	Ravishankar N	2020-12-02	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As a part of offensive language removal, we changed 'master' to 'primary' in some parts of the code that are not related to geo-replication via commits e4c9a14429c51d8d059287c2a2c7a76a5116a362 and 0fd92465333be674485b984e54b08df3e431bb0d. But it is better to use 'root' in some places to distinguish it from the geo-rep changes which use 'primary/secondary' instead of 'master/slave'. This patch mainly changes glusterfs_ctx_t->primary to glusterfs_ctx_t->root. Other places like meta xlator is also changed. gf-changelog.c is not changed since it is related to geo-rep. Updates: #1000 Change-Id: I3cd610f7bea06c7a28ae2c0104f34291023d1daf Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	DHT/Rebalance - Ensure Rebalance reports status only once upon stopping (#1783)	Barak Sason Rofman	2020-11-24	1	-0/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DHT/Rebalance - Ensure Rebalance reports status only once upon stopping Upon issuing rebalance stop command, the status of rebalance is being logged twice to the log file, which can sometime result in an inconsistent reports (one report states status stopped, while the other may report something else). This fix ensures rebalance reports it's status only once and that the correct status is being reported. fixes: #1782 Change-Id: Id3206edfad33b3db60e9df8e95a519928dc7cb37 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	posix: fix io_uring crash in reconfigure (#1804)	Ravishankar N	2020-11-17	1	-0/+20
\| \| \| \| \| \| \| \| \|	Call posix_io_uring_fini only if it was inited to begin with. Fixes: #1794 Reported-by: Mohit Agrawal <moagrawa@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> Change-Id: I0e840b6b1d1f26b104b30c8c4b88c14ce4aaac0d
*	tests: Fix issues in CentOS 8 (#1756)	Xavi Hernandez	2020-11-06	2	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* tests: Fix issues in CentOS 8 Due to some configuration changes in CentOS 8/RHEL 8, ssl-ciphers.t and bug-1053579.t were failing. The first one was failing because TLS v1.0 is disabled by default. The test hash been updated to check that at least one of TLS v1.0, v1.1 or v1.2 succeeds. For the second case, the issue is that the test assumed that the latest added group to a user should always be listed the last, but this is not always true because nsswitch.conf now uses 'sss' before 'files', which means that data comes from a db that could not be sorted. Updates: #1009 Change-Id: I4ca01a099854ec25926c3d76b3a98072175bab06 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> * tests: Fix TLS version detection The old test didn't correctly determine which version of TLS should be allowed by openssl. Change-Id: Ic081c329d5ed1842fa9f5fd23742ae007738aec0 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	glusterd: fix bug in enabling granular-entry-heal (#1752)	Ravishankar N	2020-11-05	1	-5/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit f5e1eb87d4af44be3b317b7f99ab88f89c2f0b1a meant to enable the volume option only for replica volumes but inadvertently enabled it for all volume types. Fixing it now. Also found a bug in glusterd where disabling the option on plain distribute was succeeding even though setting it in the fist place fails. Fixed that too. Fixes: #1483 Change-Id: Icb6c169a8eec44cc4fb4dd636405d3b3485e91b4 Reported-by: Sheetal Pamecha <spamecha@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	xlators: misc conscious language changes (#1715)	Ravishankar N	2020-11-02	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	core:change xlator_t->ctx->master to xlator_t->ctx->primary afr: just changed comments. meta: change .meta/master to .meta/primary. Might break scripts. changelog: variable/function name changes only. These are unrelated to geo-rep. Fixes: #1713 Change-Id: I58eb5fcd75d65fc8269633acc41313503dccf5ff Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	extras/rebalance: Script to perform directory rebalance (#1676)	Pranith Kumar Karampuri	2020-10-30	2	-0/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* extras/rebalance: Script to perform directory rebalance How should the script be executed? $ /path/to/directory-rebalance.py <dir-to-rebalance> will do rebalance just for that directory. The script assumes that fix-layout operation is completed for all the directories present inside the <dir-to-rebalance> How does it work? For the given directory path that needs to be rebalanced, full crawl is performed and the files that need to be healed and the size of each file is first written to the index. Once building the index is completed, the index is read and for each file the script executes equivalent of setfattr -n trusted.distribute.migrate-data -v 1 <path/to/file> Why does the script take two passes? Printing a sensible ETA has been a primary goal of the script. Without knowing the approximate size that will be rebalanced, it is difficult to find ETA. Hence the script does one pass to find files, sizes which it writes to the index file and then the next pass is done on the index file. It takes a minute or two for the ETA to converge but in our testing it has been giving a reasonable ETA What versions does the script support? For the script to work correctly, dht should handle "trusted.distribute.migrate-data" setxattr correctly. fixes: #1654 Change-Id: Ie5070127bd45f1a1b9cd18ed029e364420c971c1 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	cluster/dht: Perform migrate-file with lk-owner (#1581)	Pranith Kumar Karampuri	2020-10-29	4	-0/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* cluster/dht: Perform migrate-file with lk-owner 1) Added GF_ASSERT() calls in client-xlator to find these issues sooner. 2) Fuse is setting zero-lkowner with len as 8 when the fop doesn't have any lk-owner. Changed this to have len as 0 just as we have in fops triggered from xlators lower to fuse. * syncop: Avoid frame allocation if we can * cluster/dht: Set lkowner in daemon rebalance code path * cluster/afr: Set lkowner for ta-selfheal * cluster/ec: Destroy frame after heal is done * Don't assert for lk-owner in lk call * set lkowner for mandatory lock heal tests fixes: #1529 Change-Id: Ia803db6b00869316893abb1cf435b898eec31228 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	tests: exclude more contrib/fuse-lib objects (#1694)	Dmitry Antipov	2020-10-27	1	-0/+3
\| \| \| \| \| \| \|	Exclude more contrib/fuse-lib objects to avoid silly tests/basic/0symbol-check.t breakage. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Fixes: #1692
*	glusterd/afr: enable granular-entry-heal by default (#1621)	Ravishankar N	2020-10-22	12	-18/+528
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. The option has been enabled and tested for quite some time now in RHHI-V downstream and I think it is safe to make it 'on' by default. Since it is not possible to simply change it from 'off' to 'on' without breaking rolling upgrades, old clients etc., I have made it default only for new volumes starting from op-verison GD_OP_VERSION_9_0. Note: If you do a volume reset, the option will be turned back off. This is okay as the dir's gfid will be captured in 'xattrop' folder and heals will proceed. There might be stale entries inside entry-changes' folder, which will be removed when we enable the option again. 2. I encountered a cust. issue where entry heal was pending on a dir. with 236436 files in it and the glustershd.log output was just stuck at "performing entry selfheal", so I have added logs to give us more info in DEBUG level about whether entry heal and data heal are progressing (metadata heal doesn't take much time). That way, we have a quick visual indication to say things are not 'stuck' if we briefly enable debug logs, instead of taking statedumps or checking profile info etc. Fixes: #1483 Change-Id: I4f116f8c92f8cd33f209b758ff14f3c7e1981422 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	test: The test case tests/bugs/bug-1064147.t is failing (#1662)	mohit84	2020-10-19	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \|	The test case tests/bugs/bug-1064147.t is failing at the time of comparing root permission with permission changed while one of the brick was down.The permission was not matching because layout was not existing on root at the time of healing a permission, so correct permission was not healed on newly started brick Fixes: #1661 Change-Id: If63ea47576dd14f4b91681dd390e2f84f8b6ac18 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	tests/00-geo-rep: 00-georep-verify-non-root-setup.t fails on devel branch ↵	Shwetha Acharya	2020-10-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	(#1639) Increasing the timeout value to overcome the encountered delay in execution of test Change-Id: Id40d92366738439634a6b06d447a43a2c6cdbf44 Updates: #1594 Signed-off-by: Shwetha K Acharya <sacharya@redhat.com>
*	io-stats: Configure ios_sample_buf_size based on sample_interval value (#1574)	mohit84	2020-10-15	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	io-stats xlator declares a ios_sample_buf_size 64k object(10M) per xlator but in case of sample_interval is 0 this big buffer is not required so declare the default value only while sample_interval is not 0.The new change would be helpful to reduce RSS size for a brick and shd process while the number of volumes are huge. Change-Id: I3e82cca92e40549355edfac32580169f3ce51af8 Fixes: #1542 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	tests/00-geo-rep: 00-georep-verify-non-root-setup.t fails on devel branch ↵	Shwetha Acharya	2020-10-13	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	(#1617) ssh -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -p 22 nroot@127.0.0.1 /build/install/sbin/gluster --xml --remote-host=localhost volume info slave failes with error 255. Adding ssh key clean up code at the beginning of the test, inorder to clean any stale entries Updates: #1594 Signed-off-by: Shwetha K Acharya <sacharya@redhat.com>
*	mount/fuse: Fix graph-switch when reader-thread-count is set	Pranith Kumar K	2020-10-05	1	-0/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The current graph-switch code sets priv->handle_graph_switch to false even when graph-switch is in progress which leads to crashes in some cases Fix: priv->handle_graph_switch should be set to false only when graph-switch completes. fixes: #1539 Change-Id: I5b04f7220a0a6e65c5f5afa3e28d1afe9efcdc31 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	cluster/afr: Heal directory rename without rmdir/mkdir	Pranith Kumar K	2020-04-13	8	-46/+750
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem1: When a directory is renamed while a brick is down entry-heal always did an rm -rf on that directory on the sink on old location and did mkdir and created the directory hierarchy again in the new location. This is inefficient. Problem2: Renamedir heal order may lead to a scenario where directory in the new location could be created before deleting it from old location leading to 2 directories with same gfid in posix. Fix: As part of heal, if oldlocation is healed first and is not present in source-brick always rename it into a hidden directory inside the sink-brick so that when heal is triggered in new-location shd can rename it from this hidden directory to the new-location. If new-location heal is triggered first and it detects that the directory already exists in the brick, then it should skip healing the directory until it appears in the hidden directory. Credits: Ravi for rename-data-loss.t script Fixes: #1211 Change-Id: I0cba2006f35cd03d314d18211ce0bd530e254843 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	gfapi: Move the SECURE_ACCESS_FILE check out of glfs_mgmt_init	Môshe van der Sterre	2020-09-28	3	-0/+218
\| \| \| \| \| \| \| \|	glfs_mgmt_init is only called for glfs_set_volfile_server, but secure_mgmt is also required to use glfs_set_volfile with SSL. fixes: #829 Change-Id: Ibc769fe634d805e085232f85ce6e1c48bf4acc66
*	glusterd: Fix Add-brick with increasing replica count failure	Sheetal Pamecha	2020-09-23	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: add-brick operation fails with multiple bricks on same server error when replica count is increased. This was happening because of extra runs in a loop to compare hostnames and if bricks supplied were less than "replica" count, the bricks will get compared to itself resulting in above error. Fixes: #1508 Change-Id: I8668e964340b7bf59728bb838525d2db062197ed Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>