glusterfs.git - GlusterFS is a distributed file-system capable of scaling to several petabytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system.

	Commit message (Collapse)	Author	Age	Files	Lines
*	protocol/client: Fix lock memory leak (#2338)	Pranith Kumar Karampuri	2021-04-22	3	-22/+137
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem-1: When an overlapping lock is issued the merged lock is not assigned the owner. When flush is issued on the fd, this particular lock is not freed leading to memory leak Fix-1: Assign the owner while merging the locks. Problem-2: On fd-destroy lock structs could be present in fdctx. For some reason with flock -x command and closing of the bash fd, it leads to this code path. Which leaks the lock structs. Fix-2: When fdctx is being destroyed in client, make sure to cleanup any lock structs. fixes: #2337 Change-Id: I298124213ce5a1cf2b1f1756d5e8a9745d9c0a1c Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	dht: fix rebalance of sparse files (#2318)	Xavi Hernandez	2021-04-09	1	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current implementation of rebalance for sparse files has a bug that, in some cases, causes a read of 0 bytes from the source subvolume. Posix xlator doesn't allow 0 byte reads and fails them with EINVAL, which causes rebalance to abort the migration. This patch implements a more robust way of finding data segments in a sparse file that avoids 0 byte reads, allowing the file to be migrated successfully. Fixes: #2317 Change-Id: Iff168dda2fb0f2edf716b21eb04cc2cc8ac3915c Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	Removal of force option in snapshot create (#2110)	nishith-vihar	2021-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	The force option does fail for snapshot create command even though the quorum is satisfied and is redundant. The change deprecates the force option for snapshot create command and checks if all bricks are online instead of checking for quorum for creating a snapshot. Fixes: #2099 Change-Id: I45d866e67052fef982a60aebe8dec069e78015bd Signed-off-by: Nishith Vihar Sakinala <nsakinal@redhat.com>
*	afr: don't reopen fds on which POSIX locks are held (#1980)	Karthik Subrahmanya	2021-03-27	1	-0/+206
\| \| \| \| \| \| \| \| \| \|	When client.strict-locks is enabled on a volume and there are POSIX locks held on the files, after disconnect and reconnection of the clients do not re-open such fds which might lead to multiple clients acquiring the locks and cause data corruption. Change-Id: I8777ffbc2cc8d15ab57b58b72b56eb67521787c5 Fixes: #1977 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	cluster/dht: use readdir for fix-layout in rebalance (#2243)	Pranith Kumar Karampuri	2021-03-22	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: On a cluster with 15 million files, when fix-layout was started, it was not progressing at all. So we tried to do a os.walk() + os.stat() on the backend filesystem directly. It took 2.5 days. We removed os.stat() and re-ran it on another brick with similar data-set. It took 15 minutes. We realized that readdirp is extremely costly compared to readdir if the stat is not useful. fix-layout operation only needs to know that the entry is a directory so that fix-layout operation can be triggered on it. Most of the modern filesystems provide this information in readdir operation. We don't need readdirp i.e. readdir+stat. Fix: Use readdir operation in fix-layout. Do readdir+stat/lookup for filesystems that don't provide d_type in readdir operation. fixes: #2241 Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	cluster/dht: Provide option to disable fsync in data migration (#2259)	Pranith Kumar Karampuri	2021-03-17	6	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At the moment dht rebalance doesn't give any option to disable fsync after data migration. Making this an option would give admins take responsibility of data in a way that is suitable for their cluster. Default value is still 'on', so that the behavior is intact for people who don't care about this. For example: If the data that is going to be migrated is already backed up or snapshotted, there is no need for fsync to happen right after migration which can affect active I/O on the volume from applications. fixes: #2258 Change-Id: I7a50b8d3a2f270d79920ef306ceb6ba6451150c4 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	features/index: Optimize link-count fetching code path (#1789)	Pranith Kumar Karampuri	2021-03-10	7	-45/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* features/index: Optimize link-count fetching code path Problem: AFR requests 'link-count' in lookup to check if there are any pending heals. Based on this information, afr will set dirent->inode to NULL in readdirp when heals are ongoing to prevent serving bad data. When heals are completed, link-count xattr is leading to doing an opendir of xattrop directory and then reading the contents to figure out that there is no healing needed for every lookup. This was not detected until this github issue because ZFS in some cases can lead to very slow readdir() calls. Since Glusterfs does lot of lookups, this was slowing down all operations increasing load on the system. Code problem: index xlator on any xattrop operation adds index to the relevant dirs and after the xattrop operation is done, will delete/keep the index in that directory based on the value fetched in xattrop from posix. AFR sends all-zero xattrop for changelog xattrs. This is leading to priv->pending_count manipulation which sets the count back to -1. Next Lookup operation triggers opendir/readdir to find the actual link-count in lookup because in memory priv->pending_count is -ve. Fix: 1) Don't add to index on all-zero xattrop for a key. 2) Set pending-count to -1 when the first gfid is added into xattrop directory, so that the next lookup can compute the link-count. fixes: #1764 Change-Id: I8a02c7e811a72c46d78ddb2d9d4fdc2222a444e9 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * addressed comments Change-Id: Ide42bb1c1237b525d168bf1a9b82eb1bdc3bc283 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * tests: Handle base index absence Change-Id: I3cf11a8644ccf23e01537228766f864b63c49556 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * Addressed LOCK based comments, .t comments Change-Id: I5f53e40820cade3a44259c1ac1a7f3c5f2f0f310 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	afr: fix directory entry count (#2233)	Xavi Hernandez	2021-03-09	2	-0/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AFR may hide some existing entries from a directory when reading it because they are generated internally for private management. However the returned number of entries from readdir() function is not updated accordingly. So it may return a number higher than the real entries present in the gf_dirent list. This may cause unexpected behavior of clients, including gfapi which incorrectly assumes that there was an entry when the list was actually empty. This patch also makes the check in gfapi more robust to avoid similar issues that could appear in the future. Fixes: #2232 Change-Id: I81ba3699248a53ebb0ee4e6e6231a4301436f763 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	cluster/dht: Fix stack overflow in readdir(p) (#2170)	Xavi Hernandez	2021-02-24	1	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When parallel-readdir is enabled, readdir(p) requests sent by DHT can be immediately processed and answered in the same thread before the call to STACK_WIND_COOKIE() completes. This means that the readdir(p) cbk is processed synchronously. In some cases it may decide to send another readdir(p) request, which causes a recursive call. When some special conditions happen and the directories are big, it's possible that the number of nested calls is so high that the process crashes because of a stack overflow. This patch fixes this by not allowing nested readdir(p) calls. When a nested call is detected, it's queued instead of sending it. The queued request is processed when the current call finishes by the top level stack function. Fixes: #2169 Change-Id: Id763a8a51fb3c3314588ec7c162f649babf33099 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	tests: Handle nanosecond duration in profile info (#2135)	Pranith Kumar Karampuri	2021-02-08	3	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: volume profile info now prints duration in nano seconds. Tests were written when the duration was printed in micro seconds. This leads to spurious failures. Fix: Change tests to handle nano second durations fixes: #2134 Change-Id: I94722be87000a485d98c8b0f6d8b7e1a526b07e7 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	cluster/dht: Allow fix-layout only on directories (#2109)	Pranith Kumar Karampuri	2021-02-03	1	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: fix-layout operation assumes that the directory passed is directory i.e. layout->cnt == conf->subvolume_cnt. This will lead to a crash when fix-layout is attempted on a file. Fix: Disallow fix-layout on files fixes: #2107 Change-Id: I2116b8773059f67e3260e9207e20eab3de711417 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	features/shard: delay unlink of a file that has fd_count > 0 (#1563)	Vinayak hariharmath	2021-02-03	2	-0/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: Iec16d7ff5e05f29255491a43fbb6270c72868999 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I07e5a5bf9d33c24b63da72d4f3f59392c5421652 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I3679de8545f2e5b8027c4d5a6fd0592092e8dfbd Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update xlators/storage/posix/src/posix-entry-ops.c Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update fd.c * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
*	features/shard: unlink fails due to nospace to mknod marker file	Vinayakswami Hariharmath	2021-01-26	1	-0/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we hit the max capacity of the storage space, shard_unlink() starts failing if there is no space left on the brick to create a marker file. shard_unlink() happens in below steps: 1. create a marker file in the name of gfid of the base file under BRICK_PATH/.shard/.remove_me 2. unlink the base file 3. shard_delete_shards() deletes the shards in background by picking the entries in BRICK_PATH/.shard/.remove_me If a marker file creation fails then we can't really delete the shards which eventually a problem for user who is looking to make space by deleting unwanted data. Solution: Create the marker file by marking xdata = GLUSTERFS_INTERNAL_FOP_KEY which is considered to be internal op and allowed to create under reserved space. Fixes: #2038 Change-Id: I7facebab940f9aeee81d489df429e00ef4fb7c5d Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
*	tests: fix tests/bugs/nfs/bug-1053579.t (#2034)	Xavi Hernandez	2021-01-22	1	-23/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* tests: fix tests/bugs/nfs/bug-1053579.t On NFS the number of groups associated to a user that can be passed to the server is limited. This test created a user with 200 groups and checked that a file owned by the latest created group couldn't be accessed, under the assumption that the last group won't be passed to the server. However there's no guarantee on how the list of groups is generated, so the latest created group could be passed as one of the initial groups, making the allowing access to the file and causing the test to fail (because it was expecting to not be possible). Given that there's no way to be sure which groups will be passed, this patch changes the test so that a check is done for each group the user belongs to. Then we check that there have been some successes and some failures. Once 'manage-gids' is set, we do the same, but this time the number of failures must be 0. Fixes: #2033 Change-Id: Ide06da2861fcade2166372d1f3e9eb4ff2dd5f58 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	tests: ./tests/bugs/replicate/bug-921231.t is continuously failing (#2006)	mohit84	2021-01-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	The test case (./tests/bugs/replicate/bug-921231.t ) is continuously failing.The test case is failing because inodelk_max_latency is showing wrong value in profile. The value is not correct because recently the profile timestamp is changed from microsec to nanosec from the patch #1833. Fixes: #2005 Change-Id: Ieb683836938d986b56f70b2380103efe95657821 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	features/shard: avoid repeatative calls to gf_uuid_unparse() (#1689)	Vinayak hariharmath	2021-01-11	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \|	The issue is shard_make_block_abspath() calls gf_uuid_unparse() every time while constructing shard path. The gfid can be parsed and saved once and passed while constructing the path. Thus we can avoid calling gf_uuid_unparse(). Fixes: #1423 Change-Id: Ia26fbd5f09e812bbad9e5715242f14143c013c9c Signed-off-by: Vinayakswami Hariharmath vharihar@redhat.com
*	stripe cleanup: Remove the option from create and add-brick cmds (#1812)	Sheetal Pamecha	2021-01-05	2	-75/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* stripe cleanup: Remove the option from create and add-brick cmds This patch aims to remove the code for stripe option instead of keeping a default values of stripe/stripe-count variables and setting and getting dict options and similar redundant operations. Also removing tests for stripe volumes that have been already marked bad. Updates: #1000 Change-Id: Ic2b3cabd671f0c8dc0521384b164c3078f7ca7c6 Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * Fix regression error tests/000-flaky/basic_changelog_changelog-snapshot.t was failing due to 0 return value Change-Id: I8ea0443669c63768760526db5aa1f205978e1dbb Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * add constant stripe_count value for upgrade scenerios Change-Id: I49f3da4f106c55f9da20d0b0a299275a19daf4ba * Fix clang-format warning Change-Id: I83bae85d10c8c5b3c66f56c9f8de1ec81d0bbc95
*	tests: remove offensive language	Ravishankar N	2020-12-30	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	TODO: Remove 'slave-timeout' and 'slave-gluster-command-dir'. These variables are defined in geo-replication/gsyncd.conf.in. So I will remove them when I change that folder. Change-Id: Ib9167ca586d83e01f8ec755cdf58b3438184c9dd Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	core: Implement gracefull shutdown for a brick process (#1751)	mohit84	2020-12-16	2	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* core: Implement gracefull shutdown for a brick process glusterd sends a SIGTERM to brick process at the time of stopping a volume if brick_mux is not enabled.In case of brick_mux at the time of getting a terminate signal for last brick a brick process sends a SIGTERM to own process for stop a brick process.The current approach does not cleanup resources in case of either last brick is detached or brick_mux is not enabled. Solution: glusterd sends a terminate notification to a brick process at the time of stopping a volume for gracefull shutdown Change-Id: I49b729e1205e75760f6eff9bf6803ed0dbf876ae Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement gracefull shutdown for a brick process Resolve some reviwere comment Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I50e6a9e2ec86256b349aef5b127cc5bbf32d2561 * core: Implement graceful shutdown for a brick process Implement a key cluster.brick-graceful-cleanup to enable graceful shutdown for a brick process.If key value is on glusterd sends a detach request to stop the brick. Fixes: #1749 Change-Id: Iba8fb27ba15cc37ecd3eb48f0ea8f981633465c3 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement graceful shutdown for a brick process Resolve reviewer comments Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I2a8eb4cf25cd8fca98d099889e4cae3954c8579e * core: Implement gracefull shutdown for a brick process Resolve reviewer comment specific to avoid memory leak Fixes: #1749 Change-Id: Ic2f09efe6190fd3776f712afc2d49b4e63de7d1f Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement gracefull shutdown for a brick process Resolve reviewer comment specific to avoid memory leak Fixes: #1749 Change-Id: I68fbbb39160a4595fb8b1b19836f44b356e89716 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	glusterd/cli: enhance rebalance-status after replace/reset-brick (#1869)	Tamar Shacked	2020-12-08	1	-0/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* glusterd/cli: enhance rebalance-status after replace/reset-brick Rebalance status is being reset during replace/reset-brick operations. This cause 'volume status' to shows rebalance as "not started". Fix: change rebalance-status to "reset due to (replace\|reset)-brick" Change-Id: I6e3372d67355eb76c5965984a23f073289d4ff23 Signed-off-by: Tamar Shacked <tshacked@redhat.com> * glusterd/cli: enhance rebalance-status after replace/reset-brick Rebalance status is being reset during replace/reset-brick operations. This cause 'volume status' to shows rebalance as "not started". Fix: change rebalance-status to "reset due to (replace\|reset)-brick" Fixes: #1717 Signed-off-by: Tamar Shacked <tshacked@redhat.com> Change-Id: I1e3e373ca3b2007b5b7005b6c757fb43801fde33 * cli: changing rebal task ID to "None" in case status is being reset Rebalance status is being reset during replace/reset-brick operations. This cause 'volume status' to shows rebalance as "not started". Fix: change rebalance-status to "reset due to (replace\|reset)-brick" Fixes: #1717 Change-Id: Ia73a8bea3dcd8e51acf4faa6434c3cb0d09856d0 Signed-off-by: Tamar Shacked <tshacked@redhat.com>
*	glusterd: modify logic for checking hostname in add-brick (#1781)	Sheetal Pamecha	2020-12-07	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* glusterd: modify logic for checking hostname in add-brick Problem: add-brick command parses only the bricks provided in cli for a subvolume. If in same subvolume bricks are increased, these are not checked with present volume bricks. Fixes: #1779 Change-Id: I768bcf7359a008f2d6baccef50e582536473a9dc Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * removed assignment of unused variable Fixes: #1779 Change-Id: Id5ed776b28343e1225b9898e81502ce29fb480fa Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * few more changes Change-Id: I7bacedb984f968939b214f9d13546f4bf92e9df7 Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * few more changes Change-Id: I7bacedb984f968939b214f9d13546f4bf92e9df7 Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> * correction in last commit Signed-off-by: Sheetal Pamecha <spamecha@redhat.com> Change-Id: I1fd0d941cf3f32aa6e8c7850def78e5af0d88782
*	DHT/Rebalance - Ensure Rebalance reports status only once upon stopping (#1783)	Barak Sason Rofman	2020-11-24	1	-0/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DHT/Rebalance - Ensure Rebalance reports status only once upon stopping Upon issuing rebalance stop command, the status of rebalance is being logged twice to the log file, which can sometime result in an inconsistent reports (one report states status stopped, while the other may report something else). This fix ensures rebalance reports it's status only once and that the correct status is being reported. fixes: #1782 Change-Id: Id3206edfad33b3db60e9df8e95a519928dc7cb37 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	posix: fix io_uring crash in reconfigure (#1804)	Ravishankar N	2020-11-17	1	-0/+20
\| \| \| \| \| \| \| \| \|	Call posix_io_uring_fini only if it was inited to begin with. Fixes: #1794 Reported-by: Mohit Agrawal <moagrawa@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com> Change-Id: I0e840b6b1d1f26b104b30c8c4b88c14ce4aaac0d
*	tests: Fix issues in CentOS 8 (#1756)	Xavi Hernandez	2020-11-06	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* tests: Fix issues in CentOS 8 Due to some configuration changes in CentOS 8/RHEL 8, ssl-ciphers.t and bug-1053579.t were failing. The first one was failing because TLS v1.0 is disabled by default. The test hash been updated to check that at least one of TLS v1.0, v1.1 or v1.2 succeeds. For the second case, the issue is that the test assumed that the latest added group to a user should always be listed the last, but this is not always true because nsswitch.conf now uses 'sss' before 'files', which means that data comes from a db that could not be sorted. Updates: #1009 Change-Id: I4ca01a099854ec25926c3d76b3a98072175bab06 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> * tests: Fix TLS version detection The old test didn't correctly determine which version of TLS should be allowed by openssl. Change-Id: Ic081c329d5ed1842fa9f5fd23742ae007738aec0 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	glusterd/afr: enable granular-entry-heal by default (#1621)	Ravishankar N	2020-10-22	4	-15/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. The option has been enabled and tested for quite some time now in RHHI-V downstream and I think it is safe to make it 'on' by default. Since it is not possible to simply change it from 'off' to 'on' without breaking rolling upgrades, old clients etc., I have made it default only for new volumes starting from op-verison GD_OP_VERSION_9_0. Note: If you do a volume reset, the option will be turned back off. This is okay as the dir's gfid will be captured in 'xattrop' folder and heals will proceed. There might be stale entries inside entry-changes' folder, which will be removed when we enable the option again. 2. I encountered a cust. issue where entry heal was pending on a dir. with 236436 files in it and the glustershd.log output was just stuck at "performing entry selfheal", so I have added logs to give us more info in DEBUG level about whether entry heal and data heal are progressing (metadata heal doesn't take much time). That way, we have a quick visual indication to say things are not 'stuck' if we briefly enable debug logs, instead of taking statedumps or checking profile info etc. Fixes: #1483 Change-Id: I4f116f8c92f8cd33f209b758ff14f3c7e1981422 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	test: The test case tests/bugs/bug-1064147.t is failing (#1662)	mohit84	2020-10-19	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \|	The test case tests/bugs/bug-1064147.t is failing at the time of comparing root permission with permission changed while one of the brick was down.The permission was not matching because layout was not existing on root at the time of healing a permission, so correct permission was not healed on newly started brick Fixes: #1661 Change-Id: If63ea47576dd14f4b91681dd390e2f84f8b6ac18 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	io-stats: Configure ios_sample_buf_size based on sample_interval value (#1574)	mohit84	2020-10-15	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	io-stats xlator declares a ios_sample_buf_size 64k object(10M) per xlator but in case of sample_interval is 0 this big buffer is not required so declare the default value only while sample_interval is not 0.The new change would be helpful to reduce RSS size for a brick and shd process while the number of volumes are huge. Change-Id: I3e82cca92e40549355edfac32580169f3ce51af8 Fixes: #1542 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	cluster/afr: Heal directory rename without rmdir/mkdir	Pranith Kumar K	2020-04-13	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem1: When a directory is renamed while a brick is down entry-heal always did an rm -rf on that directory on the sink on old location and did mkdir and created the directory hierarchy again in the new location. This is inefficient. Problem2: Renamedir heal order may lead to a scenario where directory in the new location could be created before deleting it from old location leading to 2 directories with same gfid in posix. Fix: As part of heal, if oldlocation is healed first and is not present in source-brick always rename it into a hidden directory inside the sink-brick so that when heal is triggered in new-location shd can rename it from this hidden directory to the new-location. If new-location heal is triggered first and it detects that the directory already exists in the brick, then it should skip healing the directory until it appears in the hidden directory. Credits: Ravi for rename-data-loss.t script Fixes: #1211 Change-Id: I0cba2006f35cd03d314d18211ce0bd530e254843 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	glusterd: Fix Add-brick with increasing replica count failure	Sheetal Pamecha	2020-09-23	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: add-brick operation fails with multiple bricks on same server error when replica count is increased. This was happening because of extra runs in a loop to compare hostnames and if bricks supplied were less than "replica" count, the bricks will get compared to itself resulting in above error. Fixes: #1508 Change-Id: I8668e964340b7bf59728bb838525d2db062197ed Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
*	fuse: fetch arbitrary number of groups from /proc/[pid]/status	Csaba Henk	2020-07-17	1	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Glusterfs so far constrained itself with an arbitrary limit (32) for the number of groups read from /proc/[pid]/status (this was the number of groups shown there prior to Linux commit v3.7-9553-g8d238027b87e (v3.8-rc1~74^2~59); since this commit, all groups are shown). With this change we'll read groups up to the number Glusterfs supports in general (64k). Note: the actual number of groups that are made use of in a regular Glusterfs setup shall still be capped at ~93 due to limitations of the RPC transport. To be able to handle more groups than that, brick side gid resolution (server.manage-gids option) can be used along with NIS, LDAP or other such networked directory service (see https://github.com/gluster/glusterdocs/blob/5ba15a2/docs/Administrator%20Guide/Handling-of-users-with-many-groups.md#limit-in-the-glusterfs-protocol ). Also adding some diagnostic messages to frame_fill_groups(). Change-Id: I271f3dc3e6d3c44d6d989c7a2073ea5f16c26ee0 fixes: #1075 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	tests: provide an option to mark tests as 'flaky'	Amar Tumballi	2020-08-18	8	-510/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	* also add some time gap in other tests to see if we get things properly * create a directory 'tests/000/', which can host any tests, which are flaky. * move all the tests mentioned in the issue to above directory. * as the above dir gets tested first, all flaky tests would be reported quickly. * change `run-tests.sh` to continue tests even if flaky tests fail. Reference: gluster/project-infrastructure#72 Updates: #1000 Change-Id: Ifdafa38d083ebd80f7ae3cbbc9aa3b68b6d21d0e Signed-off-by: Amar Tumballi <amar@kadalu.io>
*	features/shard: optimization over shard lookup in case of prealloc	Vinayakswami Hariharmath	2020-08-06	1	-0/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Assume that we are preallocating a VM of size 1TB with a shard block size of 64MB then there will be ~16k shards. This creation happens in 2 steps shard_fallocate() path i.e 1. lookup for the shards if any already present and 2. mknod over those shards do not exist. But in case of fresh creation, we dont have to lookup for all shards which are not present as the the file size will be 0. Through this, we can save lookup on all shards which are not present. This optimization is quite useful in the case of preallocating big vm. Also if the file is already present and the call is to extend it to bigger size then we need not to lookup for non- existent shards. Just lookup preexisting shards, populate the inodes and issue mknod on extended size. Fixes: #1425 Change-Id: I60036fe8302c696e0ca80ff11ab0ef5bcdbd7880 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
*	glusterd: getspec() returns wrong response when volfile not found	Tamar Shacked	2020-07-21	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a cluster env: getspec() detects that volfile not found. but further on, this return code is set by another call so the error is lost and not handled. As a result the server responds with ambiguous message: {op_ret = -1, op_errno = 0..} - which cause the client to stuck. Fix: server side: don't override the failure error. fixes: #1375 Change-Id: Id394954d4d0746570c1ee7d98969649c305c6b0d Signed-off-by: Tamar Shacked <tshacked@redhat.com>
*	dht - fixing xattr inconsistency	Barak Sason Rofman	2020-07-07	1	-0/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The scenario of setting an xattr to a dir, killing one of the bricks, removing the xattr, bringing back the brick results in xattr inconsistency - The downed brick will still have the xattr, but the rest won't. This patch add a mechanism that will remove the extra xattrs during lookup. This patch is a modification to a previous patch based on comments that were made after merge: https://review.gluster.org/#/c/glusterfs/+/24613/ fixes: #1324 Change-Id: Ifec0b7aea6cd40daa8b0319b881191cf83e031d1 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	dht: Heal missing dir entry on brick in revalidate path	Susant Palai	2020-06-23	1	-0/+33
\| \| \| \| \| \| \| \| \|	Mark dir as missing in layout structure to be healed in dht_selfheal_directory. fixes: #1327 Change-Id: If2c69294bd8107c26624cfe220f008bc3b952a4e Signed-off-by: Susant Palai <spalai@redhat.com>
*	tests: added volume operations to increase code coverage	nik-redhat	2020-05-26	1	-14/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Added test for volume options like localtime-logging, fixed enable-shared-storage to include function coverage and few negative tests for other volume options to increase the code coverage in the glusterd component. Change-Id: Ib1706c1fd5bc98a64dcb5c8b15a121d639a597d7 Updates: #1052 Signed-off-by: nik-redhat <nladha@redhat.com>
*	Revert "dht - fixing xattr inconsistency"	Barak Sason Rofman	2020-06-25	1	-54/+0
\| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit 620158475f462251c996901a8e24306ef6cb4c42. The patch to revert is https://review.gluster.org/#/c/glusterfs/+/24613/ Reverting is required as comments were posted regarding a more efficient implementation were made after the patch was merged. A new patch will be posted to adress the comments will be posted. updates: #1324 Change-Id: I59205baefe1cada033c736d41ce9c51b21727d3f Signed-off-by: Barak Sason Rofman <redhat@gmail.com>
*	dht - fixing xattr inconsistency	Barak Sason Rofman	2020-06-21	1	-0/+54
\| \| \| \| \| \| \| \| \| \| \| \| \|	The scenario of setting an xattr to a dir, killing one of the bricks, removing the xattr, bringing back the brick results in xattr inconsistency - The downed brick will still have the xattr, but the rest won't. This patch add a mechanism that will remove the extra xattrs during lookup. fixes: #1324 Change-Id: Ibcc449bad6c7cb46bcae380e42e4496d733b453d Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	glusterd: add-brick command failure	Sanju Rakonde	2020-06-16	1	-0/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: add-brick operation is failing when replica or disperse count is not mentioned in the add-brick command. Reason: with commit a113d93 we are checking brick order while doing add-brick operation for replica and disperse volumes. If replica count or disperse count is not mentioned in the command, the dict get is failing and resulting add-brick operation failure. fixes: #1306 Change-Id: Ie957540e303bfb5f2d69015661a60d7e72557353 Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	tests/glusterd: spurious failure of ↵	Sanju Rakonde	2020-05-29	1	-3/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tests/bugs/glusterd/mgmt-handshake-and-volume-sync-post-glusterd-restart.t Test Summary Report ------------------- tests/bugs/glusterd/mgmt-handshake-and-volume-sync-post-glusterd-restart.t (Wstat: 0 Tests: 23 Failed: 3) Failed tests: 21-23 After glusterd restart, volume start is failing. Looks like, it need some time to sync the data. Adding sleep for the same. Note: All other changes are made to avoid spurious failures in the future. fixes: #1272 Change-Id: Ib184757fb936e03b5b6208465e44a8e790b71c1c Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	afr: more quorum checks in lookup and new entry marking	Ravishankar N	2020-05-27	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: See github issue for details. Fix: -In lookup if the entry exists in 2 out of 3 bricks, don't fail the lookup with ENOENT just because there is an entrylk on the parent. Consider quorum before deciding. -If entry FOP does not succeed on quorum no. of bricks, do not perform new entry mark. Fixes: #1303 Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	Indicate timezone offsets in timestamps	Csaba Henk	2020-03-12	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Logs and other output carrying timestamps will have now timezone offsets indicated, eg.: [2020-03-12 07:01:05.584482 +0000] I [MSGID: 106143] [glusterd-pmap.c:388:pmap_registry_remove] 0-pmap: removing brick (null) on port 49153 To this end, - gf_time_fmt() now inserts timezone offset via %z strftime(3) template. - A new utility function has been added, gf_time_fmt_tv(), that takes a struct timeval pointer (tv) instead of a time_t value to specify the time. If tv->tv_usec is negative, gf_time_fmt_tv(... tv ...) is equivalent to gf_time_fmt(... tv->tv_sec ...) Otherwise it also inserts tv->tv_usec to the formatted string. - Building timestamps of usec precision has been converted to gf_time_fmt_tv, which is necessary because the method of appending a period and the usec value to the end of the timestamp does not work if the timestamp has zone offset, but it's also beneficial in terms of eliminating repetition. - The buffer passed to gf_time_fmt/gf_time_fmt_tv has been unified to be of GF_TIMESTR_SIZE size (256). We need slightly larger buffer space to accommodate the zone offset and it's preferable to use a buffer which is undisputedly large enough. This change does not* do the following: - Retaining a method of timestamp creation without timezone offset. As to my understanding we don't need such backward compatibility as the code just emits timestamps to logs and other diagnostic texts, and doesn't do any later processing on them that would rely on their format. An exception to this, ie. a case where timestamp is built for internal use, is graph.c:fill_uuid(). As far as I can see, what matters in that case is the uniqueness of the produced string, not the format. - Implementing a single-token (space free) timestamp format. While some timestamp formats used to be single-token, now all of them will include a space preceding the offset indicator. Again, I did not see a use case where this could be significant in terms of representation. - Moving the codebase to a single unified timestamp format and dropping the fmt argument of gf_time_fmt/gf_time_fmt_tv. While the gf_timefmt_FT format is almost ubiquitous, there are a few cases where different formats are used. I'm not convinced there is any reason to not use gf_timefmt_FT in those cases too, but I did not want to make a decision in this regard. Change-Id: I0af73ab5d490cca7ed8d07a2ce7ac22a6df2920a Updates: #837 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	features/shard: Use fd lookup post file open	Vinayakswami Hariharmath	2020-06-03	1	-0/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Issue: When a process has the open fd and the same file is unlinked in middle of the operations, then file based lookup fails with ENOENT or stale file Solution: When the file already open and fd is available, use fstat to get the file attributes Change-Id: I0e83aee9f11b616dcfe13769ebfcda6742e4e0f4 Fixes: #1281 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
*	test: Test case brick-mux-validation-in-cluster.t is failing on RHEL-8	Mohit Agrawal	2020-06-09	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Brick process are not properly attached on any cluster node while some volume options are changed on peer node and glusterd is down on that specific node. Solution: At the time of restart glusterd it got a friend update request from a peer node if peer node having some changes on volume.If the brick process is started before received a friend update request in that case brick_mux behavior is not workingproperly. All bricks are attached to the same process even volumes options are not the same. To avoid the issue introduce an atomic flag volpeerupdate and update the value while glusterd has received a friend update request from peer for a specific volume.If volpeerupdate flag is 1 volume is started by glusterd_import_friend_volume synctask Change-Id: I4c026f1e7807ded249153670e6967a2be8d22cb7 Credit: Sanju Rakaonde <srakonde@redhat.com> fixes: #1290 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	cluster/afr: Prioritize ENOSPC over other errors	karthik-us	2020-05-21	1	-0/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In a replicate/arbiter volume if file creations or writes fails on quorum number of bricks and on one brick it is due to ENOSPC and on other brick it fails for a different reason, it may fail with errors other than ENOSPC in some cases. Fix: Prioritize ENOSPC over other lesser priority errors and do not set op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any error_no which can be misinterpreted by __afr_dir_write_finalize(). Also removing the function afr_has_arbiter_fop_cbk_quorum() which might consider a successful reply form a single brick as quorum success in some cases, whereas we always need fop to be successful on quorum number of bricks in arbiter configuration. Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08 Fixes: #1254 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	open-behind: rewrite of internal logic	Xavi Hernandez	2020-05-12	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	io-cache,quick-read: deprecate volume options with flawed semantics or naming	Csaba Henk	2020-05-14	3	-6/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- performance.cache-size has a flawed semantics, as it's dispatched on two independent translators, io-cache and quick-read. - performance.qr-cache-timeout has a confusing name, as other options affecting quick-read have an unabbreviated "quick-read-..." prefix in their names. We keep these options with unchanged operation, but in the help output we indicate their deprecation. The following better alternatives are introduced: - performance.io-cache-size to tune cache-size option of io-cache - performance.quick-read-cache-size to tune cache-size option of quick-read - performance.quick-read-cache-timeout as a preferred synonym for performance.qr-cache-timeout Fixes: #952 Change-Id: Ibd04fb638de8cac450ba992ad8a415154f9f4281 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	features/shard: Aggregate file size, block-count before unwinding removexattr	Krutika Dhananjay	2020-05-22	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \|	Posix translator returns pre and postbufs in the dict in {F}REMOVEXATTR fops. These iatts are further cached at layers like md-cache. Shard translator, in its current state, simply returns these values without updating the aggregated file size and block-count. This patch fixes this problem. Change-Id: I4b2dd41ede472c5829af80a67401ec5a6376d872 Fixes: #1243 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
*	features/shard: Aggregate size, block-count in iatt before unwinding setxattr	Krutika Dhananjay	2020-05-15	1	-0/+31
\| \| \| \| \| \| \| \| \| \| \| \| \|	Posix translator returns pre and postbufs in the dict in {F}SETXATTR fops. These iatts are further cached at layers like md-cache. Shard translator, in its current state, simply returns these values without updating the aggregated file size and block-count. This patch fixes this problem. Change-Id: I4da0eceb4235b91546df79270bcc0af8cd64e9ea Fixes: #1243 Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
*	tests: Fix bug-1101647.t test case failure	karthik-us	2020-04-09	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: tests/bugs/replicate/bug-1101647.t test case fails sporadically in the volume heal since connection to the bricks with shd was not being checked before running the index heal. Build link: https://build.gluster.org/job/regression-test-burn-in/5007/ Fix: Check for the connection status of the bricks with shd before performing the index heal. Change-Id: Ie7060f379b63bef39fd4f9804f6e22e0a25680c1 Updates: #1154 Signed-off-by: karthik-us <ksubrahm@redhat.com>