glusterfs.git - GlusterFS is a distributed file-system capable of scaling to several petabytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system.

	Commit message (Collapse)	Author	Age	Files	Lines
*	logging: fix a relock deadlock (#2332)	chenglin130	2021-04-15	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \|	In gf_log_inject_timer_event(), got lock log.log_buf_lock. Then, under the lock, any call to gf_msg will hang the thread. Because in _gf_msg_internal(), it will relock log.log_buf_lock. Use a PTHREAD_MUTEX_RECURSIVE type instead of the default type for this mutex to fix this deadlock. Fixes: #2330 Signed-off-by: Cheng Lin cheng.lin130@zte.com.cn
*	Remove some strlen() calls if using DICT_LIST_IMP (#2311)	Rinku Kothiya	2021-04-06	1	-8/+22
\| \| \| \| \| \| \| \| \|	The code was optimized by avoiding some strlen() calls if using DICT_LIST_IMP fixes: #2294 Change-Id: Ic5e784edb9538feb1d1b441c8514c76ba5266832 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
*	list.h: remove offensive language, introduce _list_del() (#2132)	Yaniv Kaul	2021-04-02	1	-8/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Replace offensive variables with the values the Linux kernel uses. 2. Introduce an internal function - _list_del() that can be used when list->next and list->prev are going to be assigned later on. (too bad in the code we do not have enough uses of list_move() and list_move() tail, btw. Would have contributed also to code readability) * list.h: defined LIST_POSION1, LIST_POISION2 similar to Linux kernel defines Fixes: #2025 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	cluster/dht: use readdir for fix-layout in rebalance (#2243)	Pranith Kumar Karampuri	2021-03-22	3	-2/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: On a cluster with 15 million files, when fix-layout was started, it was not progressing at all. So we tried to do a os.walk() + os.stat() on the backend filesystem directly. It took 2.5 days. We removed os.stat() and re-ran it on another brick with similar data-set. It took 15 minutes. We realized that readdirp is extremely costly compared to readdir if the stat is not useful. fix-layout operation only needs to know that the entry is a directory so that fix-layout operation can be triggered on it. Most of the modern filesystems provide this information in readdir operation. We don't need readdirp i.e. readdir+stat. Fix: Use readdir operation in fix-layout. Do readdir+stat/lookup for filesystems that don't provide d_type in readdir operation. fixes: #2241 Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	dict: avoid hash calculation when hash_size=1 (link list imp) (#2171)	Tamar Shacked	2021-03-01	1	-42/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* dict: avoid hash calculation when hash_size=1 (link list imp) Currently dict_t always constructs with dict::hash_size = 1. With this initializing the dict implemented as a link-list and searching for a key is done by iteration using key comparison. Therfore we can avoid the hash-calculation done each set()/get() in this case. Fixes: #2013 Change-Id: Id93286a8036064d43142bc2b2f8d5a3be4e97fc4 Signed-off-by: Tamar Shacked <tshacked@redhat.com> * dict: avoid hash calculation when hash_size=1 (list imp) Currently dict_t always constructs with dict::hash_size = 1. With this initializing the dict implemented as a link-list and searching for a key is done by iteration using key comparison. Therfore we can avoid the hash-calculation done each set()/get() in this case. Fix: using new macro to delimit and avoid blocks related to hash imp Fixes: #2013 Change-Id: I31180b434a6e9e7bbb456c7ad888c147c4ce3308 Signed-off-by: Tamar Shacked <tshacked@redhat.com>
*	fuse: add an option to specify the mount display name (#1989)	Amar Tumballi	2021-02-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	* fuse: add an option to specify the mount display name There are two things this PR is fixing. 1. When a mount is specified with volfile (-f) option, today, you can't make it out its from glusterfs as only volfile is added as 'fsname', so we add it as 'glusterfs:/<volname>'. 2. Provide an options for admins who wants to show the source of mount other than default (useful when one is not providing 'mount.glusterfs', but using their own scripts. Updates: #1000 Change-Id: I19e78f309a33807dc5f1d1608a300d93c9996a2f Signed-off-by: Amar Tumballi <amar@kadalu.io>
*	iobuf: use lists instead of iobufs in iobuf_arena struct (#2097)	Yaniv Kaul	2021-02-16	2	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We only need passive and active lists, there's no need for a full iobuf variable. Also ensured passive_list is before active_list, as it's always accessed first. Note: this almost brings us to using 2 cachelines only for that structure. We can easily make other variables smaller (page_size could be 4 bytes) and fit exactly 2 cache lines. Fixes: #2096 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	posix: fix chmod error on symlinks (#2155)	Xavi Hernandez	2021-02-11	3	-0/+10
\| \| \| \| \| \| \| \| \|	After glibc 2.32, lchmod() is returning EOPNOTSUPP instead of ENOSYS when called on symlinks. The man page says that the returned code is ENOTSUP. They are the same in linux, but this patch correctly handles all errors. Fixes: #2154 Change-Id: Ib3bb3d86d421cba3d7ec8d66b6beb131ef6e0925 Signed-off-by: Xavi Hernandez xhernandez@redhat.com
*	Glustereventsd Default port change (#2091)	schaffung	2021-02-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Issue : The default port of glustereventsd is currently 24009 which is preventing glustereventsd from binding to the UDP port due to selinux policies. Fix: Changing the default port to be bound by chanding it to something in the ephemeral range. Fixes: #2080 Change-Id: Ibdc87f83f82f69660dca95d6d14b226e10d8bd33 Signed-off-by: srijan-sivakumar <ssivakum@redhat.com>
*	stack.h/c: remove unused variable and reorder struct	Yaniv Kaul	2021-02-08	2	-15/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Removed unused ref_count variable - Reordered the struct to get related variables closer together. - Changed 'complete' from a '_Bool' to a 'int32_t' Before: ``` struct _call_frame { call_stack_t * root; /* 0 8 / call_frame_t parent; /* 8 8 / struct list_head frames; / 16 16 / void local; /* 32 8 / xlator_t this; /* 40 8 / ret_fn_t ret; / 48 8 / int32_t ref_count; / 56 4 / / XXX 4 bytes hole, try to pack / / --- cacheline 1 boundary (64 bytes) --- / gf_lock_t lock; / 64 40 / void cookie; /* 104 8 / _Bool complete; / 112 1 / / XXX 3 bytes hole, try to pack / glusterfs_fop_t op; / 116 4 / struct timespec begin; / 120 16 / / --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- / struct timespec end; / 136 16 / const char wind_from; /* 152 8 / const char wind_to; /* 160 8 / const char unwind_from; /* 168 8 / const char unwind_to; /* 176 8 / / size: 184, cachelines: 3, members: 17 / / sum members: 177, holes: 2, sum holes: 7 / / last cacheline: 56 bytes / ``` After: ``` struct _call_frame { call_stack_t root; /* 0 8 / call_frame_t parent; /* 8 8 / struct list_head frames; / 16 16 / struct timespec begin; / 32 16 / struct timespec end; / 48 16 / / --- cacheline 1 boundary (64 bytes) --- / void local; /* 64 8 / gf_lock_t lock; / 72 40 / void cookie; /* 112 8 / xlator_t this; /* 120 8 / / --- cacheline 2 boundary (128 bytes) --- / ret_fn_t ret; / 128 8 / glusterfs_fop_t op; / 136 4 / int32_t complete; / 140 4 / const char wind_from; /* 144 8 / const char wind_to; /* 152 8 / const char unwind_from; /* 160 8 / const char unwind_to; /* 168 8 / / size: 176, cachelines: 3, members: 16 / / last cacheline: 48 bytes */ ``` Fixes: #2130 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	introduce microsleep to improve sleep precision (#2104)	renlei4	2021-02-06	3	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* syncop: introduce microsecond sleep support Introduce microsecond sleep function synctask_usleep, which can be used to improve precision instead of synctask_sleep. Change-Id: Ie7a15dda4afc09828bfbee13cb8683713d7902de * glusterd: use synctask_usleep in glusterd_proc_stop() glusterd_proc_stop() sleep 1s for proc stop before force kill. but in most cases, process can be stopped in 100ms. This patch use synctask_usleep to check proc running state every 100ms instead of sleep 1, can reduce up to 1s stop time. in some cases like enable 100 volumes quota, average execution time reduced from 2500ms to 500ms. fixes: #2116 Change-Id: I645e083076c205aa23b219abd0de652f7d95dca7
*	features/shard: delay unlink of a file that has fd_count > 0 (#1563)	Vinayak hariharmath	2021-02-03	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: Iec16d7ff5e05f29255491a43fbb6270c72868999 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I07e5a5bf9d33c24b63da72d4f3f59392c5421652 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I3679de8545f2e5b8027c4d5a6fd0592092e8dfbd Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update xlators/storage/posix/src/posix-entry-ops.c Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update fd.c * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
*	Revert "skip the lock when refcount is not zero" (#2053)	Vinayak hariharmath	2021-01-26	1	-5/+8
\| \| \| \| \| \|	This reverts commit 50e953e2450b5183988c12e87bdfbc997e0ad8a8. Fixes: #2052 Change-Id: Ic0670a63423b5d79c3d48001e18910b1dbf7e98d
*	AFR - fixing coverity issue (Argument cannot be negative) (#2026)	Barak Sason Rofman	2021-01-22	1	-1/+1
\| \| \| \| \| \| \| \| \|	CID 1430124 A negative value is being passed to a parameter hat cannot be negative. Modified the value which is being passed. Change-Id: I06dca105f7a78ae16145b0876910851fb631e366 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	locks: remove unused conditional switch to spin_lock code (#2007)	Vinayak hariharmath	2021-01-19	4	-84/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* locks: remove ununsed conditional switch to spin_lock code use of spin_locks is depend on the variable use_spinlocks but the same is commented in the current code base through https://review.gluster.org/#/c/glusterfs/+/14763/. So it is of no use to have conditional switching to spin_lock or mutex. Removing the dead code as part of the patch Fixes: #1996 Change-Id: Ib005dd86969ce33d3409164ef3e1011bb3169129 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * locks: remove unused conditional switch to spin_lock code use of spin_locks is depend on the variable use_spinlocks but the same is commented in the current code base through https://review.gluster.org/#/c/glusterfs/+/14763/. So it is of no use to have conditional switching to spin_lock or mutex. Removing the dead code as part of the patch Fixes: #1996 Change-Id: Ib005dd86969ce33d3409164ef3e1011bb3169129 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * locks: remove unused conditional switch to spin_lock code use of spin_locks is depend on the variable use_spinlocks but the same is commented in the current code base through https://review.gluster.org/#/c/glusterfs/+/14763/. So it is of no use to have conditional switching to spin_lock or mutex. Removing the dead code as part of the patch Fixes: #1996 Change-Id: Ib005dd86969ce33d3409164ef3e1011bb3169129 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
*	dict: dict_reset() delete all elements using iteration	Tamar Shacked	2021-01-18	1	-19/+35
\| \| \| \| \| \| \|	Enhance dict_reset() imp by deleting all elements using iteration Fixes: #1536 Change-Id: Ib4d4f80bd30d52c891eb0fd4b563db19134e2328 Signed-off-by: Tamar Shacked <tshacked@redhat.com>
*	Removing unused memory allocation	Rinku Kothiya	2021-01-18	1	-15/+1
\| \| \| \| \| \| \| \| \| \|	Removing extra unused type. Removing leftovers from the RDMA Fixes: #904 Change-Id: Id5d28622120578b7076d112e355ad8df116021dd Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
*	core: Reduce calls to THIS wherever possible (#2010)	Karthik Subrahmanya	2021-01-15	2	-25/+33
\| \| \| \| \| \| \| \| \|	In few functions 'THIS' is called inside a loop and saved for later use in 'old_THIS'. Instead we can call 'THIS' only when 'old_THIS' is NULL and reuse that itself to reduce redundant calls. Change-Id: Ie5d4e5fe42bd4df02d101b4c199759cb84e6aee1 Fixes: #1755 Signed-off-by: karthik-us <ksubrahm@redhat.com>
*	avoiding memory allocation while printing trace	Rinku Kothiya	2021-01-11	1	-12/+3
\| \| \| \| \| \| \| \| \| \|	Printing trace can fail due to memory allocation issues this patch avoids that. Fixes: #1966 Change-Id: I14157303a2ff5d19de0e4ece0a460ff0cbd58c26 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
*	skip the lock when refcount is not zero	Rinku Kothiya	2021-01-08	1	-8/+5
\| \| \| \| \| \| \|	Fixes: #1380 Change-Id: I68bb46d2cf8b41c8e709fbeee4778e3cdfc2d46c Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
*	posix: avoiding redundant access of dictionary (#1786)	Rinku Kothiya	2021-01-05	3	-7/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* posix: avoiding redundant access of dictionary This patch fixes the redundant access of dictionary for the same information by the macro PL_LOCAL_GET_REQUESTS fixes: #1707 Change-Id: I48047537436ce920e74bc11cecd9773d7fe4457c Signed-off-by: Rinku Kothiya <rkothiya@redhat.com> * posix: avoiding redundant access of dictionary - Converted the macro SET_BIT to function set_bit - Removed the code to delete the key GLUSTERFS_INODELK_DOM_COUNT - Assigned the value to local->bitfield Change-Id: I101f3fda65e9e75e05907d671203c5d7f072fa8f Fixes: #1707 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com> * posix: avoiding redundant access of dictionary deleted GLUSTERFS_INODELK_DOM_COUNT key Change-Id: I638269e6a9f6fc11351eaede4c103e032881fe12 Fixes: #1707 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com> * posix: avoiding redundant access of dictionary Smoke test warnings fixed. Fixes: #1707 Change-Id: I8682bd0e49f44cbc1442324e1756b56481f18ccd Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
*	io-stats: Change latency to nanoseconds from microseconds (#1833)	Shree Vatsa N	2020-12-17	2	-12/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- In 'BUMP_THROUGHPUT' macro changed 'elapsed' from microseconds to nanoseconds. - In 'update_ios_latency' function 'elapsed' is now in nanoseconds. - In 'collect_ios_latency_sample' function removed conversion from nano to micro, instead directly assigned as 'nano' to 'tv_nsec' of 'timespec' macro - in 'ios_sample_t' macro changed 'timeval' to 'timespec' to support above change. - In '_io_stats_write_latency_sample' function changed formula to from 1e+6 to 1e+9 since 'ios_sample_t' macro now has 'timespec' - In 'BUMP_THROUGHPUT','_ios_sample_t','collect_ios_latency_sample' & update_ios_latency' changed 'elapsed' datatype from 'double' to 'int64_t' - In glusterfs/libglusterfs/src/glusterfs/common-utils.h changed return type of 'gf_tsdiff' function from 'double' to 'int64_t' since it can return negative values. - In glusterfs/libglusterfs/src/latency.c, libglusterfs/src/glusterfs/common-utils.h, xlators/debug/io-stats/src/io-stats.c & xlators/storage/posix/src/posix-helpers.c 'elapsed' is now of type 'int64_t' Fixes: #1825 Signed-off-by: Shree Vatsa N <vatsa@kadalu.io>
*	core: Implement gracefull shutdown for a brick process (#1751)	mohit84	2020-12-16	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* core: Implement gracefull shutdown for a brick process glusterd sends a SIGTERM to brick process at the time of stopping a volume if brick_mux is not enabled.In case of brick_mux at the time of getting a terminate signal for last brick a brick process sends a SIGTERM to own process for stop a brick process.The current approach does not cleanup resources in case of either last brick is detached or brick_mux is not enabled. Solution: glusterd sends a terminate notification to a brick process at the time of stopping a volume for gracefull shutdown Change-Id: I49b729e1205e75760f6eff9bf6803ed0dbf876ae Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement gracefull shutdown for a brick process Resolve some reviwere comment Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I50e6a9e2ec86256b349aef5b127cc5bbf32d2561 * core: Implement graceful shutdown for a brick process Implement a key cluster.brick-graceful-cleanup to enable graceful shutdown for a brick process.If key value is on glusterd sends a detach request to stop the brick. Fixes: #1749 Change-Id: Iba8fb27ba15cc37ecd3eb48f0ea8f981633465c3 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement graceful shutdown for a brick process Resolve reviewer comments Fixes: #1749 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I2a8eb4cf25cd8fca98d099889e4cae3954c8579e * core: Implement gracefull shutdown for a brick process Resolve reviewer comment specific to avoid memory leak Fixes: #1749 Change-Id: Ic2f09efe6190fd3776f712afc2d49b4e63de7d1f Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Implement gracefull shutdown for a brick process Resolve reviewer comment specific to avoid memory leak Fixes: #1749 Change-Id: I68fbbb39160a4595fb8b1b19836f44b356e89716 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	core: Updated the GD_OP_VERSION (#1889)	Rinku Kothiya	2020-12-14	1	-1/+3
\| \| \| \| \| \|	fixes: #1888 Change-Id: Ibe336f6f7f19cd148523f65b6fa2b81dca1bd7b6 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
*	core: Convert mem_get(0) and mem_put functions to Macors (#1908)	mohit84	2020-12-14	3	-35/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* core: Convert mem_get(0) and mem_put functions to Macors Problem: Currently mem_get(0) and mem_put functions access memory pools those are not required while mem-pool is disabled. Change-Id: Ief9bdaeb8637f5bc2b097eb6099fb942130e08ae Solution: Convert mem_get(0) functions as a Macros Fixes: #1359 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Convert mem_get(0) and mem_put functions to Macors Resolver reviewer comments Fixes: #1359 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I8dfdfc1a1cd9906e442271abefc7a635e632581e
*	runner: fix for coverity issues in runner_start() (#1902)	Nikhil Ladha	2020-12-12	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix: Fixed dead code and few unused assignments in the runner_start() in run.c file. CID:1437641 CID:1437644 CID:1437646 Updates: #1060 Change-Id: I30ac234e9ff1f768b0e33a81eb3ffbf0de576784 Signed-off-by: nik-redhat <nladha@redhat.com>
*	Improve error handling (#1914)	Rinku Kothiya	2020-12-12	1	-0/+6
\| \| \| \| \| \|	Fixes: #1678 Change-Id: I566cd8bfd22c0ef63fcd44a8cea32366388a93e5 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
*	core: Optimize _xlator->stats structure to make memory access friendly (#1866)	mohit84	2020-12-04	6	-59/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* core: Optimize _xlator->stats structure to make memory access friendly Current xlator->stats is not efficient for frequently access memroy variables, to make it friendly optimize stats structure. Fixes: #1583 Change-Id: I5c9d263b11d9bbf0bf5501e461bdd3cce03591f9 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: Optimize _xlator->stats structure to make memory access friendly Resolve reviewer comments Fixes: #1583 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I44a728263bfc397158dc95e4a9bae393fd3c9883 * core: Optimize _xlator->stats structure to make memory access friendly Resolve reviewer comments Fixes: #1583 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I55e093e3f639052644ce6379cbbe2a15b0ef4be7
*	all: change 'primary' to 'root' where it makes sense	Ravishankar N	2020-12-02	7	-26/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As a part of offensive language removal, we changed 'master' to 'primary' in some parts of the code that are not related to geo-replication via commits e4c9a14429c51d8d059287c2a2c7a76a5116a362 and 0fd92465333be674485b984e54b08df3e431bb0d. But it is better to use 'root' in some places to distinguish it from the geo-rep changes which use 'primary/secondary' instead of 'master/slave'. This patch mainly changes glusterfs_ctx_t->primary to glusterfs_ctx_t->root. Other places like meta xlator is also changed. gf-changelog.c is not changed since it is related to geo-rep. Updates: #1000 Change-Id: I3cd610f7bea06c7a28ae2c0104f34291023d1daf Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	runner: moving to posix_spawnp instead of fork (#1644)	Nikhil Ladha	2020-12-02	3	-54/+128
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* runner: moving to posix_spawnp instead of fork Removed the fork(), and implemented the posix_spwanp() acccrodingly, as it provides much better performance than fork(). More detailed description about the benefits can be found in the description of the issue linked below. Fixes:#810 Signed-off-by: nik-redhat <nladha@redhat.com> * Added the close_fds_except call Signed-off-by: nik-redhat <nladha@redhat.com> * Added comments Signed-off-by: nik-redhat <nladha@redhat.com> * Made the functions static Signed-off-by: nik-redhat <nladha@redhat.com>
*	glusterd[brick_mux]: Optimize friend handshake code to avoid call_bail (#1614)	mohit84	2020-11-30	6	-3/+177
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During glusterd handshake glusterd received a volume dictionary from peer end to compare the own volume dictionary data.If the options are differ it sets the key to recognize volume options are changed and call import syntask to delete/start the volume.In brick_mux environment while number of volumes are high(5k) the dict api in function glusterd_compare_friend_volume takes time because the function glusterd_handle_friend_req saves all peer volume data in a single dictionary. Due to time taken by the function glusterd_handle_friend RPC requests receives a call_bail from a peer end gluster(CLI) won't be able to show volume status. Solution: To optimize the code done below changes 1) Populate a new specific dictionary to save the peer end version specific data so that function won't take much time to take the decision about the peer end has some volume updates. 2) In case of volume has differ version set the key in status_arr instead of saving in a dictionary to make the operation is faster. Note: To validate the changes followed below procedure 1) Setup 5100 distributed volumes 3x1 2) Enable brick_mux 3) Start all the volumes 4) Kill all gluster processes on 3rd node 5) Run a loop to update volume option on a 1st node for i in {1..5100}; do gluster v set vol$i performance.open-behind off; done 6) Start the glusterd process on the 3rd node 7) Wait to finish handshake and check there should not be any call_bail message in the logs Change-Id: Ibad7c23988539cc369ecc39dea2ea6985470bee1 Fixes: #1613 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	change 'master' xlator to 'primary' xlator	Ravishankar N	2020-11-30	3	-8/+8
\| \| \| \| \| \| \| \| \|	These were the only offensive language occurences in the code (.c) after making the changes for geo-rep (whichis tracked in issue 1415). Change-Id: I21cd558fdcf8098e988617991bd3673ef86e120d Updates: #1000 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	core: tcmu-runner process continuous growing logs lru_size showing -1 (#1776)	mohit84	2020-11-24	2	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* core: tcmu-runner process continuous growing logs lru_size showing -1 At the time of calling inode_table_prune it checks if current lru_size is greater than lru_limit but lru_list is empty it throws a log message "Empty inode lru list found but with (%d) lru_size".As per code reading it seems lru_size is out of sync with the actual number of inodes in lru_list. Due to throwing continuous error messages entire disk is getting full and the user has to restart the tcmu-runner process to use the volumes.The log message was introduce by a patch https://review.gluster.org/#/c/glusterfs/+/15087/. Solution: Introduce a flag in_lru_list to take decision about inode is being part of lru_list or not. Fixes: #1775 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: I4b836bebf4b5db65fbf88ff41c6c88f4a7ac55c1 * core: tcmu-runner process continuous growing logs lru_size showing -1 Update in_lru_list flag only while modify lru_size Fixes: #1775 Change-Id: I3bea1c6e748b4f50437999bae59edeb3d7677f47 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> * core: tcmu-runner process continuous growing logs lru_size showing -1 Resolve comments in inode_table_destroy and inode_table_prune Fixes: #1775 Change-Id: I5aa4d8c254f0fe374daa5ec604f643dea8dd56ff Signed-off-by: Mohit Agrawal moagrawa@redhat.com * core: tcmu-runner process continuous growing logs lru_size showing -1 Update in_lru_list only while update lru_size Fixes: #1775 Change-Id: I950eb1f0010c3d4bcc44a33225a502d2291d1a83 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	enahancement/debug: Option to generate core dump without killing the process ↵	Vinayak hariharmath	2020-11-23	3	-2/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(#1814) Comments and idea proposed by: Xavi Hernandez(jahernan@redhat.com): On production systems sometimes we see a log message saying that an assertion has failed. But it's hard to track why it failed without additional information (on debug builds, a GF_ASSERT() generates a core dump and kills the process, so it can be used to debug the issue, but many times we are only able to reproduce assertion failures on production systems, where GF_ASSERT() only logs a message and continues). In other cases we may have a core dump caused by a bug, but the core dump doesn't necessarily happen when the bug has happened. Sometimes the crash happens so much later that the causes that triggered the bug are lost. In these cases we can add more assertions to the places that touch the potential candidates to cause the bug, but the only thing we'll get is a log message, which may not be enough. One solution would be to always generate a core dump in case of assertion failure, but this was already discussed and it was decided that it was too drastic. If a core dump was really needed, a new macro was created to do so: GF_ABORT(), but GF_ASSERT() would continue to not kill the process on production systems. I'm proposing to modify GF_ASSERT() on production builds so that it conditionally triggers a signal when a debugger is attached. When this happens, the debugger will generate a core dump and continue the process as if nothing had happened. If there's no debugger attached, GF_ASSERT() will behave as always. The idea I have is to use SIGCONT to do that. This signal is harmless, so we can unmask it (we currently mask all unneeded signals) and raise it inside a GF_ASSERT() when some global variable is set to true. To produce the core dump, run the script under extras/debug/gfcore.py on other terminal. gdb breaks and produces coredump when GF_ASSERT is hit. The script is copied from #1810 which is written by Xavi Hernandez(jahernan@redhat.com) Fixes: #1810 Change-Id: I6566ca2cae15501d8835c36f56be4c6950cb2a53 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
*	posix: Attach a posix_spawn_disk_thread with glusterfs_ctx (#1595)	mohit84	2020-11-09	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Currently posix xlator spawns posix_disk_space_threads per brick and in case of brick_mux environment while glusterd attached bricks at maximum level(250) with a single brick process in that case 250 threads are spawned for all bricks and brick process memory size also increased. Solution: Attach a posix_disk_space thread with glusterfs_ctx to spawn a thread per process basis instead of spawning a per brick Fixes: #1482 Change-Id: I8dd88f252a950495b71742e2a7588bd5bb019ec7 Signed-off-by: Mohit Agrawal moagrawa@redhat.com
*	logger: Always print errors in english (#1657)	Rinku Kothiya	2020-11-07	1	-1/+5
\| \| \| \| \| \|	fixes: #1302 Change-Id: If0e21f016155276a953c64a8dd13ff3eb281d09d Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
*	xlators: misc conscious language changes (#1715)	Ravishankar N	2020-11-02	6	-24/+22
\| \| \| \| \| \| \| \| \| \| \| \|	core:change xlator_t->ctx->master to xlator_t->ctx->primary afr: just changed comments. meta: change .meta/master to .meta/primary. Might break scripts. changelog: variable/function name changes only. These are unrelated to geo-rep. Fixes: #1713 Change-Id: I58eb5fcd75d65fc8269633acc41313503dccf5ff Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/dht: Perform migrate-file with lk-owner (#1581)	Pranith Kumar Karampuri	2020-10-29	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* cluster/dht: Perform migrate-file with lk-owner 1) Added GF_ASSERT() calls in client-xlator to find these issues sooner. 2) Fuse is setting zero-lkowner with len as 8 when the fop doesn't have any lk-owner. Changed this to have len as 0 just as we have in fops triggered from xlators lower to fuse. * syncop: Avoid frame allocation if we can * cluster/dht: Set lkowner in daemon rebalance code path * cluster/afr: Set lkowner for ta-selfheal * cluster/ec: Destroy frame after heal is done * Don't assert for lk-owner in lk call * set lkowner for mandatory lock heal tests fixes: #1529 Change-Id: Ia803db6b00869316893abb1cf435b898eec31228 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	glusterd/afr: enable granular-entry-heal by default (#1621)	Ravishankar N	2020-10-22	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. The option has been enabled and tested for quite some time now in RHHI-V downstream and I think it is safe to make it 'on' by default. Since it is not possible to simply change it from 'off' to 'on' without breaking rolling upgrades, old clients etc., I have made it default only for new volumes starting from op-verison GD_OP_VERSION_9_0. Note: If you do a volume reset, the option will be turned back off. This is okay as the dir's gfid will be captured in 'xattrop' folder and heals will proceed. There might be stale entries inside entry-changes' folder, which will be removed when we enable the option again. 2. I encountered a cust. issue where entry heal was pending on a dir. with 236436 files in it and the glustershd.log output was just stuck at "performing entry selfheal", so I have added logs to give us more info in DEBUG level about whether entry heal and data heal are progressing (metadata heal doesn't take much time). That way, we have a quick visual indication to say things are not 'stuck' if we briefly enable debug logs, instead of taking statedumps or checking profile info etc. Fixes: #1483 Change-Id: I4f116f8c92f8cd33f209b758ff14f3c7e1981422 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	libglusterfs/coverity: pointer to local outside the scope (#1675)	Vinayak hariharmath	2020-10-21	2	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	issue: gf_store_read_and_tokenize() returns the address of the locally referred string. fix: pass the buf to gf_store_read_and_tokenize() and use it for tokenize. CID: 1430143 Updates: #1060 Change-Id: Ifc346540c263f58f4014ba2ba8c1d491c20ac609 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
*	core: configure optimum inode table hash_size for shd (#1576)	mohit84	2020-10-11	2	-23/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In brick_mux environment a shd process consume high memory. After print the statedump i have found it allocates 1M per afr xlator for all bricks.In case of configure 4k volumes it consumes almost total 6G RSS size in which 4G consumes by inode_tables [cluster/replicate.test1-replicate-0 - usage-type gf_common_mt_list_head memusage] size=1273488 num_allocs=2 max_size=1273488 max_num_allocs=2 total_allocs=2 inode_new_table function allocates memory(1M) for a list of inode and dentry hash. For shd lru_limit size is 1 so we don't need to create a big hash table so to reduce RSS size for shd process pass optimum bucket count at the time of creating inode_table. Change-Id: I039716d42321a232fdee1ee8fd50295e638715bb Fixes: #1538 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	rpcsvc: Add latency tracking for rpc programs	Pranith Kumar K	2020-09-04	10	-67/+115
\| \| \| \| \| \| \| \| \| \|	Added latency tracking of rpc-handling code. With this change we should be able to monitor the amount of time rpc-handling code is consuming for each of the rpc call. fixes: #1466 Change-Id: I04fc7f3b12bfa5053c0fc36885f271cb78f581cd Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	glusterfsd, libglusterfs, rpc: prefer libglusterfs time API	Dmitry Antipov	2020-09-03	1	-1/+2
\| \| \| \| \| \| \| \|	Use timespec_now_realtime() rather than clock_gettime(). Change-Id: I8fa00b7c0f7b388305c7d19574be3b409db68558 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Updates: #1002
*	build: extend --enable-valgrind to support Memcheck and DRD	Dmitry Antipov	2020-08-31	3	-4/+11
\| \| \| \| \| \| \| \| \|	Extend '-enable-valgrind' to '--enable=valgrind[=memcheck,drd]' to enable Memcheck or DRD Valgrind tool, respectively. Change-Id: I80d13d72ba9756e0cbcdbeb6766b5c98e3e8c002 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Updates: #1002
*	libglusterfs: fix dict leak	Ravishankar N	2020-09-01	3	-8/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: gf_rev_dns_lookup_cached() allocated struct dnscache->dict if it was null but the freeing was left to the caller. Fix: Moved dict allocation and freeing into corresponding init and fini routines so that its easier for the caller to avoid such leaks. Updates: #1000 Change-Id: I90d6a6f85ca2dd4fe0ab461177aaa9ac9c1fbcf9 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	Events: Fixing coverity issues.	Srijan Sivakumar	2020-09-01	1	-3/+5
\| \| \| \| \| \| \| \| \| \|	Fixing resource leak reported by coverity scan. CID: 1431237 Change-Id: I2bed106b3dc4296c50d80542ee678d32c6928c25 Updates: #1060 Signed-off-by: Srijan Sivakumar <ssivakum@redhat.com>
*	fuse: fetch arbitrary number of groups from /proc/[pid]/status	Csaba Henk	2020-07-17	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Glusterfs so far constrained itself with an arbitrary limit (32) for the number of groups read from /proc/[pid]/status (this was the number of groups shown there prior to Linux commit v3.7-9553-g8d238027b87e (v3.8-rc1~74^2~59); since this commit, all groups are shown). With this change we'll read groups up to the number Glusterfs supports in general (64k). Note: the actual number of groups that are made use of in a regular Glusterfs setup shall still be capped at ~93 due to limitations of the RPC transport. To be able to handle more groups than that, brick side gid resolution (server.manage-gids option) can be used along with NIS, LDAP or other such networked directory service (see https://github.com/gluster/glusterdocs/blob/5ba15a2/docs/Administrator%20Guide/Handling-of-users-with-many-groups.md#limit-in-the-glusterfs-protocol ). Also adding some diagnostic messages to frame_fill_groups(). Change-Id: I271f3dc3e6d3c44d6d989c7a2073ea5f16c26ee0 fixes: #1075 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	libglusterfs: add functions to calculate time difference	Dmitry Antipov	2020-08-14	2	-2/+33
\| \| \| \| \| \| \| \| \| \|	Add gf_tvdiff() and gf_tsdiff() to calculate the difference between 'struct timeval' and 'struct timespec' values, use them where appropriate. Change-Id: I172be06ee84e99a1da76847c15e5ea3fbc059338 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Updates: #1002
*	posix: Implement a janitor thread to close fd	Mohit Agrawal	2020-07-27	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In the commit fb20713b380e1df8d7f9e9df96563be2f9144fd6 we use syntask to close fd but we have found the patch is reducing the performance Solution: Use janitor thread to close fd's and save the pfd ctx into ctx janitor list and also save the posix_xlator into pfd object to avoid the race condition during cleanup in brick_mux environment Change-Id: Ifb3d18a854b267333a3a9e39845bfefb83fbc092 Fixes: #1396 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	dict: Remove redundant checks	Sheetal Pamecha	2020-08-06	2	-41/+30
\| \| \| \| \| \|	fixes: #1428 Change-Id: I0cb1c42d620ac1aeab8da25a2e1d7835219d2e4a Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>