summaryrefslogtreecommitdiffstats
path: root/xlators
Commit message (Collapse)AuthorAgeFilesLines
* glusterd: After upgrade on release 9.1 glusterd protocol is broken (#2352)develmohit842021-04-231-3/+4
| | | | | | | | | | | | | | | | | * glusterd: After upgrade on release 9.1 glusterd protocol is broken After upgrade on release-9 glusterd protocol is broken because on the upgraded nodes glusterd is not able to find an actor at expected index in rpc procedure table.The new proc (GLUSTERD_MGMT_V3_POST_COMMIT) was introduced from a patch(https://review.gluster.org/#/c/glusterfs/+/24771/) in the middle due to that index of existing actor is changed on new upgraded nodes glusterd is failing. Solution: Change the proc(GLUSTERD_MGMT_V3_POST_COMMIT) position at last in proc table to avoid an issue. Fixes: #2351 Change-Id: I36575fd4302944336a75a8d4a305401a7128fd84 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* protocol/client: Fix lock memory leak (#2338)Pranith Kumar Karampuri2021-04-223-33/+68
| | | | | | | | | | | | | | | | | | | | | | Problem-1: When an overlapping lock is issued the merged lock is not assigned the owner. When flush is issued on the fd, this particular lock is not freed leading to memory leak Fix-1: Assign the owner while merging the locks. Problem-2: On fd-destroy lock structs could be present in fdctx. For some reason with flock -x command and closing of the bash fd, it leads to this code path. Which leaks the lock structs. Fix-2: When fdctx is being destroyed in client, make sure to cleanup any lock structs. fixes: #2337 Change-Id: I298124213ce5a1cf2b1f1756d5e8a9745d9c0a1c Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* coverity: Removed structural dead code (#2320)Nikhil Ladha2021-04-191-8/+2
| | | | | | | | | | | | | | | Issue: `for` loop was executed only once, leading to structural dead code in coverity Fix: Updated the code to use `if` condition instead of `for` loop for the same. CID: 1437779 Updates: #1060 Change-Id: I2ca1d2c9d2842d586161fe971bb8c7b3444dfb2b Signed-off-by: nik-redhat <nladha@redhat.com>
* dht: fix rebalance of sparse files (#2318)Xavi Hernandez2021-04-091-56/+60
| | | | | | | | | | | | | | Current implementation of rebalance for sparse files has a bug that, in some cases, causes a read of 0 bytes from the source subvolume. Posix xlator doesn't allow 0 byte reads and fails them with EINVAL, which causes rebalance to abort the migration. This patch implements a more robust way of finding data segments in a sparse file that avoids 0 byte reads, allowing the file to be migrated successfully. Fixes: #2317 Change-Id: Iff168dda2fb0f2edf716b21eb04cc2cc8ac3915c Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* Removal of force option in snapshot create (#2110)nishith-vihar2021-04-062-163/+25
| | | | | | | | | | | | The force option does fail for snapshot create command even though the quorum is satisfied and is redundant. The change deprecates the force option for snapshot create command and checks if all bricks are online instead of checking for quorum for creating a snapshot. Fixes: #2099 Change-Id: I45d866e67052fef982a60aebe8dec069e78015bd Signed-off-by: Nishith Vihar Sakinala <nsakinal@redhat.com>
* marker: initiate xattrs QUOTA_SIZE_KEY for empty volume (#2261)chenglin1302021-04-012-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * marker: initiate quota xattrs for empty volume When a VOL is empty, it's failed to list quota info after setting limit-usage. # gluster volume quota gv0 list / N/A N/A N/A N/A N/A N/A Because there is no QUOTA_SIZE_KEY in the xattrs of the VOL directory. # getfattr -d -m. -e hex /data/brick2/gv0 getfattr: Removing leading '/' from absolute path names # file: data/brick2/gv0 security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.mdata=0x01000000000000000000000000603e70f6000000003b3f3c8000000000603e70f6000000003351d14000000000603e70f9000000000ff95b00 trusted.glusterfs.quota.limit-set.1=0x0000000000a00000ffffffffffffffff trusted.glusterfs.volume-id=0xe27d61be048c4195a9e1ee349775eb59 This patch fix it by setting QUOTA_SIZE_KEY for the empty VOL directory when quota enable. # gluster volume quota gv0 list Path Hard-limit Soft-limit Used Available Soft-limit exceeded? Hard-limit exceeded? ------------------------------------------------------------------------------------------------------------------------------- / 4.0MB 80%(3.2MB) 0Bytes 4.0MB No No Fixes: #2260 Change-Id: I6ab3e43d6ef33e5ce9531b48e62fce9e8b3fc555 Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
* xlaotrs/mgmt: Fixing coverity issue 1445996Ashish Pandey2021-03-291-5/+7
| | | | | | | | Fixing "Null pointer dereferences" fixes: #2129 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* afr: don't reopen fds on which POSIX locks are held (#1980)Karthik Subrahmanya2021-03-2710-105/+439
| | | | | | | | | | When client.strict-locks is enabled on a volume and there are POSIX locks held on the files, after disconnect and reconnection of the clients do not re-open such fds which might lead to multiple clients acquiring the locks and cause data corruption. Change-Id: I8777ffbc2cc8d15ab57b58b72b56eb67521787c5 Fixes: #1977 Signed-off-by: karthik-us <ksubrahm@redhat.com>
* afr: make fsync post-op aware of inodelk count (#2273)Ravishankar N2021-03-252-17/+24
| | | | | | | | | | | | | | | | | | | | | | Problem: Since commit bd540db1e, eager-locking was enabled for fsync. But on certain VM workloads wit sharding enabled, shard xlator keeps sending fsync on the base shard. This can cause blocked inodelks from other clients (including shd) to time out due to call bail. Fix: Make afr fsync aware of inodelk count and not delay post-op + unlock when inodelk count > 1, just like writev. Code is restructured so that any fd based AFR_DATA_TRANSACTION can be made aware by setting GLUSTERFS_INODELK_DOM_COUNT in xdata request. Note: We do not know yet why VMs go in to paused state because of the blocked inodelks but this patch should be a first step in reducing the occurence. Updates: #2198 Change-Id: Ib91ebdd3101d590c326e69c829cf9335003e260b Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/dht: use readdir for fix-layout in rebalance (#2243)Pranith Kumar Karampuri2021-03-224-92/+90
| | | | | | | | | | | | | | | | | | | | Problem: On a cluster with 15 million files, when fix-layout was started, it was not progressing at all. So we tried to do a os.walk() + os.stat() on the backend filesystem directly. It took 2.5 days. We removed os.stat() and re-ran it on another brick with similar data-set. It took 15 minutes. We realized that readdirp is extremely costly compared to readdir if the stat is not useful. fix-layout operation only needs to know that the entry is a directory so that fix-layout operation can be triggered on it. Most of the modern filesystems provide this information in readdir operation. We don't need readdirp i.e. readdir+stat. Fix: Use readdir operation in fix-layout. Do readdir+stat/lookup for filesystems that don't provide d_type in readdir operation. fixes: #2241 Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* Cleanup unused this pointers (#2282)Rinku Kothiya2021-03-228-43/+35
| | | | | | fixes: #2268 Change-Id: If00ee847e15ac7f7e5b0e12125a7d02a610b9708 Signed-off-by: Rinku Kothiya <rkothiya@redhat.com>
* afr, dht: Add ramifications of disabling ensure-durabilityPranith Kumar K2021-03-193-2/+9
| | | | | | | Also moved options to NO_DOC Change-Id: I86623f4139d156812e622a87655483c9d2491052 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* Coverity Issues: 1447088, 1447089 (#2217)Ashish Pandey2021-03-182-37/+36
| | | | | | | 1447088 - Resource leak 1447089 - Buffer not null terminated updates: #2216 Signed-off-by: Ashish Pandey <aspandey@redhat.com>
* afr: remove priv->root_inode (#2244)Ravishankar N2021-03-174-8/+1
| | | | | | | | | | | | | | priv->root_inode seems to be a remenant of pump xlator and was getting populated in discover code path. thin-arbiter code used it to populate loc info but it seems that in case of some daemons like quotad, the discover path for root gfid is not hit, causing it to crash. Fix: root inode can be accessed via this->itable->root, so use that and remove priv->rot_inode instances from the afr code. Fixes: #2234 Change-Id: Iec59c157f963a4dc455652a5c85a797d00cba52a Signed-off-by: Ravishankar N <ravishankar@redhat.com>
* cluster/dht: Provide option to disable fsync in data migration (#2259)Pranith Kumar Karampuri2021-03-174-12/+37
| | | | | | | | | | | | | | | At the moment dht rebalance doesn't give any option to disable fsync after data migration. Making this an option would give admins take responsibility of data in a way that is suitable for their cluster. Default value is still 'on', so that the behavior is intact for people who don't care about this. For example: If the data that is going to be migrated is already backed up or snapshotted, there is no need for fsync to happen right after migration which can affect active I/O on the volume from applications. fixes: #2258 Change-Id: I7a50b8d3a2f270d79920ef306ceb6ba6451150c4 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* String not null terminated (#2219)nishith-vihar2021-03-111-0/+2
| | | | | | | | CID: 1214629,1274235,1437648 The buffer has been null terminated thus resolving the issue Change-Id: Ieb1d067d8dd860c55a8091dd6fbba1bcbb4dc19f Updates: #1060 Signed-off-by: Nishith Vihar Sakinala <nsakinal@redhat.com>
* cluster/dht: Fix use-after-free bug dht_queue_readdir(p) (#2242)Pranith Kumar Karampuri2021-03-111-2/+9
| | | | | | | | | | | | | Problem: In dht_queue_readdir(p) 'frame' is accessed after unwind. This will lead to undefined behavior as frame would be freed upon unwind. Fix: Store the variables that are needed in local variables and use them instead. fixes: #2239 Change-Id: I6b2e48e87c85de27fad67a12d97abd91fa27c0c1 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* features/index: Optimize link-count fetching code path (#1789)Pranith Kumar Karampuri2021-03-101-19/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * features/index: Optimize link-count fetching code path Problem: AFR requests 'link-count' in lookup to check if there are any pending heals. Based on this information, afr will set dirent->inode to NULL in readdirp when heals are ongoing to prevent serving bad data. When heals are completed, link-count xattr is leading to doing an opendir of xattrop directory and then reading the contents to figure out that there is no healing needed for every lookup. This was not detected until this github issue because ZFS in some cases can lead to very slow readdir() calls. Since Glusterfs does lot of lookups, this was slowing down all operations increasing load on the system. Code problem: index xlator on any xattrop operation adds index to the relevant dirs and after the xattrop operation is done, will delete/keep the index in that directory based on the value fetched in xattrop from posix. AFR sends all-zero xattrop for changelog xattrs. This is leading to priv->pending_count manipulation which sets the count back to -1. Next Lookup operation triggers opendir/readdir to find the actual link-count in lookup because in memory priv->pending_count is -ve. Fix: 1) Don't add to index on all-zero xattrop for a key. 2) Set pending-count to -1 when the first gfid is added into xattrop directory, so that the next lookup can compute the link-count. fixes: #1764 Change-Id: I8a02c7e811a72c46d78ddb2d9d4fdc2222a444e9 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * addressed comments Change-Id: Ide42bb1c1237b525d168bf1a9b82eb1bdc3bc283 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * tests: Handle base index absence Change-Id: I3cf11a8644ccf23e01537228766f864b63c49556 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com> * Addressed LOCK based comments, .t comments Change-Id: I5f53e40820cade3a44259c1ac1a7f3c5f2f0f310 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* afr: fix directory entry count (#2233)Xavi Hernandez2021-03-091-3/+8
| | | | | | | | | | | | | | | | | | AFR may hide some existing entries from a directory when reading it because they are generated internally for private management. However the returned number of entries from readdir() function is not updated accordingly. So it may return a number higher than the real entries present in the gf_dirent list. This may cause unexpected behavior of clients, including gfapi which incorrectly assumes that there was an entry when the list was actually empty. This patch also makes the check in gfapi more robust to avoid similar issues that could appear in the future. Fixes: #2232 Change-Id: I81ba3699248a53ebb0ee4e6e6231a4301436f763 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* EC - Fixing a Coverity issue (Uninitialized lock use)Barak Sason Rofman2021-03-051-2/+2
| | | | | | | | | | | | CID: 1444461 A lock is being destroyed, but in some code-flows might be used later on, modified code-flow to make sure the destroyed lock is not being used in all cases. Change-Id: I9610d56d9cb8a8ab7062e9094493dba9afdd0b30 updates: #1060 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* quiesce: Resource leak coverity fix (#2215)Sheetal Pamecha2021-03-051-0/+3
| | | | | | | Fixes CID: 1124725 Updates: #1060 Change-Id: Iced092c5ad1a9445e4c758f09a481501bae7275f Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
* glusterd: Fix deadlock while concurrent quota enable (#2118)zhangxianwei82021-03-041-1/+1
| | | | | | | | | | | | | in glusterd_svc_start: 1) synctaskA gets attach_lock and then releases big_lock to execute runner_run. 2) synctaskB then gets big_lock but can not gets attach_lock and then wait. 3) After executes runner_run, synctaskA then gets big_lock but synctaskB holds it, wait. This leads to deadlock. This patch uses runner_run_nowait to avoid the deadlock. fixes: #2117 Signed-off-by: Zhang Xianwei <zhang.xianwei8@zte.com.cn>
* afr: fix coverity issue introduced by 90cefde (#2201)Xavi Hernandez2021-03-011-2/+2
| | | | | | | Fixes coverity issues 1447029 and 1447028. Updates: #2161 Change-Id: I6a564231d6aeb76de20675b7ced5d45eed8c377f Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* dht: fix use-after-free introduced by 70e6ee2Xavi Hernandez2021-02-262-3/+39
| | | | | | Change-Id: I97e73c0aae74fc5d80c975f56f2f7a64e3e1ae95 Updates: #2169 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/afr: Fix race in lockinfo (f)getxattr (#2162)Xavi Hernandez2021-02-241-142/+112
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * cluster/afr: Fix race in lockinfo (f)getxattr A shared dictionary was updated outside the lock after having updated the number of remaining answers. This means that one thread may be processing the last answer and unwinding the request before another thread completes updating the dict. Thread 1 Thread 2 LOCK() call_cnt-- (=1) UNLOCK() LOCK() call_cnt-- (=0) UNLOCK() update_dict(dict) if (call_cnt == 0) { STACK_UNWIND(dict); } update_dict(dict) if (call_cnt == 0) { STACK_UNWIND(dict); } The updates from thread 1 are lost. This patch also reduces the work done inside the locked region and reduces code duplication. Fixes: #2161 Change-Id: Idc0d34ab19ea6031de0641f7b05c624d90fac8fa Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* cluster/dht: Fix stack overflow in readdir(p) (#2170)Xavi Hernandez2021-02-242-10/+81
| | | | | | | | | | | | | | | | | | | | | | When parallel-readdir is enabled, readdir(p) requests sent by DHT can be immediately processed and answered in the same thread before the call to STACK_WIND_COOKIE() completes. This means that the readdir(p) cbk is processed synchronously. In some cases it may decide to send another readdir(p) request, which causes a recursive call. When some special conditions happen and the directories are big, it's possible that the number of nested calls is so high that the process crashes because of a stack overflow. This patch fixes this by not allowing nested readdir(p) calls. When a nested call is detected, it's queued instead of sending it. The queued request is processed when the current call finishes by the top level stack function. Fixes: #2169 Change-Id: Id763a8a51fb3c3314588ec7c162f649babf33099 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* dht: Ongoing IO is failed during volume shrink operation (#2188)mohit842021-02-241-11/+30
| | | | | | | | | | | | | | | | | | | In the commit (c878174) we have introduced a check to avoid stale layout issue.To avoid a stale layout issue dht has set a key along with layout at the time of wind a create fop and posix validates the parent layout based on the key value. If layout does not match it throw and error.In case of volume shrink layout has been changed by reabalance daemon and if layout does not matches dht is not able to wind a create fop successfully. Solution: To avoid the issue populate a key only while dht has wind a fop first time. After got an error in 2nd attempt dht takes a lock and then reattempt to wind a fop again. Fixes: #2187 Change-Id: Ie018386e7823a11eea415496bb226ca032453a55 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* fuse: add an option to specify the mount display name (#1989)Amar Tumballi2021-02-221-29/+16
| | | | | | | | | | | | * fuse: add an option to specify the mount display name There are two things this PR is fixing. 1. When a mount is specified with volfile (-f) option, today, you can't make it out its from glusterfs as only volfile is added as 'fsname', so we add it as 'glusterfs:/<volname>'. 2. Provide an options for admins who wants to show the source of mount other than default (useful when one is not providing 'mount.glusterfs', but using their own scripts. Updates: #1000 Change-Id: I19e78f309a33807dc5f1d1608a300d93c9996a2f Signed-off-by: Amar Tumballi <amar@kadalu.io>
* glusterfs:the mount operation will get stuck when the vol isn't exist (#2177)zhangxyue2021-02-221-1/+4
| | | | | when passing wrong volume-name which doesn't exits, it will get stuck. The errno is 0 inited in glusterd-handshake.c. After initing the errno, the process blocks in gf_fuse_umount.
* glusterd: Resolve use after free bug (#2181)mohit842021-02-221-3/+2
| | | | | | | | | | | In the commit 61ae58e67567ea4de8f8efc6b70a9b1f8e0f1bea introduced a coverity bug use object after cleanup the object. Cleanup memory after comeout from a critical section Fixes: #2180 Change-Id: Iee2050c4883a0dd44b8523bb822b664462ab6041 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* dht/linkfile - Remove unused codeBarak Sason Rofman2021-02-182-58/+0
| | | | | | | | A couple of methods are not being used, removing them. Change-Id: I5bb4b7f04bae9486cf9b7960cf5ed91d0b59c8c7 updates: #1000 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* glusterd: Rebalance cli is not showing correct status after reboot (#2172)mohit842021-02-185-11/+91
| | | | | | | | | | | | | | | | | Rebalance cli is not showing correct status after reboot. The CLI is not correct status because defrag object is not valid at the time of creating a rpc connection to show the status. The defrag object is not valid because at the time of start a glusterd glusterd_restart_rebalance can be call almost at the same time by two different synctask and glusterd got a disconnect on rpc object and it cleanup the defrag object. Solution: To avoid the defrag object populate a reference count before create a defrag rpc object. Fixes: #1339 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Change-Id: Ia284015d79beaa3d703ebabb92f26870a5aaafba
* mount: optimize parameter backup-volfile-servers (#2043)chenglin1302021-02-112-2/+48
| | | | | | Optimize parameter backup-volfile-servers to support IPv6 address. Fixes: #2042 Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
* posix: fix chmod error on symlinks (#2155)Xavi Hernandez2021-02-111-5/+9
| | | | | | | | | After glibc 2.32, lchmod() is returning EOPNOTSUPP instead of ENOSYS when called on symlinks. The man page says that the returned code is ENOTSUP. They are the same in linux, but this patch correctly handles all errors. Fixes: #2154 Change-Id: Ib3bb3d86d421cba3d7ec8d66b6beb131ef6e0925 Signed-off-by: Xavi Hernandez xhernandez@redhat.com
* glusterd: fix for starting brick on new port (#2090)Nikhil Ladha2021-02-101-4/+2
| | | | | | | | | | | | | | | The Errno set by the runner code was not correct when the bind() fails to assign an already occupied port in the __socket_server_bind(). Fix: Updated the code to return the correct errno from the __socket_server_bind() if the bind() fails due to EADDRINUSE error. And, use the returned errno from runner_run() to retry allocating a new port to the brick process. Fixes: #1101 Change-Id: If124337f41344a04f050754e402490529ef4ecdc Signed-off-by: nik-redhat nladha@redhat.com
* stack.h/c: remove unused variable and reorder structYaniv Kaul2021-02-081-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Removed unused ref_count variable - Reordered the struct to get related variables closer together. - Changed 'complete' from a '_Bool' to a 'int32_t' Before: ``` struct _call_frame { call_stack_t * root; /* 0 8 */ call_frame_t * parent; /* 8 8 */ struct list_head frames; /* 16 16 */ void * local; /* 32 8 */ xlator_t * this; /* 40 8 */ ret_fn_t ret; /* 48 8 */ int32_t ref_count; /* 56 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 1 boundary (64 bytes) --- */ gf_lock_t lock; /* 64 40 */ void * cookie; /* 104 8 */ _Bool complete; /* 112 1 */ /* XXX 3 bytes hole, try to pack */ glusterfs_fop_t op; /* 116 4 */ struct timespec begin; /* 120 16 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ struct timespec end; /* 136 16 */ const char * wind_from; /* 152 8 */ const char * wind_to; /* 160 8 */ const char * unwind_from; /* 168 8 */ const char * unwind_to; /* 176 8 */ /* size: 184, cachelines: 3, members: 17 */ /* sum members: 177, holes: 2, sum holes: 7 */ /* last cacheline: 56 bytes */ ``` After: ``` struct _call_frame { call_stack_t * root; /* 0 8 */ call_frame_t * parent; /* 8 8 */ struct list_head frames; /* 16 16 */ struct timespec begin; /* 32 16 */ struct timespec end; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ void * local; /* 64 8 */ gf_lock_t lock; /* 72 40 */ void * cookie; /* 112 8 */ xlator_t * this; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ ret_fn_t ret; /* 128 8 */ glusterfs_fop_t op; /* 136 4 */ int32_t complete; /* 140 4 */ const char * wind_from; /* 144 8 */ const char * wind_to; /* 152 8 */ const char * unwind_from; /* 160 8 */ const char * unwind_to; /* 168 8 */ /* size: 176, cachelines: 3, members: 16 */ /* last cacheline: 48 bytes */ ``` Fixes: #2130 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
* cluster/ec: Change self-heal-window-size to 4MiB by default (#2071)Xavi Hernandez2021-02-061-1/+1
| | | | | | | | | | | | | The current block size used for self-heal by default is 128 KiB. This requires a significant amount of management requests for a very small portion of data healed. With this patch the block size is increased to 4 MiB. For a standard EC volume configuration of 4+2, this means that each healed block of a file will update 1 MiB on each brick. Change-Id: Ifeec4a2d54988017d038085720513c121b03445b Updates: #2067 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* introduce microsleep to improve sleep precision (#2104)renlei42021-02-061-3/+12
| | | | | | | | | | | | | | | | | | | | * syncop: introduce microsecond sleep support Introduce microsecond sleep function synctask_usleep, which can be used to improve precision instead of synctask_sleep. Change-Id: Ie7a15dda4afc09828bfbee13cb8683713d7902de * glusterd: use synctask_usleep in glusterd_proc_stop() glusterd_proc_stop() sleep 1s for proc stop before force kill. but in most cases, process can be stopped in 100ms. This patch use synctask_usleep to check proc running state every 100ms instead of sleep 1, can reduce up to 1s stop time. in some cases like enable 100 volumes quota, average execution time reduced from 2500ms to 500ms. fixes: #2116 Change-Id: I645e083076c205aa23b219abd0de652f7d95dca7
* glusterd-volgen: Add functionality to accept any custom xlator (#1974)Ryo Furuhashi2021-02-051-0/+155
| | | | | | | | | | | | | | | | | | | | | * glusterd-volgen: Add functionality to accept any custom xlator Add new function which allow users to insert any custom xlators. It makes to provide a way to add any processing into file operations. Users can deploy the plugin(xlator shared object) and integrate it to glusterfsd. If users want to enable a custom xlator, do the follows: 1. put xlator object(.so file) into "XLATOR_DIR/user/" 2. set the option user.xlator.<xlator> to the existing xlator-name to specify of the position in graph 3. restart gluster volume Options for custom xlator are able to set in "user.xlator.<xlator>.<optkey>". Fixes: #1943 Signed-off-by:Ryo Furuhashi <ryo.furuhashi.nh@hitachi.com> Co-authored-by: Yaniv Kaul <ykaul@redhat.com> Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
* dht: don't parse decommissioned-bricks option when in decommission (#2088)Tamar Shacked2021-02-041-3/+6
| | | | | | | | | | | | | | Scenario: 1) decommission start: the option decommissioned-bricks is added to the vol file and being parsed by dht. 2) another configuration change (like setting a new loglevel): the decommissioned-bricks option still exists on the vol file and being parsed again, this leads to invalid data. Fix: Prevent the parsing of "decommissioned-bricks" when decommission is running. This counts on the fact that once a decommission is running it cannot be started again. Fixes: #1992 Change-Id: I7a016750e2f865aee4cd620bd9033ec19421d47d Signed-off-by: Tamar Shacked <tshacked@redhat.com>
* cluster/dht: Allow fix-layout only on directories (#2109)Pranith Kumar Karampuri2021-02-031-0/+4
| | | | | | | | | | | | | Problem: fix-layout operation assumes that the directory passed is directory i.e. layout->cnt == conf->subvolume_cnt. This will lead to a crash when fix-layout is attempted on a file. Fix: Disallow fix-layout on files fixes: #2107 Change-Id: I2116b8773059f67e3260e9207e20eab3de711417 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* cluster/afr: Change default self-heal-window-size to 1MB (#2068)Pranith Kumar Karampuri2021-02-032-3/+9
| | | | | | | | | At the moment self-heal-window-size is 128KB. This leads to healing data in 128KB chunks. With the growth of data and the avg file sizes nowadays, 1MB seems like a better default. Change-Id: I70c42c83b16c7adb53d6b5762969e878477efb5c Fixes: #2067 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* features/shard: delay unlink of a file that has fd_count > 0 (#1563)Vinayak hariharmath2021-02-034-25/+285
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: Iec16d7ff5e05f29255491a43fbb6270c72868999 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I07e5a5bf9d33c24b63da72d4f3f59392c5421652 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I3679de8545f2e5b8027c4d5a6fd0592092e8dfbd Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update xlators/storage/posix/src/posix-entry-ops.c Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update fd.c * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
* features/shard: unlink fails due to nospace to mknod marker fileVinayakswami Hariharmath2021-01-261-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | When we hit the max capacity of the storage space, shard_unlink() starts failing if there is no space left on the brick to create a marker file. shard_unlink() happens in below steps: 1. create a marker file in the name of gfid of the base file under BRICK_PATH/.shard/.remove_me 2. unlink the base file 3. shard_delete_shards() deletes the shards in background by picking the entries in BRICK_PATH/.shard/.remove_me If a marker file creation fails then we can't really delete the shards which eventually a problem for user who is looking to make space by deleting unwanted data. Solution: Create the marker file by marking xdata = GLUSTERFS_INTERNAL_FOP_KEY which is considered to be internal op and allowed to create under reserved space. Fixes: #2038 Change-Id: I7facebab940f9aeee81d489df429e00ef4fb7c5d Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
* glusterd: do not allow changing storage.linux-aio for running volumeDmitry Antipov2021-01-221-10/+19
| | | | | | | | Do not allow changing storage.linux-aio for running volume, cleanup nearby storage.linux-io_uring error message as well. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Updates: #2039
* AFR - fixing coverity issue (Argument cannot be negative) (#2026)Barak Sason Rofman2021-01-221-1/+1
| | | | | | | | | CID 1430124 A negative value is being passed to a parameter hat cannot be negative. Modified the value which is being passed. Change-Id: I06dca105f7a78ae16145b0876910851fb631e366 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* posix - fix coverity issue (Unchecked return value)Barak Sason Rofman2021-01-212-2/+9
| | | | | | | | | | CID 1291733 The return value of the method pthread_cancel was not being checked. Added a retun value check and proper error handling. Change-Id: I8c52b0e462461fc59718deb3b7c2f1b4e55613c7 updates: #1060 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* glusterd - fixing coverity issues (#1947)Barak Sason Rofman2021-01-211-16/+26
| | | | | | | | | | | | | | * glusterd - fixing coverity issues - Dereference after null check (CID 1437686) - Dereference null return value (CID 1437687) - A check for the return value of a memory allocation was missing, added it. - A value of a pointer was being dereferenced after a NULL-pointer check. With this change the pointer is no longer dereferenced. Change-Id: I10bf8a2cb08612981dbb788315dad7dbb4efe2cb updates: #1060 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
* posix: implement AIO-based GF_FOP_FSYNC (#1953)Dmitry Antipov2021-01-211-65/+179
| | | | | | | | | | Implement GF_FOP_FSYNC using io_submit() with IOCB_CMD_FSYNC and IOCB_CMD_FDSYNC operations. Refactor common code to posix_aio_cb_init() and posix_aio_cb_fini() as suggested by Ravishankar N. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Updates: #1952
* Dereference after null reference (CID:1124543) (#2023)nishith-vihar2021-01-201-6/+0
| | | | | | | 'this' pointer was being dereferenced after null check. This change avoids it. Change-Id: I7dedee44c08df481d2a037eb601f3d5c4d9284f5 Updates: #1060 Signed-off-by: Nishith Vihar Sakinala <nsakinal@redhat.com>