glusterfs.git - GlusterFS is a distributed file-system capable of scaling to several petabytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system.

	Commit message (Collapse)	Author	Age	Files	Lines
*	cli: syntax check for arbiter volume creation (#2207) (#2222)	Ravishankar N	2021-03-23	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	commit 8e7bfd6a58b444b26cb50fb98870e77302f3b9eb changed the syntax for arbiter volume creation to 'replica 2 arbiter 1', while still allowing the old syntax of 'replica 3 arbiter 1'. But while doing so, it also removed a conditional check, thereby allowing replica count > 3. This patch fixes it. Updates: #2192 Change-Id: Ie109325adb6d78e287e658fd5f59c26ad002e2d3 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	glusterd: fix bug in enabling granular-entry-heal (#1752)	Ravishankar N	2020-11-05	1	-5/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	commit f5e1eb87d4af44be3b317b7f99ab88f89c2f0b1a meant to enable the volume option only for replica volumes but inadvertently enabled it for all volume types. Fixing it now. Also found a bug in glusterd where disabling the option on plain distribute was succeeding even though setting it in the fist place fails. Fixed that too. Fixes: #1483 Change-Id: Icb6c169a8eec44cc4fb4dd636405d3b3485e91b4 Reported-by: Sheetal Pamecha <spamecha@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	xlators: misc conscious language changes (#1715)	Ravishankar N	2020-11-02	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	core:change xlator_t->ctx->master to xlator_t->ctx->primary afr: just changed comments. meta: change .meta/master to .meta/primary. Might break scripts. changelog: variable/function name changes only. These are unrelated to geo-rep. Fixes: #1713 Change-Id: I58eb5fcd75d65fc8269633acc41313503dccf5ff Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	extras/rebalance: Script to perform directory rebalance (#1676)	Pranith Kumar Karampuri	2020-10-30	1	-0/+129
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* extras/rebalance: Script to perform directory rebalance How should the script be executed? $ /path/to/directory-rebalance.py <dir-to-rebalance> will do rebalance just for that directory. The script assumes that fix-layout operation is completed for all the directories present inside the <dir-to-rebalance> How does it work? For the given directory path that needs to be rebalanced, full crawl is performed and the files that need to be healed and the size of each file is first written to the index. Once building the index is completed, the index is read and for each file the script executes equivalent of setfattr -n trusted.distribute.migrate-data -v 1 <path/to/file> Why does the script take two passes? Printing a sensible ETA has been a primary goal of the script. Without knowing the approximate size that will be rebalanced, it is difficult to find ETA. Hence the script does one pass to find files, sizes which it writes to the index file and then the next pass is done on the index file. It takes a minute or two for the ETA to converge but in our testing it has been giving a reasonable ETA What versions does the script support? For the script to work correctly, dht should handle "trusted.distribute.migrate-data" setxattr correctly. fixes: #1654 Change-Id: Ie5070127bd45f1a1b9cd18ed029e364420c971c1 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	cluster/dht: Perform migrate-file with lk-owner (#1581)	Pranith Kumar Karampuri	2020-10-29	3	-0/+95
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* cluster/dht: Perform migrate-file with lk-owner 1) Added GF_ASSERT() calls in client-xlator to find these issues sooner. 2) Fuse is setting zero-lkowner with len as 8 when the fop doesn't have any lk-owner. Changed this to have len as 0 just as we have in fops triggered from xlators lower to fuse. * syncop: Avoid frame allocation if we can * cluster/dht: Set lkowner in daemon rebalance code path * cluster/afr: Set lkowner for ta-selfheal * cluster/ec: Destroy frame after heal is done * Don't assert for lk-owner in lk call * set lkowner for mandatory lock heal tests fixes: #1529 Change-Id: Ia803db6b00869316893abb1cf435b898eec31228 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	tests: exclude more contrib/fuse-lib objects (#1694)	Dmitry Antipov	2020-10-27	1	-0/+3
\| \| \| \| \| \| \|	Exclude more contrib/fuse-lib objects to avoid silly tests/basic/0symbol-check.t breakage. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Fixes: #1692
*	glusterd/afr: enable granular-entry-heal by default (#1621)	Ravishankar N	2020-10-22	8	-3/+506
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. The option has been enabled and tested for quite some time now in RHHI-V downstream and I think it is safe to make it 'on' by default. Since it is not possible to simply change it from 'off' to 'on' without breaking rolling upgrades, old clients etc., I have made it default only for new volumes starting from op-verison GD_OP_VERSION_9_0. Note: If you do a volume reset, the option will be turned back off. This is okay as the dir's gfid will be captured in 'xattrop' folder and heals will proceed. There might be stale entries inside entry-changes' folder, which will be removed when we enable the option again. 2. I encountered a cust. issue where entry heal was pending on a dir. with 236436 files in it and the glustershd.log output was just stuck at "performing entry selfheal", so I have added logs to give us more info in DEBUG level about whether entry heal and data heal are progressing (metadata heal doesn't take much time). That way, we have a quick visual indication to say things are not 'stuck' if we briefly enable debug logs, instead of taking statedumps or checking profile info etc. Fixes: #1483 Change-Id: I4f116f8c92f8cd33f209b758ff14f3c7e1981422 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	mount/fuse: Fix graph-switch when reader-thread-count is set	Pranith Kumar K	2020-10-05	1	-0/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The current graph-switch code sets priv->handle_graph_switch to false even when graph-switch is in progress which leads to crashes in some cases Fix: priv->handle_graph_switch should be set to false only when graph-switch completes. fixes: #1539 Change-Id: I5b04f7220a0a6e65c5f5afa3e28d1afe9efcdc31 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	cluster/afr: Heal directory rename without rmdir/mkdir	Pranith Kumar K	2020-04-13	4	-0/+708
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem1: When a directory is renamed while a brick is down entry-heal always did an rm -rf on that directory on the sink on old location and did mkdir and created the directory hierarchy again in the new location. This is inefficient. Problem2: Renamedir heal order may lead to a scenario where directory in the new location could be created before deleting it from old location leading to 2 directories with same gfid in posix. Fix: As part of heal, if oldlocation is healed first and is not present in source-brick always rename it into a hidden directory inside the sink-brick so that when heal is triggered in new-location shd can rename it from this hidden directory to the new-location. If new-location heal is triggered first and it detects that the directory already exists in the brick, then it should skip healing the directory until it appears in the hidden directory. Credits: Ravi for rename-data-loss.t script Fixes: #1211 Change-Id: I0cba2006f35cd03d314d18211ce0bd530e254843 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	gfapi: Move the SECURE_ACCESS_FILE check out of glfs_mgmt_init	Môshe van der Sterre	2020-09-28	3	-0/+218
\| \| \| \| \| \| \| \|	glfs_mgmt_init is only called for glfs_set_volfile_server, but secure_mgmt is also required to use glfs_set_volfile with SSL. fixes: #829 Change-Id: Ibc769fe634d805e085232f85ce6e1c48bf4acc66
*	metadisp: new translator for data and metadata separation	Sheena Artrip	2020-01-29	6	-0/+641
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: feature/metadisp is an xlator for performing "metadata dispersal" across multiple children. it does this by flattening the complex POSIX paths into /$GFID style paths, then forwarding the metadata operations to its first child and forwarding the data operations to its second child. The purpose of this xlator is to allow separation of data and metadata, in cases where metadata might be stored in another format (embedded kv?), on another disk (ssd), on another host (dht2). Change-Id: I392c8bd0c867a3237d144aea327323f700a2728d Updates: #816 Signed-Off-By: Sheena Artrip <sheenobu@fb.com> Tested-By: Amar Tumballi <amar@kadalu.io>
*	tests: provide an option to mark tests as 'flaky'	Amar Tumballi	2020-08-18	5	-799/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	* also add some time gap in other tests to see if we get things properly * create a directory 'tests/000/', which can host any tests, which are flaky. * move all the tests mentioned in the issue to above directory. * as the above dir gets tested first, all flaky tests would be reported quickly. * change `run-tests.sh` to continue tests even if flaky tests fail. Reference: gluster/project-infrastructure#72 Updates: #1000 Change-Id: Ifdafa38d083ebd80f7ae3cbbc9aa3b68b6d21d0e Signed-off-by: Amar Tumballi <amar@kadalu.io>
*	afr/split-brain: fix client side split-brain resolution when quorum is enabled	Mohammed Rafi KC	2020-07-29	1	-0/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: If we set favourite child policy, then automatic split-brain resolution should work in all cases. This was failing when quorum count was set to a non-zero value. The initial lookup before the read txn was failing with ENOTCONN. Since we don't have a readable subvol, we were failing it. We were only looking to the split brain resolution choice set through the cli command. Fix: We will now consider the favourite child policy if split-brain choice has not been set via cli command. Change-Id: Id2016c3a90d0763ac6f1a0131571053f595576f0 Fixes: #1404 Signed-off-by: Mohammed Rafi KC <rafi.kavungal@iternity.com>
*	cluster/afr: Delay post-op for fsync	Pranith Kumar K	2020-05-29	3	-0/+175
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: AFR doesn't delay post-op for fsync fop. For fsync heavy workloads this leads to un-necessary fxattrop/finodelk for every fsync leading to bad performance. Fix: Have delayed post-op for fsync. Add special flag in xdata to indicate that afr shouldn't delay post-op in cases where either the process will terminate or graph-switch would happen. Otherwise it leads to un-necessary heals when the graph-switch/process-termination happens before delayed-post-op completes. Fixes: #1253 Change-Id: I531940d13269a111c49e0510d49514dc169f4577 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	open-behind: rewrite of internal logic	Xavi Hernandez	2020-05-12	4	-0/+871
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	io-cache,quick-read: deprecate volume options with flawed semantics or naming	Csaba Henk	2020-05-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- performance.cache-size has a flawed semantics, as it's dispatched on two independent translators, io-cache and quick-read. - performance.qr-cache-timeout has a confusing name, as other options affecting quick-read have an unabbreviated "quick-read-..." prefix in their names. We keep these options with unchanged operation, but in the help output we indicate their deprecation. The following better alternatives are introduced: - performance.io-cache-size to tune cache-size option of io-cache - performance.quick-read-cache-size to tune cache-size option of quick-read - performance.quick-read-cache-timeout as a preferred synonym for performance.qr-cache-timeout Fixes: #952 Change-Id: Ibd04fb638de8cac450ba992ad8a415154f9f4281 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	dht - sparse files rebalance enhancements	Barak Sason Rofman	2020-05-06	1	-0/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently data migration in rebalance reads sparse file sequentially, disregarding which segments are holes and which are data. This can lead to extremely long migration time for large sparse file. Data migration mechanism needs to be enhanced so only data segments are read and migrated. This can be achieved using lseek to seek for holes and data in the file. This enhancement is a consequence of https://bugzilla.redhat.com/show_bug.cgi?id=1823703 fixes: #1222 Change-Id: If5f448a0c532926464e1f34f504c5c94749b08c3 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	mgmt/glusterd: Stop old shd before increasing replica count	Pranith Kumar K	2020-05-13	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: In add-brick that increases replica count SHD was restarted after pending xattrs are set on the new bricks and adding bricks. But before restarting SHD there is a possibility that old SHD would do a scan on root-directory see no heal is needed and delete index for root-dir leading to no heals until lookup is executed on the mount Fix: Stop shd, perform pending-xattr setting/adding new bricks and then restart shd Fixes: #1240 Change-Id: I94fd7c6c909211b597185dfe097a559db6c0d00f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: Disable client-heals	Pranith Kumar K	2020-05-15	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: ok 32 [ 11/ 9] < 46> 'gf_rm_file_and_gfid_link /d/backends/patchy0 del-file' not ok 33 [ 13/ 131] < 48> '! dd if=/dev/zero of=/mnt/glusterfs/0/del-file bs=1M count=1 oflag=direct' -> '' The assumption in the test above is that the file wouldn't exist when dd happens. But heal can lead to creation of the file in some cases leading to spurious failures. Fix: Disable client side heal. Fixes: #1245 Change-Id: I96b2b45528f9dfb3199d503a467cafafba9b387f Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	Fix ./tests/basic/fencing/afr-lock-heal-basic.t failure	Pranith Kumar K	2020-05-11	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In brick-mux tests, all bricks of the volume have same pid. "generate_brick_statedump" cleans up the older statedumps with same brick pid. So successive calls to this function will delete previous brick's statedump as all bricks share same pid. So grep calls to the statedump were failing leading to failure of the .t To fix this, stored the result we need from statedump before calling next brick's statedump Fixes: #1234 Change-Id: I824ed4dff79e7242b3e980364836b9af0e87a6ee Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: skip tests on absence of reflink in xfs	Pranith Kumar K	2020-05-06	1	-7/+9
\| \| \| \| \| \|	Fixes: #1223 Change-Id: I36cb72d920ffd77405051546615c5262c392daef Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	afr: event gen changes	Ravishankar N	2020-04-10	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The general idea of the changes is to prevent resetting event generation to zero in the inode ctx, since event gen is something that should follow 'causal order'. Change #1: For a read txn, in inode refresh cbk, if event_generation is found zero, we are failing the read fop. This is not needed because change in event gen is only a marker for the next inode refresh to happen and should not be taken into account by the current read txn. Change #2: The event gen being zero above can happen if there is a racing lookup, which resets even get (in afr_lookup_done) if there are non zero afr xattrs. The resetting is done only to trigger an inode refresh and a possible client side heal on the next lookup. That can be acheived by setting the need_refresh flag in the inode ctx. So replaced all occurences of resetting even gen to zero with a call to afr_inode_need_refresh_set(). Change #3: In both lookup and discover path, we are doing an inode refresh which is not required since all 3 essentially do the same thing- update the inode ctx with the good/bad copies from the brick replies. Inode refresh also triggers background heals, but I think it is okay to do it when we call refresh during the read and write txns and not in the lookup path. The .ts which relied on inode refresh in lookup path to trigger heals are now changed to do read txn so that inode refresh and the heal happens. Change-Id: Iebf39a9be6ffd7ffd6e4046c96b0fa78ade6c5ec Fixes: #1179 Signed-off-by: Ravishankar N <ravishankar@redhat.com> Reported-by: Erik Jacobson <erik.jacobson at hpe.com>
*	tests: Fix spurious failure of tests/basic/quick-read-with-upcall.t	Pranith Kumar K	2020-04-22	1	-10/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The test is failing at 14:56:41 ok 13, LINENUM:38 14:56:41 not ok 14 Got "test-message0" instead of "test-message1", LINENUM:41 14:56:41 FAILED COMMAND: test-message1 cat /mnt/glusterfs/1/test.txt This happens because fuse sometimes doesn't send 'read' fop to glusterfs and is served from cache. Fix: Mount with direct-io-mode=yes so that read is always received by gluster Fixes: #1190 Change-Id: I369e2024a85dc492dc24c7579b161fb965f55d19 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	tests: do not truncate file offsets and sizes to 32-bit	Dmitry Antipov	2020-04-10	1	-3/+4
\| \| \| \| \| \| \| \| \| \|	Do not truncate file offsets and sizes to 32-bit to prevent tests from spurious failures on >2Gb files. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Change-Id: I2a77ea5f9f415249b23035eecf07129f19194ac2 Fixes: #1161
*	Posix: Optimize posix code to improve file creation	Mohit Agrawal	2019-12-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Before executing a fop in POSIX xlator it builds an internal path based on GFID.To validate the path it call's (l)stat system call and while .glusterfs is heavily loaded kernel takes time to lookup inode and due to that performance drops Solution: In this patch we followed two ways to improve the performance. 1) Keep open fd specific to first level directory(gfid[0]) in .glusterfs, it would force to kernel keep the inodes from all those files in cache. In case of memory pressure kernel won't uncache first level inodes. We need to open 256 fd's per brick to access the entry faster. 2) Use at based call's to access relative path to reduce path based lookup time. Note: To verify the patch we have executed kernel untar 100 times on 6 different clients after enabling metadata group-cache and some other option.We were getting more than 20 percent improvement in kenel untar after applying the patch. Credits: Xavi Hernandez <xhernandez@redhat.com> Change-Id: I1643e6b01ed669b2bb148d02f4e6a8e08da45343 updates: #891 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	cluster/ec: Add test for reset-brick command	Ashish Pandey	2020-04-01	1	-0/+50
\| \| \| \| \| \| \| \| \| \| \|	Following tests are done - 1 - After finishing reset-brick all the bricks should be up. 2 - Heal should be completed. 3 - Check number of entries present on brick which was reset. Change-Id: I9314bed180293a99d400d94bb8cc7ece999da29e Updates: #1144
*	snap_scheduler: python3 compatibility and new test case	Sunny Kumar	2020-03-26	1	-0/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: "snap_scheduler.py init" command failing with the below traceback: [root@dhcp43-104 ~]# snap_scheduler.py init Traceback (most recent call last): File "/usr/sbin/snap_scheduler.py", line 941, in <module> sys.exit(main(sys.argv[1:])) File "/usr/sbin/snap_scheduler.py", line 851, in main initLogger() File "/usr/sbin/snap_scheduler.py", line 153, in initLogger logfile = os.path.join(process.stdout.read()[:-1], SCRIPT_NAME + ".log") File "/usr/lib64/python3.6/posixpath.py", line 94, in join genericpath._check_arg_types('join', a, *p) File "/usr/lib64/python3.6/genericpath.py", line 151, in _check_arg_types raise TypeError("Can't mix strings and bytes in path components") from None TypeError: Can't mix strings and bytes in path components Solution: Added the 'universal_newlines' flag to Popen to support backward compatibility. Added a basic test for snapshot scheduler. Change-Id: I78e8fabd866fd96638747ecd21d292f5ca074a4e Fixes: #1134 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	tests: fix afr-lock-heal-* failure	Ravishankar N	2020-02-21	2	-12/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When brick-mux is enabled: i)brick statedumps seem to be listing the same lock information multiple times. While that is getting fixed, make changes to the .ts to check for unique values. ii)detecting a brick as online via brick_up_status() seems to be taking longer time when delaygen is enabled. Hence bump up PROCESS_UP_TIMEOUT to 90 for afr-lock-heal-advanced.t Updates: #1042 Change-Id: Ife76008f7a99dd1f1fe5791a32577366baaab4b3 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	cluster/afr: Fixes for halo	Pranith Kumar K	2020-02-04	1	-0/+61
\| \| \| \| \| \| \| \| \| \| \|	Current implementation assumes that ping-event will come after connect event but that may not be the case in the cases where after socket connection fds need to be re-opened which would consume more time. So handle any order of the ping/child-up events. fixes: bz#1800583 Change-Id: I6bcdc0caa503bdc039ef2b4739fbf4afae121f05 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	mgmt/glusterd: Adding validation for statedump path	yatipadia	2019-12-31	1	-5/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Description of problem: server.statedump-path is the path where statedumps are stored, by default it is /var/run/gluster. And can be set to any valid directory path. It was observed that server.statedump-path was also accepting file, non-existent file and non-existent paths as well. And statedump command was successful even when statedumps with all the invalid paths. a. A file b. A non-existent path Solution: Added a validation function in gluster-volume-set.c which will allow volume set to success if it's a valid directory and in all other cases, volume set should fail. Fixes: bz#1787122 Change-Id: Ia66e2b3d35f23efc5444c829928779a79d827b42 Signed-off-by: yatipadia <ypadia@redhat.com> Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	Updating gluster manual.	Rishubh Jain	2019-08-18	1	-0/+4
\| \| \| \| \| \| \| \| \| \|	Adding disperse-data to gluster manual under volume create command Change-Id: Ic9eb47c9e71a1d7a11af9394c615c8e90f8d1d69 Fixes: bz#1668239 Signed-off-by: Rishubh Jain <risjain@redhat.com> Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
*	feature/changelog: Avoid thread creation if xlator is not enabled	Mohit Agrawal	2018-09-29	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Changelog creates threads even if the changelog is not enabled Background: Changelog xlator broadly does two things 1. Journalling - Cosumers are geo-rep and glusterfind 2. Event Notification for registered events like (open, release etc) - Consumers are bitrot, geo-rep The existing option "changelog.changelog" controls journalling and there is no option to control event notification and is enabled by default. So when bitrot/geo-rep is not enabled on the volume, threads and resources(rpc and rbuf) related to event notifications consumes resources and cpu cycle which is unnecessary. Solution: The solution is to have two different options as below. 1. changelog-notification : Event notifications 2. changelog : Journalling This patch introduces the option "changelog-notification" which is not exposed to user. When either bitrot or changelog (journalling) is enabled, it internally enbales 'changelog-notification'. But once the 'changelog-notification' is enabled, it will not be disabled for the life time of the brick process even after bitrot and changelog is disabled. As of now, rpc resource cleanup has lot of races and is difficult to cleanup cleanly. If allowed, it leads to memory leaks and crashes on enable/disable of bitrot or changelog (journal) in a loop. Hence to be safer, the event notification is not disabled within lifetime of process once enabled. Change-Id: Ifd00286e0966049e8eb9f21567fe407cf11bb02a Updates: #475 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	ctime: Fix ctime inconsisteny with utimensat	Kotresh HR	2019-11-18	1	-0/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When touch is used to create a file, the ctime is not matching atime and mtime which ideally should match. There is a difference in nano seconds. Cause: When touch is used modify atime or mtime to current time (UTIME_NOW), the current time is taken from kernel. The ctime gets updated to current time when atime or mtime is updated. But the current time to update ctime is taken from utime xlator. Hence the difference in nano seconds. Fix: When utimesat uses UTIME_NOW, use the current time from kernel. fixes: bz#1773530 Change-Id: I9ccfa47dcd39df23396852b4216f1773c49250ce Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	glusterfsd-mgmt: unify read and write tests	Yaniv Kaul	2019-09-27	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Both read and write tests required writing first. Either just writing (write test) or write and then read (read test). So the code is now unified. 2. There's no reason to read zeros from /dev/zero. Just use a CALLOC'ed buffer. I don't think we should read and write zeros, but I did not change the code yet (I think compression and/or dedup will offset results) It appears neither read-perf nor write-perf were tested, so added basic tests for them. Change-Id: I24b1f249fa0335ed652a8982e99c0687d940230e updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	afr: lock healing changes	Ravishankar N	2019-10-16	4	-0/+612
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implements lock healing for gluster-block fencing use case. If mandatory lock is enabled: - Add domain lock/unlock to afr_lk fop. - Maintain a list of locks to be healed in afr_private_t. - Add lock to the list if afr_lk(F_SETLK or F_SETLKW) was sucessful. - Remove it from the list during afr_lk(F_UNLCK). - On child_down, mark lock as needing heal on that child. If lock is lost on quorum no. of bricks, remove it from the list and mark fd bad. - For fds marked as bad, fail the subsequent fd based fops. - On parent up, traverse the list and heal the locks IFF the client is the lk owner and has quorum. (shd does not heal any locks). updates: #613 Change-Id: I03c46ceaea30f5e6236d5ec13f71d843d827f1bc Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: add tests for bz#1758878	Ravishankar N	2019-10-10	3	-34/+62
\| \| \| \| \| \| \| \|	updates: bz#1193929 Change-Id: I517fa29e57bde970c2c22ebc2de80fec1509cd2d Signed-off-by: Sanju Rakonde <srakonde@redhat.com> Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	tests: Remove hard-coding of lower bound	Pranith Kumar K	2019-10-15	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Total blocks of XFS partition are dependent on XFS parameters used for formatting. So hardcoded lower-bound may not be the lower bound for different default parameters. Used blocks are lesser on my machine for a freshly formatted XFS partition compared to where the test fails. On my machine: Filesystem 1024-blocks Used Available Capacity Mounted on /dev/loop0 98980 5472 93508 6% /d/backends/patchy1 /dev/loop1 98980 5472 93508 6% /d/backends/patchy2 On a machine where this test fails: Filesystem 1024-blocks Used Available Capacity Mounted on /dev/loop0 96928 6112 90816 7% /d/backends/patchy1 /dev/loop1 96928 6112 90816 7% /d/backends/patchy2 Fix: Make lower bound 2% less than the brick-blocks available fixes: bz#1761759 Change-Id: I974d5e75766f7ff44780a2e4c2a19cd5d1d14a79 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/afr: Add afr_seek to fops table	Pranith Kumar K	2019-10-10	3	-1/+57
\| \| \| \| \| \|	fixes: bz#1760189 Change-Id: Iffbf8d6f4c50b8e2de8364658697bdbe96549f5d Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: Implement read-mask feature	Pranith Kumar K	2019-09-10	1	-0/+114
\| \| \| \| \| \|	fixes: #725 Change-Id: Iaaefe6f49c8193c476b987b92df6bab3e2f62601 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	read-ahead/io-cache: turn off by default	Raghavendra Gowdappa	2019-02-12	2	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We've found perf xlators io-cache and read-ahead not adding any performance improvement. At best read-ahead is redundant due to kernel read-ahead and at worst io-cache is degrading the performance for workloads that doesn't involve re-read. Given that VFS already have both these functionalities, this patch makes these two translators turned off by default for native fuse mounts. For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have these xlators on by having custom profiles. Change-Id: Ie7535788909d4c741844473696f001274dc0bb60 Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> fixes: bz#1676479
*	gfapi: 'glfs_h_creat_open' - new API to create handle and open fd	Soumya Koduri	2019-09-18	2	-0/+145
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now we have two separate APIs, one - 'glfs_h_creat_handle' to create handle & another - 'glfs_h_open' to create a glfd to return to application Having two separate routines can result in access errors while trying to create and write into a read-only file. Since a fd is opened even during file/directory creation, introducing a new API to make these two operations atomic i.e, which can create both handle & fd and pass them to application Change-Id: Ibf513fcfcdad175f4d7eb6fa7a61b8feec6d33b5 Fixes: bz#1753569 Signed-off-by: Soumya Koduri <skoduri@redhat.com>
*	ctime/rebalance: Heal ctime xattr on directory during rebalance	Kotresh HR	2019-07-29	6	-0/+477
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After add-brick and rebalance, the ctime xattr is not present on rebalanced directories on new brick. This patch fixes the same. Note that ctime still doesn't support consistent time across distribute sub-volume. This patch also fixes the in-memory inconsistency of time attributes when metadata is self healed. Change-Id: Ia20506f1839021bf61d4753191e7dc34b31bb2df fixes: bz#1734026 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	tests/shd: Mark "tests/basic/volume-scale-shd-mux.t" as bad	Mohammed Rafi KC	2019-09-13	1	-0/+2
\| \| \| \| \| \| \| \| \|	This test case is failing in upstream. Marking this test as bad for now. Change-Id: I014c67628c14683c32a3c1dd770b10aaf35ad4cc Updates: bz#1752331 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	libgfapi: return correct errno on invalid volume name	Sheetal Pamecha	2019-08-19	3	-8/+95
\| \| \| \| \| \| \| \| \| \| \| \| \|	glfs_init when called with volume name prefixed by '/' sets errno to 0. Setting errno to EINVAL to resolve the issue. Also volname is a parameter to glfs_new. Thus, validating volname in glfs_new itself and returning EINVAL from that function fixes: bz#1507896 Change-Id: I0d4d2423e26cc07644d50ec8cce788ecc639203d Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
*	tests: revive back volume-scale-shd-mux.t	Atin Mukherjee	2019-07-05	5	-30/+28
\| \| \| \| \| \| \|	Fixes: bz#1708929 Change-Id: I9cc81a9047ff874df752ca5552e00bf033485bd8 Signed-off-by: Atin Mukherjee <amukherj@redhat.com> Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	cluster/ec: quorum-count implementation	Pranith Kumar K	2019-09-05	2	-0/+215
\| \| \| \| \| \|	fixes: #721 Change-Id: I5333540e3c635ccf441cf1f4696e4c8986e38ea8 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	cluster/ec: Fail fsync/flush for files on update size/version failure	Pranith Kumar K	2019-09-04	2	-0/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: If update size/version is not successful on the file, updates on the same stripe could lead to data corruptions if the earlier un-aligned write is not successful on all the bricks. Application won't have any knowledge of this because update size/version happens in the background. Fix: Fail fsync/flush on fds that are opened before update-size-version went bad. fixes: bz#1748836 Change-Id: I9d323eddcda703bd27d55f340c4079d76e06e492 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	graph/cleanup: Fix race in graph cleanup	Mohammed Rafi KC	2019-07-05	2	-3/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We were unconditionally cleaning up the grap when we get child_down followed by parent_down. But this is prone to race condition when some of the bricks are already disconnected. In this case, even before the last child down is executed in the client xlator code,we might have freed the graph. Because the child_down event is alreadt recevied. To fix this race, we have introduced a check to see if all client xlator have cleared thier reconnect chain, and called the child_down for last time. Change-Id: I7d02813bc366dac733a836e0cd7b14a6fac52042 fixes: bz#1727329 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	tests/dht: Add a test file for file renames	N Balachandran	2018-09-07	1	-0/+1021
\| \| \| \| \| \| \| \| \|	Test the various combinations of hashed and cached subvols for the src and dst. Change-Id: I41416f9e5f2b7ea1c880d1913fdd6576da1ee868 fixes: bz#1626543 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	tests/shd: Break down shd mux tests into multiple .t file	Mohammed Rafi KC	2019-07-31	4	-149/+191
\| \| \| \| \| \| \| \| \| \|	Test file tests/basic/shd-mux.t was taking longer than 200 seconds in some iterations. So this patch is breaking the test case to three files Change-Id: I1430f58798f876edf6368d6f4b8b5a75f0114c31 Updates: bz#1708929 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>