glusterfs.git - GlusterFS is a distributed file-system capable of scaling to several petabytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system.

	Commit message (Collapse)	Author	Age	Files	Lines
*	mount/fuse: Fix graph-switch when reader-thread-count is set	Pranith Kumar K	2020-10-05	1	-8/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The current graph-switch code sets priv->handle_graph_switch to false even when graph-switch is in progress which leads to crashes in some cases Fix: priv->handle_graph_switch should be set to false only when graph-switch completes. fixes: #1539 Change-Id: I5b04f7220a0a6e65c5f5afa3e28d1afe9efcdc31 Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
*	xlators: prefer libglusterfs time API	Dmitry Antipov	2020-09-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Prefer timespec_now_realtime() and gf_time() over clock_gettime() and time(), use gf_tvdiff() and gf_tsdiff() where appropriate, drop unused time_elapsed() and leftovers in 'struct posix_private'. Change-Id: Ie1f0229df5b03d0862193ce2b7fb91d27b0981b6 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Updates: #1002
*	fuse: fetch arbitrary number of groups from /proc/[pid]/status	Csaba Henk	2020-08-21	1	-24/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Glusterfs so far constrained itself with an arbitrary limit (32) for the number of groups read from /proc/[pid]/status (this was the number of groups shown there prior to Linux commit v3.7-9553-g8d238027b87e (v3.8-rc1~74^2~59); since this commit, all groups are shown). With this change we'll read groups up to the number Glusterfs supports in general (64k). Note: the actual number of groups that are made use of in a regular Glusterfs setup shall still be capped at ~93 due to limitations of the RPC transport. To be able to handle more groups than that, brick side gid resolution (server.manage-gids option) can be used along with NIS, LDAP or other such networked directory service (see https://github.com/gluster/glusterdocs/blob/5ba15a2/docs/Administrator%20Guide/Handling-of-users-with-many-groups.md#limit-in-the-glusterfs-protocol ). Also adding some diagnostic messages to frame_fill_groups(). Change-Id: I271f3dc3e6d3c44d6d989c7a2073ea5f16c26ee0 fixes: #1075 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	FreeBSD patches for fuse mount utility	Daniel Morante	2020-08-19	1	-1/+8
\| \| \| \| \|	Change-Id: Ib2bac85c28905bb8997fbb64db2308f2a6f31720 Fixes: #1376
*	fuse: change setlk interrupt strategy to 'sync'	Csaba Henk	2020-07-24	1	-14/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The setlk interrupt handler uses a 'fork' of the resolved fuse state from setlk (a copy with some edits) to initiate its own auxiliary fop. Thus the references stored in the fuse states of the setlk fop and of its interrupt handler are shared (apart from the ones edited by the interrupt handler -- but the bulk of them remain as is). The lifetimes of these references are tied to the setlk fop, which has established them by properly claiming their backing resources. To guarantee the validity of these references in the interrupt context, we need to make sure that the setlk fop did not reclaim the fuse state while the interrupt handler is running. In other words, the setlk fop needs to wait for the termination of the interrupt handler, which is accomplished by the 'sync' strategy of the interrupt API (passing true for the 'sync' argument of fuse_interrupt_finish_{fop,interrupt} functions). Change-Id: I9a6dc76972507be4b7ba8d023cc876e5fddf813f Updates: #1374 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	fuse: fix waiting for interrupt handler	Csaba Henk	2020-07-24	2	-7/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With 'sync' strategy, a fop's cbk waits for the interrupt handler to finish by making a call to fuse_interrupt_finish_fop() with sync = true. The wait is implemented by monitoring an interrupt_state struct member via a condition variable. However, due to broken code logic, the pthread_cond_wait() call is never reached. This change introduces a new member to the fuse_interrupt_state_t enum (the type of aforementioned struct member), FUSE_INTERRUPT_WAITING_HANDLER, which is then used for indicating the state of waiting for the interrupt handler. Change-Id: I72ab06c37f45ff8f212a6a632bac1f647af05cbd Updates: #1374 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	Make FUSE notification optional at configure time	Emmanuel Dreyfus	2020-07-23	1	-4/+9
\| \| \| \| \| \| \| \| \| \|	NetBSD FUSE does not implement FUSE notification yet. This changes makes this feature a configure time option so that it can be disabled. Fixes: #1381 Change-Id: I3d977d8d69b57e1ac6957be84a9ddbb69b100893 Type: Bug Signed-off-by: Emmanuel Dreyfus manu@netbsd.org
*	mount/fuse: use cookies to get fuse-interrupt-record instead of xdata	Pranith Kumar K	2020-06-18	1	-21/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: On executing tests/features/flock_interrupt.t the following error log appears [2020-06-16 11:51:54.631072 +0000] E [fuse-bridge.c:4791:fuse_setlk_interrupt_handler_cbk] 0-glusterfs-fuse: interrupt record not found This happens because fuse-interrupt-record is never sent on the wire by getxattr fop and there is no guarantee that in the cbk it will be available in case of failures. Fix: wind getxattr fop with fuse-interrupt-record as cookie and recover it in the cbk Fixes: #1310 Change-Id: I4cfff154321a449114fc26e9440db0f08e5c7daa Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	Indicate timezone offsets in timestamps	Csaba Henk	2020-06-15	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Logs and other output carrying timestamps will have now timezone offsets indicated, eg.: [2020-03-12 07:01:05.584482 +0000] I [MSGID: 106143] [glusterd-pmap.c:388:pmap_registry_remove] 0-pmap: removing brick (null) on port 49153 To this end, - gf_time_fmt() now inserts timezone offset via %z strftime(3) template. - A new utility function has been added, gf_time_fmt_tv(), that takes a struct timeval pointer (tv) instead of a time_t value to specify the time. If tv->tv_usec is negative, gf_time_fmt_tv(... tv ...) is equivalent to gf_time_fmt(... tv->tv_sec ...) Otherwise it also inserts tv->tv_usec to the formatted string. - Building timestamps of usec precision has been converted to gf_time_fmt_tv, which is necessary because the method of appending a period and the usec value to the end of the timestamp does not work if the timestamp has zone offset, but it's also beneficial in terms of eliminating repetition. - The buffer passed to gf_time_fmt/gf_time_fmt_tv has been unified to be of GF_TIMESTR_SIZE size (256). We need slightly larger buffer space to accommodate the zone offset and it's preferable to use a buffer which is undisputedly large enough. This change does not* do the following: - Retaining a method of timestamp creation without timezone offset. As to my understanding we don't need such backward compatibility as the code just emits timestamps to logs and other diagnostic texts, and doesn't do any later processing on them that would rely on their format. An exception to this, ie. a case where timestamp is built for internal use, is graph.c:fill_uuid(). As far as I can see, what matters in that case is the uniqueness of the produced string, not the format. - Implementing a single-token (space free) timestamp format. While some timestamp formats used to be single-token, now all of them will include a space preceding the offset indicator. Again, I did not see a use case where this could be significant in terms of representation. - Moving the codebase to a single unified timestamp format and dropping the fmt argument of gf_time_fmt/gf_time_fmt_tv. While the gf_timefmt_FT format is almost ubiquitous, there are a few cases where different formats are used. I'm not convinced there is any reason to not use gf_timefmt_FT in those cases too, but I did not want to make a decision in this regard. Change-Id: I0af73ab5d490cca7ed8d07a2ce7ac22a6df2920a Updates: #837 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	cluster/afr: Delay post-op for fsync	Pranith Kumar K	2020-06-08	1	-3/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: AFR doesn't delay post-op for fsync fop. For fsync heavy workloads this leads to un-necessary fxattrop/finodelk for every fsync leading to bad performance. Fix: Have delayed post-op for fsync. Add special flag in xdata to indicate that afr shouldn't delay post-op in cases where either the process will terminate or graph-switch would happen. Otherwise it leads to un-necessary heals when the graph-switch/process-termination happens before delayed-post-op completes. Fixes: #1253 Change-Id: I531940d13269a111c49e0510d49514dc169f4577 Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	open-behind: rewrite of internal logic	Xavi Hernandez	2020-06-04	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	fuse - remove unnecessary code block	Barak Sason Rofman	2020-06-02	2	-27/+0
\| \| \| \| \| \| \| \| \|	Per a "TODO" comment in the code, a code block can be removed as it's no longer required Change-Id: I60e064ece985ff2ea2a686bbd2f0e6cc850899e9 updates: #1000 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	fuse: occasional logging for fuse device 'weird' write errors	Csaba Henk	2020-05-21	2	-1/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change is a followup to I510158843e4b1d482bdc496c2e97b1860dc1ba93. In referred change we pushed log messages about 'weird' write errors to fuse device out of sight, by reporting them at Debug loglevel instead of Error (where 'weird' means errno is not POSIX compliant but having meaningful semantics for FUSE protocol). This solved the issue of spurious error reporting. And so far so good: these messages don't indicate an error condition by themselves. However, when they come in high repetitions, that indicates a suboptimal condition which should be reported.[1] Therefore now we shall emit a Warning if a certain errno occurs a certain number of times[2] as the outcome of a write to the fuse device. ___ [1] typically ENOENTs and ENOTDIRs accumulate when glusterfs' inode invalidation lags behind the kernel's internal inode garbage collection (in this case above errnos mean that the inode which we requested to be invalidated is not found in kernel). This can be mitigated with the invalidate-limit command line / mount option, cf. bz#1732717. [2] 256, as of the current implementation. Change-Id: I8cc7fe104da43a88875f93b0db49d5677cc16045 Updates: #1000 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	mount/fuse: Wait for 'mount' child to exit before dying	Pranith Kumar K	2020-04-09	1	-0/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: tests/bugs/protocol/bug-1433815-auth-allow.t fails sometimes because of stale mount. This stale mount comes into picture when parent process dies without waiting for the child process which mounts fuse fs to die Fix: Wait for mounting child process to die before dying. Fixes: #1152 Change-Id: I8baee8720e88614fdb762ea822d5877973eef8dc Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	mount.glusterfs: don't strip / from subdir-mounts	Andreas Loibl	2020-02-20	1	-0/+1
\| \| \| \| \|	Change-Id: Ib07c31dc6669aa9d9f84a21e6160237290ed9afb Fixes: bz#1804786
*	fuse: degrade logging of write failure to fuse device	Csaba Henk	2020-01-09	2	-7/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: FUSE uses failures of communicating with /dev/fuse with various errnos to indicate in-kernel conditions to userspace. Some of these shouldn't be handled as an application error. Also the standard POSIX errno description should not be shown as they are misleading in this context. Solution: When writing to the fuse device, the caller of the respective convenience routine can mask those errnos which don't qualify to be an error for the application in that context, so then those shall be reported at DEBUG level. The possible non-standard errnos are reported with their POSIX name instead of their description to avoid confusion. (Eg. for ENOENT we don't log "no such file or directory", we log indeed literal "ENOENT".) Change-Id: I510158843e4b1d482bdc496c2e97b1860dc1ba93 updates: bz#1193929 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	Avoid buffer overwrite due to uuid_utoa() misuse	Dmitry Antipov	2019-12-27	1	-14/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Code like: f(..., uuid_utoa(x), uuid_utoa(y)); is not valid (causes undefined behaviour) because uuid_utoa() uses the only static thread-local buffer which will be overwritten by the subsequent call. All such cases should be converted to use uuid_utoa_r() with explicitly specified buffer. Change-Id: I5e72bab806d96a9dd1707c28ed69ca033b9c8d6c Updates: bz#1193929 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
*	fuse: add missing unlock	Xie Changlong	2019-12-10	1	-1/+3
\| \| \| \| \| \|	updates: bz#1193929 Change-Id: I50f75f730ea6970e99347fcee661ce9dc8477725 Signed-off-by: Xie Changlong <xiechanglong@cmss.chinamobile.com>
*	ctime: Fix ctime inconsisteny with utimensat	Kotresh HR	2019-12-10	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: When touch is used to create a file, the ctime is not matching atime and mtime which ideally should match. There is a difference in nano seconds. Cause: When touch is used modify atime or mtime to current time (UTIME_NOW), the current time is taken from kernel. The ctime gets updated to current time when atime or mtime is updated. But the current time to update ctime is taken from utime xlator. Hence the difference in nano seconds. Fix: When utimesat uses UTIME_NOW, use the current time from kernel. fixes: bz#1773530 Change-Id: I9ccfa47dcd39df23396852b4216f1773c49250ce Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	Improve logging of fusedump creation failure	Csaba Henk	2019-11-07	1	-4/+10
\| \| \| \| \| \|	updates bz#1193929 Signed-off-by: Csaba Henk <csaba@redhat.com> Change-Id: I12cbe1d87f60fb497654d0e13e12171940867f76
*	Multiple files: make root gfid a static variable	Yaniv Kaul	2019-10-14	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	In many places we use it, compare to it, etc. It could be a static variable, as it really doesn't change. I think it's better than initializing to 0 and then doing gfid[15] = 1 or other tricks. I think there are additional oppportunuties to make more variables static. This is an attempt at an easy one. Change-Id: I7f23a30a94056d8f043645371ab841cbd0f90d19 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	glusterfs/fuse: Reduce the default lru-limit value	N Balachandran	2019-09-24	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	The current lru-limit value still uses memory for upto 128K inodes. Reduce the default value of lru-limit to 64K. Change-Id: Ica2dd4f8f5fde45cb5180d8f02c3d86114ac52b3 Fixes: bz#1753880 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	mount/fuse - Fixing a coverity issue	Barak Sason	2019-08-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Fixed resource leak of dict_value and newkey variables CID: 1398630 Updates: bz#789278 Change-Id: I589fdc0aecaeb4f446cd29f95bad36ccd7a35beb Signed-off-by: Barak Sason <bsasonro@redhat.com>
*	mount.glusterfs: make fcache-keep-open option take a value	Philip Spencer	2019-08-16	1	-8/+14
\| \| \| \| \| \|	Fixes: bz#1158130 Change-Id: Ifdeaed7c9fbe85f7ce421f7c89cbe7265e45f77c Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	fuse: Set limit on invalidate queue size	N Balachandran	2019-08-14	3	-16/+55
\| \| \| \| \| \| \| \| \| \| \| \| \|	If the glusterfs fuse client process is unable to process the invalidate requests quickly enough, the number of such requests quickly grows large enough to use a significant amount of memory. We are now introducing another option to set an upper limit on these to prevent runaway memory usage. Change-Id: Iddfff1ee2de1466223e6717f7abd4b28ed947788 Fixes: bz#1732717 Signed-off-by: N Balachandran <nbalacha@redhat.com>
*	fuse: rate limit reading from fuse device upon receiving EPERM	Csaba Henk	2019-08-08	3	-0/+31
\| \| \| \| \| \|	Fixes: bz#1644322 Change-Id: I53e8fa362cd8c7d04fb1c4abb606a9abb642c592 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	fuse: add missing GF_FREE to fuse_interrupt	Csaba Henk	2019-07-25	1	-1/+4
\| \| \| \| \| \|	Change-Id: Id7e003e4a53d0a0057c1c84e1cd704c80a6cb015 Fixes: bz#1728047 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	ibverbs/rdma: remove from build	Amar Tumballi	2019-07-13	1	-10/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have proposed about this an year ago, and with recent smoke failures, it looks like the right time to take such call. ref: https://lists.gluster.org/pipermail/gluster-users/2018-July/034400.html With this, glusterfs-8.0 wouldn't have rdma feature, and would allow some modularity changes possible with rpc layer (as we would have just 1 transport) Updates: bz#1635688 Change-Id: Ia277dca4d4b1f0cffae20819024a52b075b775e5 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	Remove hadoop related code from the codebase	Vijay Bellur	2019-07-09	2	-25/+12
\| \| \| \| \| \| \| \| \|	As Hadoop is no longer supported, dropping code for handling Hadoop access. Fixes: bz#1728417 Signed-off-by: Vijay Bellur <vbellur@redhat.com> Change-Id: I8fcf4faacb364f1c9a8abb0c48faec337087f845
*	core: use multiple servers while mounting a volume using ipv6	Sanju Rakonde	2019-07-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	According to man page: mount -t glusterfs [-o <options>] <server1>,<server2>, <server3>,..<serverN>:/<volname>[/<subdir>] <mount_point> When we try mount -t glusterfs 52:54:00:23:5a:b6,52:54:00:a0:be:ed:/ta-vol1 /mnt/ta we see the below msg in log: [2019-06-17 11:41:36.085809] I [MSGID: 100030] [glusterfsd.c:2867:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 7dev (args: /usr/local/sbin/glusterfs --process-name fuse --volfile-server=52:54:00:23:5a --volfile-id=/ta-vol1 /mnt/ta) With this change, we'll able to give multiple volfile servers while mounting the volume using ipv6. After the change, I see the following in log: [2019-06-17 12:00:21.183658] I [MSGID: 100030] [glusterfsd.c:2867:main] 0-/usr/local/sbin/glusterfs: Started running /usr/local/sbin/glusterfs version 7dev (args: /usr/local/sbin/glusterfs --process-name fuse --volfile-server=52:54:00:23:5a:b6 --volfile-server=52:54:00:a0:be:ed --volfile-id=/ta-vol1 /mnt/ta) fixes: bz#1719290 credits: Aga <aga_1990@hotmail.com> Change-Id: Icf89bea3ba15d8374ef428aeb59f2ef55ad544ec Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
*	stack: Make sure to have unique call-stacks in all cases	Pranith Kumar K	2019-05-30	2	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At the moment new stack doesn't populate frame->root->unique in all cases. This makes it difficult to debug hung frames by examining successive state dumps. Fuse and server xlators populate it whenever they can, but other xlators won't be able to assign 'unique' when they need to create a new frame/stack because they don't know what 'unique' fuse/server xlators already used. What we need is for unique to be correct. If a stack with same unique is present in successive statedumps, that means the same operation is still in progress. This makes 'finding hung frames' part of debugging hung frames easier. fixes bz#1714098 Change-Id: I3e9a8f6b4111e260106c48a2ac3a41ef29361b9e Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	mount.glusterfs: change the error message	Amar Tumballi	2019-03-29	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In scenarios where a mount fails before creating log file, doesn't make sense to give message to 'check log file'. See below: ``` ERROR: failed to create logfile "/var/log/glusterfs/mnt.log" (No space left on device) ERROR: failed to open logfile /var/log/glusterfs/mnt.log Mount failed. Please check the log file for more details. ``` Fixes: bz#1688068 Change-Id: I1d837caa4f9bc9f1a37780783e95007e01ae4e3f Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	fuse : fix high sev coverity issue	Sunny Kumar	2019-03-21	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \|	This patch fixed coverity issue in fuse-bridge.c. CID : 1398630 : Resource leak CID : 1399757 : Uninitialized pointer read updates: bz#789278 Change-Id: I69f8591400ee56a5d215eeac443a8e3d7777db27 Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	mount/fuse: Fix spelling mistake	Pranith Kumar K	2019-03-15	1	-1/+2
\| \| \| \| \| \|	updates bz#1193929 Change-Id: I55ffa8f086ad9570f2526d91c196d7de9ffe6add Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	fuse : fix memory leak	Sunny Kumar	2019-02-25	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes memory leak reported by ASan. Tracebacks: ERROR: LeakSanitizer: detected memory leaks Direct leak of 712 byte(s) in 1 object(s) allocated from: #0 0x7f35139dc848 in __interceptor_malloc (/lib64/libasan.so.5+0xef848) #1 0x7f35136efb29 in __gf_malloc ../libglusterfs/src/mem-pool.c:136 #2 0x7f3510591ce9 in fuse_thread_proc ../xlators/mount/fuse/src/fuse-bridge.c:5929 #3 0x7f351336d58d in start_thread (/lib64/libpthread.so.0+0x858d) SUMMARY: AddressSanitizer: 712 byte(s) leaked in 1 allocation(s). updates: bz#1633930 Change-Id: Ie5b4da6b338d8e5fc770c5b2da1238e3462468ac Signed-off-by: Sunny Kumar <sunkumar@redhat.com>
*	core: implement a global thread pool	Xavi Hernandez	2019-02-18	2	-21/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch implements a thread pool that is wait-free for adding jobs to the queue and uses a very small locked region to get jobs. This makes it possible to decrease contention drastically. It's based on wfcqueue structure provided by urcu library. It automatically enables more threads when load demands it, and stops them when not needed. There's a maximum number of threads that can be used. This value can be configured. Depending on the workload, the maximum number of threads plays an important role. So it needs to be configured for optimal performance. Currently the thread pool doesn't self adjust the maximum for the workload, so this configuration needs to be changed manually. For this reason, the global thread pool has been made optional, so that volumes can still use the thread pool provided by io-threads. To enable it for bricks, the following option needs to be set: config.global-threading = on This option has no effect if bricks are already running. A restart is required to activate it. It's recommended to also enable the following option when running bricks with the global thread pool: performance.iot-pass-through = on To enable it for a FUSE mount point, the option '--global-threading' must be added to the mount command. To change it, an umount and remount is needed. It's recommended to disable the following option when using global threading on a mount point: performance.client-io-threads = off To enable it for services managed by glusterd, glusterd needs to be started with option '--global-threading'. In this case all daemons, like self-heal, will be using the global thread pool. Currently it can only be enabled for bricks, FUSE mounts and glusterd services. The maximum number of threads for clients and bricks can be configured using the following options: config.client-threads config.brick-threads These options can be applied online and its effect is immediate most of the times. If one of them is set to 0, the maximum number of threads will be calcutated as #cores * 2. Some distributions use a very old userspace-rcu library (version 0.7) for this reason, some header files from version 0.10 have been copied into contrib/userspace-rcu and are used if the detected version is 0.7 or older. An additional change has been made to io-threads to prevent that threads are started when iot-pass-through is set. Change-Id: I09d19e246b9e6d53c6247b29dfca6af6ee00a24b updates: #532 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	mount/fuse: fix bug related to --auto-invalidation in mount script	Raghavendra Gowdappa	2019-02-09	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	When "auto-invalidation" option was not specified for mount script, glusterfs cmdline ended with "--auto-invalidation=" option. This patch fixes that bug in mount script. Thanks to Amar for reporting it. Change-Id: Ie5cd4c6ffb3ac644d9d2b032035f914a935d05a8 Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> updates: bz#1664934
*	fuse: correctly handle setxattr values	Xavi Hernandez	2019-02-07	1	-4/+16
\| \| \| \| \| \| \| \| \| \|	The setxattr function receives a pointer to raw data, which may not be null-terminated. When this data needs to be interpreted as a string, an explicit null termination needs to be added before using the value. Change-Id: Id110f9b215b22786da5782adec9449ce38d0d563 updates: bz#1193929 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	mount/fuse: expose auto-invalidation as a mount option	Raghavendra Gowdappa	2019-02-02	3	-10/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Auto invalidation is necessary when same (meta)data is shared/access across multiple mounts. However, if (meta)data is not shared, all relevant I/O goes through the cache of single mount and hence is coherent with (meta)data on bricks always. So, fuse-auto-invalidation can be disabled for this case which gives a huge performance boost for workloads that write data and then immediately read the data they just wrote. From glusterfs --help, <snip> --auto-invalidation[=BOOL] controls whether fuse-kernel can auto-invalidate attribute, dentry and page-cache. Disable this only if same files/directories are not accessed across two different mounts concurrently [default: "on"] </snip> Details on how disabling auto-invalidation helped to reduce pgbench init times can be found at [1]. Time taken for pgbench init of scale 8000 was 8340s. That will be an improvement of 86% (59280s vs 8340s) with auto-invalidations turned off along with other optimizations. Just disabling auto-invalidation contributed 56% improvement by reducing the total time taken by 33260s. [1] https://www.spinics.net/lists/gluster-devel/msg25907.html Change-Id: I0ed730dba9064bd9c576ad1800170a21e100e1ce Signed-off-by: Raghavendra Gowdappa <rgowdapp@redhat.com> updates: bz#1664934
*	fix 32-bit-build-smoke warnings	Iraj Jamali	2019-01-11	1	-1/+1
\| \| \| \| \| \| \|	fixes: bz#1622665 Change-Id: I777d67b1b62c284c62a02277238ad7538eef001e Signed-off-by: Iraj Jamali <ijamali@redhat.com>
*	Revert "iobuf: Get rid of pre allocated iobuf_pool and use per thread mem pool"	Amar Tumballi	2019-01-08	1	-1/+2
\| \| \| \| \| \| \| \| \|	This reverts commit b87c397091bac6a4a6dec4e45a7671fad4a11770. There seems to be some performance regression with the patch and hence recommended to have it reverted. Updates: #325 Change-Id: Id85d6203173a44fad6cf51d39b3e96f37afcec09
*	multiple-files: clang-scan fixes	Amar Tumballi	2018-12-31	1	-3/+12
\| \| \| \| \| \|	updates: bz#1622665 Change-Id: I9f3a75ed9be3d90f37843a140563c356830ef945 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	iobuf: Get rid of pre allocated iobuf_pool and use per thread mem pool	Poornima G	2018-12-18	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current implementation of iobuf_pool has two problems: - prealloc of 12.5MB memory, this limits the scale factor of the gluster processes due to RAM requirements - lock contention, as the current implementation has one global iobuf_pool lock. Credits for debugging and addressing the same goes to Krutika Dhananjay <kdhananj@redhat.com>. Issue: #410 Hence changing the iobuf implementation to use per thread mem pool. This may theoritically appear to cause perf dip as there is no preallocation. But per thread mem pool will not have significant perf impact as the last allocated memory is kept alive for subsequent allocs, for some time. The worst case would be if iobufs requested are of random sizes each time. The best case is, if we get iobuf request of the same size. From the perf tests, this patch did not seem to cause any perf decrease. Note that, with this patch, the rdma performance is going to degrade drastically. In one of the previous patchsets we had fixes to not degrade rdma perf, but rdma is not supported and also not tested [1]. Hence the decision was to not have code in rdma that is not tested and not supported. [1] https://lists.gluster.org/pipermail/gluster-users.old/2018-July/034400.html Updates: #325 Change-Id: Ic2ef3bd498f9250dea25f25ba0c01fde19584b27 Signed-off-by: Poornima G <pgurusid@redhat.com>
*	fuse: SETLKW interrupt	Csaba Henk	2018-12-14	1	-0/+130
\| \| \| \| \| \| \| \| \|	Use the (f)getxattr based clearlocks interface to interrupt a pending lock request. updates: #465 Change-Id: I4e91a4d8791fc688fed400a02de4c53487e61be2 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	fuse: add --lru-limit option	Amar Tumballi	2018-12-14	3	-51/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The inode LRU mechanism is moot in fuse xlator (ie. there is no limit for the LRU list), as fuse inodes are referenced from kernel context, and thus they can only be dropped on request of the kernel. This might results in a high number of passive inodes which are useless for the glusterfs client, causing a significant memory overhead. This change tries to remedy this by extending the LRU semantics and allowing to set a finite limit on the fuse inode LRU. A brief history of problem: When gluster's inode table was designed, fuse didn't have any 'invalidate' method, which means, userspace application could never ask kernel to send a 'forget()' fop, instead had to wait for kernel to send it based on kernel's parameters. Inode table remembers the number of times kernel has cached the inode based on the 'nlookup' parameter. And 'nlookup' field is not used by no other entry points (like server-protocol, gfapi etc). Hence the inode_table of fuse module always has to have lru-limit as '0', which means no limit. GlusterFS always had to keep all inodes in memory as kernel would have had a reference to it. Again, the reason for this is, kernel's glusterfs inode reference was pointer of 'inode_t' structure in glusterfs. As it is a pointer, we could never free it (to prevent segfault, or memory corruption). Solution: In the inode table, handle the prune case of inodes with 'nlookup' differently, and call a 'invalidator' method, which in this case is fuse_invalidate(), and it sends the request to kernel for getting the forget request. When the kernel sends the forget, it means, it has dropped all the reference to the inode, and it will send the forget with the 'nlookup' parameter too. We just need to make sure to reduce the 'nlookup' value we have when we get forget. That automatically cause the relevant prune to happen. Credits: Csaba Henk, Xavier Hernandez, Raghavendra Gowdappa, Nithya B fixes: bz#1560969 Change-Id: Ifee0737b23b12b1426c224ec5b8f591f487d83a2 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	copy_file_range support in GlusterFS	Raghavendra Bhat	2018-12-12	2	-0/+150
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* libglusterfs changes to add new fop * Fuse changes: - Changes in fuse bridge xlator to receive and send responses * posix changes to perform the op on the backend filesystem * protocol and rpc changes for sending and receiving the fop * gfapi changes for performing the fop * tools: glfs-copy-file-range tool for testing copy_file_range fop - Although, copy_file_range support has been added to the upstream fuse kernel module, no release has been made yet of a kernel which contains the support. It is expected to come in the upcoming release of linux-4.20 So, as of now, executing copy_file_range fop on a fused based filesystem results in fuse kernel module sending read on the source fd and write on the destination fd. Therefore a small gfapi based tool has been written to be able test the copy_file_range fop. This tool is similar (in functionality) to the example program given in copy_file_range man page. So, running regular copy_file_range on a fuse mount point and running gfapi based glfs-copy-file-range tool gives some idea about how fast, the copy_file_range (or reflink) can be. On the local machine this was the result obtained. mount -t glusterfs workstation:new /mnt/glusterfs [root@workstation ~]# cd /mnt/glusterfs/ [root@workstation glusterfs]# ls file [root@workstation glusterfs]# cd [root@workstation ~]# time /tmp/a.out /mnt/glusterfs/file /mnt/glusterfs/new real 0m6.495s user 0m0.000s sys 0m1.439s [root@workstation ~]# time glfs-copy-file-range $(hostname) new /tmp/glfs.log /file /rrr OPEN_SRC: opening /file is success OPEN_DST: opening /rrr is success FSTAT_SRC: fstat on /rrr is success copy_file_range successful real 0m0.309s user 0m0.039s sys 0m0.017s This tool needs following arguments 1) hostname 2) volume name 3) log file path 4) source file path (relative to the gluster volume root) 5) destination file path (relative to the gluster volume root) "glfs-copy-file-range <hostname> <volume> <log file path> <source> <destination>" - Added a testcase as well to run glfs-copy-file-range tool * io-stats changes to capture the fop for profiling * NOTE: - Added conditional check to see whether the copy_file_range syscall is available or not. If not, then return ENOSYS. - Added conditional check for kernel minor version in fuse_kernel.h and fuse-bridge while referring to copy_file_range. And the kernel minor version is kept as it is. i.e. 24. Increment it in future when there is a kernel release which contains the support for copy_file_range fop in fuse kernel module. * The document which contains a writeup on this enhancement can be found at https://docs.google.com/document/d/1BSILbXr_knynNwxSyyu503JoTz5QFM_4suNIh2WwrSc/edit Change-Id: I280069c814dd21ce6ec3be00a884fc24ab692367 updates: #536 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
*	all: add xlator_api to many translators	Amar Tumballi	2018-12-06	1	-0/+14
\| \| \| \| \| \|	Fixes: #164 Change-Id: I93ad6f0232a1dc534df099059f69951e1339086f Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	libglusterfs: Move devel headers under glusterfs directory	ShyamsundarR	2018-12-05	3	-17/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	libglusterfs devel package headers are referenced in code using include semantics for a program, this while it works can be better especially when dealing with out of tree xlator builds or in general out of tree devel package usage. Towards this, the following changes are done, - moved all devel headers under a glusterfs directory - Included these headers using system header notation <> in all code outside of libglusterfs - Included these headers using own program notation "" within libglusterfs This change although big, is just moving around the headers and making it correct when including these headers from other sources. This helps us correctly include libglusterfs includes without namespace conflicts. Change-Id: Id2a98854e671a7ee5d73be44da5ba1a74252423b Updates: bz#1193929 Signed-off-by: ShyamsundarR <srangana@redhat.com>
*	core: fix strncpy warnings	Kaleb S. KEITHLE	2018-11-15	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since gcc-8.2.x (fedora-28 or so) gcc has been emitting warnings about buggy use of strncpy. e.g. warning: ‘strncpy’ output truncated before terminating nul copying as many bytes from a string as its length and warning: ‘strncpy’ specified bound depends on the length of the source argument Since we're copying string fragments and explicitly null terminating use memcpy to silence the warning Change-Id: I413d84b5f4157f15c99e9af3e154ce594d5bcdc1 updates: bz#1193929 Signed-off-by: Kaleb S. KEITHLEY <kkeithle@redhat.com>
*	fuse: diagnostic FLUSH interrupt	Csaba Henk	2018-11-06	3	-2/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We add dummy interrupt handling for the FLUSH fuse message. It can be enabled by the "--fuse-flush-handle-interrupt" hidden command line option, or "-ofuse-flush-handle-interrupt=yes" mount option. It serves no other than diagnostic & demonstational purposes -- to exercise the interrupt handling framework a bit and to give an usage example. Documentation is also provided that showcases interrupt handling via FLUSH. Change-Id: I522f1e798501d06b74ac3592a5f73c1ab0590c60 updates: #465 Signed-off-by: Csaba Henk <csaba@redhat.com>