glusterfs.git - GlusterFS is a distributed file-system capable of scaling to several petabytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system.

	Commit message (Collapse)	Author	Age	Files	Lines
*	rpcsvc: Add latency tracking for rpc programs	Pranith Kumar K	2020-09-07	1	-0/+5
\| \| \| \| \| \| \| \| \| \|	Added latency tracking of rpc-handling code. With this change we should be able to monitor the amount of time rpc-handling code is consuming for each of the rpc call. fixes: #1466 Change-Id: I04fc7f3b12bfa5053c0fc36885f271cb78f581cd Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
*	libglusterfs: fix dict leak	Ravishankar N	2020-09-04	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: gf_rev_dns_lookup_cached() allocated struct dnscache->dict if it was null but the freeing was left to the caller. Fix: Moved dict allocation and freeing into corresponding init and fini routines so that its easier for the caller to avoid such leaks. Updates: #1000 Change-Id: I90d6a6f85ca2dd4fe0ab461177aaa9ac9c1fbcf9 Signed-off-by: Ravishankar N <ravishankar@redhat.com>
*	Remove need for /proc on FreeBSD	Daniel Morante	2020-08-20	1	-1/+3
\| \| \| \| \|	Change-Id: Ieebd9a54307813954011ac8833824831dce6da10 Fixes: #1376
*	dht - fixing xattr inconsistency	Barak Sason Rofman	2020-07-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The scenario of setting an xattr to a dir, killing one of the bricks, removing the xattr, bringing back the brick results in xattr inconsistency - The downed brick will still have the xattr, but the rest won't. This patch add a mechanism that will remove the extra xattrs during lookup. This patch is a modification to a previous patch based on comments that were made after merge: https://review.gluster.org/#/c/glusterfs/+/24613/ fixes: #1324 Change-Id: Ifec0b7aea6cd40daa8b0319b881191cf83e031d1 Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
*	storage/posix, libglusterfs: library function to sync filesystem	Dmitry Antipov	2020-06-22	1	-0/+1
\| \| \| \| \| \| \| \|	Convert an ad-hoc hack to a regular library function gf_syncfs(). Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Change-Id: I3ed93e9f28f22c273df1466ba4a458eacb8df395 Fixes: #1329
*	open-behind: rewrite of internal logic	Xavi Hernandez	2020-06-04	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There was a critical flaw in the previous implementation of open-behind. When an open is done in the background, it's necessary to take a reference on the fd_t object because once we "fake" the open answer, the fd could be destroyed. However as long as there's a reference, the release function won't be called. So, if the application closes the file descriptor without having actually opened it, there will always remain at least 1 reference, causing a leak. To avoid this problem, the previous implementation didn't take a reference on the fd_t, so there were races where the fd could be destroyed while it was still in use. To fix this, I've implemented a new xlator cbk that gets called from fuse when the application closes a file descriptor. The whole logic of handling background opens have been simplified and it's more efficient now. Only if the fop needs to be delayed until an open completes, a stub is created. Otherwise no memory allocations are needed. Correctly handling the close request while the open is still pending has added a bit of complexity, but overall normal operation is simpler. Change-Id: I6376a5491368e0e1c283cc452849032636261592 Fixes: #1225 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	syncop: improve scaling and implement more tools	Xavi Hernandez	2020-05-13	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current scaling of the syncop thread pool is not working properly and can leave some tasks in the run queue more time than necessary when the maximum number of threads is not reached. This patch provides a better scaling condition to react faster to pending work. Condition variables and sleep in the context of a synctask have also been implemented. Their purpose is to replace regular condition variables and sleeps that block synctask threads and prevent other tasks to be executed. The new features have been applied to several places in glusterd. Change-Id: Ic50b7c73c104f9e41f08101a357d30b95efccfbf Fixes: #1116 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	Posix: Optimize posix code to improve file creation	Mohit Agrawal	2020-04-06	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Before executing a fop in POSIX xlator it builds an internal path based on GFID.To validate the path it call's (l)stat system call and while .glusterfs is heavily loaded kernel takes time to lookup inode and due to that performance drops Solution: In this patch we followed two ways to improve the performance. 1) Keep open fd specific to first level directory(gfid[0]) in .glusterfs, it would force to kernel keep the inodes from all those files in cache. In case of memory pressure kernel won't uncache first level inodes. We need to open 256 fd's per brick to access the entry faster. 2) Use at based call's to access relative path to reduce path based lookup time. Note: To verify the patch we have executed kernel untar 100 times on 6 different clients after enabling metadata group-cache and some other option.We were getting more than 20 percent improvement in kenel untar after applying the patch. Credits: Xavi Hernandez <xhernandez@redhat.com> Change-Id: I1643e6b01ed669b2bb148d02f4e6a8e08da45343 updates: #891 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	graph/cleanup: Fix race in graph cleanup	Mohammed Rafi KC	2019-09-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We were unconditionally cleaning up the grap when we get child_down followed by parent_down. But this is prone to race condition when some of the bricks are already disconnected. In this case, even before the last child down is executed in the client xlator code,we might have freed the graph. Because the child_down event is alreadt recevied. To fix this race, we have introduced a check to see if all client xlator have cleared thier reconnect chain, and called the child_down for last time. Change-Id: I7d02813bc366dac733a836e0cd7b14a6fac52042 fixes: bz#1727329 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	ctime: Fix incorrect realtime passed to frame->root->ctime	Kotresh HR	2019-08-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	On systems that don't support "timespec_get"(e.g., centos6), it was using "clock_gettime" with "CLOCK_MONOTONIC" to get unix epoch time which is incorrect. This patch introduces "timespec_now_realtime" which uses "clock_gettime" with "CLOCK_REALTIME" which fixes the issue. Change-Id: I57be35ce442d7e05319e82112b687eb4f28d7612 Signed-off-by: Kotresh HR <khiremat@redhat.com> fixes: bz#1743652
*	libglusterfs: remove dependency of rpc	Amar Tumballi	2019-08-16	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Goal: 'libglusterfs' files shouldn't have any dependency outside of the tree, specially the header files, shouldn't have '#include' from outside the tree. Fixes: * Had to introduce libglusterd so, methods and structures required for only mgmt/glusterd, and cli/ are separated from 'libglusterfs/' * Remove rpc/xdr/gen from build, which was used mainly so dependency for libglusterfs could be properly satisfied. * Move rpcsvc_auth_data to client_t.h, so all dependencies could be handled. Updates: bz#1636297 Change-Id: I0e80243a5a3f4615e6fac6e1b947ad08a9363fce Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	event: rename event_XXX with gf_ prefixed	Xiubo Li	2019-07-29	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I hit one crash issue when using the libgfapi. In the libgfapi it will call glfs_poller() --> event_dispatch() in file api/src/glfs.c:721, and the event_dispatch() is defined by libgluster locally, the problem is the name of event_dispatch() is the extremly the same with the one from libevent package form the OS. For example, if a executable program Foo, which will also use and link the libevent and the libgfapi at the same time, I can hit the crash, like: kernel: glfs_glfspoll[68486]: segfault at 1c0 ip 00007fef006fd2b8 sp 00007feeeaffce30 error 4 in libevent-2.0.so.5.1.9[7fef006ed000+46000] The link for Foo is: lib_foo_LADD = -levent $(GFAPI_LIBS) It will crash. This is because the glfs_poller() is calling the event_dispatch() from the libevent, not the libglsuter. The gfapi link info : GFAPI_LIBS = -lacl -lgfapi -lglusterfs -lgfrpc -lgfxdr -luuid If I link Foo like: lib_foo_LADD = $(GFAPI_LIBS) -levent It will works well without any problem. And if Foo call one private lib, such as handler_glfs.so, and the handler_glfs.so will link the GFAPI_LIBS directly, while the Foo won't and it will dlopen(handler_glfs.so), then the crash will be hit everytime. The link info will be: foo_LADD = -levent libhandler_glfs_LIBADD = $(GFAPI_LIBS) I can avoid the crash temporarily by linking the GFAPI_LIBS in Foo too like: foo_LADD = $(GFAPI_LIBS) -levent libhandler_glfs_LIBADD = $(GFAPI_LIBS) But this is ugly since the Foo won't use any APIs from the GFAPI_LIBS. And in some cases when the --as-needed link option is added(on many dists it is added as default), then the crash is back again, the above workaround won't work. Fixes: #699 Change-Id: I38f0200b941bd1cff4bf3066fca2fc1f9a5263aa Signed-off-by: Xiubo Li <xiubli@redhat.com>
*	ctime: Set mdata xattr on legacy files	Kotresh HR	2019-07-22	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: The files which were created before ctime enabled would not have "trusted.glusterfs.mdata"(stores time attributes) xattr. Upon fops which modifies either ctime or mtime, the xattr gets created with latest ctime, mtime and atime, which is incorrect. It should update only the corresponding time attribute and rest from backend Solution: Creating xattr with values from brick is not possible as each brick of replica set would have different times. So create the xattr upon successful lookup if the xattr is not created Note To Reviewers: The time attributes used to set xattr is got from successful lookup. Instead of sending the whole iatt over the wire via setxattr, a structure called mdata_iatt is sent. The mdata_iatt contains only time attributes. Change-Id: I5e535631ddef04195361ae0364336410a2895dd4 fixes: bz#1593542 Signed-off-by: Kotresh HR <khiremat@redhat.com>
*	Replace usleep() with nanosleep()	Vijay Bellur	2019-07-11	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As usleep has been obsoleted, changed all invocations of usleep to nanosleep. From man 3 usleep: "4.3BSD, POSIX.1-2001. POSIX.1-2001 declares this function obsolete; use nanosleep(2) instead. POSIX.1-2008 removes the specification of usleep()." Added a helper function gf_nanosleep() to have a single place for handling edge cases that might arise from the conversion of usleep to nanosleep and allow the sleep to resume with right remaining value upon being interrupted. Fixes: bz#1721686 Change-Id: Ia39ab82c9e0f4669d2c00d4cdf25e38d94ef9f62 Signed-off-by: Vijay Bellur <vbellur@redhat.com>
*	glusterd: Optimize code to copy dictionary in handshake code path	Mohit Agrawal	2019-05-31	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: While high no. of volumes are configured around 2000 glusterd has bottleneck during handshake at the time of copying dictionary Solution: To avoid the bottleneck serialize a dictionary instead of copying key-value pair one by one Change-Id: I9fb332f432e4f915bc3af8dcab38bed26bda2b9a fixes: bz#1711297 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	ec/fini: Fix race with ec_fini and ec_notify	Mohammed Rafi KC	2019-05-21	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During a graph cleanup, we first sent a PARENT_DOWN and wait for a child down to ultimately free the xlator and the graph. In the ec xlator, we cleanup the threads when we get a PARENT_DOWN event. But a racing event like CHILD_UP or event xl_op may trigger healing threads after threads cleanup. So there is a chance that the threads might access a freed private variabe Change-Id: I252d10181bb67b95900c903d479de707a8489532 fixes: bz#1703948 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	libglusterfs: Remove decompunder helper routines from symbol export	Anoop C S	2019-05-11	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	decompounder and related sources were removed via the following commits: https://review.gluster.org/#/c/glusterfs/+/22627/ https://review.gluster.org/#/c/glusterfs/+/22629/ Therefore taking out symbol exports for those removed routines. Change-Id: I2ef99a318de1e4b512cabd2fa923225c5b79b1e5 updates: bz#1193929 Signed-off-by: Anoop C S <anoopcs@redhat.com>
*	glusterd/store: store all key-values in one shot	Yaniv Kaul	2019-05-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Instead of saving each key-value separately, which is slow ( especially as we fflush() after each!), store them all as one string and write all together. Implements https://github.com/gluster/glusterfs/issues/629 Change-Id: Ie77a272446b0b6785584b710a4fdd9c613dd9578 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat,.com>
*	core: avoid dynamic TLS allocation when possible	Xavi Hernandez	2019-04-24	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some interdependencies between logging and memory management functions make it impossible to use the logging framework before initializing memory subsystem because they both depend on Thread Local Storage allocated through pthread_key_create() during initialization. This causes a crash when we try to log something very early in the initialization phase. To prevent this, several dynamically allocated TLS structures have been replaced by static TLS reserved at compile time using '__thread' keyword. This also reduces the number of error sources, making initialization simpler. Updates: bz#1193929 Change-Id: I8ea2e072411e30790d50084b6b7e909c7bb01d50 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	core: handle memory accounting correctly	Xavi Hernandez	2019-04-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a translator stops, memory accounting for that translator is not destroyed (because there could remain memory allocated that references it), but mutexes that coordinate updates of memory accounting were destroyed. This caused incorrect memory accounting and even crashes in debug mode. This patch also fixes some other things: * Reduce the number of atomic operations needed to manage memory accounting. * Correctly account memory when realloc() is used. * Merge two critical sections into one. * Cleaned the code a bit. Change-Id: Id5eaee7338729b9bc52c931815ca3ff1e5a7dcc8 Updates: bz#1659334 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	mgmt/shd: Implement multiplexing in self heal daemon	Mohammed Rafi KC	2019-04-01	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Shd daemon is per node, which means they create a graph with all volumes on it. While this is a great for utilizing resources, it is so good in terms of performance and managebility. Because self-heal daemons doesn't have capability to automatically reconfigure their graphs. So each time when any configurations changes happens to the volumes(replicate/disperse), we need to restart shd to bring the changes into the graph. Because of this all on going heal for all other volumes has to be stopped in the middle, and need to restart all over again. Solution: This changes makes shd as a per volume daemon, so that the graph will be generated for each volumes. When we want to start/reconfigure shd for a volume, we first search for an existing shd running on the node, if there is none, we will start a new process. If already a daemon is running for shd, then we will simply detach a graph for a volume and reatach the updated graph for the volume. This won't touch any of the on going operations for any other volumes on the shd daemon. Example of an shd graph when it is per volume graph ----------------------- \| debug-iostat \| ----------------------- / \| \ / \| \ --------- --------- ---------- \| AFR-1 \| \| AFR-2 \| \| AFR-3 \| -------- --------- ---------- A running shd daemon with 3 volumes will be like--> graph ----------------------- \| debug-iostat \| ----------------------- / \| \ / \| \ ------------ ------------ ------------ \| volume-1 \| \| volume-2 \| \| volume-3 \| ------------ ------------ ------------ Change-Id: Idcb2698be3eeb95beaac47125565c93370afbd99 fixes: bz#1659708 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	WORM-Xlator: Maybe integer overflow when computing new atime	David Spisla	2019-03-07	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The structs worm_reten_state_t and read_only_priv_t from read-only.h are using uint64_t values to store periods of retention and autocommmit. This seems to be dangerous since in worm-helper.c the function worm_set_state computes in line 97: stbuf->ia_atime = time(NULL) + retention_state->ret_period; stbuf->ia_atime is using int64_t because of the settings of struct iattr. So if there is a very very high retention period stored, there is maybe an integer overflow. What can be the solution? Using int64_t instead if uint64_t may reduce the probability of the occurance. Change-Id: Id1e86c6b20edd53f171c4cfcb528804ba7881f65 fixes: bz#1685944 Signed-off-by: David Spisla <david.spisla@iternity.com>
*	core: implement a global thread pool	Xavi Hernandez	2019-02-18	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch implements a thread pool that is wait-free for adding jobs to the queue and uses a very small locked region to get jobs. This makes it possible to decrease contention drastically. It's based on wfcqueue structure provided by urcu library. It automatically enables more threads when load demands it, and stops them when not needed. There's a maximum number of threads that can be used. This value can be configured. Depending on the workload, the maximum number of threads plays an important role. So it needs to be configured for optimal performance. Currently the thread pool doesn't self adjust the maximum for the workload, so this configuration needs to be changed manually. For this reason, the global thread pool has been made optional, so that volumes can still use the thread pool provided by io-threads. To enable it for bricks, the following option needs to be set: config.global-threading = on This option has no effect if bricks are already running. A restart is required to activate it. It's recommended to also enable the following option when running bricks with the global thread pool: performance.iot-pass-through = on To enable it for a FUSE mount point, the option '--global-threading' must be added to the mount command. To change it, an umount and remount is needed. It's recommended to disable the following option when using global threading on a mount point: performance.client-io-threads = off To enable it for services managed by glusterd, glusterd needs to be started with option '--global-threading'. In this case all daemons, like self-heal, will be using the global thread pool. Currently it can only be enabled for bricks, FUSE mounts and glusterd services. The maximum number of threads for clients and bricks can be configured using the following options: config.client-threads config.brick-threads These options can be applied online and its effect is immediate most of the times. If one of them is set to 0, the maximum number of threads will be calcutated as #cores * 2. Some distributions use a very old userspace-rcu library (version 0.7) for this reason, some header files from version 0.10 have been copied into contrib/userspace-rcu and are used if the detected version is 0.7 or older. An additional change has been made to io-threads to prevent that threads are started when iot-pass-through is set. Change-Id: I09d19e246b9e6d53c6247b29dfca6af6ee00a24b updates: #532 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	inode: make critical section smaller	Amar Tumballi	2019-02-13	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	do all the 'static' tasks outside of locked region. * hash_dentry() and hash_gfid() are now called outside locked region. * remove extra __dentry_hash exported in libglusterfs.sym * avoid checks in locked functions, if the check is done in calling function. * implement dentry_destroy(), which handles freeing of dentry separately, from that of dentry_unset (which takes care of separating dentry from inode, and table) Updates: bz#1670031 Change-Id: I584213e0748464bb427fbdef3c4ab6615d7d5eb0 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	core: make gf_thread_create() easier to use	Xavi Hernandez	2019-02-01	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch creates a specific function to set the thread name using a string format and a variable argument list, like printf(). This function is used to set the thread name from gf_thread_create(), which now accepts a variable argument list to create the full name. It's not necessary anymore to use a local array to build the name of the thread. This is done automatically. Change-Id: Idd8d01fd462c227359b96e98699f8c6d962dc17c Updates: bz#1193929 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	afr/self-heal:Fix wrong type checking	Ravishankar N	2019-01-24	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	gf_dirent struct has d_type variable which should check with DT_DIR istead of IA_IFDIR or IA_IFDIR has to compare with entry->d_stat.ia_type Change-Id: Idf1059ce2a590734bc5b6adaad73604d9a708804 updates: bz#1653359 Signed-off-by: Mohammed Rafi KC <rkavunga@redhat.com>
*	rpc: use address-family option from vol file	Milind Changire	2019-01-22	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch helps enable IPv6 connections in the cluster. The default address-family is IPv4 without using this option explicitly. When address-family is set to "inet6" in the /etc/glusterfs/glusterd.vol file, the mount command-line also needs to have -o xlator-option="transport.address-family=inet6" added to it. This option also gets added to the brick command-line. Snapshot and gfapi use-cases should also use this option to pass in the inet6 address-family. Change-Id: I97db91021af27bacb6d7578e33ea4817f66d7270 fixes: bz#1635863 Signed-off-by: Milind Changire <mchangir@redhat.com>
*	core: Resolve dict_leak at the time of destroying graph	Mohit Agrawal	2019-01-14	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Problem: In gluster code some of the places it call's get_new_dict to create a dictionary without taking reference so at the time of dict_unref it has become a leak Solution: To resolve the same call dict_new instead of get_new_dict updates bz#1650403 Change-Id: I3ccbbf5af07079a4fa09aad2cd0458c8625b2f06 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	fuse: add --lru-limit option	Amar Tumballi	2018-12-14	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The inode LRU mechanism is moot in fuse xlator (ie. there is no limit for the LRU list), as fuse inodes are referenced from kernel context, and thus they can only be dropped on request of the kernel. This might results in a high number of passive inodes which are useless for the glusterfs client, causing a significant memory overhead. This change tries to remedy this by extending the LRU semantics and allowing to set a finite limit on the fuse inode LRU. A brief history of problem: When gluster's inode table was designed, fuse didn't have any 'invalidate' method, which means, userspace application could never ask kernel to send a 'forget()' fop, instead had to wait for kernel to send it based on kernel's parameters. Inode table remembers the number of times kernel has cached the inode based on the 'nlookup' parameter. And 'nlookup' field is not used by no other entry points (like server-protocol, gfapi etc). Hence the inode_table of fuse module always has to have lru-limit as '0', which means no limit. GlusterFS always had to keep all inodes in memory as kernel would have had a reference to it. Again, the reason for this is, kernel's glusterfs inode reference was pointer of 'inode_t' structure in glusterfs. As it is a pointer, we could never free it (to prevent segfault, or memory corruption). Solution: In the inode table, handle the prune case of inodes with 'nlookup' differently, and call a 'invalidator' method, which in this case is fuse_invalidate(), and it sends the request to kernel for getting the forget request. When the kernel sends the forget, it means, it has dropped all the reference to the inode, and it will send the forget with the 'nlookup' parameter too. We just need to make sure to reduce the 'nlookup' value we have when we get forget. That automatically cause the relevant prune to happen. Credits: Csaba Henk, Xavier Hernandez, Raghavendra Gowdappa, Nithya B fixes: bz#1560969 Change-Id: Ifee0737b23b12b1426c224ec5b8f591f487d83a2 Signed-off-by: Amar Tumballi <amarts@redhat.com>
*	copy_file_range support in GlusterFS	Raghavendra Bhat	2018-12-12	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* libglusterfs changes to add new fop * Fuse changes: - Changes in fuse bridge xlator to receive and send responses * posix changes to perform the op on the backend filesystem * protocol and rpc changes for sending and receiving the fop * gfapi changes for performing the fop * tools: glfs-copy-file-range tool for testing copy_file_range fop - Although, copy_file_range support has been added to the upstream fuse kernel module, no release has been made yet of a kernel which contains the support. It is expected to come in the upcoming release of linux-4.20 So, as of now, executing copy_file_range fop on a fused based filesystem results in fuse kernel module sending read on the source fd and write on the destination fd. Therefore a small gfapi based tool has been written to be able test the copy_file_range fop. This tool is similar (in functionality) to the example program given in copy_file_range man page. So, running regular copy_file_range on a fuse mount point and running gfapi based glfs-copy-file-range tool gives some idea about how fast, the copy_file_range (or reflink) can be. On the local machine this was the result obtained. mount -t glusterfs workstation:new /mnt/glusterfs [root@workstation ~]# cd /mnt/glusterfs/ [root@workstation glusterfs]# ls file [root@workstation glusterfs]# cd [root@workstation ~]# time /tmp/a.out /mnt/glusterfs/file /mnt/glusterfs/new real 0m6.495s user 0m0.000s sys 0m1.439s [root@workstation ~]# time glfs-copy-file-range $(hostname) new /tmp/glfs.log /file /rrr OPEN_SRC: opening /file is success OPEN_DST: opening /rrr is success FSTAT_SRC: fstat on /rrr is success copy_file_range successful real 0m0.309s user 0m0.039s sys 0m0.017s This tool needs following arguments 1) hostname 2) volume name 3) log file path 4) source file path (relative to the gluster volume root) 5) destination file path (relative to the gluster volume root) "glfs-copy-file-range <hostname> <volume> <log file path> <source> <destination>" - Added a testcase as well to run glfs-copy-file-range tool * io-stats changes to capture the fop for profiling * NOTE: - Added conditional check to see whether the copy_file_range syscall is available or not. If not, then return ENOSYS. - Added conditional check for kernel minor version in fuse_kernel.h and fuse-bridge while referring to copy_file_range. And the kernel minor version is kept as it is. i.e. 24. Increment it in future when there is a kernel release which contains the support for copy_file_range fop in fuse kernel module. * The document which contains a writeup on this enhancement can be found at https://docs.google.com/document/d/1BSILbXr_knynNwxSyyu503JoTz5QFM_4suNIh2WwrSc/edit Change-Id: I280069c814dd21ce6ec3be00a884fc24ab692367 updates: #536 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
*	server: Resolve memory leak path in server_init	Mohit Agrawal	2018-12-03	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: 1) server_init does not cleanup allocate resources while it is failed before return error 2) dict leak at the time of graph destroying Solution: 1) free resources in case of server_init is failed 2) Take dict_ref of graph xlator before destroying the graph to avoid leak Change-Id: I9e31e156b9ed6bebe622745a8be0e470774e3d15 fixes: bz#1654917 Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
*	fuse: interrupt handling framework	Csaba Henk	2018-11-06	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- add sub-framework to send timed responses to kernel - add interrupt handler queue - implement INTERRUPT fuse_interrupt looks up handlers for interrupted messages in the queue. If found, it invokes the handler function. Else responds with EAGAIN with a delay. See spec at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/fuse.txt?h=v4.17#n148 and explanation in comments. Change-Id: I1a79d3679b31f36e14b4ac8f60b7f2c1ea2badfb updates: #465 Signed-off-by: Csaba Henk <csaba@redhat.com>
*	socket: use accept4/paccept for nonblocking socket	Krishnan Parthasarathi	2018-10-12	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reduces the no. of syscalls on Linux systems from 2, accept(2) and fcntl(2) for setting O_NONBLOCK, to a single accept4(2). On NetBSD, we have paccept(2) that does the same, if we leave signal masking aside. Added sys_accept which accepts an extra flags argument than accept(2). This would opportunistically use accept4/paccept as available. It would fallback to accept(2) and fcntl(2) otherwise. While at this, the patch sets FD_CLOEXEC flag on the accepted socket fd. BUG: 1236272 Change-Id: I41e43fd3e36d6dabb07e578a1cea7f45b7b4e37f fixes: bz#1236272 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
*	socket: set FD_CLOEXEC on all sockets	Krishnan Parthasarathi	2018-10-11	1	-0/+1
\| \| \| \| \| \| \| \| \|	For more information, see http://udrepper.livejournal.com/20407.html BUG: 1236272 Change-Id: I25a645c10bdbe733a81d53cb714eb036251f8129 fixes: bz#1236272 Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
*	all: fix warnings on non 64-bits architectures	Xavi Hernandez	2018-10-10	1	-2/+1
\| \| \| \| \| \| \| \| \| \|	When compiling in other architectures there appear many warnings. Some of them are actual problems that prevent gluster to work correctly on those architectures. Change-Id: Icdc7107a2bc2da662903c51910beddb84bdf03c0 fixes: bz#1632717 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
*	Some (mgmt) xlators: use dict_{setn\|getn\|deln\|get_int32n\|set_int32n\|set_strn}	Yaniv Kaul	2018-09-09	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a previous patch (https://review.gluster.org/20769) we've added the key length to be passed to dict_* funcs, to remove the need to strlen() it. This patch moves some xlators to use it. - It also adds dict_get_int32n which was missing. - It also reduces the size of some key variables. They were set to 1024b or PATH_MAX, where sometimes 64 bytes were really enough. Please review carefully: 1. That I did not reduce some the size of the key variables too much. 2. That I did not mix up some keys. Compile-tested only! Change-Id: Ic729baf179f40e8d02bc2350491d4bb9b6934266 updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	features/uss: Use xxh64 to generate gfid instead of md5sum	Raghavendra Manjunath	2018-09-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* This is to ensure FIPS support * Also changed the signature of svs_uuid_generate to get xlator argument * Added xxh64 wrapper functions in common-utils to generate gfid using xxh64 - Those wrapper functions can be used by other xlators as well to generate gfids using xxh64. But as of now snapview-server is going to be the only consumer. Change-Id: Ide66573125dd74122430cccc4c4dc2a376d642a2 Updates: #230 Signed-off-by: Raghavendra Manjunath <raghavendra@redhat.com> Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
*	libglusterfs/src/dict.{c,h}: Introduce dict_setn, dict_addn, dict_addn and ↵	Yaniv Kaul	2018-08-27	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	others. They all take as a parameter the key length, instead of strlen() it. In most cases, we know the key length, we just never bothered to save and pass it along. (We most likely sprintf'ed it earlier and the return value could have been used). A more interesting addition is dict_set_nstrn() [horrible name. Ideas are welcome]. It accepts both the string length and the key length and avoids strlen() both. Some of it can be calculated on compile-time, btw. For example: dict_set_str (dict, "key", "all"); Should become: dict_set_nstrn (dict, "key", sizeof ("key"), "all", sizeof ("all")); Compile-tested only! Change-Id: Ic2667f445f6c2e22e279505f5ad435788b4b668c updates: bz#1193929 Signed-off-by: Yaniv Kaul <ykaul@redhat.com>
*	md-cache: Do not invalidate cache post set/remove xattr	Poornima G	2018-07-11	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since setxattr and removexattr fops cbk do not carry poststat, the stat cache was being invalidated in setxatr/remoxattr cbk. Hence the further lookup wouldn't be served from cache. To prevent this invalidation, md-cache is modified to get the poststat in set/removexattr_cbk in dict. Co-authored with Xavi Hernandez. Change-Id: I6b946be2d20b807e2578825743c25ba5927a60b4 fixes: bz#1586018 Signed-off-by: Xavi Hernandez <xhernandez@redhat.com> Signed-off-by: Poornima G <pgurusid@redhat.com>
*	core: make glfs_iobuf_copy() consumable for general purpose.	Susant Palai	2018-05-24	1	-0/+1
\| \| \| \| \| \| \| \| \|	Currently plugins for cloudsync will be using it to write back data downloaded from remote store/cloud. Change-Id: I59f10bebed21b19568c94cbf29e3d536d5570749 Updates: #387 Signed-off-by: Susant Palai <spalai@redhat.com>
*	build: Disallow unresolved symbol references	Prashanth Pai	2018-05-18	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the past, it was often[1] forgotten for xlators to be linked against the symbols they refer to. This often caused glusterd2 to fail while loading xlator's shared object (.so) file. This change adds "--no-undefined" as a linker flag which causes the linker to treat unresolved symbol references as an error and hence fail linking. [1]: https://review.gluster.org/#/c/19912/ https://review.gluster.org/#/c/19664/ https://review.gluster.org/#/c/19056/ https://review.gluster.org/#/c/17659/ https://bugzilla.redhat.com/show_bug.cgi?id=1532238 Bonus: Added cloudsync and utime xlator's generated source files to .gitignore Updates: bz#1193929 Change-Id: I9604a4a87b7313a5fa43bda5fdb37dfa7ef8facd Signed-off-by: Prashanth Pai <ppai@redhat.com>
*	gfapi : RECALL_LEASE implementation	Soumya Koduri	2018-05-04	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Right now there are two types of upcalls * poll method * registering callback But callback can be registered per fs and same callback fn shall be used for any lease recall with object handle as argument as done for cache invalidation. TODO: RECALL LEASE for each glfd (for future reference) (may be needed fo Samba as they do not deal with object handles. In case of RECALL_LEASE, we could associate separate cbk function for each glfd either by - extending pub_glfs_lease to accept new args (recall_cbk_fn, cookie) - or by defining new API "glfs_register_recall_cbk_fn (glfd, recall_cbk_fn, cookie) . In such cases, flag it and instead of calling below upcall functions, define a new one to go through the glfd list and invoke each of theirs recall_cbk_fn. Plus added following as well * passed lease id to dict in required arguments * added flag check in pub_glfs_open Updates: #350 Change-Id: I07a971f0f26ec6aae0b9f9a5613504317dee153b Signed-off-by: Soumya Koduri <skoduri@redhat.com> Signed-off-by: Poornima G <pgurusid@redhat.com> Signed-off-by: Jiffin Tony Thottan <jthottan@redhat.com>
*	features/bitrot: print the path of the corrupted objects	Raghavendra Bhat	2018-05-04	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently "gluster volume bitrot <volume name> scrub status" gives the list of the corrupted objects (files as of now). But only the gfids of those corrupted objects are seen and one has to do getfattr, find etc operations to get the actual path of those objects for removal etc. This change makes an attempt to print the path of those files as much as possible. * Try to get the path using the on disk gfid2path xattr. * If the above operation fails, then go for in memory path (provided that the object has its dentry properly created and linked in the inode table of the brick where the corrupted object is present) So the gfid to path resolution is a soft resolution, i.e. based on the inode and dentry cache in the brick's memory. If the path cannot be obtained via inode table also, then only gfid is printed. Change-Id: Ie9a30307f43a49a2a9225821803c7d40d231de68 fixes: bz#1570962 Signed-off-by: Raghavendra Bhat <raghavendra@redhat.com>
*	server: fix unresolved symbols by moving them to libglusterfs	Mohit Agrawal	2018-04-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: glusterd2 build is failed due to undefined symbol (xlator_mem_cleanup , glusterfsd_ctx) in server.so Solution: To resolve the same done below two changes 1) Move xlator_mem_cleanup code from glusterfsd-mgmt.c to xlator.c to be part of libglusterfs.so 2) replace glusterfsd_ctx to this->ctx because symbol glusterfsd_ctx is not part of server.so BUG: 1544090 Change-Id: Ie5e6fba9ed458931d08eb0948d450aa962424ae5 fixes: bz#1544090 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	gluster: Sometimes Brick process is crashed at the time of stopping brick	Mohit Agrawal	2018-04-19	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: Sometimes brick process is getting crashed at the time of stop brick while brick mux is enabled. Solution: Brick process was getting crashed because of rpc connection was not cleaning properly while brick mux is enabled.In this patch after sending GF_EVENT_CLEANUP notification to xlator(server) waits for all rpc client connection destroy for specific xlator.Once rpc connections are destroyed in server_rpc_notify for all associated client for that brick then call xlator_mem_cleanup for for brick xlator as well as all child xlators.To avoid races at the time of cleanup introduce two new flags at each xlator cleanup_starting, call_cleanup. BUG: 1544090 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> Note: Run all test-cases in separate build (https://review.gluster.org/#/c/19700/) with same patch after enable brick mux forcefully, all test cases are passed. Change-Id: Ic4ab9c128df282d146cf1135640281fcb31997bf updates: bz#1544090
*	Quota: heal directory on newly added bricks when quota limit is reached	Sanoj Unnikrishnan	2018-03-28	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: if a lookup is done on a newly added brick for a path on which limit has been reached, the lookup fails to heal the directory tree due to quota. Solution: Tag the lookup as an internal fop and ignore it in quota. Since marking internal fop does not usually give enough contextual information. Introducing new flags to pass the contextual info. Adding dict_check_flag and dict_set_flag to aid flag operations. A flag is a single bit in a bit array (currently limited to 256 bits). Change-Id: Ifb6a68bcaffedd425dd0f01f7db24edd5394c095 fixes: bz#1505355 BUG: 1505355 Signed-off-by: Sanoj Unnikrishnan <sunnikri@redhat.com>
*	glusterd: TLS verification fails while using intermediate CA	Mohit Agrawal	2018-03-19	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Problem: TLS verification fails while using intermediate CA if mgmt SSL is enabled. Solution: There are two main issue of TLS verification failing 1) not calling ssl_api to set cert_depth 2) The current code does not allow to set certificate depth while MGMT SSL is enabled. After apply this patch to set certificate depth user need to set parameter option transport.socket.ssl-cert-depth <depth> in /var/lib/glusterd/secure_acccess instead to set in /etc/glusterfs/glusterd.vol. At the time of set secure_mgmt in ctx we will check the value of cert-depth and save the value of cert-depth in ctx.If user does not provide any value in cert-depth in that case it will consider default value is 1 BUG: 1555154 Change-Id: I89e9a9e1026e37efb5c20f9ec62b1989ef644f35 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
*	protocol: Fix 4.0 client, parsing older iatt in dict	ShyamsundarR	2018-03-10	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a mixed mode cluster involving 4.0 and older 3.x bricks, if clients are newer, then the iatt encoded in the dictionary can be of the older iatt format, which a newer client will map incorrectly to the newer structure. This causes failures in FOPs that depend on this iatt for some functionality (seen in mkdir operations failing as EIO, when DHT hits its internal setxattr call). The fix provided is to convert the iatt in the dict, based on which RPC version is used to communicate with the server. IOW, this is the reverse of change in commit "b966c7790e" Tested using a mixed mode cluster (i.e bricks in 3.12 and 4.0 versions) and a mixed set of clients, 3.12 and 4.0 clients. There is no regression test provided, as this needs a mixed mode cluster to test and validate. Change-Id: I454e54651ca836b9f7c28f45f51d5956106aefa9 BUG: 1554053 Signed-off-by: ShyamsundarR <srangana@redhat.com>
*	xlators/features/namespace: Add namespace xlator and link into brick graph	Varsha Rao	2018-02-21	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following release-3.8-fb branch patch is upstreamed: > features/namespace: Add namespace xlator and link into brick graph > Commit ID: dbd30776f26e > https://review.gluster.org/#/c/18041/ > By Michael Goulet <mgoulet@fb.com> Changes in this patch: Removes extra config.h and namespace.h file in namespace.c Adds default_getspec_cbk to libglusterfs.sym Rename dict_for_each to dict_foreach_inline Remove fd.h header file stack.h Add test case for truncate, open and symlink This patch is required to forward port io-threads namespace patch. Updates: #401 Change-Id: Ib88c95b89eecee9b8957df8a4c8712c899c761d1 Signed-off-by: Varsha Rao <varao@redhat.com>
*	posix/afr: handle backward compatibility for rchecksum fop	Ravishankar N	2018-02-19	1	-0/+1
\| \| \| \| \| \| \| \| \|	Added a volume option 'fips-mode-rchecksum' tied to op version 4. If not set, rchecksum fop will use MD5 instead of SHA256. updates: #230 Change-Id: Id8ea1303777e6450852c0bc25503cda341a6aec2 Signed-off-by: Ravishankar N <ravishankar@redhat.com>