glusterfs.git/libglusterfs/src/glusterfs.h, branch v3.11.0

glusterfsd: process attach and detach request inside lock

2017-05-29T13:56:08+00:00

With brick multiplexing, there is a high possibility that attach and
detach requests might be parallely processed and to avoid a concurrent
update to the same graph list, a mutex lock is required.

Please note this backport defines the volfile_lock mutex which was done
as part of a different patch https://review.gluster.org/15036 in
mainline but is not available in release-3.11 branch.

Credits : Rafi (rkavunga@redhat.com) for the RCA of this issue

>Reviewed-on: https://review.gluster.org/17374
>Smoke: Gluster Build System 
>NetBSD-regression: NetBSD Build System 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Jeff Darcy 
>(cherry picked from commit 3ca5ae2f3bff2371042b607b8e8a218bf316b48c)

Change-Id: Ic8e6d1708655c8a143c5a3690968dfa572a32a9c
BUG: 1455907
Signed-off-by: Atin Mukherjee 
Reviewed-on: https://review.gluster.org/17402
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

glusterfsd: send PARENT_UP on brick attach

2017-05-16T00:32:25+00:00

With brick multiplexing being enabled, if a brick is instance attached to a
process then a PARENT_UP event is needed so that it reaches right till
posix layer and then from posix CHILD_UP event is sent back to all the
children.

>Reviewed-on: https://review.gluster.org/17225
>NetBSD-regression: NetBSD Build System 
>Smoke: Gluster Build System 
>CentOS-regression: Gluster Build System 
>Reviewed-by: Jeff Darcy 
>(cherry picked from commit 86ad032949cb80b6ba3df9dc8268243529d4eb84)

Change-Id: Ic341086adb3bbbde0342af518e1b273dd2f669b9
BUG: 1450729
Signed-off-by: Atin Mukherjee 
Reviewed-on: https://review.gluster.org/17289
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan

core: make the per glusterfs_ctx_t timer-wheel refcounted

2017-05-12T13:32:32+00:00

xlators can use a 'global' timer-wheel for scheduling events. This
timer-wheel is managed per glusterfs_ctx_t, but does not need to be
allocated for every graph. When an xlator wants to use the timer-wheel,
it will be instanciated on demand, and provided to xlators that request
it later on.

By adding a reference counter to the glusterfs_ctx_t for the
timer-wheel, the threads and structures can be cleaned up when the last
xlator does not have a need for it anymore. In general, the xlators
request the timer-wheel in init(), and they should return it in fini().

Because the timer-wheel is managed per glusterfs_ctx_t, the functions
can be added to ctx.c and do not need to live in their very minimal
tw.[ch] files.


>Reported-by: Poornima G 
>Signed-off-by: Niels de Vos 
>Reviewed-on: https://review.gluster.org/17068
>NetBSD-regression: NetBSD Build System 
>CentOS-regression: Gluster Build System 
>Smoke: Gluster Build System 
>Reviewed-by: Amar Tumballi 
>Reviewed-by: Zhou Zhengping 
>Reviewed-by: Kaleb KEITHLEY 
>(cherry picked from commit 73fcf3a874b2049da31d01b8363d1ac85c9488c2)

Change-Id: I19d225b39aaa272d9005ba7adc3104c3764f1572
BUG: 1450267
Reviewed-on: https://review.gluster.org/17262
Tested-by: Poornima G 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Niels de Vos

feature/dht: Directory synchronization

2017-04-26T09:00:34+00:00

Design doc: https://review.gluster.org/16876

Directory creation is now synchronized with blocking inodelk of the
parent on the hashed subvolume followed by the entrylk on the hashed
subvolume between dht_mkdir, dht_rmdir, dht_rename_dir and lookup
selfheal mkdir.

To maintain internal consistency of directories across all subvols of
dht, we need locks. Specifically we are interested in:

 1. Consistency of layout of a directory. Only one writer should modify
    the layout at a time. A writer (layout setting during directory heal
    as part of lookup) shouldn't modify the layout while there are
    readers (all other fops like create, mkdir etc., which consume
    layout) and readers shouldn't read the layout while a writer is in
    progress. Readers can read the layout simultaneously. Writer takes
    a WRITE inodelk on the directory (whose layout is being modified)
    across ALL subvols. Reader takes a READ inodelk on the directory
    (whose layout is being read) on ANY subvol.

 2. Consistency of directory namespace across subvols. The path and
    associated gfid should be same on all subvols. A gfid should not be
    associated with more than one path on any subvol. All fops that can
    change directory names (mkdir, rmdir, renamedir, directory creation
    phase in lookup-heal) takes an entrylk on hashed subvol of the
    directory.

 NOTE1: In point 2 above, since dht takes entrylk on hashed subvol of a
        directory, the transaction itself is a consumer of layout on
        parent directory. So, the transaction is a reader of parent
        layout and does an inodelk on parent directory just like any
        other layout reader. So a mkdir (dir/subdir) would:

     > Acquire a READ inodelk on "dir" on any subvol.
     > Acquire an entrylk (dir, "subdir") on hashed subvol of "subdir".
     > creates directory on hashed subvol and possibly on non-hashed subvols.
     > UNLOCK (entrylk)
     > UNLOCK (inodelk)

 NOTE2: mkdir fop while setting the layout of the directory being created
        is considered as a reader, but NOT a writer. The reason is for
        a fop which can consume the layout of a directory to come either
        of the following conditions has to be true:

     > mkdir syscall from application has to complete. In this case no
       need of synchronization.
     > A lookup issued on the directory racing with mkdir has to complete.
       Since layout setting by a lookup is considered as a writer, only
       one of either mkdir or lookup will set the layout.

Code re-organization:
   All the lock related routines are moved to "dht-lock.c" file.
   New wrapper function is introduced to take blocking inodelk
   followed by entrylk 'dht_protect_namespace'

Updates #191
Change-Id: I01569094dfbe1852de6f586475be79c1ba965a31
Signed-off-by: Kotresh HR 
BUG: 1443373
Reviewed-on: https://review.gluster.org/15472
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Raghavendra G 
Smoke: Gluster Build System

xlator: do not call dlclose() when debugging

2017-04-07T17:17:12+00:00

Valgrind can not show the symbols if a .so after calling dlclose(). The
unhelpful ??? in the output gets resolved properly with this change:

  ==25170== 344 bytes in 1 blocks are definitely lost in loss record 233 of 324
  ==25170==    at 0x4C29975: calloc (vg_replace_malloc.c:711)
  ==25170==    by 0x52C7C0B: __gf_calloc (mem-pool.c:117)
  ==25170==    by 0x12B0638A: ???
  ==25170==    by 0x528FCE6: __xlator_init (xlator.c:472)
  ==25170==    by 0x528FE16: xlator_init (xlator.c:498)
  ==25170==    by 0x52DA8D6: glusterfs_graph_init (graph.c:321)
  ==25170==    by 0x52DB587: glusterfs_graph_activate (graph.c:695)
  ==25170==    by 0x5046407: glfs_process_volfp (glfs-mgmt.c:79)
  ==25170==    by 0x5043B9E: glfs_volumes_init (glfs.c:281)
  ==25170==    by 0x5044FEC: glfs_init_common (glfs.c:986)
  ==25170==    by 0x50451A7: glfs_init@@GFAPI_3.4.0 (glfs.c:1031)

By not calling dlclose(), the dynamically loaded .so is still available
upon program exit, and Valgrind is able to resolve the symbols. This
will add an additional leak, so dlclose() is called for normal builds,
but skipped when configuring with "./configure --enable-valgrind" or
passing the "run-with-valgrind" xlator option.

URL: http://valgrind.org/docs/manual/faq.html#faq.unhelpful
Change-Id: I2044e21b1b8fcce32ad1a817fdd795218f967731
BUG: 1425623
Signed-off-by: Niels de Vos 
Reviewed-on: https://review.gluster.org/16809
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Samikshan Bairagya 
Reviewed-by: Kaleb KEITHLEY

core: run many bricks within one glusterfsd process

2017-01-31T00:13:58+00:00

This patch adds support for multiple brick translator stacks running
in a single brick server process.  This reduces our per-brick memory usage by
approximately 3x, and our appetite for TCP ports even more.  It also creates
potential to avoid process/thread thrashing, and to improve QoS by scheduling
more carefully across the bricks, but realizing that potential will require
further work.

Multiplexing is controlled by the "cluster.brick-multiplex" global option.  By
default it's off, and bricks are started in separate processes as before.  If
multiplexing is enabled, then *compatible* bricks (mostly those with the same
transport options) will be started in the same process.

Change-Id: I45059454e51d6f4cbb29a4953359c09a408695cb
BUG: 1385758
Signed-off-by: Jeff Darcy 
Reviewed-on: https://review.gluster.org/14763
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System 
Reviewed-by: Vijay Bellur

tier : Tier as a service

2017-01-17T04:49:47+00:00

tierd is implemented by separating from rebalance process.

The commands affected:

1) Attach tier will trigger this process instead of old one
2) tier start and tier start force will also trigger this process.
3) volume status [tier] will show tier daemon as a process instead
of task and normal tier status and tier detach status works.
4) tier stop implemented.
5) detach tier implemented separately along with new detach tier
status
6) volume tier volname status will work using the changes.
7) volume set works

This patch has separated the tier translator from the legacy
DHT rebalance code. It now sends the RPCs from the CLI
to glusterd separate to the DHT rebalance code.
The daemon is now a service, similar to the snapshot daemon,
and can be viewed using the volume status command.

The code for the validation and commit phase are the same
as the earlier tier validation code in DHT rebalance.

The “brickop” phase has been changed so that the status
command can use this framework.

The service management framework is now used.
DHT rebalance does not use this framework.

This service framework takes care of :

*) spawning the daemon, killing it and other such processes.
*) volume set options , which are written on the volfile.
*) restart and reconfigure functions. Restart is to restart
the daemon at two points
        1)after gluster goes down and comes up.
        2) to stop detach tier.
*) reconfigure is used to make immediate volfile changes.
By doing this, we don’t restart the daemon.
it has the code to rewrite the volfile for topological
changes too (which comes into place during add and remove brick).

With this patch the log, pid, and volfile are separated
and put into respective directories.

Change-Id: I3681d0d66894714b55aa02ca2a30ac000362a399
BUG: 1313838
Signed-off-by: hari gowtham 
Reviewed-on: http://review.gluster.org/13365
Smoke: Gluster Build System 
Tested-by: hari gowtham 
CentOS-regression: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Dan Lambright 
Reviewed-by: Atin Mukherjee

ec: Invalidations in disperse volume should not update the stat

2017-01-06T05:12:18+00:00

Issue:
In disperse volume, the file is present across bricks, hence the stat
from one brick doesn't carry the valid size of the file. Therefore
the upcall from one brick updating the md-cache results in wrong size
being updated.

Fix:
If the notification is cache invalidation then, indicate md-cache that
the attributes is invalid.

BUG: 1410375
Change-Id: Id89d2283478e70b62b435a8891fffc86d2be8cb2
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/16329
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
Reviewed-by: Xavier Hernandez 
CentOS-regression: Gluster Build System 
Reviewed-by: Pranith Kumar Karampuri

md-cache, afr: Reduce the window of stale read

2016-10-20T07:07:55+00:00

Problem:
Consider a replica setup, where one mount writes data to a
file and the other mount reads the file. In afr, read operations
are not transaction based, a brick(read subvolume) is chosen as
a part of lookup or other operations, read is always wound only
to the read subvolume, even if there was write from a different client
that failed on this brick. This stale read continues until there is
a lookup or any write operation from the mount point. Currently, this
is not a major issue, as a lookup is issued before every read and it will
switch the read subvolume to a correct one. But with the plan of
increasing md-cache timeout to 600s, the stale read problem will be
more pronounced, i.e. stale read can continue for 600s(or more if cascaded
with readdirp), as there will be no lookups.

Solution:
Afr doesn't have any built-in solution for stale read(without affecting
the performance). The solution that came up, was to use upcall. When a file
on any brick is marked bad for the first time, upcall sends a notification
to all the clients that had recently accessed the file. The solution has
2 parts:
- Identifying when a file is marked bad, on any of the bricks,
  for the first time
- Client side actions on recieving the notifications

Identifying when a file is marked bad on any of the bricks for the first time:
-----------------------------------------------------------------------------
The idea is to track xattrop in upcall. xattrop currently comes with 2 afr
xattrs - afr dirty bit and afr pending xattrs.
   Dirty xattr is set to 1 before every write, and is unset if write succeeds.
In certain scenarios, dirty xattr can be 0 and still the file could be bad
copy. Hence do not track dirty xattr.
   Pending xattr is set on the good copy, indicating the other bricks that have
bad copy. It is still not as simple as, notifying when any of the pending xattrs
change. It could lead to flood of notifcations, in case the other brick is
completely down or consistantly failing. Hence it is important to notify only
once, the first time a good copy is marked bad.

Client side actions on recieving pending xattr change, notification:
--------------------------------------------------------------------
md-cache will invalidate the cache of that file, so that further lookup is
passed down to afr and hence update the read subvolume. Invalidating only in
md-cache is not enough, consider the folling oder of opertaions:
- pending xattr invalidation - invalidate md-cache
- readdirp on the bad read subvolume - fill md-cache
- lookup (served from md-cache)
- read - wound to the old read subvol.
Hence, along with invalidating md-cache, it is very important to reset the
read subvolume for that file, in afr.

Design Credit: Anuradha Talur, Ravishankar N

1. xattrop doesn't carry info saying post op/pre op.
2. Pre xattrop will have 0 value for all pending xattrs,
   the cbk of pre xattrop carries the on-disk xattr value.
   Non zero indicated healing is required.
3. Post xattrop will have non zero value for any of the
   pending xattrs, if the fop failed on any of the bricks.

Change-Id: I469cbc111714c433984fe1c922be2ef113c25804
BUG: 1211863
Signed-off-by: Poornima G 
Reviewed-on: http://review.gluster.org/15398
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Pranith Kumar Karampuri 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System

glusterfsd/main: fix OOM adjustment for older kernels

2016-10-11T12:18:05+00:00

Milind Changire reported that GlusterFS fails to build on RHEL5
because linux/oom.h is unavailable.

Milind's initial patch disables OOM adjustment completely
for those environments that do not have this header. However,
I'd take another approach that:

1) checks for linux/oom.h in compile-time and defines necessary
constants if the header is not present;
2) checks for available OOM API in /proc in run-time and uses it
accordingly.

This allows OOM to be adjusted properly on RHEL5 (the kernel is pretty new
to present /proc API for that) as well as RHEL6 (the kernel has many thing
backported including new /proc API).

Change-Id: I1bc610586872d208430575c149a7d0c54bd82370
BUG: 1379769
Signed-off-by: Oleksandr Natalenko 
Reviewed-on: http://review.gluster.org/15587
Tested-by: Oleksandr Natalenko 
Reviewed-by: Niels de Vos 
Smoke: Gluster Build System 
NetBSD-regression: NetBSD Build System 
CentOS-regression: Gluster Build System