glusterfs.git/libglusterfs/src/glusterfs, branch devel

list.h: remove offensive language, introduce _list_del() (#2132)

2021-04-02T05:52:07+00:00


1. Replace offensive variables with the values the Linux kernel uses.
2. Introduce an internal function - _list_del() that can be used
when list->next and list->prev are going to be assigned later on.

(too bad in the code we do not have enough uses of list_move() and
list_move() tail, btw. Would have contributed also to code readability)

* list.h: defined LIST_POSION1, LIST_POISION2 similar to Linux kernel
defines

Fixes: #2025
Signed-off-by: Yaniv Kaul

cluster/dht: use readdir for fix-layout in rebalance (#2243)

2021-03-22T04:49:27+00:00

Problem:
On a cluster with 15 million files, when fix-layout was started, it was
not progressing at all. So we tried to do a os.walk() + os.stat() on the
backend filesystem directly. It took 2.5 days. We removed os.stat() and
re-ran it on another brick with similar data-set. It took 15 minutes. We
realized that readdirp is extremely costly compared to readdir if the
stat is not useful. fix-layout operation only needs to know that the
entry is a directory so that fix-layout operation can be triggered on
it. Most of the modern filesystems provide this information in readdir
operation. We don't need readdirp i.e. readdir+stat.

Fix:
Use readdir operation in fix-layout. Do readdir+stat/lookup for
filesystems that don't provide d_type in readdir operation.

fixes: #2241
Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb
Signed-off-by: Pranith Kumar K

fuse: add an option to specify the mount display name (#1989)

2021-02-22T18:06:53+00:00

* fuse: add an option to specify the mount display name

There are two things this PR is fixing.
1. When a mount is specified with volfile (-f) option, today, you can't make it out its from glusterfs as only volfile is added as 'fsname', so we add it as 'glusterfs:/'.
2. Provide an options for admins who wants to show the source of mount other than default (useful when one is not providing 'mount.glusterfs', but using their own scripts.

Updates: #1000
Change-Id: I19e78f309a33807dc5f1d1608a300d93c9996a2f
Signed-off-by: Amar Tumballi

iobuf: use lists instead of iobufs in iobuf_arena struct (#2097)

2021-02-16T11:56:14+00:00

We only need passive and active lists, there's no need for a full
iobuf variable.

Also ensured passive_list is before active_list, as it's always accessed
first.

Note: this almost brings us to using 2 cachelines only for that structure.
We can easily make other variables smaller (page_size could be 4 bytes) and fit
exactly 2 cache lines.

Fixes: #2096
Signed-off-by: Yaniv Kaul

posix: fix chmod error on symlinks (#2155)

2021-02-11T15:11:19+00:00

After glibc 2.32, lchmod() is returning EOPNOTSUPP instead of ENOSYS when
called on symlinks. The man page says that the returned code is ENOTSUP.
They are the same in linux, but this patch correctly handles all errors.

Fixes: #2154
Change-Id: Ib3bb3d86d421cba3d7ec8d66b6beb131ef6e0925
Signed-off-by: Xavi Hernandez xhernandez@redhat.com

stack.h/c: remove unused variable and reorder struct

2021-02-08T12:51:47+00:00

- Removed unused ref_count variable
- Reordered the struct to get related variables closer together.
- Changed 'complete' from a '_Bool' to a 'int32_t'

Before:
```
struct _call_frame {
        call_stack_t *             root;                 /*     0     8 */
        call_frame_t *             parent;               /*     8     8 */
        struct list_head           frames;               /*    16    16 */
        void *                     local;                /*    32     8 */
        xlator_t *                 this;                 /*    40     8 */
        ret_fn_t                   ret;                  /*    48     8 */
        int32_t                    ref_count;            /*    56     4 */

        /* XXX 4 bytes hole, try to pack */

        /* --- cacheline 1 boundary (64 bytes) --- */
        gf_lock_t                  lock;                 /*    64    40 */
        void *                     cookie;               /*   104     8 */
        _Bool                      complete;             /*   112     1 */

        /* XXX 3 bytes hole, try to pack */

        glusterfs_fop_t            op;                   /*   116     4 */
        struct timespec            begin;                /*   120    16 */
        /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
        struct timespec            end;                  /*   136    16 */
        const char  *              wind_from;            /*   152     8 */
        const char  *              wind_to;              /*   160     8 */
        const char  *              unwind_from;          /*   168     8 */
        const char  *              unwind_to;            /*   176     8 */

        /* size: 184, cachelines: 3, members: 17 */
        /* sum members: 177, holes: 2, sum holes: 7 */
        /* last cacheline: 56 bytes */
```

After:
```
struct _call_frame {
        call_stack_t *             root;                 /*     0     8 */
        call_frame_t *             parent;               /*     8     8 */
        struct list_head           frames;               /*    16    16 */
        struct timespec            begin;                /*    32    16 */
        struct timespec            end;                  /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        void *                     local;                /*    64     8 */
        gf_lock_t                  lock;                 /*    72    40 */
        void *                     cookie;               /*   112     8 */
        xlator_t *                 this;                 /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */
        ret_fn_t                   ret;                  /*   128     8 */
        glusterfs_fop_t            op;                   /*   136     4 */
        int32_t                    complete;             /*   140     4 */
        const char  *              wind_from;            /*   144     8 */
        const char  *              wind_to;              /*   152     8 */
        const char  *              unwind_from;          /*   160     8 */
        const char  *              unwind_to;            /*   168     8 */

        /* size: 176, cachelines: 3, members: 16 */
        /* last cacheline: 48 bytes */
```

Fixes: #2130
Signed-off-by: Yaniv Kaul

introduce microsleep to improve sleep precision (#2104)

2021-02-06T00:50:14+00:00

* syncop: introduce microsecond sleep support

Introduce microsecond sleep function synctask_usleep,
which can be used to improve precision instead of synctask_sleep.

Change-Id: Ie7a15dda4afc09828bfbee13cb8683713d7902de

* glusterd: use synctask_usleep in glusterd_proc_stop()

glusterd_proc_stop() sleep 1s for proc stop before force kill.
but in most cases, process can be stopped in 100ms.
This patch use synctask_usleep to check proc running state
every 100ms instead of sleep 1, can reduce up to 1s stop time.
in some cases like enable 100 volumes quota, average execution
time reduced from 2500ms to 500ms.

fixes: #2116
Change-Id: I645e083076c205aa23b219abd0de652f7d95dca7

features/shard: delay unlink of a file that has fd_count > 0 (#1563)

2021-02-03T11:34:25+00:00

* features/shard: delay unlink of a file that has fd_count > 0

When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.

Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.

Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).

C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.

This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.

Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.

In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.

Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.

fixes: #1358
Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c
Signed-off-by: Vinayakswami Hariharmath 

* features/shard: delay unlink of a file that has fd_count > 0

When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.

Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.

Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).

C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.

This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.

Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.

In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.

Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.

fixes: #1358

Change-Id: Iec16d7ff5e05f29255491a43fbb6270c72868999
Signed-off-by: Vinayakswami Hariharmath 

* features/shard: delay unlink of a file that has fd_count > 0

When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.

Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.

Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).

C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.

This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.

Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.

In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.

Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.

fixes: #1358

Change-Id: I07e5a5bf9d33c24b63da72d4f3f59392c5421652
Signed-off-by: Vinayakswami Hariharmath 

* features/shard: delay unlink of a file that has fd_count > 0

When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.

Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.

Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).

C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.

This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.

Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.

In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.

Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.

fixes: #1358

Change-Id: I3679de8545f2e5b8027c4d5a6fd0592092e8dfbd
Signed-off-by: Vinayakswami Hariharmath 

* Update xlators/storage/posix/src/posix-entry-ops.c

Co-authored-by: Xavi Hernandez 

* features/shard: delay unlink of a file that has fd_count > 0

When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.

Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.

Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).

C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.

This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.

Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.

In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.

Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.

fixes: #1358
Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c
Signed-off-by: Vinayakswami Hariharmath 

* Update fd.c

* features/shard: delay unlink of a file that has fd_count > 0

When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.

Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.

Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).

C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.

This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.

Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.

In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.

Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.

fixes: #1358
Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c
Signed-off-by: Vinayakswami Hariharmath 

* features/shard: delay unlink of a file that has fd_count > 0

When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.

Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.

Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).

C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.

This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.

Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.

In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.

Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.

fixes: #1358
Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c
Signed-off-by: Vinayakswami Hariharmath 

Co-authored-by: Xavi Hernandez

AFR - fixing coverity issue (Argument cannot be negative) (#2026)

2021-01-22T12:19:31+00:00

CID 1430124

A negative value is being passed to a parameter hat cannot be negative.
Modified the value which is being passed.

Change-Id: I06dca105f7a78ae16145b0876910851fb631e366
Signed-off-by: Barak Sason Rofman

locks: remove unused conditional switch to spin_lock code (#2007)

2021-01-19T10:09:35+00:00

* locks: remove ununsed conditional switch to spin_lock code

use of spin_locks is depend on the variable use_spinlocks
but the same is commented in the current code base through
https://review.gluster.org/#/c/glusterfs/+/14763/. So it is
of no use to have conditional switching to spin_lock or
mutex. Removing the dead code as part of the patch

Fixes: #1996
Change-Id: Ib005dd86969ce33d3409164ef3e1011bb3169129
Signed-off-by: Vinayakswami Hariharmath 

* locks: remove unused conditional switch to spin_lock code

use of spin_locks is depend on the variable use_spinlocks
but the same is commented in the current code base through
https://review.gluster.org/#/c/glusterfs/+/14763/. So it is
of no use to have conditional switching to spin_lock or
mutex. Removing the dead code as part of the patch

Fixes: #1996
Change-Id: Ib005dd86969ce33d3409164ef3e1011bb3169129
Signed-off-by: Vinayakswami Hariharmath 

* locks: remove unused conditional switch to spin_lock code

use of spin_locks is depend on the variable use_spinlocks
but the same is commented in the current code base through
https://review.gluster.org/#/c/glusterfs/+/14763/. So it is
of no use to have conditional switching to spin_lock or
mutex. Removing the dead code as part of the patch

Fixes: #1996
Change-Id: Ib005dd86969ce33d3409164ef3e1011bb3169129
Signed-off-by: Vinayakswami Hariharmath