| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem-1:
When an overlapping lock is issued the merged lock is not assigned the
owner. When flush is issued on the fd, this particular lock is not freed
leading to memory leak
Fix-1:
Assign the owner while merging the locks.
Problem-2:
On fd-destroy lock structs could be present in fdctx. For some reason
with flock -x command and closing of the bash fd, it leads to this code
path. Which leaks the lock structs.
Fix-2:
When fdctx is being destroyed in client, make sure to cleanup any lock
structs.
fixes: #2337
Change-Id: I298124213ce5a1cf2b1f1756d5e8a9745d9c0a1c
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current implementation of rebalance for sparse files has a bug that,
in some cases, causes a read of 0 bytes from the source subvolume.
Posix xlator doesn't allow 0 byte reads and fails them with EINVAL,
which causes rebalance to abort the migration.
This patch implements a more robust way of finding data segments in
a sparse file that avoids 0 byte reads, allowing the file to be
migrated successfully.
Fixes: #2317
Change-Id: Iff168dda2fb0f2edf716b21eb04cc2cc8ac3915c
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The force option does fail for snapshot create command even though
the quorum is satisfied and is redundant.
The change deprecates the force option for snapshot create command
and checks if all bricks are online instead of checking for quorum
for creating a snapshot.
Fixes: #2099
Change-Id: I45d866e67052fef982a60aebe8dec069e78015bd
Signed-off-by: Nishith Vihar Sakinala <nsakinal@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
When client.strict-locks is enabled on a volume and there are POSIX
locks held on the files, after disconnect and reconnection of the
clients do not re-open such fds which might lead to multiple clients
acquiring the locks and cause data corruption.
Change-Id: I8777ffbc2cc8d15ab57b58b72b56eb67521787c5
Fixes: #1977
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
On a cluster with 15 million files, when fix-layout was started, it was
not progressing at all. So we tried to do a os.walk() + os.stat() on the
backend filesystem directly. It took 2.5 days. We removed os.stat() and
re-ran it on another brick with similar data-set. It took 15 minutes. We
realized that readdirp is extremely costly compared to readdir if the
stat is not useful. fix-layout operation only needs to know that the
entry is a directory so that fix-layout operation can be triggered on
it. Most of the modern filesystems provide this information in readdir
operation. We don't need readdirp i.e. readdir+stat.
Fix:
Use readdir operation in fix-layout. Do readdir+stat/lookup for
filesystems that don't provide d_type in readdir operation.
fixes: #2241
Change-Id: I5fe2ecea25a399ad58e31a2e322caf69fc7f49eb
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
At the moment dht rebalance doesn't give any option to disable fsync
after data migration. Making this an option would give admins take
responsibility of data in a way that is suitable for their cluster.
Default value is still 'on', so that the behavior is intact for people
who don't care about this.
For example: If the data that is going to be migrated is already backed
up or snapshotted, there is no need for fsync to happen right after
migration which can affect active I/O on the volume from applications.
fixes: #2258
Change-Id: I7a50b8d3a2f270d79920ef306ceb6ba6451150c4
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* features/index: Optimize link-count fetching code path
Problem:
AFR requests 'link-count' in lookup to check if there are any pending
heals. Based on this information, afr will set dirent->inode to NULL in
readdirp when heals are ongoing to prevent serving bad data. When heals
are completed, link-count xattr is leading to doing an opendir of
xattrop directory and then reading the contents to figure out that there
is no healing needed for every lookup. This was not detected until this
github issue because ZFS in some cases can lead to very slow readdir()
calls. Since Glusterfs does lot of lookups, this was slowing down
all operations increasing load on the system.
Code problem:
index xlator on any xattrop operation adds index to the relevant dirs
and after the xattrop operation is done, will delete/keep the index in
that directory based on the value fetched in xattrop from posix. AFR
sends all-zero xattrop for changelog xattrs. This is leading to
priv->pending_count manipulation which sets the count back to -1. Next
Lookup operation triggers opendir/readdir to find the actual link-count in
lookup because in memory priv->pending_count is -ve.
Fix:
1) Don't add to index on all-zero xattrop for a key.
2) Set pending-count to -1 when the first gfid is added into xattrop
directory, so that the next lookup can compute the link-count.
fixes: #1764
Change-Id: I8a02c7e811a72c46d78ddb2d9d4fdc2222a444e9
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* addressed comments
Change-Id: Ide42bb1c1237b525d168bf1a9b82eb1bdc3bc283
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* tests: Handle base index absence
Change-Id: I3cf11a8644ccf23e01537228766f864b63c49556
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
* Addressed LOCK based comments, .t comments
Change-Id: I5f53e40820cade3a44259c1ac1a7f3c5f2f0f310
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AFR may hide some existing entries from a directory when reading it
because they are generated internally for private management. However
the returned number of entries from readdir() function is not updated
accordingly. So it may return a number higher than the real entries
present in the gf_dirent list.
This may cause unexpected behavior of clients, including gfapi which
incorrectly assumes that there was an entry when the list was actually
empty.
This patch also makes the check in gfapi more robust to avoid similar
issues that could appear in the future.
Fixes: #2232
Change-Id: I81ba3699248a53ebb0ee4e6e6231a4301436f763
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When parallel-readdir is enabled, readdir(p) requests sent by DHT can be
immediately processed and answered in the same thread before the call to
STACK_WIND_COOKIE() completes.
This means that the readdir(p) cbk is processed synchronously. In some
cases it may decide to send another readdir(p) request, which causes a
recursive call.
When some special conditions happen and the directories are big, it's
possible that the number of nested calls is so high that the process
crashes because of a stack overflow.
This patch fixes this by not allowing nested readdir(p) calls. When a
nested call is detected, it's queued instead of sending it. The queued
request is processed when the current call finishes by the top level
stack function.
Fixes: #2169
Change-Id: Id763a8a51fb3c3314588ec7c162f649babf33099
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
volume profile info now prints duration in nano seconds. Tests were
written when the duration was printed in micro seconds. This leads to
spurious failures.
Fix:
Change tests to handle nano second durations
fixes: #2134
Change-Id: I94722be87000a485d98c8b0f6d8b7e1a526b07e7
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
fix-layout operation assumes that the directory passed is directory i.e.
layout->cnt == conf->subvolume_cnt. This will lead to a crash when
fix-layout is attempted on a file.
Fix:
Disallow fix-layout on files
fixes: #2107
Change-Id: I2116b8773059f67e3260e9207e20eab3de711417
Signed-off-by: Pranith Kumar K <pranith.karampuri@phonepe.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* features/shard: delay unlink of a file that has fd_count > 0
When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.
Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.
Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).
C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.
This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.
Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.
In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.
Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.
fixes: #1358
Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
* features/shard: delay unlink of a file that has fd_count > 0
When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.
Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.
Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).
C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.
This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.
Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.
In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.
Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.
fixes: #1358
Change-Id: Iec16d7ff5e05f29255491a43fbb6270c72868999
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
* features/shard: delay unlink of a file that has fd_count > 0
When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.
Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.
Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).
C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.
This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.
Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.
In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.
Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.
fixes: #1358
Change-Id: I07e5a5bf9d33c24b63da72d4f3f59392c5421652
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
* features/shard: delay unlink of a file that has fd_count > 0
When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.
Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.
Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).
C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.
This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.
Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.
In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.
Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.
fixes: #1358
Change-Id: I3679de8545f2e5b8027c4d5a6fd0592092e8dfbd
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
* Update xlators/storage/posix/src/posix-entry-ops.c
Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
* features/shard: delay unlink of a file that has fd_count > 0
When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.
Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.
Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).
C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.
This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.
Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.
In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.
Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.
fixes: #1358
Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
* Update fd.c
* features/shard: delay unlink of a file that has fd_count > 0
When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.
Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.
Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).
C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.
This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.
Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.
In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.
Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.
fixes: #1358
Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
* features/shard: delay unlink of a file that has fd_count > 0
When there are multiple processes working on a file and if any
process unlinks that file then unlink operation shouldn't harm
other processes working on it. This is a posix a compliant
behavior and this should be supported when shard feature is
enabled also.
Problem description:
Let's consider 2 clients C1 and C2 working on a file F1 with 5
shards on gluster mount and gluster server has 4 bricks
B1, B2, B3, B4.
Assume that base file/shard is present on B1, 1st, 2nd shards
on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1
has opened the F1 in append mode and is writing to it. The
write FOP goes to 5th shard in this case. So the
inode->fd_count = 1 on B1(base file) and B4 (5th shard).
C2 at the same time issued unlink to F1. On the server, the
base file has fd_count = 1 (since C1 has opened the file),
the base file is renamed under .glusterfs/unlink and
returned to C2. Then unlink will be sent to shards on all
bricks and shards on B2 and B3 will be deleted which have
no open reference yet. C1 starts getting errors while
accessing the remaining shards though it has open references
for the file.
This is one such undefined behavior. Likewise we will
encounter many such undefined behaviors as we dont have one
global lock to access all shards as one. Of Course having such
global lock will lead to performance hit as it reduces window
for parallel access of shards.
Solution:
The above undefined behavior can be addressed by delaying the
unlink of a file when there are open references on it.
File unlink happens in 2 steps.
step 1: client creates marker file under .shard/remove_me and
sends unlink on base file to the server
step 2: on return from the server, the associated shards will
be cleaned up and finally marker file will be removed.
In step 2, the back ground deletion process does nameless
lookup using marker file name (marker file is named after the
gfid of the base file) in glusterfs/unlink dir. If the nameless
look up is successful then that means the gfid still has open
fds and deletion of shards has to be delayed. If nameless
lookup fails then that indicates the gfid is unlinked and no
open fds on that file (the gfid path is unlinked during final
close on the file). The shards on which deletion is delayed
are unlinked one the all open fds are closed and this is
done through a thread which wakes up every 10 mins.
Also removed active_fd_count from inode structure and
referring fd_count wherever active_fd_count was used.
fixes: #1358
Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we hit the max capacity of the storage space, shard_unlink()
starts failing if there is no space left on the brick to create a
marker file.
shard_unlink() happens in below steps:
1. create a marker file in the name of gfid of the base file under
BRICK_PATH/.shard/.remove_me
2. unlink the base file
3. shard_delete_shards() deletes the shards in background by
picking the entries in BRICK_PATH/.shard/.remove_me
If a marker file creation fails then we can't really delete the
shards which eventually a problem for user who is looking to make
space by deleting unwanted data.
Solution:
Create the marker file by marking xdata = GLUSTERFS_INTERNAL_FOP_KEY
which is considered to be internal op and allowed to create under
reserved space.
Fixes: #2038
Change-Id: I7facebab940f9aeee81d489df429e00ef4fb7c5d
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* tests: fix tests/bugs/nfs/bug-1053579.t
On NFS the number of groups associated to a user that can be passed
to the server is limited. This test created a user with 200 groups
and checked that a file owned by the latest created group couldn't
be accessed, under the assumption that the last group won't be passed
to the server.
However there's no guarantee on how the list of groups is generated,
so the latest created group could be passed as one of the initial
groups, making the allowing access to the file and causing the test
to fail (because it was expecting to not be possible).
Given that there's no way to be sure which groups will be passed, this
patch changes the test so that a check is done for each group the user
belongs to. Then we check that there have been some successes and some
failures.
Once 'manage-gids' is set, we do the same, but this time the number of
failures must be 0.
Fixes: #2033
Change-Id: Ide06da2861fcade2166372d1f3e9eb4ff2dd5f58
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The test case (./tests/bugs/replicate/bug-921231.t )
is continuously failing.The test case is failing because
inodelk_max_latency is showing wrong value in profile.
The value is not correct because recently the profile
timestamp is changed from microsec to nanosec from
the patch #1833.
Fixes: #2005
Change-Id: Ieb683836938d986b56f70b2380103efe95657821
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
The issue is shard_make_block_abspath() calls gf_uuid_unparse()
every time while constructing shard path. The gfid can be parsed
and saved once and passed while constructing the path. Thus
we can avoid calling gf_uuid_unparse().
Fixes: #1423
Change-Id: Ia26fbd5f09e812bbad9e5715242f14143c013c9c
Signed-off-by: Vinayakswami Hariharmath vharihar@redhat.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* stripe cleanup: Remove the option from create and add-brick cmds
This patch aims to remove the code for stripe option instead
of keeping a default values of stripe/stripe-count variables and
setting and getting dict options and similar redundant operations.
Also removing tests for stripe volumes that have been already
marked bad.
Updates: #1000
Change-Id: Ic2b3cabd671f0c8dc0521384b164c3078f7ca7c6
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
* Fix regression error
tests/000-flaky/basic_changelog_changelog-snapshot.t
was failing due to 0 return value
Change-Id: I8ea0443669c63768760526db5aa1f205978e1dbb
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
* add constant stripe_count value for upgrade scenerios
Change-Id: I49f3da4f106c55f9da20d0b0a299275a19daf4ba
* Fix clang-format warning
Change-Id: I83bae85d10c8c5b3c66f56c9f8de1ec81d0bbc95
|
|
|
|
|
|
|
|
|
|
| |
TODO:
Remove 'slave-timeout' and 'slave-gluster-command-dir'.
These variables are defined in geo-replication/gsyncd.conf.in.
So I will remove them when I change that folder.
Change-Id: Ib9167ca586d83e01f8ec755cdf58b3438184c9dd
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* core: Implement gracefull shutdown for a brick process
glusterd sends a SIGTERM to brick process at the time
of stopping a volume if brick_mux is not enabled.In case
of brick_mux at the time of getting a terminate signal
for last brick a brick process sends a SIGTERM to own
process for stop a brick process.The current approach
does not cleanup resources in case of either last brick
is detached or brick_mux is not enabled.
Solution: glusterd sends a terminate notification to a
brick process at the time of stopping a volume for gracefull
shutdown
Change-Id: I49b729e1205e75760f6eff9bf6803ed0dbf876ae
Fixes: #1749
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* core: Implement gracefull shutdown for a brick process
Resolve some reviwere comment
Fixes: #1749
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
Change-Id: I50e6a9e2ec86256b349aef5b127cc5bbf32d2561
* core: Implement graceful shutdown for a brick process
Implement a key cluster.brick-graceful-cleanup to enable graceful
shutdown for a brick process.If key value is on glusterd sends a
detach request to stop the brick.
Fixes: #1749
Change-Id: Iba8fb27ba15cc37ecd3eb48f0ea8f981633465c3
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* core: Implement graceful shutdown for a brick process
Resolve reviewer comments
Fixes: #1749
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
Change-Id: I2a8eb4cf25cd8fca98d099889e4cae3954c8579e
* core: Implement gracefull shutdown for a brick process
Resolve reviewer comment specific to avoid memory leak
Fixes: #1749
Change-Id: Ic2f09efe6190fd3776f712afc2d49b4e63de7d1f
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
* core: Implement gracefull shutdown for a brick process
Resolve reviewer comment specific to avoid memory leak
Fixes: #1749
Change-Id: I68fbbb39160a4595fb8b1b19836f44b356e89716
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* glusterd/cli: enhance rebalance-status after replace/reset-brick
Rebalance status is being reset during replace/reset-brick operations.
This cause 'volume status' to shows rebalance as "not started".
Fix:
change rebalance-status to "reset due to (replace|reset)-brick"
Change-Id: I6e3372d67355eb76c5965984a23f073289d4ff23
Signed-off-by: Tamar Shacked <tshacked@redhat.com>
* glusterd/cli: enhance rebalance-status after replace/reset-brick
Rebalance status is being reset during replace/reset-brick operations.
This cause 'volume status' to shows rebalance as "not started".
Fix: change rebalance-status to "reset due to (replace|reset)-brick"
Fixes: #1717
Signed-off-by: Tamar Shacked <tshacked@redhat.com>
Change-Id: I1e3e373ca3b2007b5b7005b6c757fb43801fde33
* cli: changing rebal task ID to "None" in case status is being reset
Rebalance status is being reset during replace/reset-brick operations.
This cause 'volume status' to shows rebalance as "not started".
Fix:
change rebalance-status to "reset due to (replace|reset)-brick"
Fixes: #1717
Change-Id: Ia73a8bea3dcd8e51acf4faa6434c3cb0d09856d0
Signed-off-by: Tamar Shacked <tshacked@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* glusterd: modify logic for checking hostname in add-brick
Problem: add-brick command parses only the bricks provided
in cli for a subvolume. If in same subvolume bricks are
increased, these are not checked with present volume bricks.
Fixes: #1779
Change-Id: I768bcf7359a008f2d6baccef50e582536473a9dc
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
* removed assignment of unused variable
Fixes: #1779
Change-Id: Id5ed776b28343e1225b9898e81502ce29fb480fa
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
* few more changes
Change-Id: I7bacedb984f968939b214f9d13546f4bf92e9df7
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
* few more changes
Change-Id: I7bacedb984f968939b214f9d13546f4bf92e9df7
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
* correction in last commit
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
Change-Id: I1fd0d941cf3f32aa6e8c7850def78e5af0d88782
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DHT/Rebalance - Ensure Rebalance reports status only once upon stopping
Upon issuing rebalance stop command, the status of rebalance is being
logged twice to the log file, which can sometime result in an
inconsistent reports (one report states status stopped, while the other
may report something else).
This fix ensures rebalance reports it's status only once and that the
correct status is being reported.
fixes: #1782
Change-Id: Id3206edfad33b3db60e9df8e95a519928dc7cb37
Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Call posix_io_uring_fini only if it was inited to begin with.
Fixes: #1794
Reported-by: Mohit Agrawal <moagrawa@redhat.com>
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
Change-Id: I0e840b6b1d1f26b104b30c8c4b88c14ce4aaac0d
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* tests: Fix issues in CentOS 8
Due to some configuration changes in CentOS 8/RHEL 8, ssl-ciphers.t
and bug-1053579.t were failing.
The first one was failing because TLS v1.0 is disabled by default. The
test hash been updated to check that at least one of TLS v1.0, v1.1 or
v1.2 succeeds.
For the second case, the issue is that the test assumed that the
latest added group to a user should always be listed the last, but
this is not always true because nsswitch.conf now uses 'sss' before
'files', which means that data comes from a db that could not be
sorted.
Updates: #1009
Change-Id: I4ca01a099854ec25926c3d76b3a98072175bab06
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
* tests: Fix TLS version detection
The old test didn't correctly determine which version of TLS should
be allowed by openssl.
Change-Id: Ic081c329d5ed1842fa9f5fd23742ae007738aec0
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. The option has been enabled and tested for quite some time now in RHHI-V
downstream and I think it is safe to make it 'on' by default. Since it
is not possible to simply change it from 'off' to 'on' without breaking
rolling upgrades, old clients etc., I have made it default only for new volumes
starting from op-verison GD_OP_VERSION_9_0.
Note: If you do a volume reset, the option will be turned back off.
This is okay as the dir's gfid will be captured in 'xattrop' folder and heals
will proceed. There might be stale entries inside entry-changes' folder,
which will be removed when we enable the option again.
2. I encountered a cust. issue where entry heal was pending on a dir. with
236436 files in it and the glustershd.log output was just stuck at
"performing entry selfheal", so I have added logs to give us
more info in DEBUG level about whether entry heal and data heal are
progressing (metadata heal doesn't take much time). That way, we have a
quick visual indication to say things are not 'stuck' if we briefly
enable debug logs, instead of taking statedumps or checking profile info
etc.
Fixes: #1483
Change-Id: I4f116f8c92f8cd33f209b758ff14f3c7e1981422
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The test case tests/bugs/bug-1064147.t is failing at
the time of comparing root permission with permission changed
while one of the brick was down.The permission was not matching
because layout was not existing on root at the time of healing
a permission, so correct permission was not healed on
newly started brick
Fixes: #1661
Change-Id: If63ea47576dd14f4b91681dd390e2f84f8b6ac18
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
| |
io-stats xlator declares a ios_sample_buf_size 64k object(10M) per xlator
but in case of sample_interval is 0 this big buffer is not required so
declare the default value only while sample_interval is not 0.The new
change would be helpful to reduce RSS size for a brick and shd process
while the number of volumes are huge.
Change-Id: I3e82cca92e40549355edfac32580169f3ce51af8
Fixes: #1542
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem1:
When a directory is renamed while a brick
is down entry-heal always did an rm -rf on that directory on
the sink on old location and did mkdir and created the directory
hierarchy again in the new location. This is inefficient.
Problem2:
Renamedir heal order may lead to a scenario where directory in
the new location could be created before deleting it from old
location leading to 2 directories with same gfid in posix.
Fix:
As part of heal, if oldlocation is healed first and is not present in
source-brick always rename it into a hidden directory inside the
sink-brick so that when heal is triggered in new-location shd can
rename it from this hidden directory to the new-location.
If new-location heal is triggered first and it detects that the
directory already exists in the brick, then it should skip healing the
directory until it appears in the hidden directory.
Credits: Ravi for rename-data-loss.t script
Fixes: #1211
Change-Id: I0cba2006f35cd03d314d18211ce0bd530e254843
Signed-off-by: Pranith Kumar K <pkarampu@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: add-brick operation fails with multiple bricks on same
server error when replica count is increased.
This was happening because of extra runs in a loop to compare
hostnames and if bricks supplied were less than "replica" count,
the bricks will get compared to itself resulting in above error.
Fixes: #1508
Change-Id: I8668e964340b7bf59728bb838525d2db062197ed
Signed-off-by: Sheetal Pamecha <spamecha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Glusterfs so far constrained itself with an arbitrary limit (32)
for the number of groups read from /proc/[pid]/status (this was
the number of groups shown there prior to Linux commit
v3.7-9553-g8d238027b87e (v3.8-rc1~74^2~59); since this commit, all
groups are shown).
With this change we'll read groups up to the number Glusterfs
supports in general (64k).
Note: the actual number of groups that are made use of in a
regular Glusterfs setup shall still be capped at ~93 due to limitations
of the RPC transport. To be able to handle more groups than that,
brick side gid resolution (server.manage-gids option) can be used along
with NIS, LDAP or other such networked directory service (see
https://github.com/gluster/glusterdocs/blob/5ba15a2/docs/Administrator%20Guide/Handling-of-users-with-many-groups.md#limit-in-the-glusterfs-protocol
).
Also adding some diagnostic messages to frame_fill_groups().
Change-Id: I271f3dc3e6d3c44d6d989c7a2073ea5f16c26ee0
fixes: #1075
Signed-off-by: Csaba Henk <csaba@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* also add some time gap in other tests to see if we get things properly
* create a directory 'tests/000/', which can host any tests, which are flaky.
* move all the tests mentioned in the issue to above directory.
* as the above dir gets tested first, all flaky tests would be reported quickly.
* change `run-tests.sh` to continue tests even if flaky tests fail.
Reference: gluster/project-infrastructure#72
Updates: #1000
Change-Id: Ifdafa38d083ebd80f7ae3cbbc9aa3b68b6d21d0e
Signed-off-by: Amar Tumballi <amar@kadalu.io>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Assume that we are preallocating a VM of size 1TB with a shard
block size of 64MB then there will be ~16k shards.
This creation happens in 2 steps shard_fallocate() path i.e
1. lookup for the shards if any already present and
2. mknod over those shards do not exist.
But in case of fresh creation, we dont have to lookup for all
shards which are not present as the the file size will be 0.
Through this, we can save lookup on all shards which are not
present. This optimization is quite useful in the case of
preallocating big vm.
Also if the file is already present and the call is to
extend it to bigger size then we need not to lookup for non-
existent shards. Just lookup preexisting shards, populate
the inodes and issue mknod on extended size.
Fixes: #1425
Change-Id: I60036fe8302c696e0ca80ff11ab0ef5bcdbd7880
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In a cluster env: getspec() detects that volfile not found.
but further on, this return code is set by another call
so the error is lost and not handled.
As a result the server responds with ambiguous message:
{op_ret = -1, op_errno = 0..} - which cause the client to stuck.
Fix:
server side: don't override the failure error.
fixes: #1375
Change-Id: Id394954d4d0746570c1ee7d98969649c305c6b0d
Signed-off-by: Tamar Shacked <tshacked@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The scenario of setting an xattr to a dir, killing one of the bricks,
removing the xattr, bringing back the brick results in xattr
inconsistency - The downed brick will still have the xattr, but the rest
won't.
This patch add a mechanism that will remove the extra xattrs during
lookup.
This patch is a modification to a previous patch based on comments that
were made after merge:
https://review.gluster.org/#/c/glusterfs/+/24613/
fixes: #1324
Change-Id: Ifec0b7aea6cd40daa8b0319b881191cf83e031d1
Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Mark dir as missing in layout structure to be healed in
dht_selfheal_directory.
fixes: #1327
Change-Id: If2c69294bd8107c26624cfe220f008bc3b952a4e
Signed-off-by: Susant Palai <spalai@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added test for volume options like
localtime-logging, fixed enable-shared-storage
to include function coverage and few negative
tests for other volume options to increase the
code coverage in the glusterd component.
Change-Id: Ib1706c1fd5bc98a64dcb5c8b15a121d639a597d7
Updates: #1052
Signed-off-by: nik-redhat <nladha@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 620158475f462251c996901a8e24306ef6cb4c42.
The patch to revert is https://review.gluster.org/#/c/glusterfs/+/24613/
Reverting is required as comments were posted regarding a more
efficient implementation were made after the patch was merged.
A new patch will be posted to adress the comments will be posted.
updates: #1324
Change-Id: I59205baefe1cada033c736d41ce9c51b21727d3f
Signed-off-by: Barak Sason Rofman <redhat@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The scenario of setting an xattr to a dir, killing one of the bricks,
removing the xattr, bringing back the brick results in xattr
inconsistency - The downed brick will still have the xattr, but the rest
won't.
This patch add a mechanism that will remove the extra xattrs during
lookup.
fixes: #1324
Change-Id: Ibcc449bad6c7cb46bcae380e42e4496d733b453d
Signed-off-by: Barak Sason Rofman <bsasonro@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: add-brick operation is failing when replica or disperse
count is not mentioned in the add-brick command.
Reason: with commit a113d93 we are checking brick order while
doing add-brick operation for replica and disperse volumes. If
replica count or disperse count is not mentioned in the command,
the dict get is failing and resulting add-brick operation failure.
fixes: #1306
Change-Id: Ie957540e303bfb5f2d69015661a60d7e72557353
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tests/bugs/glusterd/mgmt-handshake-and-volume-sync-post-glusterd-restart.t
Test Summary Report
-------------------
tests/bugs/glusterd/mgmt-handshake-and-volume-sync-post-glusterd-restart.t
(Wstat: 0 Tests: 23 Failed: 3)
Failed tests: 21-23
After glusterd restart, volume start is failing. Looks like, it need some
time to sync the data. Adding sleep for the same.
Note: All other changes are made to avoid spurious failures in the future.
fixes: #1272
Change-Id: Ib184757fb936e03b5b6208465e44a8e790b71c1c
Signed-off-by: Sanju Rakonde <srakonde@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: See github issue for details.
Fix:
-In lookup if the entry exists in 2 out of 3 bricks, don't fail the
lookup with ENOENT just because there is an entrylk on the parent.
Consider quorum before deciding.
-If entry FOP does not succeed on quorum no. of bricks, do not perform
new entry mark.
Fixes: #1303
Change-Id: I56df8c89ad53b29fa450c7930a7b7ccec9f4a6c5
Signed-off-by: Ravishankar N <ravishankar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Logs and other output carrying timestamps
will have now timezone offsets indicated, eg.:
[2020-03-12 07:01:05.584482 +0000] I [MSGID: 106143] [glusterd-pmap.c:388:pmap_registry_remove] 0-pmap: removing brick (null) on port 49153
To this end,
- gf_time_fmt() now inserts timezone offset via %z strftime(3) template.
- A new utility function has been added, gf_time_fmt_tv(), that
takes a struct timeval pointer (*tv) instead of a time_t value to
specify the time. If tv->tv_usec is negative,
gf_time_fmt_tv(... tv ...)
is equivalent to
gf_time_fmt(... tv->tv_sec ...)
Otherwise it also inserts tv->tv_usec to the formatted string.
- Building timestamps of usec precision has been converted to
gf_time_fmt_tv, which is necessary because the method of appending
a period and the usec value to the end of the timestamp does not work
if the timestamp has zone offset, but it's also beneficial in terms of
eliminating repetition.
- The buffer passed to gf_time_fmt/gf_time_fmt_tv has been unified to
be of GF_TIMESTR_SIZE size (256). We need slightly larger buffer space
to accommodate the zone offset and it's preferable to use a buffer
which is undisputedly large enough.
This change does *not* do the following:
- Retaining a method of timestamp creation without timezone offset.
As to my understanding we don't need such backward compatibility
as the code just emits timestamps to logs and other diagnostic
texts, and doesn't do any later processing on them that would rely
on their format. An exception to this, ie. a case where timestamp
is built for internal use, is graph.c:fill_uuid(). As far as I can
see, what matters in that case is the uniqueness of the produced
string, not the format.
- Implementing a single-token (space free) timestamp format.
While some timestamp formats used to be single-token, now all of
them will include a space preceding the offset indicator. Again,
I did not see a use case where this could be significant in terms
of representation.
- Moving the codebase to a single unified timestamp format and
dropping the fmt argument of gf_time_fmt/gf_time_fmt_tv.
While the gf_timefmt_FT format is almost ubiquitous, there are
a few cases where different formats are used. I'm not convinced
there is any reason to not use gf_timefmt_FT in those cases too,
but I did not want to make a decision in this regard.
Change-Id: I0af73ab5d490cca7ed8d07a2ce7ac22a6df2920a
Updates: #837
Signed-off-by: Csaba Henk <csaba@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Issue:
When a process has the open fd and the same file is
unlinked in middle of the operations, then file based
lookup fails with ENOENT or stale file
Solution:
When the file already open and fd is available, use fstat
to get the file attributes
Change-Id: I0e83aee9f11b616dcfe13769ebfcda6742e4e0f4
Fixes: #1281
Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Brick process are not properly attached on any cluster node while
some volume options are changed on peer node and glusterd is down on
that specific node.
Solution: At the time of restart glusterd it got a friend update request
from a peer node if peer node having some changes on volume.If the brick
process is started before received a friend update request in that case
brick_mux behavior is not workingproperly. All bricks are attached to
the same process even volumes options are not the same. To avoid the
issue introduce an atomic flag volpeerupdate and update the value while
glusterd has received a friend update request from peer for a specific
volume.If volpeerupdate flag is 1 volume is started by
glusterd_import_friend_volume synctask
Change-Id: I4c026f1e7807ded249153670e6967a2be8d22cb7
Credit: Sanju Rakaonde <srakonde@redhat.com>
fixes: #1290
Signed-off-by: Mohit Agrawal <moagrawal@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem:
In a replicate/arbiter volume if file creations or writes fails on
quorum number of bricks and on one brick it is due to ENOSPC and
on other brick it fails for a different reason, it may fail with
errors other than ENOSPC in some cases.
Fix:
Prioritize ENOSPC over other lesser priority errors and do not set
op_errno in posix_gfid_set if op_ret is 0 to avoid receiving any
error_no which can be misinterpreted by __afr_dir_write_finalize().
Also removing the function afr_has_arbiter_fop_cbk_quorum() which
might consider a successful reply form a single brick as quorum
success in some cases, whereas we always need fop to be successful
on quorum number of bricks in arbiter configuration.
Change-Id: I106e267f8b9451f681022f1cccb410d9bc824c08
Fixes: #1254
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was a critical flaw in the previous implementation of open-behind.
When an open is done in the background, it's necessary to take a
reference on the fd_t object because once we "fake" the open answer,
the fd could be destroyed. However as long as there's a reference,
the release function won't be called. So, if the application closes
the file descriptor without having actually opened it, there will
always remain at least 1 reference, causing a leak.
To avoid this problem, the previous implementation didn't take a
reference on the fd_t, so there were races where the fd could be
destroyed while it was still in use.
To fix this, I've implemented a new xlator cbk that gets called from
fuse when the application closes a file descriptor.
The whole logic of handling background opens have been simplified and
it's more efficient now. Only if the fop needs to be delayed until an
open completes, a stub is created. Otherwise no memory allocations are
needed.
Correctly handling the close request while the open is still pending
has added a bit of complexity, but overall normal operation is simpler.
Change-Id: I6376a5491368e0e1c283cc452849032636261592
Fixes: #1225
Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- performance.cache-size has a flawed semantics, as it's
dispatched on two independent translators, io-cache
and quick-read.
- performance.qr-cache-timeout has a confusing name, as
other options affecting quick-read have an unabbreviated
"quick-read-..." prefix in their names.
We keep these options with unchanged operation, but in the
help output we indicate their deprecation.
The following better alternatives are introduced:
- performance.io-cache-size to tune cache-size option of io-cache
- performance.quick-read-cache-size to tune cache-size option of
quick-read
- performance.quick-read-cache-timeout as a preferred synonym for
performance.qr-cache-timeout
Fixes: #952
Change-Id: Ibd04fb638de8cac450ba992ad8a415154f9f4281
Signed-off-by: Csaba Henk <csaba@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Posix translator returns pre and postbufs in the dict in {F}REMOVEXATTR fops.
These iatts are further cached at layers like md-cache.
Shard translator, in its current state, simply returns these values without
updating the aggregated file size and block-count.
This patch fixes this problem.
Change-Id: I4b2dd41ede472c5829af80a67401ec5a6376d872
Fixes: #1243
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Posix translator returns pre and postbufs in the dict in {F}SETXATTR fops.
These iatts are further cached at layers like md-cache.
Shard translator, in its current state, simply returns these values without
updating the aggregated file size and block-count.
This patch fixes this problem.
Change-Id: I4da0eceb4235b91546df79270bcc0af8cd64e9ea
Fixes: #1243
Signed-off-by: Krutika Dhananjay <kdhananj@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Problem: tests/bugs/replicate/bug-1101647.t test case fails sporadically
in the volume heal since connection to the bricks with shd was not being
checked before running the index heal.
Build link: https://build.gluster.org/job/regression-test-burn-in/5007/
Fix: Check for the connection status of the bricks with shd before
performing the index heal.
Change-Id: Ie7060f379b63bef39fd4f9804f6e22e0a25680c1
Updates: #1154
Signed-off-by: karthik-us <ksubrahm@redhat.com>
|