features/shard: delay unlink of a file that has fd_count > 0 (#1563)

* features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: Iec16d7ff5e05f29255491a43fbb6270c72868999 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I07e5a5bf9d33c24b63da72d4f3f59392c5421652 Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I3679de8545f2e5b8027c4d5a6fd0592092e8dfbd Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update xlators/storage/posix/src/posix-entry-ops.c Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * Update fd.c * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> * features/shard: delay unlink of a file that has fd_count > 0 When there are multiple processes working on a file and if any process unlinks that file then unlink operation shouldn't harm other processes working on it. This is a posix a compliant behavior and this should be supported when shard feature is enabled also. Problem description: Let's consider 2 clients C1 and C2 working on a file F1 with 5 shards on gluster mount and gluster server has 4 bricks B1, B2, B3, B4. Assume that base file/shard is present on B1, 1st, 2nd shards on B2, 3rd and 4th shards on B3 and 5th shard falls on B4 C1 has opened the F1 in append mode and is writing to it. The write FOP goes to 5th shard in this case. So the inode->fd_count = 1 on B1(base file) and B4 (5th shard). C2 at the same time issued unlink to F1. On the server, the base file has fd_count = 1 (since C1 has opened the file), the base file is renamed under .glusterfs/unlink and returned to C2. Then unlink will be sent to shards on all bricks and shards on B2 and B3 will be deleted which have no open reference yet. C1 starts getting errors while accessing the remaining shards though it has open references for the file. This is one such undefined behavior. Likewise we will encounter many such undefined behaviors as we dont have one global lock to access all shards as one. Of Course having such global lock will lead to performance hit as it reduces window for parallel access of shards. Solution: The above undefined behavior can be addressed by delaying the unlink of a file when there are open references on it. File unlink happens in 2 steps. step 1: client creates marker file under .shard/remove_me and sends unlink on base file to the server step 2: on return from the server, the associated shards will be cleaned up and finally marker file will be removed. In step 2, the back ground deletion process does nameless lookup using marker file name (marker file is named after the gfid of the base file) in glusterfs/unlink dir. If the nameless look up is successful then that means the gfid still has open fds and deletion of shards has to be delayed. If nameless lookup fails then that indicates the gfid is unlinked and no open fds on that file (the gfid path is unlinked during final close on the file). The shards on which deletion is delayed are unlinked one the all open fds are closed and this is done through a thread which wakes up every 10 mins. Also removed active_fd_count from inode structure and referring fd_count wherever active_fd_count was used. fixes: #1358 Change-Id: I8985093386e26215e0b0dce294c534a66f6ca11c Signed-off-by: Vinayakswami Hariharmath <vharihar@redhat.com> Co-authored-by: Xavi Hernandez <xhernandez@users.noreply.github.com>
author: Vinayak hariharmath <65405035+VHariharmath-rh@users.noreply.github.com> 2021-02-03 17:04:25 +0530
committer: GitHub <noreply@github.com> 2021-02-03 17:04:25 +0530
commit: 8bebb100868fa018007c120474280aed0717312d (patch)
tree: 416ff20879a1c0c980f4eaff57fb2578d39f372b /xlators
parent: 74b9d50461367e95b73235d71b4deb25a8dd5587 (diff)
download: glusterfs-8bebb100868fa018007c120474280aed0717312d.tar.gz
glusterfs-8bebb100868fa018007c120474280aed0717312d.tar.xz
glusterfs-8bebb100868fa018007c120474280aed0717312d.zip
4 files changed, 285 insertions, 25 deletions
diff --git a/xlators/features/shard/src/shard.c b/xlators/features/shard/src/shard.c
index 091f820caa..7a01f6d9c1 100644
--- a/xlators/features/shard/src/shard.c
+++ b/xlators/features/shard/src/shard.c
@@ -1252,7 +1252,8 @@ out:
 
 static inode_t *
 shard_link_internal_dir_inode(shard_local_t *local, inode_t *inode,
-                              struct iatt *buf, shard_internal_dir_type_t type)
+                              xlator_t *this, struct iatt *buf,
+                              shard_internal_dir_type_t type)
 {
     inode_t *linked_inode = NULL;
     shard_priv_t *priv = NULL;
@@ -1260,7 +1261,7 @@ shard_link_internal_dir_inode(shard_local_t *local, inode_t *inode,
     inode_t **priv_inode = NULL;
     inode_t *parent = NULL;
 
-    priv = THIS->private;
+    priv = this->private;
 
     switch (type) {
         case SHARD_INTERNAL_DIR_DOT_SHARD:
@@ -1304,7 +1305,7 @@ shard_refresh_internal_dir_cbk(call_frame_t *frame, void *cookie,
     /* To-Do: Fix refcount increment per call to
      * shard_link_internal_dir_inode().
      */
-    linked_inode = shard_link_internal_dir_inode(local, inode, buf, type);
+    linked_inode = shard_link_internal_dir_inode(local, inode, this, buf, type);
     shard_inode_ctx_mark_dir_refreshed(linked_inode, this);
 out:
     shard_common_resolve_shards(frame, this, local->post_res_handler);
@@ -1393,7 +1394,7 @@ shard_lookup_internal_dir_cbk(call_frame_t *frame, void *cookie, xlator_t *this,
         goto unwind;
     }
 
-    link_inode = shard_link_internal_dir_inode(local, inode, buf, type);
+    link_inode = shard_link_internal_dir_inode(local, inode, this, buf, type);
     if (link_inode != inode) {
         shard_refresh_internal_dir(frame, this, type);
     } else {
@@ -3595,7 +3596,8 @@ shard_resolve_internal_dir(xlator_t *this, shard_local_t *local,
                        "Lookup on %s failed, exiting", bname);
             goto err;
         } else {
-            shard_link_internal_dir_inode(local, loc->inode, &stbuf, type);
+            shard_link_internal_dir_inode(local, loc->inode, this, &stbuf,
+                                          type);
         }
     }
     ret = 0;
@@ -3642,6 +3644,45 @@ err:
     return ret;
 }
 
+static int
+shard_nameless_lookup_base_file(xlator_t *this, char *gfid)
+{
+    int ret = 0;
+    loc_t loc = {
+        0,
+    };
+    dict_t *xattr_req = dict_new();
+    if (!xattr_req) {
+        ret = -1;
+        goto out;
+    }
+
+    loc.inode = inode_new(this->itable);
+    if (loc.inode == NULL) {
+        ret = -1;
+        goto out;
+    }
+
+    ret = gf_uuid_parse(gfid, loc.gfid);
+    if (ret < 0)
+        goto out;
+
+    ret = dict_set_uint32(xattr_req, GF_UNLINKED_LOOKUP, 1);
+    if (ret < 0)
+        goto out;
+
+    ret = syncop_lookup(FIRST_CHILD(this), &loc, NULL, NULL, xattr_req, NULL);
+    if (ret < 0)
+        goto out;
+
+out:
+    if (xattr_req)
+        dict_unref(xattr_req);
+    loc_wipe(&loc);
+
+    return ret;
+}
+
 int
 shard_delete_shards(void *opaque)
 {
@@ -3743,6 +3784,11 @@ shard_delete_shards(void *opaque)
                     if (ret < 0)
                         continue;
                 }
+
+                ret = shard_nameless_lookup_base_file(this, entry->d_name);
+                if (!ret)
+                    continue;
+
                 link_inode = inode_link(entry->inode, local->fd->inode,
                                         entry->d_name, &entry->d_stat);
 
@@ -4114,6 +4160,9 @@ err:
 int
 shard_unlock_entrylk(call_frame_t *frame, xlator_t *this);
 
+static int
+shard_unlink_handler_spawn(xlator_t *this);
+
 int
 shard_unlink_base_file_cbk(call_frame_t *frame, void *cookie, xlator_t *this,
                            int32_t op_ret, int32_t op_errno,
@@ -4135,7 +4184,7 @@ shard_unlink_base_file_cbk(call_frame_t *frame, void *cookie, xlator_t *this,
         if (xdata)
             local->xattr_rsp = dict_ref(xdata);
         if (local->cleanup_required)
-            shard_start_background_deletion(this);
+            shard_unlink_handler_spawn(this);
     }
 
     if (local->entrylk_frame) {
@@ -5790,7 +5839,7 @@ shard_mkdir_internal_dir_cbk(call_frame_t *frame, void *cookie, xlator_t *this,
         }
     }
 
-    link_inode = shard_link_internal_dir_inode(local, inode, buf, type);
+    link_inode = shard_link_internal_dir_inode(local, inode, this, buf, type);
     if (link_inode != inode) {
         shard_refresh_internal_dir(frame, this, type);
     } else {
@@ -7104,6 +7153,132 @@ shard_seek(call_frame_t *frame, xlator_t *this, fd_t *fd, off_t offset,
     return 0;
 }
 
+static void
+shard_unlink_wait(shard_unlink_thread_t *ti)
+{
+    struct timespec wait_till = {
+        0,
+    };
+
+    pthread_mutex_lock(&ti->mutex);
+    {
+        /* shard_unlink_handler() runs every 10 mins of interval */
+        wait_till.tv_sec = time(NULL) + 600;
+
+        while (!ti->rerun) {
+            if (pthread_cond_timedwait(&ti->cond, &ti->mutex, &wait_till) ==
+                ETIMEDOUT)
+                break;
+        }
+        ti->rerun = _gf_false;
+    }
+    pthread_mutex_unlock(&ti->mutex);
+}
+
+static void *
+shard_unlink_handler(void *data)
+{
+    shard_unlink_thread_t *ti = data;
+    xlator_t *this = ti->this;
+
+    THIS = this;
+
+    while (!ti->stop) {
+        shard_start_background_deletion(this);
+        shard_unlink_wait(ti);
+    }
+    return NULL;
+}
+
+static int
+shard_unlink_handler_spawn(xlator_t *this)
+{
+    int ret = 0;
+    shard_priv_t *priv = this->private;
+    shard_unlink_thread_t *ti = &priv->thread_info;
+
+    ti->this = this;
+
+    pthread_mutex_lock(&ti->mutex);
+    {
+        if (ti->running) {
+            pthread_cond_signal(&ti->cond);
+        } else {
+            ret = gf_thread_create(&ti->thread, NULL, shard_unlink_handler, ti,
+                                   "shard_unlink");
+            if (ret < 0) {
+                gf_log(this->name, GF_LOG_ERROR,
+                       "Failed to create \"shard_unlink\" thread");
+                goto unlock;
+            }
+            ti->running = _gf_true;
+        }
+
+        ti->rerun = _gf_true;
+    }
+unlock:
+    pthread_mutex_unlock(&ti->mutex);
+    return ret;
+}
+
+static int
+shard_unlink_handler_init(shard_unlink_thread_t *ti)
+{
+    int ret = 0;
+    xlator_t *this = THIS;
+
+    ret = pthread_mutex_init(&ti->mutex, NULL);
+    if (ret) {
+        gf_log(this->name, GF_LOG_ERROR,
+               "Failed to init mutex for \"shard_unlink\" thread");
+        goto out;
+    }
+
+    ret = pthread_cond_init(&ti->cond, NULL);
+    if (ret) {
+        gf_log(this->name, GF_LOG_ERROR,
+               "Failed to init cond var for \"shard_unlink\" thread");
+        pthread_mutex_destroy(&ti->mutex);
+        goto out;
+    }
+
+    ti->running = _gf_false;
+    ti->rerun = _gf_false;
+    ti->stop = _gf_false;
+
+out:
+    return -ret;
+}
+
+static void
+shard_unlink_handler_fini(shard_unlink_thread_t *ti)
+{
+    int ret = 0;
+    xlator_t *this = THIS;
+    if (!ti)
+        return;
+
+    pthread_mutex_lock(&ti->mutex);
+    if (ti->running) {
+        ti->rerun = _gf_true;
+        ti->stop = _gf_true;
+        pthread_cond_signal(&ti->cond);
+    }
+    pthread_mutex_unlock(&ti->mutex);
+
+    if (ti->running) {
+        ret = pthread_join(ti->thread, NULL);
+        if (ret)
+            gf_msg(this->name, GF_LOG_WARNING, 0, 0,
+                   "Failed to clean up shard unlink thread.");
+        ti->running = _gf_false;
+    }
+    ti->thread = 0;
+
+    pthread_cond_destroy(&ti->cond);
+    pthread_mutex_destroy(&ti->mutex);
+}
+
 int32_t
 mem_acct_init(xlator_t *this)
 {
@@ -7170,6 +7345,14 @@ init(xlator_t *this)
     this->private = priv;
     LOCK_INIT(&priv->lock);
     INIT_LIST_HEAD(&priv->ilist_head);
+
+    ret = shard_unlink_handler_init(&priv->thread_info);
+    if (ret) {
+        gf_log(this->name, GF_LOG_ERROR,
+               "Failed to initialize resources for \"shard_unlink\" thread");
+        goto out;
+    }
+
     ret = 0;
 out:
     if (ret) {
@@ -7197,6 +7380,8 @@ fini(xlator_t *this)
     if (!priv)
         goto out;
 
+    shard_unlink_handler_fini(&priv->thread_info);
+
     this->private = NULL;
     LOCK_DESTROY(&priv->lock);
     GF_FREE(priv);
diff --git a/xlators/features/shard/src/shard.h b/xlators/features/shard/src/shard.h
index 4fe181b64d..3dcb112bf7 100644
--- a/xlators/features/shard/src/shard.h
+++ b/xlators/features/shard/src/shard.h
@@ -207,6 +207,16 @@ typedef enum {
 
 /* rm = "remove me" */
 
+typedef struct shard_unlink_thread {
+    pthread_mutex_t mutex;
+    pthread_cond_t cond;
+    pthread_t thread;
+    gf_boolean_t running;
+    gf_boolean_t rerun;
+    gf_boolean_t stop;
+    xlator_t *this;
+} shard_unlink_thread_t;
+
 typedef struct shard_priv {
     uint64_t block_size;
     uuid_t dot_shard_gfid;
@@ -220,6 +230,7 @@ typedef struct shard_priv {
     shard_bg_deletion_state_t bg_del_state;
     gf_boolean_t first_lookup_done;
     uint64_t lru_limit;
+    shard_unlink_thread_t thread_info;
 } shard_priv_t;
 
 typedef struct {
diff --git a/xlators/storage/posix/src/posix-entry-ops.c b/xlators/storage/posix/src/posix-entry-ops.c
index 8cc3ccf8c0..f0f34c1f4a 100644
--- a/xlators/storage/posix/src/posix-entry-ops.c
+++ b/xlators/storage/posix/src/posix-entry-ops.c
@@ -177,6 +177,11 @@ posix_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xdata)
     posix_inode_ctx_t *ctx = NULL;
     int ret = 0;
     int dfd = -1;
+    uint32_t lookup_unlink_dir = 0;
+    char *unlink_path = NULL;
+    struct stat lstatbuf = {
+        0,
+    };
 
     VALIDATE_OR_GOTO(frame, out);
     VALIDATE_OR_GOTO(this, out);
@@ -215,7 +220,36 @@ posix_lookup(call_frame_t *frame, xlator_t *this, loc_t *loc, dict_t *xdata)
     op_ret = -1;
     if (gf_uuid_is_null(loc->pargfid) || (loc->name == NULL)) {
         /* nameless lookup */
+        op_ret = op_errno = errno = 0;
         MAKE_INODE_HANDLE(real_path, this, loc, &buf);
+
+        /* The gfid will be renamed to ".glusterfs/unlink" in case
+         * there are any open fds on the file in posix_unlink path.
+         * So client can request server to do nameless lookup with
+         * xdata = GF_UNLINKED_LOOKUP in ".glusterfs/unlink"
+         * dir if a client wants to know the status of the all open fds
+         * on the unlinked file. If the file still present in the
+         * ".glusterfs/unlink" dir then it indicates there still
+         * open fds present on the file and the file is still under
+         * unlink process */
+        if (op_ret < 0 && errno == ENOENT) {
+            ret = dict_get_uint32(xdata, GF_UNLINKED_LOOKUP,
+                                  &lookup_unlink_dir);
+            if (!ret && lookup_unlink_dir) {
+                op_ret = op_errno = errno = 0;
+                POSIX_GET_FILE_UNLINK_PATH(priv->base_path, loc->gfid,
+                                           unlink_path);
+                ret = sys_lstat(unlink_path, &lstatbuf);
+                if (ret) {
+                    op_ret = -1;
+                    op_errno = errno;
+                } else {
+                    iatt_from_stat(&buf, &lstatbuf);
+                    buf.ia_nlink = 0;
+                }
+                goto nameless_lookup_unlink_dir_out;
+            }
+        }
     } else {
         MAKE_ENTRY_HANDLE(real_path, par_path, this, loc, &buf);
         if (!real_path || !par_path) {
@@ -335,6 +369,8 @@ out:
 
     if (op_ret == 0)
         op_errno = 0;
+
+nameless_lookup_unlink_dir_out:
     STACK_UNWIND_STRICT(lookup, frame, op_ret, op_errno,
                         (loc) ? loc->inode : NULL, &buf, xattr, &postparent);
 
diff --git a/xlators/storage/posix/src/posix-inode-fd-ops.c b/xlators/storage/posix/src/posix-inode-fd-ops.c
index e95ea7c0ed..3f51c49642 100644
--- a/xlators/storage/posix/src/posix-inode-fd-ops.c
+++ b/xlators/storage/posix/src/posix-inode-fd-ops.c
@@ -2524,6 +2524,39 @@ out:
     return 0;
 }
 
+static int
+posix_unlink_renamed_file(xlator_t *this, inode_t *inode)
+{
+    int ret = 0;
+    char *unlink_path = NULL;
+    uint64_t ctx_uint = 0;
+    posix_inode_ctx_t *ctx = NULL;
+    struct posix_private *priv = this->private;
+
+    ret = inode_ctx_get(inode, this, &ctx_uint);
+
+    if (ret < 0)
+        goto out;
+
+    ctx = (posix_inode_ctx_t *)(uintptr_t)ctx_uint;
+
+    if (ctx->unlink_flag == GF_UNLINK_TRUE) {
+        POSIX_GET_FILE_UNLINK_PATH(priv->base_path, inode->gfid, unlink_path);
+        if (!unlink_path) {
+            gf_msg(this->name, GF_LOG_ERROR, ENOMEM, P_MSG_UNLINK_FAILED,
+                   "Failed to remove gfid :%s", uuid_utoa(inode->gfid));
+            ret = -1;
+        } else {
+            ret = sys_unlink(unlink_path);
+            if (!ret)
+                ctx->unlink_flag = GF_UNLINK_FALSE;
+        }
+    }
+
+out:
+    return ret;
+}
+
 int32_t
 posix_release(xlator_t *this, fd_t *fd)
 {
@@ -2534,6 +2567,9 @@ posix_release(xlator_t *this, fd_t *fd)
     VALIDATE_OR_GOTO(this, out);
     VALIDATE_OR_GOTO(fd, out);
 
+    if (fd->inode->active_fd_count == 0)
+        posix_unlink_renamed_file(this, fd->inode);
+
     ret = fd_ctx_del(fd, this, &tmp_pfd);
     if (ret < 0) {
         gf_msg(this->name, GF_LOG_WARNING, 0, P_MSG_PFD_NULL,
@@ -5979,41 +6015,33 @@ posix_forget(xlator_t *this, inode_t *inode)
     uint64_t ctx_uint1 = 0;
     uint64_t ctx_uint2 = 0;
     posix_inode_ctx_t *ctx = NULL;
-    posix_mdata_t *mdata = NULL;
-    struct posix_private *priv_posix = NULL;
-
-    priv_posix = (struct posix_private *)this->private;
-    if (!priv_posix)
-        return 0;
+    struct posix_private *priv = this->private;
 
     ret = inode_ctx_del2(inode, this, &ctx_uint1, &ctx_uint2);
+
+    if (ctx_uint2)
+        GF_FREE((posix_mdata_t *)(uintptr_t)ctx_uint2);
+
     if (!ctx_uint1)
-        goto check_ctx2;
+        return 0;
 
     ctx = (posix_inode_ctx_t *)(uintptr_t)ctx_uint1;
 
     if (ctx->unlink_flag == GF_UNLINK_TRUE) {
-        POSIX_GET_FILE_UNLINK_PATH(priv_posix->base_path, inode->gfid,
-                                   unlink_path);
+        POSIX_GET_FILE_UNLINK_PATH(priv->base_path, inode->gfid, unlink_path);
         if (!unlink_path) {
             gf_msg(this->name, GF_LOG_ERROR, ENOMEM, P_MSG_UNLINK_FAILED,
                    "Failed to remove gfid :%s", uuid_utoa(inode->gfid));
             ret = -1;
-            goto ctx_free;
+        } else {
+            ret = sys_unlink(unlink_path);
         }
-        ret = sys_unlink(unlink_path);
     }
-ctx_free:
+
     pthread_mutex_destroy(&ctx->xattrop_lock);
     pthread_mutex_destroy(&ctx->write_atomic_lock);
     pthread_mutex_destroy(&ctx->pgfid_lock);
     GF_FREE(ctx);
 
-check_ctx2:
-    if (ctx_uint2) {
-        mdata = (posix_mdata_t *)(uintptr_t)ctx_uint2;
-    }
-
-    GF_FREE(mdata);
     return ret;
 }
author	Vinayak hariharmath <65405035+VHariharmath-rh@users.noreply.github.com>	2021-02-03 17:04:25 +0530
committer	GitHub <noreply@github.com>	2021-02-03 17:04:25 +0530
commit	8bebb100868fa018007c120474280aed0717312d (patch)
tree	416ff20879a1c0c980f4eaff57fb2578d39f372b /xlators
parent	74b9d50461367e95b73235d71b4deb25a8dd5587 (diff)
download	glusterfs-8bebb100868fa018007c120474280aed0717312d.tar.gz glusterfs-8bebb100868fa018007c120474280aed0717312d.tar.xz glusterfs-8bebb100868fa018007c120474280aed0717312d.zip