glusterfs.git/libglusterfs/src/syncop.c, branch v3.6.3beta1

synctask: add backtrace per waiting task

2014-09-26T10:20:07+00:00

The backtrace is 'saved' in a per-task buffer.
This would come handy while debugging code using
synctasks.

Change-Id: I732b275f6d15b31f31361f5ecf2ba47cacde9b54
BUG: 1145093
Signed-off-by: Krishnan Parthasarathi 
Reviewed-on: http://review.gluster.org/8795
Tested-by: Gluster Build System 
Reviewed-by: Niels de Vos 
Reviewed-by: Vijay Bellur

syncop: Invoke dict_unref() in inodelk only if dictionary is not NULL

2014-09-20T08:54:55+00:00

In the absence of this check, logs can get flooded with messages like this when rebalance is run:

[2014-09-04 17:48:07.148262] W [dict.c:480:dict_unref] (-->/lib64/libc.so.6()
[0x30daa47a00] (-->/usr/local/lib/libglusterfs.so.0(synctask_wrap+0x12)
[0x7fa20b7c6ec2]
(-->/usr/local/lib/glusterfs/3.7dev/xlator/cluster/distribute.so(dht_migrate_file+0x23f)
[0x7fa200fdb58f]))) 0-dict: dict is NULL

Change-Id: I4c93e4485293b35d86ba07df4d583d2758ec3f49
BUG: 1138395
Signed-off-by: Vijay Bellur 
Reviewed-on-master: http://review.gluster.org/8601
Tested-by: Gluster Build System 
Reviewed-by: Shyamsundar Ranganathan 
Reviewed-on: http://review.gluster.org/8782

libglusterfs/syncop: implement inodelk

2014-09-12T08:10:43+00:00

Change-Id: Iea489157490b70cb2bb03576b0d4943c6d8f052d
BUG: 1138395
Signed-off-by: Raghavendra G 
Reviewed-on-master: http://review.gluster.org/8522
Reviewed-by: Pranith Kumar Karampuri 
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur 
Reviewed-on: http://review.gluster.org/8610

user servicable snapshots

2014-05-29T16:25:46+00:00

Change-Id: Idbf27dbe088e646a8ab81cedc5818413795895ea
Signed-off-by: Raghavendra Bhat 
Signed-off-by: Anand Subramanian 
Signed-off-by: Raghavendra Bhat 
Reviewed-on: http://review.gluster.org/7700
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur

syncops: add support for custom PID

2014-02-13T19:19:33+00:00

AFR self-heal needs to issue syncops with special PID. Extend
the custom UID/GID support to include custom PIDs

Change-Id: I736c0e177f862b029f203acc87f9eb46c8cb839b
BUG: 1021686
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/6888
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat

core: add @xdata parameter to syncop_[f]removexattr()

2014-02-13T19:17:05+00:00

To be used in afr metadata self-heal

Change-Id: I8dac4b19d61e331702427eeb5b606aab3d20b328
BUG: 1021686
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/6941
Tested-by: Gluster Build System 
Reviewed-by: Raghavendra Bhat

syncop: Change return value of syncop

2014-01-20T07:05:15+00:00

Problem:
We found a day-1 bug when syncop_xxx() infra is used inside a synctask with
compilation optimization (CFLAGS -O2).

Detailed explanation of the Root cause:
We found the bug in 'gf_defrag_migrate_data' in rebalance operation:

Lets look at interesting parts of the function:

int
gf_defrag_migrate_data (xlator_t *this, gf_defrag_info_t *defrag, loc_t *loc,
                        dict_t *migrate_data)
{
.....
code section - [ Loop ]
        while ((ret = syncop_readdirp (this, fd, 131072, offset, NULL,
                                       &entries)) != 0) {
.....
code section - [ ERRNO-1 ] (errno of readdirp is stored in readdir_operrno by a
thread)
                /* Need to keep track of ENOENT errno, that means, there is no
                   need to send more readdirp() */
                readdir_operrno = errno;
.....
code section - [ SYNCOP-1 ] (syncop_getxattr is called by a thread)
                        ret = syncop_getxattr (this, &entry_loc, &dict,
                                               GF_XATTR_LINKINFO_KEY);
code section - [ ERRNO-2]   (checking for failures of syncop_getxattr(). This
may not always be executed in same thread which executed [SYNCOP-1])
                        if (ret < 0) {
                                if (errno != ENODATA) {
                                        loglevel = GF_LOG_ERROR;
                                        defrag->total_failures += 1;
.....
}

the function above could be executed by thread(t1) till [SYNCOP-1] and code
from [ERRNO-2] can be executed by a different thread(t2) because of the way
syncop-infra schedules the tasks.

when the code is compiled with -O2 optimization this is the assembly code that
is generated:
 [ERRNO-1]
1165                        readdir_operrno = errno; <<---- errno gets expanded
as *(__errno_location())
   0x00007fd149d48b60 <+496>:        callq  0x7fd149d410c0 
   0x00007fd149d48b72 <+514>:        mov    %rax,0x50(%rsp) <<------ Address
returned by __errno_location() is stored in a special location in stack for
later use.
   0x00007fd149d48b77 <+519>:        mov    (%rax),%eax
   0x00007fd149d48b79 <+521>:        mov    %eax,0x78(%rsp)
....
 [ERRNO-2]
1281                                        if (errno != ENODATA) {
   0x00007fd149d492ae <+2366>:        mov    0x50(%rsp),%rax <<-----  Because
it already stored the address returned by __errno_location(), it just
dereferences the address to get the errno value. BUT THIS CODE NEED NOT BE
EXECUTED BY SAME THREAD!!!
   0x00007fd149d492b3 <+2371>:        mov    $0x9,%ebp
   0x00007fd149d492b8 <+2376>:        mov    (%rax),%edi
   0x00007fd149d492ba <+2378>:        cmp    $0x3d,%edi

The problem is that __errno_location() value of t1 and t2 are different. So
[ERRNO-2] ends up reading errno of t1 instead of errno of t2 even though t2 is
executing [ERRNO-2] code section.

When code is compiled without any optimization for [ERRNO-2]:
1281                                        if (errno != ENODATA) {
   0x00007fd58e7a326f <+2237>:        callq  0x7fd58e797300
<<--- As it is calling __errno_location() again it gets the
location from t2 so it works as intended.
   0x00007fd58e7a3274 <+2242>:        mov    (%rax),%eax
   0x00007fd58e7a3276 <+2244>:        cmp    $0x3d,%eax
   0x00007fd58e7a3279 <+2247>:        je     0x7fd58e7a32a1


Fix:
Make syncop_xxx() return (-errno) value as the return value in
case of errors and all the functions which make syncop_xxx() will need to use
(-ret) to figure out the reason for failure in case of syncop_xxx() failures.

Change-Id: I314d20dabe55d3e62ff66f3b4adb1cac2eaebb57
BUG: 1040356
Signed-off-by: Pranith Kumar K 
Reviewed-on: http://review.gluster.org/6475
Tested-by: Gluster Build System 
Reviewed-by: Anand Avati

syncops: expose @flags in syncop_rmdir()

2013-11-21T21:08:32+00:00

Change-Id: I9b73c1db728e4cb3948fc118cceb292b21d48b96
BUG: 1021686
Signed-off-by: Anand Avati 
Reviewed-on: http://review.gluster.org/6112
Reviewed-by: Amar Tumballi 
Tested-by: Gluster Build System

zerofill: Change the type of len argument of glfs_zerofill() to off_t

2013-11-15T07:29:48+00:00

glfs_zerofill() can be potentially called to zero-out entire file and
hence allow for bigger value of length parameter.

Change-Id: I75f1d11af298915049a3f3a7cb3890a2d72fca63
BUG: 1028673
Signed-off-by: Bharata B Rao 
Reviewed-on: http://review.gluster.org/6266
Tested-by: Gluster Build System 
Reviewed-by: M. Mohan Kumar 
Tested-by: M. Mohan Kumar 
Reviewed-by: Anand Avati

glusterfs: zerofill support

2013-11-11T05:25:49+00:00

Add support for a new ZEROFILL fop. Zerofill writes zeroes to a file in
the specified range. This fop will be useful when a whole file needs to
be initialized with zero (could be useful for zero filled VM disk image
provisioning or  during scrubbing of VM disk images).

Client/application can issue this FOP for zeroing out. Gluster server
will zero out required range of bytes ie server offloaded zeroing. In
the absence of this fop,  client/application has to repetitively issue
write (zero) fop to the server, which is very inefficient method because
of the overheads involved in RPC calls  and acknowledgements.

WRITESAME is a  SCSI T10 command that takes a block of data as input and
writes the same data to other blocks and this write is handled
completely within the storage and hence is known as offload . Linux ,now
has support for SCSI WRITESAME command which is exposed to the user in
the form of BLKZEROOUT ioctl.  BD Xlator can exploit BLKZEROOUT ioctl to
implement this fop. Thus zeroing out operations can be completely
offloaded to the storage device , making it highly efficient.

The fop takes two arguments offset and size. It zeroes out 'size' number
of bytes in an opened file starting from 'offset' position.

This patch adds zerofill support to the following areas:
	- libglusterfs
	- io-stats
	- performance/md-cache,open-behind
	- quota
	- cluster/afr,dht,stripe
	- rpc/xdr
	- protocol/client,server
	- io-threads
	- marker
	- storage/posix
	- libgfapi

Client applications can exloit this fop by using glfs_zerofill introduced in
libgfapi.FUSE support to this fop has not been added as there is no system call
for this fop.

Changes from previous version 3:
* Removed redundant memory failure log messages

Changes from previous version 2:
* Rebased and fixed build error

Changes from previous version 1:
* Rebased for latest master

TODO :
     * Add zerofill support to trace xlator
     * Expose zerofill capability as part of gluster volume info

Here is a performance comparison of server offloaded zeofill vs zeroing
out using repeated writes.

[root@llmvm02 remote]# time ./offloaded aakash-test log 20

real	3m34.155s
user	0m0.018s
sys	0m0.040s
[root@llmvm02 remote]# time ./manually aakash-test log 20

real	4m23.043s
user	0m2.197s
sys	0m14.457s
[root@llmvm02 remote]# time ./offloaded aakash-test log 25;

real	4m28.363s
user	0m0.021s
sys	0m0.025s
[root@llmvm02 remote]# time ./manually aakash-test log 25

real	5m34.278s
user	0m2.957s
sys	0m18.808s

The argument log is a file which we want to set for logging purpose and
the third argument is size in GB .

As we can see there is a performance improvement of around 20% with this
fop.

Change-Id: I081159f5f7edde0ddb78169fb4c21c776ec91a18
BUG: 1028673
Signed-off-by: Aakash Lal Das 
Signed-off-by: M. Mohan Kumar 
Reviewed-on: http://review.gluster.org/5327
Tested-by: Gluster Build System 
Reviewed-by: Vijay Bellur