summaryrefslogtreecommitdiffstats
path: root/ctdb
Commit message (Collapse)AuthorAgeFilesLines
...
* eventscripts: Load CTDB configuration settings in 70.iscsiAmitay Isaacs2013-09-231-0/+2
| | | | | | Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit ff41ce5ef202f8f6342e285d195bb5df61d848ce)
* recoverd: Disable takeover runs on other nodes for 5 minutesMartin Schwenke2013-09-191-2/+2
| | | | | | | | | 60 seconds might not be long enough to kill all connections and release IPs. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 71080676bb4acbd0d9b595a30cf7fe6dddbf426f)
* recoverd: Improve logging for takeover runsMartin Schwenke2013-09-191-1/+5
| | | | | | | | | Takeover runs are currently silent when they succeed. However, they are important, so log something by default. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit b39aa2e401fbb581207d986bac93778e9c01acdc)
* tools/ctdb: Use the standard long timeout when disabling takeover runsMartin Schwenke2013-09-191-2/+4
| | | | | | | | | This means that takeover runs will be disabled for about as long as the reloadips control can take to complete. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 6d44657a5e5b0df22bab2d487a503dd1c5ba79b4)
* tools/ctdb: Fix arguments/semantics of rebalance nodeMartin Schwenke2013-09-191-6/+20
| | | | | | | | | | | | | | | There's no reason why specifying a node should be compulsory. This is a cluster-wide operation because it is implemented by the recovery master so multiple nodes should not be specified using -n. However, the command should be able to specify multiple nodes so let it have its own nodestring argument. This change should be backward compatible with the old requirement of specifying a single node via -n. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 0846c00597adb66bba8c9dbf63443d0c2f91a7d1)
* tools/ctdb: Make rebalancenode more robustMartin Schwenke2013-09-191-8/+4
| | | | | | | | | | Use a broadcast instead of trying to win the race of determining the recovery master and then sending the message before the recovery master changes. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit ac946ee4ad01b1e5cd1006930b9f8a190a0a58ba)
* tests/simple: Fix the reloadips test to cope with changes to reloadipsMartin Schwenke2013-09-191-3/+3
| | | | | | | | Specifying nodes to reload no longer uses -n. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit d921b2756d5f1c4ad7a35fe120f6fda9f5bf5686)
* recoverd: Be careful about freeing the list of IP rebalance target nodesMartin Schwenke2013-09-191-1/+7
| | | | | | | | | | | It can change during a takeover run. If it does then don't free it. There are potentially fancier solutions (e.g. check what PNNs are new to the list) to this issue but this is the simplest. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e81589b7084c661adf617e166cc2c25b4939f841)
* recoverd: reloadips should rebalance target nodes for new IPsMartin Schwenke2013-09-191-0/+20
| | | | | | | | | | Otherwise, if existing IPs are added to extra nodes (that have, perhaps, been disconnected) then those IPs will not be rebalanced across the extra nodes. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit ceb30432a9a550778aed0b422a654fc5287b82a3)
* ctdbd: Make ctdb_reloadips_child send controls asynchronouslyMartin Schwenke2013-09-191-42/+90
| | | | | | | | | | | Deleting IPs can take a while because IPs are released and connections are killed. This can take a while so do them in parallel. In fact, since the set of IPs being added and deleted will be disjoint, send all the adds/deletes at the same time and then wait. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 85a5b544ec032173e98c9cc3b5402a76b961aa3b)
* recoverd: Fix the implementation of CTDB_SRVID_REBALANCE_NODEMartin Schwenke2013-09-194-66/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation has a few flaws: * A takeover run is called unconditionally when the timer goes even if the recovery master role has moved. This means a node other than the recovery master can incorrectly do a takeover run. * The rebalancing target nodes are cleared in the setup for a takeover run, regardless of whether the takeover run succeeds. * The timer to force a rebalance isn't cleared if another takeover run occurs before the deadline. Any forced rebalancing will happen in the first takeover run and when the timer expires some time later then an unnecessary takeover run will occur. * If the recovery master role moves then the rebalancing data will stay on the original node and affect the next takeover run to occur if the recovery master role should come back to the original node. Instead, store an array of rebalance target nodes in the recovery master context. This is passed as an extra argument to ctdb_takeover_run() each time it is called and is cleared when a takeover run succeeds. The timer hangs off the array of rebalance target nodes, which is cleared if the node isn't the recovery master. This means that it is possible to lose rebalance data if the recovery master role moves. However, that's a difficult problem to solve. The best way of approaching it is probably to try to stop the recovery master role from jumping around unnecesarily when inactive nodes join the cluster. The long term solution is to avoid this nonsense completely. The IP allocation algorithm needs to cache state between runs so that it knows which nodes have just become healthy. This also needs recovery master stability. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c51c1efe5fc7fa668597f2acd435dee16e410fc9)
* recoverd: Remove unused CTDB_SRVID_RELOAD_ALL_IPS and handlerMartin Schwenke2013-09-192-93/+0
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4cd727439a0824ebb8dbcf737d9888ffc3c41184)
* tools/ctdb: Reimplement reloadipsMartin Schwenke2013-09-191-70/+40
| | | | | | | | | | | | | | | This implementation disables takeover runs on all nodes before trying to reload IPs. It also takes "all" or the list of PNNs as an argument to the command instead of to -n. -n can still be specified with a single node indicating that node should be considered the current node - that might be confusing so could be removed. This implementation does not use CTDB_SRVID_RELOAD_ALL_IPS, so it can be removed. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit d66a072d9b120c78c47e726e9f29a3c1cfdd87ce)
* recoverd: Defer ipreallocated requests when takeover runs are disabledMartin Schwenke2013-09-191-1/+2
| | | | | | | | | The takeover run will fail anyway but deferring seems like a cleaner option. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 428f800bcdf3dbfe19de8bb36099fbf01ebeaab4)
* recoverd: Reimplement CTDB_SRVID_DISABLE_IP_CHECKMartin Schwenke2013-09-191-51/+37
| | | | | | | | | Use disable_takeover_runs_handler() instead of maintaining duplicate logic. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 0a51a85915486b2a8fded7ba6444b18c6c1ee8e8)
* recoverd: New SRVID message CTDB_SRVID_DISABLE_TAKEOVER_RUNSMartin Schwenke2013-09-192-25/+153
| | | | | | | | | | | | This implements a superset of CTDB_SRVID_DISABLE_IP_CHECK. It stops the IP checks but also causes any attempted takeover runs to fail and be rescheduled. This is meant to completely stop IP movements. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 00db4de53a0d86013e79e6577e7e6cf3ef864e56)
* tools/ctdb: Add a wait_for_all option to srvid_broadcast()Martin Schwenke2013-09-191-13/+82
| | | | | | | | | | | | This will be useful for other SRVIDs. The error checking in the handler depends on the SRVID responding with a uint32_t where <0 indicates an error and >=0 is a PNN that succeeded. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 52050e1c75b21961dafe2bc410268b44240ab24e)
* tools/ctdb: Factor out SRVID broadcast code from ipreallocate()Martin Schwenke2013-09-191-26/+40
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit a566fb5e70282c4e9f76654b1be4dc80829dced0)
* tools/ctdb: Change ipreallocate() to use a local done flagMartin Schwenke2013-09-191-6/+10
| | | | | | | | | Instead of the current global variable. This is in anticipation of abstracting the code. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit c58ee0eddf7ae3283e3ca8bd25575e6e677e1b17)
* recoverd: Factor out the SRVID handling codeMartin Schwenke2013-09-191-54/+99
| | | | | | | | | | | | | The code that handles IP reallocate requests can be reused. This also changes the result back to a SRVID caller to the PNN on success or a negative error code on failure. None of the callers currently look at the result so this is harmless... but it will be useful later. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e4eae6e3291baa299a1d0f733ab11b138ee699a3)
* recoverd: Make the SRVID request structure genericMartin Schwenke2013-09-194-22/+16
| | | | | | | | No need for a separate one for each SRVID. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit d9c22b04d5aa7938a3965bd3144568664eb772ce)
* recoverd: Move disabling of IP checks into do_takeover_run()Martin Schwenke2013-09-192-14/+26
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 48b603fbf16311daa47b01e7a33d477ed51da56d)
* recoverd: do_takeover_run() should mark when a takeover run is in progressMartin Schwenke2013-09-191-0/+13
| | | | | | | | Nested takeover runs should never happens so they should fail. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 8ed29c60c0a7dd29f2a6efdf694d38e94281e1c4)
* recoverd: takeover_fail_callback() doesn't need to set rec->need_takeover_runMartin Schwenke2013-09-191-1/+0
| | | | | | | | It is set on every failure anyway. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit e5f94c7857405bdeac233069003c3769b3dc3616)
* recoverd: Fail takeover run if "ipreallocated" failsMartin Schwenke2013-09-191-14/+15
| | | | | | | | | | | | Previously flagging a failure was probably avoided because of attempts to run "ipreallocated" events on stopped and banned nodes, which would fail because they are in recovery. Given the change to a new control and that fallback only retries the old method on active nodes, this should never fail in reasonable circumstances. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 53722430ad35f80935aabd12fa07654126443b8b)
* recoverd: New function do_takeover_run()Martin Schwenke2013-09-191-21/+31
| | | | | | | | | | | Factor the calling sequence for ctdb_takeover_run() into a new function and call it instead. This changes rec->need_takeover_run to false for each successful takeover run and that seems to be the right thing to do. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 9a3f0c0e61ca5c17e020c6e0463d73c7cf4f7c09)
* recoverd: Stabilise the recovery master roleMartin Schwenke2013-09-191-0/+8
| | | | | | | | | | | | | | | | | | | | | On rare occasions when a node that has been inactive it will trigger an election when it becomes active again. If that node has been up for the longest then it will win the election and the recovery master role will spuriously move. While a node remains inactive we reset the priority time to discourage it from winning elections. The priority time will now reflect roughly how long the node has been active rather than how long it has been up. That means the most stable node is more likely to win elections. Having a stable recovery master means that disabling takeover runs while reloading IPs is more likely to succeed. It also improves the chances of being able to cache information in the recovery master - for example, between takeover runs. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f0f48f22f45e4c82eba2582efae307e25385de81)
* recoverd: Banned nodes should not be told to run "ipreallocated" eventMartin Schwenke2013-09-181-3/+3
| | | | | | | | | | | | They will reject it because they are in recovery. This can result in extra banning credits being applied to banned nodes. This corresponds to commit 9132e6814ed927fa317f333f03dedb18f75d0e5b from the 1.2.40 branch. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 403938804caf1322f9773d63197e4303a7b2a788)
* common: Make parse_ip() valgrind-cleanMartin Schwenke2013-09-111-0/+2
| | | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> Pair-programmed-with: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit c0bb147ca09e82019b05ec22995623cffc3184e2)
* recoverd: Remove an orphaned commentMartin Schwenke2013-09-111-4/+0
| | | | | | | | | This should have been removed with the associated code in commit 14bd0b6961ef1294e9cba74ce875386b7dfbf446. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 36de63843de10a1f2a9ccdbbee24cc1d08542984)
* recoverd: Update a comment to use current terminologyMartin Schwenke2013-09-111-3/+4
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit ea5576071b22e1877903ec0921d375626a23e13b)
* client: Remove unused function list_of_active_nodes_except_pnn()Martin Schwenke2013-09-112-12/+0
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit d8a76cf79f07dfb5a93c6c9a13f16e3268c7dd57)
* tools/ctdb: list_of_active_nodes_except_pnn() -> list_of_nodes()Martin Schwenke2013-09-111-1/+1
| | | | | | | | | list_of_active_nodes_except_pnn() is only used here and can be removed if we remove this call. Less is more... Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit d4e206fb818048b7fab4797c877b854bdbb1ab70)
* tools/ctdb: Fix a memory leak in parse_nodestring()Martin Schwenke2013-09-111-2/+3
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 8753a094b97340deb26dd44f6ea345ca0a642a95)
* tests/eventscripts: Tests for memory checking in 00.ctdbMartin Schwenke2013-09-1110-2/+166
| | | | | | | | ... plus updates to test infrastructure to support. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4a388fc6bf54636b7e1f6da8e6aa451cddd574f7)
* eventscripts: Clean up monitoring of system memory in 00.ctdbMartin Schwenke2013-09-111-30/+41
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 16fcff0d1993b7a0479341862ea44d10bd5c6d6d)
* server: standardize formatting of comment block for ctdb_reply_dmaster() ↵Michael Adam2013-08-261-7/+7
| | | | | | | | | | | | while I'm at it.. This was the comment block I was touching and meant to adapt in commit 00d3bf092e2f72eda330978c75ec85f17e870553. My search was apparently not unique... Signed-off-by: Michael Adam <obnox@samba.org> (This used to be ctdb commit 09940255011b119dc6af3304f5d3e9568e6006fd)
* doc: Update NEWSMartin Schwenke2013-08-221-0/+78
| | | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit c446579fc442955ecc74f5566eaa0635c3171498)
* build: Fix build dependencies for ctdb_lock_tdbAmitay Isaacs2013-08-221-1/+1
| | | | | | Signed-off-by: Amitay Isaacs <amitay@gmail.com> (This used to be ctdb commit eb8575718400c45626cd1b2e0fd247bc3ebff655)
* tests/simple: Minimise the chance of a monitor event being cancelledMartin Schwenke2013-08-221-0/+4
| | | | | | | | | | | | | | A monitor event following a "ctdb delip" might reconfigure services. If the monitor event is cancelled then a service might be stopped but not yet restarted and this could result in the subsequent monitor events failing. This obviously needs to be fixed in CTDB itself. This will happen by making "ctdb reloadips" the supported way of reconfiguring IPs. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 618ea3660e36e7bd92b686e1ca8728cf63c3c068)
* packaging: Remove pushd/popd from maketarball.sh, don't need bashMartin Schwenke2013-08-221-45/+34
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 3ffca990a18cbd31c8bd3ae01c6671d60da58f58)
* tools/ctdb_diagnostics: Add output of "ctdb getdbmap"Martin Schwenke2013-08-221-0/+1
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit f0d69a9079b7aecc68f1d2d8510702046b618b19)
* tools/ctdb_diagnostics: Safer temporary file creationMartin Schwenke2013-08-221-3/+10
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 406e1cb1fdd17ddd239774d0228e3657b73ae68f)
* eventscripts: Avoid using a temporary file in 62.cnfsMartin Schwenke2013-08-221-4/+1
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 81833052d7ee8f76b1e98376a0273448640cfa8e)
* scripts: Remove gdb_backtraceMartin Schwenke2013-08-221-87/+0
| | | | | | | | | This uses potentially insecure temporary files and is not referenced anywhere else. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 4b914d7e217202f3d11a8e95f9f74bc17869475b)
* tools/ctdb: Make most non-auto-all commands abort if run with -n allMartin Schwenke2013-08-221-6/+42
| | | | | | | | Or if run with -n A,B,... Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit b1d8732b5da18ae80aea1df0e66b0b5cdcd919bc)
* tools/ctdb: Remove more non-essential fetching of PNN from daemonMartin Schwenke2013-08-221-25/+21
| | | | | | | | | | | The useful cases are either CTDB_CURRENT_NODE, in which case ctdb_get_pnn() does the job, or a PNN, which is... ummm... a PNN! :-) This works because parse_nodestring() validates PNNs. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 7b3f7eea2465efb099a2faf3e42174bc97b13a16)
* tools/ctdb: Improve auto-all settings for some commandsMartin Schwenke2013-08-221-8/+8
| | | | | | | | | | | | | | * ipreallocate is cluster-wide so should not be auto-all * enablescript, disablescript, getreclock, setreclock, natgwlist can all be auto-all without issues * xpnn, ipiface a local-only so don't work with -n, so might as well not be auto-all Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit 123a4677528cb46bee1c6dad8a5162eba9880bc1)
* recoverd: Remove an unused temporary talloc contextMartin Schwenke2013-08-221-3/+0
| | | | | | Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit da22d5e60dc023009854025cc9e6bc4b0a84c60e)
* recoverd: Move struct ctdb_public_ip_list back into ctdb_takeover.cMartin Schwenke2013-08-222-6/+6
| | | | | | | | | | This is an internal structure. It was moved into ctdb_private.h a long time ago to allow unit testing. Unit test compilation was changed shortly afterwards to make this unnecessary. Signed-off-by: Martin Schwenke <martin@meltin.net> (This used to be ctdb commit db57261d7dc264e161659a8c547f44fbd9e88eeb)