Try to detect that the remote vm isn't responsive
ClosedPublic

Authored by mkrizek on Oct 1 2015, 12:21 PM.

Details

Summary

Currently there seems to be no way of telling that the remote vm is not
alive anymore. This patch implements counter that is re-set each time we
get data from the remote. If we don't receive any data in given time, we
fail gracefully. The problem is with sftp where we use paramiko methods
for putting/getting files to/from the remote and cannot use the counter
approach. Not to spend any more time on this, it'd be better to wait
until it is fixed in paramiko (issues are filed already) and watch logs
how often the remote becomes dead during sftp calls. This patch covers
the most time consuming case of talking to the remote which is the
actual task execution.

Note that the vm is not properly teared down which is subject of separate patch in D603.

Test Plan

Tried running rpmlint in direct ssh mode and killed the vm during the task execution, after timeout an exception was thrown.

Diff Detail

Repository
rLTRN libtaskotron
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
mkrizek retitled this revision from to Try to detect that the remote vm isn't responsive.Oct 1 2015, 12:21 PM
mkrizek updated this object.
mkrizek edited the test plan for this revision. (Show Details)
mkrizek added reviewers: tflink, kparal, jskladan, lbrabec.
tflink added a comment.Oct 1 2015, 2:55 PM

Do you know if any progress is being made upstream with enabling the counter timeout method for SFTP? I suspect that we'll want to do some testing to make sure that buildbot's timeout/kill mechanism is going to work well for us in the case of timeout.

In D604#11495, @tflink wrote:

Do you know if any progress is being made upstream with enabling the counter timeout method for SFTP? I suspect that we'll want to do some testing to make sure that buildbot's timeout/kill mechanism is going to work well for us in the case of timeout.

I linked some issues/pull requests in the ticket and asked about progress. Either way we don't know when a new version is coming anyway.

jskladan accepted this revision.Oct 2 2015, 10:21 AM

Just a nit, looks good though

libtaskotron/remote_exec.py
158–161

how about:

alive_counter += 0.1

if alive_counter >= self.TIMEOUT:

Not a big deal though, just sayin' :)

Also, and this might be /me being unnecessarily cautious, but I'd use the >= comparator either way.

This revision is now accepted and ready to land.Oct 2 2015, 10:21 AM
kparal accepted this revision.Oct 2 2015, 11:42 AM

Please do some minor changes as commented below and commit.

In D604#11495, @tflink wrote:

Do you know if any progress is being made upstream with enabling the counter timeout method for SFTP? I suspect that we'll want to do some testing to make sure that buildbot's timeout/kill mechanism is going to work well for us in the case of timeout.

The problem is with disposable clients, if buildbot master kills buildbot clients, the disposable machines will not be torn down and will stay around forever. We will need to come up with some solution for this. This is not specific to paramiko, even though it certainly contributes to the problem. Let's file a separate ticket.

libtaskotron/remote_exec.py
143–144

Please add a link here to either our Phab ticket where we track this or a relevant upstream ticket.

158–161

I like @jskladan's solution, but both are fine. We should use >= in all cases, defensive programming.

The problem is with disposable clients, if buildbot master kills buildbot clients, the disposable machines will not be torn down and will stay around forever. We will need to come up with some solution for this. This is not specific to paramiko, even though it certainly contributes to the problem. Let's file a separate ticket.

Filed T624.

This revision was automatically updated to reflect the committed changes.