Commit Graph

15 Commits

Author SHA1 Message Date
Tolga Ceylan
1cd5894f41 fn: LB agent: reduce 'Too Busy' error logs (#1033)
With this PR, runner client translates too busy errors
from gRPC session and runner itself into Fn error type.
Placers now ignore this error message to reduce unnecessary
logging.
2018-06-04 12:16:00 -07:00
Tolga Ceylan
7261ddedcc fn: LB agent: EOF from runner is normal in nack cases (#1032) 2018-06-04 12:10:00 -07:00
Tolga Ceylan
7f1d14d21f fn: slot hash id must be utf8 in gRPC (#1016) 2018-05-29 16:26:43 -07:00
Tolga Ceylan
74a5379dec fn: lb & pure-runner slot hash id communication (#1007)
* fn: lb & pure-runner slot hash id communication

With this change, LB can pre-calculate the slot hash
key and pass it to runners. If LB knows/calculates
the slot hash ids, then it can also make better
estimates on which runner can successfully execute
it especially when status messages from runner
include a small summary of idle slots for a given
slot hash id. (TODO)

* fn: fix mock test
2018-05-25 14:12:48 -07:00
Tolga Ceylan
77086ecc24 fn: lb-agent & runner gRPC updates (#1005)
Breaking changes:

*) Removed unused ACK/NACK definitions
*) Extended Finished messages with error code/str
2018-05-17 15:02:15 -07:00
Tolga Ceylan
7cf8e2a61d fn: pure-runner time out while waiting TryCall (#1006)
This should return a retriable error code 503.
2018-05-17 15:00:50 -07:00
Tolga Ceylan
4ccde8897e fn: lb and pure-runner with non-blocking agent (#989)
* fn: lb and pure-runner with non-blocking agent

*) Removed pure-runner capacity tracking code. This did
not play well with internal agent resource tracker.
*) In LB and runner gRPC comm, removed ACK. Now,
upon TryCall, pure-runner quickly proceeds to call
Submit. This is good since at this stage pure-runner
already has all relevant data to initiate the call.
*) Unless pure-runner emits a NACK, LB immediately
streams http body to runners.
*) For retriable requests added a CachedReader for
http.Request Body.
*) Idempotenty/retry is similar to previous code.
After initial success in Engament, after attempting
a TryCall, unless we receive NACK, we cannot retry
that call.
*) ch and naive places now wraps each TryExec with
a cancellable context to clean up gRPC contexts
quicker.

* fn: err for simpler one-time read GetBody approach

This allows for a more flexible approach since we let
users to define GetBody() to allow repetitive http body
read. In default LB case, LB executes a one-time io.ReadAll
and sets of GetBody, which is detected by RunnerCall.RequestBody().

* fn: additional check for non-nil req.body

* fn: attempt to override IO errors with ctx for TryExec

* fn: system-tests log dest

* fn: LB: EOF send handling

* fn: logging for partial IO

* fn: use buffer pool for IO storage in lb agent

* fn: pure runner should use chunks for data msgs

* fn: required config validations and pass APIErrors

* fn: additional tests and gRPC proto simplification

*) remove ACK/NACK messages as Finish message type works
OK for this purpose.
*) return resp in api tests for check for status code
*) empty body json test in api tests for lb & pure-runner

* fn: buffer adjustments

*) setRequestBody result handling correction
*) switch to bytes.Reader for read-only safety
*) io.EOF can be returned for non-nil Body in request.

* fn: clarify detection of 503 / Server Too Busy
2018-05-17 12:09:03 -07:00
Tolga Ceylan
c0ee3ce736 fn: locked mutex while blocked on I/O considered harmful (#935)
* fn: mutex while waiting I/O considered harmful

*) Removed hold mutex while wait I/O cases these
included possible disk I/O and network I/O.

*) Error/Context Close/Shutdown semantics changed since
the context timeout and comments were misleading. Close
always waits for pending gRPC session to complete.
Context usage here was merely 'wait up to x secs to
report an error' which only logs the error anyway.
Instead, the runner can log the error. And context
still can be passed around perhaps for future opencensus
instrumentation.
2018-04-13 11:23:29 -07:00
Tolga Ceylan
623aeb35b2 fn: common.WaitGroup improvements (#940)
* fn: common.WaitGroup improvements

*) Split the API into AddSession/DoneSession
*) Only wake up listeners when session count reaches zero.

* fn: WaitGroup go-routine blast test

* fn: test fix and rebase fixup
2018-04-12 16:21:13 -07:00
Tolga Ceylan
e53d23afc9 fn: sync.WaitGroup replacement common.WaitGroup (#937)
* fn: sync.WaitGroup replacement common.WaitGroup

agent/lb_agent/pure_runner has been incorrectly using
sync.WaitGroup semantics. Switching these components to
use the new common.WaitGroup() that provides a few handy
functionality for common graceful shutdown cases.

From https://golang.org/pkg/sync/#WaitGroup,
    "Note that calls with a positive delta that occur when the counter
     is zero must happen before a Wait. Calls with a negative delta,
     or calls with a positive delta that start when the counter is
     greater than zero, may happen at any time. Typically this means
     the calls to Add should execute before the statement creating
     the goroutine or other event to be waited for. If a WaitGroup
     is reused to wait for several independent sets of events,
     new Add calls must happen after all previous Wait calls have
     returned."

HandleCallEnd introduces some complexity to the shutdowns, but this
is currently handled by AddSession(2) initially and letting the
HandleCallEnd() when to decrement by -1 in addition to decrement -1 in
Submit().

lb_agent shutdown sequence and particularly timeouts with runner pool
needs another look/revision, but this is outside of the scope of this
commit.

* fn: lb-agent wg share

* fn: no need to +2 in Submit with defer.

Removed defer since handleCallEnd already has
this responsibility.
2018-04-12 11:33:01 -07:00
Tolga Ceylan
9b86e3626e fn: avoid go-routine leak (#934) 2018-04-11 12:11:08 -07:00
jan grant
88074a42c0 Bugfix/grpc consume eof (#912)
* GRPC streams end with an EOF

The client should ensure that the final packet is followed by a GRPC
EOF. This has the benefit of permitting the client code to clean up resources.

* Don't require an entire HTTP request in RunnerCall

TryExec needs a handle on an incoming ReadCloser containing the body
of a request; however, everything else will already have been extracted
from the HTTP request in the case of lbAgent use.

(The point of this change is to simplify the interface for other uses.)

* Return error from GRPC layer explicitly

As per review
2018-04-03 15:04:21 +01:00
Gerardo Viedma
348bbaf36b support runner TLS certificates with specified certificate Common Names (#900)
* support runner TLS certificates with specified certificate Common Names

* removes duplicate constant

* run in insecure mode by default but expose ability to create tls-secured runner pools programmatically

* fixes runner tests to use new tls interfaces
2018-03-28 13:57:15 +01:00
Gerardo Viedma
1cae6f988e Make PKI data and RunnerFactory public objects (#865)
* Make PKI data and RunnerFactory public objects

* removes unnecessary nullRunner object

* renames secure factory to point out MTLS
2018-03-16 15:40:58 +00:00
Gerardo Viedma
73ae77614c Moves out node pool manager behind an extension using runner pool abstraction (Part 2) (#862)
* Move out node-pool manager and replace it with RunnerPool extension

* adds extension points for runner pools in load-balanced mode

* adds error to return values in RunnerPool and Runner interfaces

* Implements runner pool contract with context-aware shutdown

* fixes issue with range

* fixes tests to use runner abstraction

* adds empty test file as a workaround for build requiring go source files in top-level package

* removes flappy timeout test

* update docs to reflect runner pool setup

* refactors system tests to use runner abstraction

* removes poolmanager

* moves runner interfaces from models to api/runnerpool package

* Adds a second runner to pool docs example

* explicitly check for request spillover to second runner in test

* moves runner pool package name for system tests

* renames runner pool pointer variable for consistency

* pass model json to runner

* automatically cast to http.ResponseWriter in load-balanced call case

* allow overriding of server RunnerPool via a programmatic ServerOption

* fixes return type of ResponseWriter in test

* move Placer interface to runnerpool package

* moves hash-based placer out of open source project

* removes siphash from Gopkg.lock
2018-03-16 13:46:21 +00:00