Commit Graph

12 Commits

Author SHA1 Message Date
Andrea Rosa
182db94fad Feature/acksync response writer (#1267)
This implements a "detached" mechanism to get an ack from the runner
once it actually starts to run a function. In this scenario the response
returned back is just a 202 if we placed the function in a specific
time-frame. If we hit some errors or we fail to place the fn in time we
return back different errors.
2018-11-09 10:25:43 -08:00
Owen Strain
21f77f837e Add APIErrorWrapper so that underlying errors can be logged (#1246) 2018-09-28 17:26:54 -07:00
Tolga Ceylan
9f29d824d6 fn: New timeout for LB Placer (#1137)
* fn: New timeout for LB Placer

Previously, LB Placers worked hard as long as
client contexts allowed for. Adding a Placer
config setting to bound this by 360 seconds by
default.

The new timeout is not accounted during actual
function execution and only applies to the amount
of wait time in Placers when the call is not
being executed.
2018-07-26 10:19:25 -07:00
Tolga Ceylan
f24172aa9d fn: introducing lb placer basic metrics (#1058)
* fn: introducing lb placer basic metrics

This change adds basic metrics to naive and consistent
hash LB placers. The stats show how many times we scanned
the full runner list, if runner pool failed to return a
runner list or if runner pool returned an empty list.

Placed and not placed status are also tracked along with
if TryExec returned an error or not. Most common error
code, Too-Busy is specifically tracked.

If client cancels/times out, this is also tracked as
a client cancel metric.

For placer latency, we would like to know how much time
the placer spent on searching for a runner until it
successfully places a call. This includes round-trip
times for NACK responses from the runners until a successful
TryExec() call. By excluding last successful TryExec() latency,
we try to exclude function execution & runner container
startup time from this metric in an attempt to isolate
Placer only latency.

* fn: latency and attempt tracker

Removing full scan metric. Tracking number of
runners attempted is a better metric for this
purpose.

Also, if rp.Runners() fail, this is an unrecoverable
error and we should bail out instead of retrying.

* fn: typo fix, ch placer finalize err return

* fn: enable LB placer metrics in WithAgentFromEnv if prometheus is enabled
2018-06-12 13:36:05 -07:00
Owen Cliffe
c6abc8bf64 Use context logging more to ensure context vars are present in log lines (#1039) 2018-06-06 15:14:29 +01:00
Tolga Ceylan
1cd5894f41 fn: LB agent: reduce 'Too Busy' error logs (#1033)
With this PR, runner client translates too busy errors
from gRPC session and runner itself into Fn error type.
Placers now ignore this error message to reduce unnecessary
logging.
2018-06-04 12:16:00 -07:00
Tolga Ceylan
a57907eed0 fn: user friendly timeout handling changes (#1021)
* fn: user friendly timeout handling changes

Timeout setting in routes now means "maximum amount
of time a function can run in a container".

Total wait time for a given http request is now expected
to be handled by the client. As long as the client waits,
the LB, runner or agents will search for resources to
schedule it.
2018-06-01 13:18:13 -07:00
Tolga Ceylan
e1b7e30e49 fn: cleanup of unused/global constants in lb agent (#1020)
Moved retry interval as placer member variable for time-being.
2018-05-31 13:04:06 -07:00
Tolga Ceylan
4ccde8897e fn: lb and pure-runner with non-blocking agent (#989)
* fn: lb and pure-runner with non-blocking agent

*) Removed pure-runner capacity tracking code. This did
not play well with internal agent resource tracker.
*) In LB and runner gRPC comm, removed ACK. Now,
upon TryCall, pure-runner quickly proceeds to call
Submit. This is good since at this stage pure-runner
already has all relevant data to initiate the call.
*) Unless pure-runner emits a NACK, LB immediately
streams http body to runners.
*) For retriable requests added a CachedReader for
http.Request Body.
*) Idempotenty/retry is similar to previous code.
After initial success in Engament, after attempting
a TryCall, unless we receive NACK, we cannot retry
that call.
*) ch and naive places now wraps each TryExec with
a cancellable context to clean up gRPC contexts
quicker.

* fn: err for simpler one-time read GetBody approach

This allows for a more flexible approach since we let
users to define GetBody() to allow repetitive http body
read. In default LB case, LB executes a one-time io.ReadAll
and sets of GetBody, which is detected by RunnerCall.RequestBody().

* fn: additional check for non-nil req.body

* fn: attempt to override IO errors with ctx for TryExec

* fn: system-tests log dest

* fn: LB: EOF send handling

* fn: logging for partial IO

* fn: use buffer pool for IO storage in lb agent

* fn: pure runner should use chunks for data msgs

* fn: required config validations and pass APIErrors

* fn: additional tests and gRPC proto simplification

*) remove ACK/NACK messages as Finish message type works
OK for this purpose.
*) return resp in api tests for check for status code
*) empty body json test in api tests for lb & pure-runner

* fn: buffer adjustments

*) setRequestBody result handling correction
*) switch to bytes.Reader for read-only safety
*) io.EOF can be returned for non-nil Body in request.

* fn: clarify detection of 503 / Server Too Busy
2018-05-17 12:09:03 -07:00
Tolga Ceylan
f0f9a6d945 fn: LB ch and naive fixes (#942)
* fn: LB ch and naive fixes

*) Naive is now a naive RR algorithm.
*) Both now checks for ctx/timeout in each attempt.

* fn: test fix
2018-05-07 11:50:16 -07:00
Tolga Ceylan
e47d55056a fn: reduce lbagent and agent dependency (#938)
* fn: reduce lbagent and agent dependency

lbagent and agent code is too dependent. This causes
any changed in agent to break lbagent. In reality, for
LB there should be no delegated agent. Splitting these
two will cause some code duplication, but it reduces
dependency and complexity (eg. agent without docker)

* fn: post rebase fixup

* fn: runner/runnercall should use lbDeadline

* fn: fixup ln agent test

* fn: remove agent create option for common.WaitGroup
2018-04-12 15:51:58 -07:00
Tolga Ceylan
e7658db822 Move ch ring placement back from old FnLB. (#930)
* fn: bring back CH ring placer into FN repo based on original FnLB
* fn: move placement code into runnerpool directory
2018-04-10 17:26:24 -07:00