* fn: stats view/distribution improvements
*) View latency distribution is now an argument
in view creation functions. This allows easier
override to set custom buckets. It is simplistic
and assumes all latency views would use the same
set, but in practice this is already the case.
*) Removed API view creation to main, this should not
be enabled for all node types. This is consistent with
the rest of the system.
* fn: Docker samples of cpu/mem/disk with specific buckets
* fn: New timeout for LB Placer
Previously, LB Placers worked hard as long as
client contexts allowed for. Adding a Placer
config setting to bound this by 360 seconds by
default.
The new timeout is not accounted during actual
function execution and only applies to the amount
of wait time in Placers when the call is not
being executed.
LB agent reports lb placer latency. It should also report
how long it took for the runner to initiate the call as
well as execution time inside the container if the runner
has accepted (committed) to the call.
This is a small tweak to the placer latency stats. If we have a cluster of values
around the 1-2s mark, then having a single relatively broad bucket that captures
the (1s, 10s] range will obscure that. In particular, typical Prometheus quartile
estimates may be distorted by this bucket size.
* fn: introducing lb placer basic metrics
This change adds basic metrics to naive and consistent
hash LB placers. The stats show how many times we scanned
the full runner list, if runner pool failed to return a
runner list or if runner pool returned an empty list.
Placed and not placed status are also tracked along with
if TryExec returned an error or not. Most common error
code, Too-Busy is specifically tracked.
If client cancels/times out, this is also tracked as
a client cancel metric.
For placer latency, we would like to know how much time
the placer spent on searching for a runner until it
successfully places a call. This includes round-trip
times for NACK responses from the runners until a successful
TryExec() call. By excluding last successful TryExec() latency,
we try to exclude function execution & runner container
startup time from this metric in an attempt to isolate
Placer only latency.
* fn: latency and attempt tracker
Removing full scan metric. Tracking number of
runners attempted is a better metric for this
purpose.
Also, if rp.Runners() fail, this is an unrecoverable
error and we should bail out instead of retrying.
* fn: typo fix, ch placer finalize err return
* fn: enable LB placer metrics in WithAgentFromEnv if prometheus is enabled