Even for CLOCK_MONOTONIC, NTP adjustments can be made for advancing
the clock forward. When reporting metrics, let's handle this as
callLatency zero (in other words, execution latency is almost
same as overall latency.)
Add a new function, NewgRPCRunnerWithTimeout, that allows the caller to
define the grpc connection timeout. Modify NewgRPCRunner to be
implemented in terms of NewgRPCRunnerWithTimeout, passing its existing
default timeout to the new function.
Addition of this function was necessary because the timeout here is
rather aggressive. A test tool that leverages the existing
NewgRPCRunner was moved into an environment where its test target was in
a different region from the origin, which required a higher timeout to
cope with the test endpoint being physically farther away.
If naive placer is not instantiated per call/runner group (aka LBG),
then the rr index will not trigger an round-robin behavior since
the index is initialized and stored in the placer configuration.
With this PR, moving rr index to per RunnerPool.Runners() inner loop
to ensure a round robin within that set. Each time we fetch a set,
since the set might be different, we reset our rr index. This means
we rr within that set once, then randomly start from another node
for the next RunnerPool.Runners() iteration. In busy systems,
no significant behavior change is expected (accept the removal of
atomic operations with respect to performance), but in idle systems
round robin behavior should be more observable and simple to follow and can
reduce same hit cases for the given RunnerPool.Runners().
In addition, introducing naive placer tests to ensure we observe
this behavior.
When LB is processing http headers from flat array representation
in gRPC, it should use http.Header.Add() to grow http headers to
handle header keys with multiple values. Set() overrides the
previous entries.
* add fn fdk version to default tags in api response metrics and fields in api response logging
* add how to build fn docker image, and specify it to fn start command
Updates the RetryAllBackOff function to fail early when a request to obtain Runners cannot suceed due to a human Fn misconfiguration error. This prevents a user having to wait for the maximum timeout when the system is misconfigured.
* fn: agent slot error handling improvements
* fn: slot error retention is suboptimal
Current fn agent (aka runner) tries to communicate
all container start errors to clients. In order to
achieve this, the errors are retained and could be
delivered to clients that did not spawn that container.
This is OK as it tries to be transparent to callers
instead of suppressing or hiding errors. However,
these errors are retained without a time bound which
is not ideal. An old request could trigger an error
and this error can be sent to a client much later time.
With this change, the error retention semantics change.
If an errors occurs during container start and the client
which triggered the container is no longer present, we
log and discard the result.
Broadly, any error that occurs when a request is not 1-1
bound to a container is logged and discarded.
For testing these scenarios, a debug option is added
to agent to allow passing request id that triggered
the spawn of the container as an environment variable.
Before this change, agent used GetResourceToken() calls
for both non-blocking and blocking mode. However, for
nested calls such as in agent checkLaunch(), a token might
already be queued via go-routine in getResourceTokenNBChan().
If the consuming code in checkLaunch() runs faster, than
it could place another GetResourceToken() call while the
active token is not yet closed. This can result in momentarily
double cpu/mem allocation resulting in 503 or excess wait.
With this PR, resource allocation blocking and non-blocking
interface is changed and ResourceToken is directly returned
without a channel.
Duration units don't need msec conversion for consistency
across, we currently do not do any conversion for execution
duration for example. Newer metrics, ctrPrepTime, imagePullWaitTime,
ctrCreateTime, initStartTime can follow the same.
Remove legacy runner reported duration calculations.
* opencensus attribs for remote runner (#1538)
adding span attrib support for remote runner metrics
adding ochttp filter to group invoke calls in server.go
changing up the grpc caallfinished for display on jaeger
changed call.go, including imgpull time (if applicable)
span annotation in runner_client
add imagepulltime in protobuf
Adding span.setstatus to some spans
Moving dockerwait to mcall
* adding ctr metrics to rpcs (#1538)
Adding container create metric to grpc
adding metrics for ctr preparation duration, creation duration, init start, using atomic.load and store to avoid concurrent thread faults
adding atomic store for stats generated in runhot function
excluding / path in webserver, removing ochttp for adminserver
adding spandata in runner client to handle new proto data
Catches and generates function errors for two new cases. The first
occurs due to a function/FDK error. If the function closes the read end
of the pipe that the hostagent uses to write data before the hostagent
has finished writing the data, generate an error. This ensures that
any premature close of the input stream is detected and handled by the
hostagent.
Second, catch any cases where the function attempts to respond before
reading all of the input data from the stream. This is safe because we
already enforce a maximum upper bound on the request body, so a function
or fdk will not have to read for an unbounded amount of time to consume
the outstanding data. Since the container contract enforces HTTP-like
semantics, and HTTP expects the server side to wait for the response
body to arrive before responding, this is not unreasonable. If the
function or fdk attempts to write before the end of the input stream has
been processed by the hostagent, return a different function error
indicating that a premature write has been detected.