* Don't log stacktrace when we get gRPC errors
In the top error handler we log the stacktrace if we get back a non-API errors. Under that category we will have gRPC errors as well, there is no point to log a stacktrace for a gRPC errors where those could happen on the server side or due to some temporary network blips.
In this change we skip the stack trace log in presence of gRPC errors, plus we change some log level from Error down to Info for some retryable errors.
* fn: docker pull retry / backoff options
Introducing SetPullImageRetryPolicy() for docker driver to allow
customizations of docker pull behavior.
Replacing old backoff code with a more formal exponential backoff
with policy options.
* fn: introducing a new image pull layer
Due to unnecessary http traffic caused by docker internally, with
this PR a new image pull layer is introduced. This allows us to
serialize same image docker pulls using a simple active transfer
with list of listener model.
The timeout behavior is slightly different when multiple listeners
are waiting. The timeout from first listener is emitted to all
listeners. However since the docker-pull timeout is globally
configured, overall timeout behavior is essentially the same.
We need the ability to configure a remote runner. The data we pass is implementation specific, so we keep it generic enough. The pure_runner implementation dumps the data into a file. The file path is configurable.
With this change CallOverrider get the request as its first parameter.
We know that we want to get rid of it but there are some cases where the
CallOverrider at the moment is the only place where we can customize a
call before it gets executed until we find a solution for the problem
reported in issue #1426.
Having access to the original request it could be handy during in
CallOverrider.
* fn: make evictor less aggressive
Experiments in load testing revealed that eviction of
containers that are starting is detrimental since this
increases churn and acts as a feedback mechanism
increasing docker API rates. Starting and deleting
a container are one of the most expensive docker API
calls and traffic arriving on a busy/loaded server
can trigger flood of container create actions that
get canceled with evictions. Previous code tried
to avoid this by tracking original request that
triggered the container start, but based on the
data this does not seem enough to avoid flurry
of evictions. In other words, in a busy system,
the likelihood of the original request getting
quickly serviced by another container is not
that rare.
With this change, we restrict evictability of
a container to Idle state exclusively. This makes
the backpressure (503) more likely since it
allows starting containers to initialize.
A race condition that occasionally causes busy
container eviction is also fixed. The fix proposed
here is that we unblock the listeners as if busy-container
really got evicted, but in busy container we simply
refresh the eviction token and get rid of the evicted
old token.
In future, based on empirical data, we may consider
introducing evictions for slow docker pulls.
In the lbAgent at the moment it is not possible to add additional call
options other than the ones supplied to the GetCall.
This change adds a new WtihLBCallOptions functions to add that
capability to the lbAgent.
* Send RunnerMsg_ResultStart message
This change adds a call to a function to send the RunnerMsg_ResultStart
message during the enqueueCallResponse function.
The RunnerMsg_ResultStart contains any headers set by the function and
the status code.
This fixes the case where we don't send custom headers if a function
doesn't return a body.
Fixes#1413
* fn: clean docker retry logic and legacy code
Cleanup of docker retry error handling logic. The code
is likely due to older less stable docker versions. Instead,
we would like to expose any underlying issues and fix them
as necessary.
Examination of logs/events from last 3-4 weeks do not
show any occurrence of retries/errors and on the contrary
causing more issues (eg. retry of docker-pull if 5xx)
* plumb docker auther to the depths
this patch accomplishes a few things wrt the docker auth interface used for
configuring credentials to be used for pull:
* adds context for timing out the docker auth call
* no longer call docker auth before validating an image, we were using this to
validate that a user still has access to an image that we have locally. this
policy itself could be TBD but we're being flagged for hitting registries too
hard, and runners have bounded lifetimes on the order of hours, so we decided
this is ok. now we only call docker auth function when pulling an image, and
it's time used is included in the pull timeout we've so delicately plumbed
* adds ability for agents to add call configurations to every call created via
getcall, this ought to help allow fn extenders configure calls and puts us on
the path of getting rid of the call overrider (and is enough to be able to
replace our current one that is causing issues)
* adds an option to configure calls to have a docker auth function they
provide, so that fn extenders can plumb this thing into here without having to
fork the driver (which would be absurd!)
* some lints...
* fix tests?
* add the image to docker auth
this can come in handy for a provided docker auth function to get credentials
differently based on registry version, registry host, etc and it makes sense
in all contexts of usage I think here
Removing hello-world, fnproject/hello images from tests. Replacing
these with busybox which we use in the project.
In drivers.ContainerTask Timeout is obsoleted code from cold
containers, which we no longer have.
During gRPC communication, previously we assumed that
if Send(Try) message failure in TryExec(), can always
be retried. However, this is not robust as we cannot assume
no data was written to wire. With this change, before we
can conclude that the call can be retried, we also check
Unavailable error code.
ListContainers call which is asynchronously spawned
during docker driver start needs to have a reasonable
timeout and should be retried if timeout expires.
* fn: driver cookie interface changes for finer control
Users of driver cookie, as seen in agent.go require specific
timeouts in contexts passed to the different operations. For
example, cookie.PullImage() context shows intent to apply
timeout/duration for the pull operation. To clarify this
and avoid tricky timeout related issues, removing implicit
docker-inspect in cookie.PullImage(). In addition, separating
the Authentication step as Auther interface calls unknown
code.
Additional changes to cookie implementation to clarify
inspected image, created container and authentication
configuration, where each AuthImage(), ValidateImage()
and CreateContainer() steps initialize the corresponding
pointer object in the cookie.
Changes now require ValidateImage() call if a PullImage()
is attempted. In all cases, AuthImage() call is required
before ValidateImage().
An example usage is as follows:
cookie, err := driver.CreateCookie(ctx, task)
err = cookie.AuthImage(ctx)
pull, err := cookie.ValidateImage()
if pull {
err = cookie.PullImage(ctx)
pull, err = cookie.ValidateImage()
}
err = cookie.CreateContainer(ctx)
waiter := cookie.Run(ctx)
waiter.Wait(ctx)
error handling is omitted for clarity in the above example.