gosec severity=medium passes, all severity=low errors are from unhandled
errors, we have 107 of them. tbh it doesn't look worth it to me, but maybe
there are a few assholes even itchier than mine out there. medium has some
good stuff in it, and of course high makes sense if we're gonna do this at
all.
this adds some nosec annotations for some things like sql sprintfs where we
know it's clean (we're constructing the strings with variables in them). fixed
up other spots where we were sprinting without need.
some stuff like filepath.Clean when opening a file from a variable, and file
permissions, easy stuff...
I can't get the CI build to shut up, but I can locally get it to be pretty
quiet about imports and it just outputs the gosec output. fortunately, it
still works as expected even when it's noisy. I got it to shut up by unsetting
some of the go mod flags locally, but that doesn't seem to quite do it in
circle, printed the env out and don't see them, so idk... i give up, this
works
closes#1303
* fn: logging/context improvements for runner status calls
Avoid blocking calls to runStatusCall() to make sure gRPC
context can be cancelled/timedout. This is unlikely an
issue, but blocked runStatusCall() while gRPC is cancelled
is a hard to follow case mentally. New flow is a bit
easier to follow.
Log all error cases in Status() gRPC entry point including
client side cancellations.
There is no need to propagate CapacityFull to waitHot(). Instead
waitHot() can receive 503 which is easier to follow (and less
error prone) in handleCallEnd().
* docker-pull timeout is now a 504 which classifies it as a
service error. Avoid using 503 to make sure LB does not retry.
* Only applicable to detached mode, a timeout on LB is
now a ErrServiceReservationFailure (500). In detached mode,
this is unlikely to make it back to a client and it is mostly
for documentation/metrics purposes.
* For Triggers, avoid scrubbing service code.
* fn: uds init wait latency metric in prometheus
Adding tracker for UDS initialization during container start. This
complements our existing container state latency trackers and
docker-api latency trackers.
* fn: latency metrics for various call states
This complements the API latency metrics available
on LB agent. In this case, we would like to measure
calls that have finished with the following status:
"completed"
"canceled"
"timeouts"
"errors"
"server_busy"
and while measuring this latency, we subtract the
amount of time actual function execution took. This
is not precise, but an approximation mostly suitable
for trending.
Going forward, we could also subtract UDS wait time and/or
docker pull latency from this latency as an enhancement
to this PR.
For the sake of completeness and also as defensive coding,
let's record client context cancel/timeout cases. In retry
reason, if error is same as client context (timeout/cancel),
we should not report as retry due to error. Similarly in
placed calls, do not flag the placed call as error if client
canceled or timedout. We track client context timeout/cancel
separately in addition to this.
* actually disable stdout/stderr. stdout>stderr
* for pure runner this turns it off for real this time.
* this also just makes the agent container type send stdout to stderr, since
we're not using stdout for function output anymore this is pretty
straightforward hopefully.
* I added a panic and some type checking printlns to ensure this is true for
pure_runner, both stdout and stderr are off, also added a unit test from agent
to ensure this behavior from its container type, which pure_runner utilizes
(no integration test though)
* tests ensure that logs still work if not trying to disable them (full agent)
* handle non ghost swapping
* disable pure runner logging
there's a racey bug where the logger is being written to when it's closing,
but this led to figuring out that we don't need the logger at all in pure
runner really, the syslog thing isn't an in process fn thing and we don't need
the logs from attach for anything further in pure runner. so this disables the
logger at the docker level, to save sending the bytes back over the wire, this
could be a nice little performance bump too. of course, with this, it means
agents can be configured to not log debug or have logs to store at all, and
not a lot of guards have been put on this for 'full' agent mode while it hangs
on a cross feeling the breeze awaiting its demise - the default configuration
remains the same, and no behavior changes in 'full' agent are here.
it was a lot smoother to make the noop than to try to plumb in 'nil' for
stdout/stderr, this has a lot lower risk of nil panic issues for the same
effect, though it's not perfect relying on type casting, plumbing in an
interface to check has the same issues (loss of interface adherence for any
decorator), so this seems ok. defaulting to not having a logger was similarly
painful, and ended up with this. but open to ideas.
* replace usage of old null reader writer impl
* make Read return io.EOF for io.Copy usage
In runHot(), it's safer to use a separate channel between
monitoring go-routine and processing go-routine to handle
cancellations triggered by monitorin go-routine.
Container initialization phase consumes resource tracker
resources (token), during lengthy operations.
In order for agent stability/liveness, this phase has
to be evictable/cancelable and time bounded.
With this change, introducing a new system wide environment setting
to bound the time spent in container initialization phase. This phase
includes docker-pull, docker-create, docker-attach, docker-start
and UDS wait operations. This initialization period is also now
considered evictable.
Now obsoleted driver.PrepareCookie() call handled image and
container creation. In agent, going forward we will need finer
grained control over the timeouts implied by the contexts.
For this reason, with this change, we split PrepareCookie()
into Validate/Pull/Create calls under Cookie interface.
This implements a "detached" mechanism to get an ack from the runner
once it actually starts to run a function. In this scenario the response
returned back is just a 202 if we placed the function in a specific
time-frame. If we hit some errors or we fail to place the fn in time we
return back different errors.