* fn: runner status and docker load images
Introducing a function run for pure runner Status
calls. Previously, Status gRPC calls returned active
inflight request counts with the purpose of a simple
health checker. However this is not sufficient since
it does not show if agent or docker is healthy. With
this change, if pure runner is configured with a status
image, that image is executed through docker. The
call uses zero memory/cpu/tmpsize settings to ensure
resource tracker does not block it.
However, operators might not always have a docker
repository accessible/available for status image. Or
operators might not want the status to go over the
network. To allow such cases, and in general possibly
caching docker images, added a new environment variable
FN_DOCKER_LOAD_FILE. If this is set, fn-agent during
startup will load these images that were previously
saved with 'docker save' into docker.
* fn: size restricted tmpfs /tmp and read-only / support
*) read-only Root Fs Support
*) removed CPUShares from docker API. This was unused.
*) docker.Prepare() refactoring
*) added docker.configureTmpFs() for size limited tmpfs on /tmp
*) tmpfs size support in routes and resource tracker
*) fix fn-test-utils to handle sparse files better in create file
* test typo fix
* fn: lb and pure-runner with non-blocking agent
*) Removed pure-runner capacity tracking code. This did
not play well with internal agent resource tracker.
*) In LB and runner gRPC comm, removed ACK. Now,
upon TryCall, pure-runner quickly proceeds to call
Submit. This is good since at this stage pure-runner
already has all relevant data to initiate the call.
*) Unless pure-runner emits a NACK, LB immediately
streams http body to runners.
*) For retriable requests added a CachedReader for
http.Request Body.
*) Idempotenty/retry is similar to previous code.
After initial success in Engament, after attempting
a TryCall, unless we receive NACK, we cannot retry
that call.
*) ch and naive places now wraps each TryExec with
a cancellable context to clean up gRPC contexts
quicker.
* fn: err for simpler one-time read GetBody approach
This allows for a more flexible approach since we let
users to define GetBody() to allow repetitive http body
read. In default LB case, LB executes a one-time io.ReadAll
and sets of GetBody, which is detected by RunnerCall.RequestBody().
* fn: additional check for non-nil req.body
* fn: attempt to override IO errors with ctx for TryExec
* fn: system-tests log dest
* fn: LB: EOF send handling
* fn: logging for partial IO
* fn: use buffer pool for IO storage in lb agent
* fn: pure runner should use chunks for data msgs
* fn: required config validations and pass APIErrors
* fn: additional tests and gRPC proto simplification
*) remove ACK/NACK messages as Finish message type works
OK for this purpose.
*) return resp in api tests for check for status code
*) empty body json test in api tests for lb & pure-runner
* fn: buffer adjustments
*) setRequestBody result handling correction
*) switch to bytes.Reader for read-only safety
*) io.EOF can be returned for non-nil Body in request.
* fn: clarify detection of 503 / Server Too Busy
* fn: agent MaxRequestSize limit
Currently, LimitRequestBody() exists to install a
http request body size in http/gin server. For production
enviroments, this is expected to be used. However, in agents
we may need to verify/enforce these size limits and to be
able to assert in case of missing limits is valuable.
With this change, operators can define an agent env variable
to limit this in addition to installing Gin/Http handler.
http.MaxBytesReader is superior in some cases as it sets
http headers (Connection: close) to guard against subsequent
requests.
However, NewClampReadCloser() is superior in other cases,
where it can cleanly return an API error for this case alone
(http.MaxBytesReader() does not return a clean error type
for overflow case, which makes it difficult to use it without
peeking into its implementation.)
For lb agent, upcoming changes rely on such limits enabled
and using gin/http handler (http.MaxBytesReader) makes such
checks/safety validations difficult.
* fn: read/write clamp code adjustment
In case of overflows, opt for simple implementation
of a partial write followed by return error.
* fn: allow specified docker networks in functions
If FN_DOCKER_NETWORK is specified with a list of
networks, then agent driver picks the least used
network to place functions on.
* add mutex comment
* fn: non-blocking resource tracker and notification
For some types of errors, we might want to notify
the actual caller if the error is directly 1-1 tied
to that request. If hotLauncher is triggered with
signaller, then here we send a back communication
error notification channel. This is passed to
checkLaunch to send back synchronous responses
to the caller that initiated this hot container
launch.
This is useful if we want to run the agent in
quick fail mode, where instead of waiting for
CPU/Mem to become available, we prefer to fail
quick in order not to hold up the caller.
To support this, non-blocking resource tracker
option/functions are now available.
* fn: test env var rename tweak
* fn: fixup merge
* fn: rebase test fix
* fn: merge fixup
* fn: test tweak down to 70MB for 128MB total
* fn: refactor token creation and use broadcast regardless
* fn: nb description
* fn: bugfix
* fn: experimental prefork recycle and other improvements
*) Recycle and do not use same pool container again option.
*) Two state processing: initializing versus ready (start-kill).
*) Ready state is exempt from rate limiter.
* fn: experimental prefork pool multiple network support
In order to exceed 1023 container (bridge port) limit, add
multiple networks:
for i in fn-net1 fn-net2 fn-net3 fn-net4
do
docker network create $i
done
to Docker startup, (eg. dind preentry.sh), then provide this
to prefork pool using:
export FN_EXPERIMENTAL_PREFORK_NETWORKS="fn-net1 fn-net2 fn-net3 fn-net4"
which should be able to spawn 1023 * 4 containers.
* fn: fixup tests for cfg move
* fn: add ipc and pid namespaces into prefork pooling
* fn: revert ipc and pid namespaces for now
Pid/Ipc opens up the function container to pause container.
* fn: perform call.End() after request is processed
call.End() performs several tasks in sequence; insert call,
insert log, (todo) remove mq entry, fireAfterCall callback, etc.
These currently add up to the request latency as return
from agent.Submit() is blocked on these. We also haven't been
able to apply any timeouts on these operations since they are
handled during request processing and it is hard to come up
with a strategy for it. Also the error cases
(couldn't insert call or log) are not propagated to the caller.
With this change, call.End() handling becomes asynchronous where
we perform these tasks after the request is done. This improves
latency and we no longer have to block the call on these operations.
The changes will also free up the agent slot token more quickly
and now we are no longer tied to hiccups in call.End().
Now, a timeout policy is also added to this which can
be adjusted with an env variable. (default 10 minutes)
This accentuates the fact that call/log/fireAfterCall are not
completed when request is done. So, there's a window there where
call is done, but call/log/fireAfterCall are not yet propagated.
This was already the case especially for error cases.
There's slight risk of accumulating call.End() operations in
case of hiccups in these log/call/callback systems.
* fn: address risk of overstacking of call.End() calls.
Oracle Linux 7.4 backported versions still having issues
with freezing/terminating containers. 17.10.0-ce seems like
a resonable lowest common denominator.
Added env FN_MAX_FS_SIZE_MB, which if defined and non-zero
is passed to docker as storage opt size. We do not validate
if this option is supported by docker currently. This is
because it's difficult to actually validate this since it
not only depends on storage driver and its backing filesystem,
but also the mount options used to mount that fs.
* fn: reorg agent config
*) Moving constants in agent to agent config, which helps
with testing, tuning.
*) Added max total cpu & memory for testing & clamping max
mem & cpu usage if needed.
* fn: adjust PipeIO time
* fn: for hot, cannot reliably test EndOfLogs in TestRouteRunnerExecution
* fn: enable failing test back
* fn: fortifying the stderr output
Modified limitWriter to discard excess data instead
of returning error, this is to allow stderr/stdout
pipes flowing to avoid head-of-line blocking or
data corruption in container stdout/stderr output stream.
*) I/O protocol parse issues should shutdown the container as the container
goes to inconsistent state between calls. (eg. next call may receive previous
calls left overs.)
*) Move ghost read/write code into io_utils in common.
*) Clean unused error from docker Wait()
*) We can catch one case in JSON, if there's remaining unparsed data in
decoder buffer, we can shut the container
*) stdout/stderr when container is not handling a request are now blocked if freezer is also enabled.
*) if a fatal err is set for slot, we do not requeue it and proceed to shutdown
*) added a test function for a few cases with freezer strict behavior
*) Limit response http body or json response size to FN_MAX_RESPONSE_SIZE (default unlimited)
*) If limits are exceeded 502 is returned with 'body too large' in the error message