Commit Graph

329 Commits

Author SHA1 Message Date
Tolga Ceylan
b6a5b02e45 fn: logging/context improvements for runner status calls (#1345)
* fn: logging/context improvements for runner status calls

Avoid blocking calls to runStatusCall() to make sure gRPC
context can be cancelled/timedout. This is unlikely an
issue, but blocked runStatusCall() while gRPC is cancelled
is a hard to follow case mentally. New flow is a bit
easier to follow.

Log all error cases in Status() gRPC entry point including
client side cancellations.
2018-12-07 15:26:06 -08:00
Tolga Ceylan
1f17edf78a fn: minor error code simplification (#1341)
There is no need to propagate CapacityFull to waitHot(). Instead
waitHot() can receive 503 which is easier to follow (and less
error prone) in handleCallEnd().
2018-12-05 23:59:47 -08:00
Tolga Ceylan
a9a828cc40 fn: error handling updates (#1337)
* docker-pull timeout is now a 504 which classifies it as a
service error. Avoid using 503 to make sure LB does not retry.

* Only applicable to detached mode, a timeout on LB is
now a ErrServiceReservationFailure (500). In detached mode,
this is unlikely to make it back to a client and it is mostly
for documentation/metrics purposes.
  
* For Triggers, avoid scrubbing service code.
2018-12-03 14:58:48 -08:00
Tolga Ceylan
179b5e0e4c fn: uds init wait latency metric in prometheus (#1336)
* fn: uds init wait latency metric in prometheus

Adding tracker for UDS initialization during container start. This
complements our existing container state latency trackers and
docker-api latency trackers.
2018-11-30 14:09:01 -08:00
Tolga Ceylan
df53250958 fn: latency metrics for various call states (#1332)
* fn: latency metrics for various call states

This complements the API latency metrics available
on LB agent. In this case, we would like to measure
calls that have finished with the following status:
     "completed"
     "canceled"
     "timeouts"
     "errors"
     "server_busy"
and while measuring this latency, we subtract the
amount of time actual function execution took. This
is not precise, but an approximation mostly suitable
for trending.

Going forward, we could also subtract UDS wait time and/or
docker pull latency from this latency as an enhancement
to this PR.
2018-11-30 13:20:59 -08:00
Tolga Ceylan
f44a44921e fn: status call failures should be logged (#1331)
Added logging for status call failures.
2018-11-29 16:08:48 -08:00
Krister Johansen
af59d19d24 Add support for counting kdumps in pure-runner. (#1309) 2018-11-21 16:26:30 -08:00
Reed Allman
bf2f96cbeb actually disable stdout/stderr. stdout>stderr (#1321)
* actually disable stdout/stderr. stdout>stderr

* for pure runner this turns it off for real this time.
* this also just makes the agent container type send stdout to stderr, since
we're not using stdout for function output anymore this is pretty
straightforward hopefully.
* I added a panic and some type checking printlns to ensure this is true for
pure_runner, both stdout and stderr are off, also added a unit test from agent
to ensure this behavior from its container type, which pure_runner utilizes
(no integration test though)
* tests ensure that logs still work if not trying to disable them (full agent)

* handle non ghost swapping
2018-11-21 15:04:53 -06:00
Tolga Ceylan
69131420bf fn: enforce container/FDK contract in dispatch (#1314)
1) FDK returned 200/502/504 codes now handled.
2) Container init timeout is now default to 5 seconds.
2018-11-20 13:50:53 -08:00
Shreya Garge
91f6ef3402 added context for runnerpool interface (#1320)
* added context for runnerpool interface

* added context for runnerpool interface
2018-11-20 17:02:47 +00:00
Tolga Ceylan
f797cb933f fn: remove error formatting in fireBeforeCall/fireAfterCall (#1317)
fmt.Errorf strips API errors in models, we should propagate
the error directly.
2018-11-19 12:23:59 -08:00
Reed Allman
29fdbc9b49 disable pure runner logging (#1313)
* disable pure runner logging

there's a racey bug where the logger is being written to when it's closing,
but this led to figuring out that we don't need the logger at all in pure
runner really, the syslog thing isn't an in process fn thing and we don't need
the logs from attach for anything further in pure runner. so this disables the
logger at the docker level, to save sending the bytes back over the wire, this
could be a nice little performance bump too. of course, with this, it means
agents can be configured to not log debug or have logs to store at all, and
not a lot of guards have been put on this for 'full' agent mode while it hangs
on a cross feeling the breeze awaiting its demise - the default configuration
remains the same, and no behavior changes in 'full' agent are here.

it was a lot smoother to make the noop than to try to plumb in 'nil' for
stdout/stderr, this has a lot lower risk of nil panic issues for the same
effect, though it's not perfect relying on type casting, plumbing in an
interface to check has the same issues (loss of interface adherence for any
decorator), so this seems ok. defaulting to not having a logger was similarly
painful, and ended up with this. but open to ideas.

* replace usage of old null reader writer impl

* make Read return io.EOF for io.Copy usage
2018-11-16 12:56:49 -06:00
Tolga Ceylan
c89f1e5f9c fn: safer hand over between monitoring and main processing (#1316)
In runHot(), it's safer to use a separate channel between
monitoring go-routine and processing go-routine to handle
cancellations triggered by monitorin go-routine.
2018-11-15 16:57:16 -08:00
Tolga Ceylan
6eaf1578e6 fn: container initialization monitoring (#1288)
Container initialization phase consumes resource tracker
resources (token), during lengthy operations.
In order for agent stability/liveness, this phase has
to be evictable/cancelable and time bounded.

With this change, introducing a new system wide environment setting
to bound the time spent in container initialization phase. This phase
includes docker-pull, docker-create, docker-attach, docker-start
and UDS wait operations. This initialization period is also now
considered evictable.
2018-11-15 13:37:43 -08:00
Tolga Ceylan
fe2b9fb53d fn: cookie and driver api changes (#1312)
Now obsoleted driver.PrepareCookie() call handled image and
container creation. In agent, going forward we will need finer
grained control over the timeouts implied by the contexts.
For this reason, with this change, we split PrepareCookie()
into Validate/Pull/Create calls under Cookie interface.
2018-11-14 16:51:05 -08:00
Tolga Ceylan
8ee4c1098b fn: correct typo in docker command tag (#1311) 2018-11-14 11:38:48 -08:00
Eric Fode
90e39c8fd3 initial addition of the diskfree op (#1308)
* initial addition of the diskfree op

fixing up some typos

last of fmt errors

* fixed up some feedbacks
2018-11-14 09:22:07 -08:00
Andrea Rosa
182db94fad Feature/acksync response writer (#1267)
This implements a "detached" mechanism to get an ack from the runner
once it actually starts to run a function. In this scenario the response
returned back is just a 202 if we placed the function in a specific
time-frame. If we hit some errors or we fail to place the fn in time we
return back different errors.
2018-11-09 10:25:43 -08:00
Tolga Ceylan
25afb2f478 fn: remove tini option & env variable (#1301) 2018-11-07 12:35:19 -08:00
Tolga Ceylan
975b780695 fn: tests for hung and bad docker repo during docker-pull (#1298)
* fn: tests for hung and bad docker repo during docker-pull
2018-11-05 16:01:42 -08:00
Tolga Ceylan
5415b2bc38 fn: move UDS client into container to keep runHot() simpler (#1297) 2018-11-02 14:03:09 -07:00
Tolga Ceylan
ac17825a36 fn: add container state to eviction stats (#1296) 2018-11-02 13:32:13 -07:00
Tolga Ceylan
de9c2cbb63 fn: cleanup of docker timeouts and docker health check (#1292)
Moving the timeout management of various docker operations
to agent. This allows for finer control over what operation
should use. For instance, for pause/unpause our tolerance
is very low to avoid resource issues. For docker remove,
the consequences of failure will lead to potential agent
failure and therefore we wait up to 10 minute.
For cookie create/prepare (which includes docker-pull)
we cap this at 10 minutes by default.

With new UDS/FDK contract, health check is now obsoleted
as container advertise health using UDS availibility.
2018-11-01 14:22:47 -07:00
Tolga Ceylan
e227802512 fn: Remove error channel for container exits (#1287)
The channel is unnecessary and unreliable since exits
trigger I/O failure on UDS earlier than we detect
the exit.
2018-10-30 12:11:23 -07:00
Reed Allman
e13a6fd029 death to format (#1281)
* get rid of old format stuff, utils usage, fix up for fdk2.0 interface

* pure agent format removal, TODO remove format field, fix up all tests

* shitter's clogged

* fix agent tests

* start rolling through server tests

* tests compile, some failures

* remove json / content type detection on invoke/httptrigger, fix up tests

* remove hello, fixup system tests

the fucking status checker test just hangs and it's testing that it doesn't
work so the test passes but the test doesn't pass fuck life it's not worth it

* fix migration

* meh

* make dbhelper shut up about dbhelpers not being used

* move fail status at least into main thread, jfc

* fix status call to have FN_LISTENER

also turns off the stdout/stderr blocking between calls, because it's
impossible to debug without that (without syslog), now that stdout and stderr
go to the same place (either to host stderr or nowhere) and isn't used for
function output this shouldn't be a big fuss really

* remove stdin

* cleanup/remind: fixed bug where watcher would leak if container dies first

* silence system-test logs until fail, fix datastore tests

postgres does weird things with constraints when renaming tables, took the
easy way out

system-tests were loud as fuck and made you download a circleci text file of
the logs, made them only yell when they goof

* fix fdk-go dep for test image. fun

* fix swagger and remove test about format

* update all the gopkg files

* add back FN_FORMAT for fdks that assert things. pfft

* add useful error for functions that exit

this error is really confounding because containers can exit for all manner of
reason, we're just guessing that this is the most likely cause for now, and
this error message should very likely change or be removed from the client
path anyway (context.Canceled wasn't all that useful either, but anyway, I'd
been hunting for this... so found it). added a test to avoid being publicly
shamed for 1 line commits (beware...).
2018-10-26 10:43:04 -07:00
Tolga Ceylan
241d3fede1 fn: blocking mode should not emit 503 if can't evict (#1283) 2018-10-25 12:17:26 -07:00
Tolga Ceylan
bf41789af2 fn: eviction resource correction (#1282)
Previously evictor did not perform an eviction
if total cpu/mem of evictable containers was less
than requested cpu/mem. With this change, we
try to perform evictions based on actual needed cpu & mem
reported by resource tracker.
2018-10-25 11:10:19 +01:00
Tolga Ceylan
8fe1c9a07c fn: reduce logging for evicted containers (#1276)
Let's not log evicted containers which would be context
canceled.
2018-10-18 15:10:15 -07:00
Tolga Ceylan
44e366d195 fn: add details to runner finish logging (#1271)
Adding http-status/fn-http-status details in runner finish
logger.
2018-10-15 12:15:08 -07:00
Tolga Ceylan
f10fab21bc fn: fixup possible go-routine leak (#1265) 2018-10-05 17:02:18 -07:00
Reed Allman
e6eec186d0 small tweaks to dispatch (#1264)
* the dispatch span actually encloses dispatch and gives an accurate span now
* turning a call into an http request can't fail unless it's our fault, if
tests don't catch this, we don't deserve money
* moved http req creation inside of dispatch goroutine

there's further work to do cleaning up dispatch... removing the old formats
will make this slightly more clear, waiting for that. this was bugging me
anyway after seeing something else and was easy to fix up.
2018-10-05 16:32:01 -07:00
Tolga Ceylan
29dcf0a791 fn: adding docker events to stats (#1262)
Streaming docker events is useful as we can record/capture some
asynchronous containers events such as out-of-memory. For now,
we record these in opencensus/prometheus stats.
2018-10-04 18:54:09 -07:00
Tolga Ceylan
f132bba3fb fn: adding hot launcher eviction waiting (#1257)
If checkLaunch triggers evictions, it must wait
for these eviction to complete before returning.
Premature returning from checkLaunch will cause
checkLaunch to be called again by hot launcher.
This causes checkLaunch to receive an out of
capacity error and causes a 503.

The evictor is also improved with this PR and it
provides a slice of channels to wait on if evictions
are taking place.

Eviction token deletion is performed *after*
resource token close to ensure that once an
eviction is done, resource token is also free.
2018-10-01 16:16:29 -07:00
Tolga Ceylan
2e610a264a fn: remove async+sync seperation in resource tracker (#1254)
This simplifies resource tracker. Originally, logically we had
split the cpu/mem into two pools where a 20%
was kept specifically for sync calls to avoid
async calls dominating the system. However, resource
tracker should not handle such call prioritization.
Given the improvements to the evictor, I think
we can get rid of this code in resource tracker
for time being.
2018-10-01 10:46:32 -07:00
Dario Domizioli
4a862212a2 Limit connection pool size on UDS: we should only need one per container (#1252)
Hopefully this reduces FD usage even further.
2018-09-28 11:07:31 -07:00
Tolga Ceylan
a256d96f1e fn: keepalives timeout for UDS http-stream client (#1253) 2018-09-28 10:59:22 -07:00
Dario Domizioli
5aabdae26a Fix missing context on request sent through UDS (#1251) 2018-09-28 14:54:05 +01:00
Richard Connon
8d8c7df569 Log failure to close fsnotify handle (#1250) 2018-09-28 12:06:43 +01:00
Owen Cliffe
53d4be00ca Add checks for unix socket destination to avoid FDK tricking agent into talking to non-relative dirs (#1247)
* Add checks for unix socket destination to avoid leaking access to host OS

* style, typos
2018-09-27 18:20:03 -07:00
Reed Allman
319e0af41c we shouldn't log tokens, this shouldn't have been info either and was noisy (#1249)
* we shouldn't log tokens, this shouldn't have been info either and was noisy

* simplify logic too
2018-09-27 23:37:35 +01:00
Reed Allman
01b8e8679d HTTP trigger http-stream tests (#1241) 2018-09-26 13:25:48 +01:00
Tom Coupland
d454ff9aa4 Initial Refactor (#1234)
* Inital Refactor

Removing the repeated logic exposed some problems with the reponse
writers.

Currently, the trigger writer was overlaid on part of the header
writing. The main invoke blog writing into the different levels of the
overlays at different points in the logic.

Instead, by extending the types and embedded structs, the writer is
more transparent. So, at the end of the flow it goes over all the
headers available and removes our prefixes. This lets the invoke logic
just write to the top level.

Going to continue after lunch to try and remove some of the layers and
param passing.

* Try and repeat concurrency failure

* Nested FromHTTPFnRequest inside FromHTTPTriggerRequest

* Consolidate buffer pooling logic

* go fmt yourself

* fix import
2018-09-24 12:20:30 +01:00
Tolga Ceylan
a994b57d9a fn: freezer/evictor adjustments (#1233)
*) removed faulty Idle state setter in runHot() since with
UDS wait, we need to wait until we can determine if a container
is idle. This is now moved to runHotReq().
*) evictor now more aggresive and no longer tied to pause
timer/configuration.
*) removed unnecessary optimization on timer=0 case for immediate
pause.
2018-09-20 14:13:11 -07:00
Vijay Krishnan
b2f85b70ea Use registry auth token from Call extensions to pull images (#1228) 2018-09-20 13:57:41 -07:00
Owen Cliffe
d9b74cfd14 Gateway trigger support (#1225)
* initial gateway trigger support

* Pass Content-Type down to wrapped writer

* Move req header setting

* Adding call id to responses

* add dupe Fn-Call-Id headers
2018-09-20 11:30:28 -07:00
Reed Allman
87e2562db9 Http stream invoke tests (#1231)
* adds parity level of testing http-stream invoke

the other formats had a gamut of tests, now http-stream does too. this makes
obvious some of its behaviors. some things changed / can change now that we
don't have pipes to worry about, the main one being that when containers blow
up now the uds client will get an EOF/ECONNREFUSED instead of the pipe getting
wedged up (allowing us to get the container error easily, previously). I made
my best 50% effort to make a reasonable error for when this happens (similar
to when http/json received garbage errors), open to ideas on verbiage / policy
there.

should be pretty straightforward. one thing to notice is that
http/json/default don't return our fancy new Fn-Http-Status or Fn-Http-H
headers... it's relatively easy to go add this to fdk-go just to test this,
but for invoke I'm really not sure we care (?) and for the gateway, the output
will be identical with the old formats bypassing the header decap. if anybody
has any feelings, feel free to express them.

* fix oomer up for new error

* Adding http header stripping to agent

Adding the header stripping into the agent, this should be low enough
that all routes to fns get treated the same.
2018-09-20 18:52:20 +01:00
Reed Allman
485fa465a0 Stream test commence (#1224)
* initial invoke testing

this assures that Content-Type and Fn-Http-Status are set for an http-stream
function. it took some fixing up of the test utils code for the plumbing to
work, looking forward to deleting most stuff in fn-test-utils.go file around
each format -- had to update fdk-go to latest for http-stream support. this
only adds 1 test, since there's some machinery here, and would like to unblock
working on the http gateway simultaneously while adding a full suite of invoke
tests (this work can be parallelized)...

i added debug logs back to the debugging output. turns out this is useful, but
it can get noisy (only when things fail, hopefully).

* fix oom tests?
2018-09-19 08:48:48 -07:00
Tolga Ceylan
a9bba2c3a8 fn: remove eviction timer to simplify eviction logic (#1223)
We tie container pausing with evictions, where if a container
is paused, then it is also eligible for eviction.
2018-09-18 15:20:39 -07:00
Reed Allman
3a82790d99 clean up hardcoded lsnr.sock refs, move iofs to /tmp (#1221)
* clean up hardcoded lsnr.sock refs

because what drivers.ContainerTask needs is another method, and we all know it

atoning for my sins the first time around. and yes, i refuse to use a cross
package exported constant (just think of the dep graphs)

* fix tests
2018-09-18 08:12:44 -07:00
Tolga Ceylan
893ff1e6fc fn: add missing dequeue in agent Submit (#1220) 2018-09-17 17:58:12 -07:00