Commit Graph

150 Commits

Author SHA1 Message Date
Tolga Ceylan
c89f1e5f9c fn: safer hand over between monitoring and main processing (#1316)
In runHot(), it's safer to use a separate channel between
monitoring go-routine and processing go-routine to handle
cancellations triggered by monitorin go-routine.
2018-11-15 16:57:16 -08:00
Tolga Ceylan
6eaf1578e6 fn: container initialization monitoring (#1288)
Container initialization phase consumes resource tracker
resources (token), during lengthy operations.
In order for agent stability/liveness, this phase has
to be evictable/cancelable and time bounded.

With this change, introducing a new system wide environment setting
to bound the time spent in container initialization phase. This phase
includes docker-pull, docker-create, docker-attach, docker-start
and UDS wait operations. This initialization period is also now
considered evictable.
2018-11-15 13:37:43 -08:00
Tolga Ceylan
fe2b9fb53d fn: cookie and driver api changes (#1312)
Now obsoleted driver.PrepareCookie() call handled image and
container creation. In agent, going forward we will need finer
grained control over the timeouts implied by the contexts.
For this reason, with this change, we split PrepareCookie()
into Validate/Pull/Create calls under Cookie interface.
2018-11-14 16:51:05 -08:00
Tolga Ceylan
25afb2f478 fn: remove tini option & env variable (#1301) 2018-11-07 12:35:19 -08:00
Tolga Ceylan
975b780695 fn: tests for hung and bad docker repo during docker-pull (#1298)
* fn: tests for hung and bad docker repo during docker-pull
2018-11-05 16:01:42 -08:00
Tolga Ceylan
5415b2bc38 fn: move UDS client into container to keep runHot() simpler (#1297) 2018-11-02 14:03:09 -07:00
Tolga Ceylan
ac17825a36 fn: add container state to eviction stats (#1296) 2018-11-02 13:32:13 -07:00
Tolga Ceylan
de9c2cbb63 fn: cleanup of docker timeouts and docker health check (#1292)
Moving the timeout management of various docker operations
to agent. This allows for finer control over what operation
should use. For instance, for pause/unpause our tolerance
is very low to avoid resource issues. For docker remove,
the consequences of failure will lead to potential agent
failure and therefore we wait up to 10 minute.
For cookie create/prepare (which includes docker-pull)
we cap this at 10 minutes by default.

With new UDS/FDK contract, health check is now obsoleted
as container advertise health using UDS availibility.
2018-11-01 14:22:47 -07:00
Tolga Ceylan
e227802512 fn: Remove error channel for container exits (#1287)
The channel is unnecessary and unreliable since exits
trigger I/O failure on UDS earlier than we detect
the exit.
2018-10-30 12:11:23 -07:00
Reed Allman
e13a6fd029 death to format (#1281)
* get rid of old format stuff, utils usage, fix up for fdk2.0 interface

* pure agent format removal, TODO remove format field, fix up all tests

* shitter's clogged

* fix agent tests

* start rolling through server tests

* tests compile, some failures

* remove json / content type detection on invoke/httptrigger, fix up tests

* remove hello, fixup system tests

the fucking status checker test just hangs and it's testing that it doesn't
work so the test passes but the test doesn't pass fuck life it's not worth it

* fix migration

* meh

* make dbhelper shut up about dbhelpers not being used

* move fail status at least into main thread, jfc

* fix status call to have FN_LISTENER

also turns off the stdout/stderr blocking between calls, because it's
impossible to debug without that (without syslog), now that stdout and stderr
go to the same place (either to host stderr or nowhere) and isn't used for
function output this shouldn't be a big fuss really

* remove stdin

* cleanup/remind: fixed bug where watcher would leak if container dies first

* silence system-test logs until fail, fix datastore tests

postgres does weird things with constraints when renaming tables, took the
easy way out

system-tests were loud as fuck and made you download a circleci text file of
the logs, made them only yell when they goof

* fix fdk-go dep for test image. fun

* fix swagger and remove test about format

* update all the gopkg files

* add back FN_FORMAT for fdks that assert things. pfft

* add useful error for functions that exit

this error is really confounding because containers can exit for all manner of
reason, we're just guessing that this is the most likely cause for now, and
this error message should very likely change or be removed from the client
path anyway (context.Canceled wasn't all that useful either, but anyway, I'd
been hunting for this... so found it). added a test to avoid being publicly
shamed for 1 line commits (beware...).
2018-10-26 10:43:04 -07:00
Tolga Ceylan
241d3fede1 fn: blocking mode should not emit 503 if can't evict (#1283) 2018-10-25 12:17:26 -07:00
Tolga Ceylan
bf41789af2 fn: eviction resource correction (#1282)
Previously evictor did not perform an eviction
if total cpu/mem of evictable containers was less
than requested cpu/mem. With this change, we
try to perform evictions based on actual needed cpu & mem
reported by resource tracker.
2018-10-25 11:10:19 +01:00
Tolga Ceylan
8fe1c9a07c fn: reduce logging for evicted containers (#1276)
Let's not log evicted containers which would be context
canceled.
2018-10-18 15:10:15 -07:00
Tolga Ceylan
f10fab21bc fn: fixup possible go-routine leak (#1265) 2018-10-05 17:02:18 -07:00
Reed Allman
e6eec186d0 small tweaks to dispatch (#1264)
* the dispatch span actually encloses dispatch and gives an accurate span now
* turning a call into an http request can't fail unless it's our fault, if
tests don't catch this, we don't deserve money
* moved http req creation inside of dispatch goroutine

there's further work to do cleaning up dispatch... removing the old formats
will make this slightly more clear, waiting for that. this was bugging me
anyway after seeing something else and was easy to fix up.
2018-10-05 16:32:01 -07:00
Tolga Ceylan
f132bba3fb fn: adding hot launcher eviction waiting (#1257)
If checkLaunch triggers evictions, it must wait
for these eviction to complete before returning.
Premature returning from checkLaunch will cause
checkLaunch to be called again by hot launcher.
This causes checkLaunch to receive an out of
capacity error and causes a 503.

The evictor is also improved with this PR and it
provides a slice of channels to wait on if evictions
are taking place.

Eviction token deletion is performed *after*
resource token close to ensure that once an
eviction is done, resource token is also free.
2018-10-01 16:16:29 -07:00
Tolga Ceylan
2e610a264a fn: remove async+sync seperation in resource tracker (#1254)
This simplifies resource tracker. Originally, logically we had
split the cpu/mem into two pools where a 20%
was kept specifically for sync calls to avoid
async calls dominating the system. However, resource
tracker should not handle such call prioritization.
Given the improvements to the evictor, I think
we can get rid of this code in resource tracker
for time being.
2018-10-01 10:46:32 -07:00
Dario Domizioli
4a862212a2 Limit connection pool size on UDS: we should only need one per container (#1252)
Hopefully this reduces FD usage even further.
2018-09-28 11:07:31 -07:00
Tolga Ceylan
a256d96f1e fn: keepalives timeout for UDS http-stream client (#1253) 2018-09-28 10:59:22 -07:00
Dario Domizioli
5aabdae26a Fix missing context on request sent through UDS (#1251) 2018-09-28 14:54:05 +01:00
Richard Connon
8d8c7df569 Log failure to close fsnotify handle (#1250) 2018-09-28 12:06:43 +01:00
Owen Cliffe
53d4be00ca Add checks for unix socket destination to avoid FDK tricking agent into talking to non-relative dirs (#1247)
* Add checks for unix socket destination to avoid leaking access to host OS

* style, typos
2018-09-27 18:20:03 -07:00
Reed Allman
319e0af41c we shouldn't log tokens, this shouldn't have been info either and was noisy (#1249)
* we shouldn't log tokens, this shouldn't have been info either and was noisy

* simplify logic too
2018-09-27 23:37:35 +01:00
Reed Allman
01b8e8679d HTTP trigger http-stream tests (#1241) 2018-09-26 13:25:48 +01:00
Tolga Ceylan
a994b57d9a fn: freezer/evictor adjustments (#1233)
*) removed faulty Idle state setter in runHot() since with
UDS wait, we need to wait until we can determine if a container
is idle. This is now moved to runHotReq().
*) evictor now more aggresive and no longer tied to pause
timer/configuration.
*) removed unnecessary optimization on timer=0 case for immediate
pause.
2018-09-20 14:13:11 -07:00
Vijay Krishnan
b2f85b70ea Use registry auth token from Call extensions to pull images (#1228) 2018-09-20 13:57:41 -07:00
Reed Allman
87e2562db9 Http stream invoke tests (#1231)
* adds parity level of testing http-stream invoke

the other formats had a gamut of tests, now http-stream does too. this makes
obvious some of its behaviors. some things changed / can change now that we
don't have pipes to worry about, the main one being that when containers blow
up now the uds client will get an EOF/ECONNREFUSED instead of the pipe getting
wedged up (allowing us to get the container error easily, previously). I made
my best 50% effort to make a reasonable error for when this happens (similar
to when http/json received garbage errors), open to ideas on verbiage / policy
there.

should be pretty straightforward. one thing to notice is that
http/json/default don't return our fancy new Fn-Http-Status or Fn-Http-H
headers... it's relatively easy to go add this to fdk-go just to test this,
but for invoke I'm really not sure we care (?) and for the gateway, the output
will be identical with the old formats bypassing the header decap. if anybody
has any feelings, feel free to express them.

* fix oomer up for new error

* Adding http header stripping to agent

Adding the header stripping into the agent, this should be low enough
that all routes to fns get treated the same.
2018-09-20 18:52:20 +01:00
Tolga Ceylan
a9bba2c3a8 fn: remove eviction timer to simplify eviction logic (#1223)
We tie container pausing with evictions, where if a container
is paused, then it is also eligible for eviction.
2018-09-18 15:20:39 -07:00
Reed Allman
3a82790d99 clean up hardcoded lsnr.sock refs, move iofs to /tmp (#1221)
* clean up hardcoded lsnr.sock refs

because what drivers.ContainerTask needs is another method, and we all know it

atoning for my sins the first time around. and yes, i refuse to use a cross
package exported constant (just think of the dep graphs)

* fix tests
2018-09-18 08:12:44 -07:00
Tolga Ceylan
893ff1e6fc fn: add missing dequeue in agent Submit (#1220) 2018-09-17 17:58:12 -07:00
Richard Connon
493790dbd2 Add tmpfs IOFS (#1212)
* Define an interface for IOFS handling. Add no-op and temporary directory implementations.

* Move IOFS stuff out into separate file, add basic tmpfs implementation for linux only

* Switch between directory and tmpfs based on platform and config

* Respect FN_IOFS_OPTS

* Make directory iofs default on all platforms

* At least try to clean up a bit on failure

* Add backout if IOFS creation fails

* Add comment about iofs.Close
2018-09-17 11:50:43 -07:00
Tolga Ceylan
b0c93dbd82 fn: new agent resource tracker metrics (#1215)
New metrics for agent resource tracker: CpuUsed, CpuAvail,
MemUsed, MemAvail.
2018-09-17 10:31:17 -07:00
Tom Coupland
d56a49b321 Remove V1 endpoints and Routes (#1210)
Largely a removal job, however many tests, particularly system level
ones relied on Routes. These have been migrated to use Fns.

* Add 410 response to swagger
* No app names in log tags
* Adding constraint in GetCall for FnID
* Adding test to check FnID is required on call
* Add fn_id to call selector
* Fix text in docker mem warning
* Correct buildConfig func name
* Test fix up
* Removing CPU setting from Agent test

CPU setting has been deprecated, but the code base is still riddled
with it. This just removes it from this layer. Really we need to
remove it from Call.

* Remove fn id check on calls
* Reintroduce fn id required on call
* Adding fnID to calls for execute test
* Correct setting of app id in middleware
* Removes root middlewares ability to redirect fun invocations
* Add over sized test check
* Removing call fn id check
2018-09-17 16:44:51 +01:00
Owen Cliffe
6567f6e8ef support configuration-based relative dirs (host and agent) for iofs (#1213)
* support configuration-based relative dirs (host and agent) for iofs mounts
* Send UDS requests as POST to <UDS>/call
2018-09-17 11:59:16 +01:00
Tolga Ceylan
aa13a40168 fn: agent/lb/runner error handling adjustments (#1214)
1) Early call validation and return due to cpu/mem impossible
to meet (eg. request cpu/mem larger than max-mem or max-cpu
on server) now emits HTTP Bad Request (400) instead of 503.
This case is most likely due to client/service configuration
and/or validation issue.
2) 'failed' metric is now removed. 'failed' versus 'errors'
were too confusing. 'errors' is now a catch all error case.
3) new 'canceled' counter for client side cancels.
4) 'server_busy' now covers more cases than it previously did.
2018-09-14 16:50:14 -07:00
Reed Allman
2b797a556a update docs with pro tips for fdk http stream people (#1211)
* update docs with pro tips for fdk http stream people

* fix bug where container could die before uds wait

we used to hang out for an hour. oopsie, thanks Owen
2018-09-14 16:54:18 +01:00
Reed Allman
3a9c48b8a3 http-stream format (#1202)
* POC code for inotify UDS-io-socket

* http-stream format

introducing the `http-stream` format support in fn. there are many details for
this, none of which can be linked from github :( -- docs are coming (I could
even try to add some here?). this is kinda MVP-ish level, but does not
implement the remaining spec, ie 'headers' fixing up / invoke fixing up. the
thinking being we can land this to test fdks / cli with and start splitting
work up on top of this. all other formats work the same as previous (no
breakage, only new stuff)

with the cli you can set `format: http-stream` and deploy, and then invoke a
function via the `http-stream` format. this uses unix domain socket (uds) on
the container instead of previous stdin/stdout, and fdks will have to support
this in a new fashion (will see about getting docs on here). fdk-go works,
which is here: https://github.com/fnproject/fdk-go/pull/30 . the output looks
the same as an http format function when invoking a function. wahoo.

there's some amount of stuff we can clean up here, enumerated:

* the cleanup of the sock files is iffy, high pri here

* permissions are a pain in the ass and i punted on dealing with them. you can
run `sudo ./fnserver` if running locally, it may/may not work in dind(?) ootb

* no pipe usage at all (yay), still could reduce buffer usage around the pipe
behavior, we could clean this up potentially before removal (and tests)

* my brain can’t figure out if dispatchOldFormats changes pipe behavior, but
tests work

* i marked XXX to do some clean up which will follow soon… need this to test fdk
tho so meh, any thoughts on those marked would be appreciated however (1 less
decision for me). mostly happy w/ general shape/plumbing tho

* there are no tests atm, this is a tricky dance indeed. attempts were made.
need to futz with the permission stuff before committing to adding any tests
here, which I don't like either. also, need to get the fdk-go based test image
updated according to the fdk-go, and there's a dance there too. rumba time..

* delaying the big big cleanup until we have good enough fdk support to kill
all the other formats.

open to ideas on how to maneuver landing stuff...

* fix unmount

* see if the tests work on ci...

* add call id header

* fix up makefile

* add configurable iofs opts

* add format file describing http-stream contract

* rm some cruft

* default iofs to /tmp, remove mounting

out of the box fn we can't mount. /tmp will provide a memory backed fs for us
on most systems, this will be fine for local developing and this can be
configured to be wherever for anyone that wants to make things more difficult
for themselves.

also removes the mounting, this has to be done as root. we can't do this in
the oss fn (short of requesting root, but no). in the future, we may want to
have a knob here to have a function that can be configured in fn that allows
further configuration here. since we don't know what we need in this dept
really, not doing that yet (it may be the case that it could be done
operationally outside of fn, eg, but not if each directory needs to be
configured itself, which seems likely, anyway...)

* add WIP note just in case...
2018-09-14 10:59:12 +01:00
Tolga Ceylan
4dcdb7d982 fn: paused and evicted container stats (#1209)
* fn: paused and evicted container stats

With this change, now stats reports paused state
as well as incidents of container exit due to evictions.

* fn: update/document state transitions in state tracker

There's no case of a transition moving from done to waiting. This
must be deprecated behavior.
2018-09-13 16:24:26 -07:00
Tolga Ceylan
586d5c4735 fn: make call.End() to blocking to reduce complexity (#1208)
agent/lb-agent/runner roles execute call.End() in the background
in some cases to reduce latency. With this change, we simplify this
and switch to non-background execution of call.End(). This fixes
hard to detect issues such as non-deterministic calculation of
call.CompletedAt or incomplete Call.Stats in runners.

Downstream projects if impacted by the now blocking call.End()
latency should take steps to handle this according to their requirements.
2018-09-13 11:28:11 +01:00
Tolga Ceylan
6226af933a fn: slot metrics/stats should be in stats/metrics removing logging (#1200)
Slot stats are too noisy. These should be (or shortly will be) in
metrics/stats/tracing.
2018-09-10 16:30:25 -07:00
Reed Allman
7638b31e11 use tini to run every container (#1195)
fixes #1101

additional context:

* this was introduced in docker 1.13 (1/2017), we require docker 17.10
(10/2017), this should not have any issues dependency-wise, as `docker-init`
is in the docker install from that point in time. unless explicitly removed,
it should be in the dind container we use as well...
* the PR that introduced this to docker is
https://github.com/moby/moby/pull/26061 for additional context
* it may be wise to put this through some paces, if anybody has any...
interesting... function containers. the tests seem to work fine, however, and
this shouldn't be something users have to think about (?) at all, just
something that we are doing. this isn't the default in docker for
compatibility reasons, which is maybe a yellow flag but I am not sure tbh
2018-09-04 15:41:30 -07:00
Tolga Ceylan
ad011fde7f fn: introducing docker-syslog driver as default logger (#1189)
* fn: introducing docker-syslog driver as default logger

With this change, fn-agent prefers RFC2454 docker-syslog driver
for logging stdout/stderr from containers. The advantage
of this is to offload it to docker itself instead of
streaming stderr along with stdout, which gets multiplexed
through single connection via docker-API.

The change will need support from FDKs in order to log
correct call-id and supress '\n' that splits syslog lines.
2018-08-29 13:08:02 -07:00
Peter Jausovec
35408ac949 Change the syslog format to use app_name instead of app_id (#1166)
* Add AppName to the models.Call, so we can include it in the syslog

* Replace the app_id with app_name
2018-08-09 12:06:19 -07:00
Reed Allman
409c104df3 make agent options/config pass lint checks (#1144) 2018-07-30 16:04:27 -07:00
Tolga Ceylan
1258baeb7f fn: agent eviction revisited (#1131)
* fn: agent eviction revisited

Previously, the hot-container eviction logic used
number of waiters of cpu/mem resources to decide to
evict a container. An ejection ticker used to wake up
its associated container every 1 sec to reasses system
load based on waiter count. However, this does not work
for non-blocking agent since there are no waiters for
non-blocking mode.

Background on blocking versus non-blocking agent:
    *) Blocking agent holds a request until the
    the request is serviced or client times out. It assumes
    the request can be eventually serviced when idle
    containers eject themselves or busy containers finish
    their work.
    *) Non-blocking mode tries to limit this wait time.
    However non-blocking agent has never been truly
    non-blocking. This simply means that we only
    make a request wait if we take some action in
    the system. Non-blocking agents are configured with
    a much higher hot poll frequency to make the system
    more responsive as well as to handle cases where an
    too-busy event is missed by the request. This is because
    the communication between hot-launcher and waiting
    requests are not 1-1 and lossy if another request
    arrives for the same slot queue and receives a
    too-busy response before the original request.

Introducing an evictor where each hot container can
register itself, if it is idle for more than 1 seconds.
Upon registry, these idle containers become eligible
for eviction.

In hot container launcher, in non-blocking mode,
before we attempt to emit a too-busy response, now
we attempt an evict. If this is successful, then
we wait some more. This could result in requests
waiting for more than they used to only if a
container was evicted. For blocking-mode, the
hot launcher uses hot-poll period to assess if
a request has waited for too long, then eviction
is triggered.
2018-07-19 15:04:15 -07:00
Tolga Ceylan
5dc5740a54 fn: runner status and docker load images (#1116)
* fn: runner status and docker load images

Introducing a function run for pure runner Status
calls. Previously, Status gRPC calls returned active
inflight request counts with the purpose of a simple
health checker. However this is not sufficient since
it does not show if agent or docker is healthy. With
this change, if pure runner is configured with a status
image, that image is executed through docker. The
call uses zero memory/cpu/tmpsize settings to ensure
resource tracker does not block it.

However, operators might not always have a docker
repository accessible/available for status image. Or
operators might not want the status to go over the
network. To allow such cases, and in general possibly
caching docker images, added a new environment variable
FN_DOCKER_LOAD_FILE. If this is set, fn-agent during
startup will load these images that were previously
saved with 'docker save' into docker.
2018-07-12 13:58:38 -07:00
Owen Cliffe
fff95e7992 Clean up/make consistent the APIs for registering core components, make Docker an optional component at compile time (#1111) 2018-07-07 10:37:19 +01:00
Owen Cliffe
b8b544ed25 HTTP Triggers hookup (#1086)
* Initial suypport for invoking tiggers

* dupe method

* tighten server constraints

* runner tests not working yet

* basic route tests passing

* post rebase fixes

* add hybrid support for trigger invoke and tests

* consoloidate all hybrid evil into one place

* cleanup and make triggers unique by source

* fix oops with Agent

* linting

* review fixes
2018-07-05 12:56:07 -05:00
Reed Allman
51ff7caeb2 Bye bye openapi (#1081)
* add DateTime sans mgo

* change all uses of strfmt.DateTime to common.DateTime, remove test strfmt usage

* remove api tests, system-test dep on api test

multiple reasons to remove the api tests:

* awkward dependency with fn_go meant generating bindings on a branched fn to
vendor those to test new stuff. this is at a minimum not at all intuitive,
worth it, nor a fun way to spend the finite amount of time we have to live.
* api tests only tested a subset of functionality that the server/ api tests
already test, and we risk having tests where one tests some thing and the
other doesn't. let's not. we have too many test suites as it is, and these
pretty much only test that we updated the fn_go bindings, which is actually a
hassle as noted above and the cli will pretty quickly figure out anyway.
* fn_go relies on openapi, which relies on mgo, which is deprecated and we'd
like to remove as a dependency. openapi is a _huge_ dep built in a NIH
fashion, that cannot simply remove the mgo dep as users may be using it.
we've now stolen their date time and otherwise killed usage of it in fn core,
for fn_go it still exists but that's less of a problem.

* update deps

removals:

* easyjson
* mgo
* go-openapi
* mapstructure
* fn_go
* purell
* go-validator

also, had to lock docker. we shouldn't use docker on master anyway, they
strongly advise against that. had no luck with latest version rev, so i locked
it to what we were using before. until next time.

the rest is just playing dep roulette, those end up removing a ton tho

* fix exec test to work

* account for john le cache
2018-06-21 11:09:16 -07:00
Tolga Ceylan
881a0ba1db fn: agent call overrider (#1080)
Similar to LB Agent call overrider, this PR adds Agent overrider
for Agents to modify/analyze a Call/Extensions during GetCall().
2018-06-20 16:21:09 -07:00