Commit Graph

245 Commits

Author SHA1 Message Date
Tolga Ceylan
1258baeb7f fn: agent eviction revisited (#1131)
* fn: agent eviction revisited

Previously, the hot-container eviction logic used
number of waiters of cpu/mem resources to decide to
evict a container. An ejection ticker used to wake up
its associated container every 1 sec to reasses system
load based on waiter count. However, this does not work
for non-blocking agent since there are no waiters for
non-blocking mode.

Background on blocking versus non-blocking agent:
    *) Blocking agent holds a request until the
    the request is serviced or client times out. It assumes
    the request can be eventually serviced when idle
    containers eject themselves or busy containers finish
    their work.
    *) Non-blocking mode tries to limit this wait time.
    However non-blocking agent has never been truly
    non-blocking. This simply means that we only
    make a request wait if we take some action in
    the system. Non-blocking agents are configured with
    a much higher hot poll frequency to make the system
    more responsive as well as to handle cases where an
    too-busy event is missed by the request. This is because
    the communication between hot-launcher and waiting
    requests are not 1-1 and lossy if another request
    arrives for the same slot queue and receives a
    too-busy response before the original request.

Introducing an evictor where each hot container can
register itself, if it is idle for more than 1 seconds.
Upon registry, these idle containers become eligible
for eviction.

In hot container launcher, in non-blocking mode,
before we attempt to emit a too-busy response, now
we attempt an evict. If this is successful, then
we wait some more. This could result in requests
waiting for more than they used to only if a
container was evicted. For blocking-mode, the
hot launcher uses hot-poll period to assess if
a request has waited for too long, then eviction
is triggered.
2018-07-19 15:04:15 -07:00
Tolga Ceylan
e9d5221e15 fn: Status gRPC call timeout handling (#1125)
Status calls should not directly use client
gRPC context deadlines/timeouts during Status
execution. Status should allow plenty of time
for the scheduler agent and docker to run and
emit useful error information.

Setting this timeout to 60 seconds, which should
surface disk I/O, docker, etc. issues.
2018-07-16 18:33:23 -07:00
Tolga Ceylan
564db4e9d2 fn: Status should expose if data was served from cache. (#1123)
This is useful in scenarios where gRPC client might want
to reliably observe/report the status latency metrics
and remove any possible duplicates. If the status query
was served from cache, then these latencies show last
execution latency.
2018-07-13 17:35:00 -07:00
Tolga Ceylan
5dc5740a54 fn: runner status and docker load images (#1116)
* fn: runner status and docker load images

Introducing a function run for pure runner Status
calls. Previously, Status gRPC calls returned active
inflight request counts with the purpose of a simple
health checker. However this is not sufficient since
it does not show if agent or docker is healthy. With
this change, if pure runner is configured with a status
image, that image is executed through docker. The
call uses zero memory/cpu/tmpsize settings to ensure
resource tracker does not block it.

However, operators might not always have a docker
repository accessible/available for status image. Or
operators might not want the status to go over the
network. To allow such cases, and in general possibly
caching docker images, added a new environment variable
FN_DOCKER_LOAD_FILE. If this is set, fn-agent during
startup will load these images that were previously
saved with 'docker save' into docker.
2018-07-12 13:58:38 -07:00
Owen Cliffe
fff95e7992 Clean up/make consistent the APIs for registering core components, make Docker an optional component at compile time (#1111) 2018-07-07 10:37:19 +01:00
Owen Cliffe
b8b544ed25 HTTP Triggers hookup (#1086)
* Initial suypport for invoking tiggers

* dupe method

* tighten server constraints

* runner tests not working yet

* basic route tests passing

* post rebase fixes

* add hybrid support for trigger invoke and tests

* consoloidate all hybrid evil into one place

* cleanup and make triggers unique by source

* fix oops with Agent

* linting

* review fixes
2018-07-05 12:56:07 -05:00
Tolga Ceylan
300fcd7d92 fn: applications should be aware of reserved writable space (#1083)
Similar to FN_MEMORY, we pass FN_TMPSIZE to function config.
2018-07-03 16:04:48 -07:00
Tolga Ceylan
317de18e6b fn: lb-agent: Add Runner Scheduler/Execution Stats (#1107)
LB agent reports lb placer latency. It should also report
how long it took for the runner to initiate the call as
well as execution time inside the container if the runner
has accepted (committed) to the call.
2018-07-02 17:15:43 -07:00
Tom Coupland
3ebff051a4 Add support for Function and Trigger domain objects (#1060)
Vast commit, includes:

 * Introduces the Trigger domain entity.
 * Introduces the Fns domain entity.
 * V2 of the API for interacting with the new entities in swaggerv2.yml
 * Adds v2 end points for Apps to support PUT updates.
 * Rewrites the datastore level tests into a new pattern.
 * V2 routes use entity ID over name as the path parameter.
2018-06-25 15:37:06 +01:00
Reed Allman
51ff7caeb2 Bye bye openapi (#1081)
* add DateTime sans mgo

* change all uses of strfmt.DateTime to common.DateTime, remove test strfmt usage

* remove api tests, system-test dep on api test

multiple reasons to remove the api tests:

* awkward dependency with fn_go meant generating bindings on a branched fn to
vendor those to test new stuff. this is at a minimum not at all intuitive,
worth it, nor a fun way to spend the finite amount of time we have to live.
* api tests only tested a subset of functionality that the server/ api tests
already test, and we risk having tests where one tests some thing and the
other doesn't. let's not. we have too many test suites as it is, and these
pretty much only test that we updated the fn_go bindings, which is actually a
hassle as noted above and the cli will pretty quickly figure out anyway.
* fn_go relies on openapi, which relies on mgo, which is deprecated and we'd
like to remove as a dependency. openapi is a _huge_ dep built in a NIH
fashion, that cannot simply remove the mgo dep as users may be using it.
we've now stolen their date time and otherwise killed usage of it in fn core,
for fn_go it still exists but that's less of a problem.

* update deps

removals:

* easyjson
* mgo
* go-openapi
* mapstructure
* fn_go
* purell
* go-validator

also, had to lock docker. we shouldn't use docker on master anyway, they
strongly advise against that. had no luck with latest version rev, so i locked
it to what we were using before. until next time.

the rest is just playing dep roulette, those end up removing a ton tho

* fix exec test to work

* account for john le cache
2018-06-21 11:09:16 -07:00
Tolga Ceylan
881a0ba1db fn: agent call overrider (#1080)
Similar to LB Agent call overrider, this PR adds Agent overrider
for Agents to modify/analyze a Call/Extensions during GetCall().
2018-06-20 16:21:09 -07:00
Tolga Ceylan
e67d0e5f3f fn: Call extensions/overriding and more customization friendly docker driver (#1065)
In pure-runner and LB agent, service providers might want to set specific driver options.

For example, to add cpu-shares to functions, LB can add the information as extensions
to the Call and pass this via gRPC to runners. Runners then pick these extensions from
gRPC call and pass it to driver. Using a custom driver implementation, pure-runners can
process these extensions to modify docker.CreateContainerOptions.

To achieve this, LB agents can now be configured using a call overrider.

Pure-runners can be configured using a custom docker driver.

RunnerCall and Call interfaces both expose call extensions.

An example to demonstrate this is implemented in test/fn-system-tests/system_test.go
which registers a call overrider for LB agent as well as a simple custom docker driver.
In this example, LB agent adds a key-value to extensions and runners add this key-value
as an environment variable to the container.
2018-06-18 14:42:28 -07:00
Andrea Rosa
e637661ea2 Adding a way to inject a request ID (#1046)
* Adding a way to inject a request ID

It is very useful to associate a request ID to each incoming request,
this change allows to provide a function to do that via Server Option.
The change comes with a default function which will generate a new
request ID. The request ID is put in the request context along with a
common logger which always logs the request-id

We add gRPC interceptors to the server so it can get the request ID out
of the gRPC metadata and put it in the common logger stored in the
context so as all the log lines using the common logger from the context
will have the request ID logged
2018-06-14 10:40:55 +01:00
Peter Jausovec
bd5150f1ac Extract register view functionality (#1056)
* WIP

* Create separate Register*Views functions that are called from main.
2018-06-12 17:24:21 +01:00
Owen Cliffe
1ad27f4f0d Inverting deps on SQL, Log and MQ plugins to make them optional dependencies of extended servers, Removing some dead code that brought in unused dependencies Filtering out some non-linux transitive deps. (#1057)
* initial Db helper split - make SQL and datastore packages optional

* abstracting log store

* break out DB, MQ and log drivers as extensions

* cleanup

* fewer deps

* fixing docker test

* hmm dbness

* updating db startup

* Consolidate all your extensions into one convenient package

* cleanup

* clean up dep constraints
2018-06-11 18:23:28 +01:00
Tolga Ceylan
fce1e54746 fn: remove dead code in static pool (#1052)
Static pool is oriented for testing/basic usage and
as it's name implies it is a static pool. Therefore,
removing unnecessary/dead code.
2018-06-08 15:57:06 -07:00
Tolga Ceylan
8f969918bd fn: removing unused/dead code (#1051) 2018-06-08 15:51:19 -07:00
Tolga Ceylan
4fcb52f69d fn: MaxTotalCPU and MaxTotalMemory in non-Linux systems (#1043)
Non-Linux systems skip some of memory/cpu determination
code in resource tracker. But config settings to cap
these are used in tests, so they must not be ignored.

With this change, we apply these config settings even
on non-Linux systems.

Memory allocation code is also now same in non-Linux
systems, but default is raised to 2GB from 1.5GB.
2018-06-06 14:50:21 -07:00
Owen Cliffe
c6abc8bf64 Use context logging more to ensure context vars are present in log lines (#1039) 2018-06-06 15:14:29 +01:00
Tolga Ceylan
4af53025d8 fn: lb-agent: Initial TryCall result can be retriable. (#1035)
Before this change, we assumed data may end up in a container
once we placed a TryCall() and if gRPC send failed, we did not
retry. However, a send failure cannot result in data in a
container, since only upon successful receipt of a TryCall can
pure-runner schedule a call into a container. Here we trust
gRPC and if gRPC layer says it could not send a msg, then
the receiver did not receive it.
2018-06-05 14:41:13 -07:00
Andrea Rosa
c2c295ffb3 Add a LBAgent constructor which accept AgentConfig (#1037)
In some cases could be useful to pass Agent configurations to the
LnAgent constuctor, this small change adds a new constructor which
accepts an agent configuration as additional parameter.
2018-06-05 13:59:43 -07:00
Tolga Ceylan
1cd5894f41 fn: LB agent: reduce 'Too Busy' error logs (#1033)
With this PR, runner client translates too busy errors
from gRPC session and runner itself into Fn error type.
Placers now ignore this error message to reduce unnecessary
logging.
2018-06-04 12:16:00 -07:00
Tolga Ceylan
7261ddedcc fn: LB agent: EOF from runner is normal in nack cases (#1032) 2018-06-04 12:10:00 -07:00
Reed Allman
00c29b8bf3 datastore no longer implements logstore (#1013)
* datastore no longer implements logstore

the underlying implementation of our sql store implements both the datastore
and the logstore interface, however going forward we are likely to encounter
datastore implementers that would mock out the logstore interface and not use
its methods - signalling a poor interface. this remedies that, now they are 2
completely separate things, which our sqlstore happens to implement both of.

related to some recent changes around wrapping, this keeps the imposed metrics
and validation wrapping of a servers logstore and datastore, just moving it
into New instead of in the opts - this is so that a user can have the
underlying datastore in order to set the logstore to it, since wrapping it in
a validator/metrics would render it no longer a logstore implementer (i.e.
validate datastore doesn't implement the logstore interface), we need to do
this after setting the logstore to the datastore if one wasn't provided
explicitly.

* splits logstore and datastore metrics & validation logic
* `make test` should be `make full-test` always. got rid of that so that
nobody else has to wait for CI to blow up on them after the tests pass locally
ever again.

* fix new tests
2018-06-04 00:08:16 -07:00
Tolga Ceylan
a57907eed0 fn: user friendly timeout handling changes (#1021)
* fn: user friendly timeout handling changes

Timeout setting in routes now means "maximum amount
of time a function can run in a container".

Total wait time for a given http request is now expected
to be handled by the client. As long as the client waits,
the LB, runner or agents will search for resources to
schedule it.
2018-06-01 13:18:13 -07:00
Tolga Ceylan
f97b63f878 fn: fixup temp dir read/write permissions if tmp fs size is not set. (#1024)
When TmpFsSize is not set in a route, docker fails to create a /tmp
mount that is writable. Forcing docker to explicitly to this if
read-only root directory is enabled (default).
2018-06-01 10:49:07 -07:00
Tolga Ceylan
e1b7e30e49 fn: cleanup of unused/global constants in lb agent (#1020)
Moved retry interval as placer member variable for time-being.
2018-05-31 13:04:06 -07:00
Tolga Ceylan
d190167580 fn: read-only root fs becomes default (#1019)
* fn: read-only root fs becomes default

Set root fs as read-only by default.

* fn: update doc for FN_DISABLE_READONLY_ROOTFS
2018-05-30 18:17:28 -07:00
Tolga Ceylan
7f1d14d21f fn: slot hash id must be utf8 in gRPC (#1016) 2018-05-29 16:26:43 -07:00
Tolga Ceylan
74a5379dec fn: lb & pure-runner slot hash id communication (#1007)
* fn: lb & pure-runner slot hash id communication

With this change, LB can pre-calculate the slot hash
key and pass it to runners. If LB knows/calculates
the slot hash ids, then it can also make better
estimates on which runner can successfully execute
it especially when status messages from runner
include a small summary of idle slots for a given
slot hash id. (TODO)

* fn: fix mock test
2018-05-25 14:12:48 -07:00
Tolga Ceylan
9584643142 fn: size restricted tmpfs /tmp and read-only / support (#1012)
* fn: size restricted tmpfs /tmp and read-only / support

*) read-only Root Fs Support
*) removed CPUShares from docker API. This was unused.
*) docker.Prepare() refactoring
*) added docker.configureTmpFs() for size limited tmpfs on /tmp
*) tmpfs size support in routes and resource tracker
*) fix fn-test-utils to handle sparse files better in create file

* test typo fix
2018-05-25 14:12:29 -07:00
Gerardo Viedma
ea1f94253f Implement graceful shutdown of agent.DataAccess (#1008)
* Implements graceful shutdown of agent.DataAccess and underlying Datastore/Logstore/MessageQueue

* adds tests for closing agent.DataAccess and Datastore
2018-05-21 11:28:21 +01:00
Tolga Ceylan
77086ecc24 fn: lb-agent & runner gRPC updates (#1005)
Breaking changes:

*) Removed unused ACK/NACK definitions
*) Extended Finished messages with error code/str
2018-05-17 15:02:15 -07:00
Tolga Ceylan
7cf8e2a61d fn: pure-runner time out while waiting TryCall (#1006)
This should return a retriable error code 503.
2018-05-17 15:00:50 -07:00
Tolga Ceylan
4ccde8897e fn: lb and pure-runner with non-blocking agent (#989)
* fn: lb and pure-runner with non-blocking agent

*) Removed pure-runner capacity tracking code. This did
not play well with internal agent resource tracker.
*) In LB and runner gRPC comm, removed ACK. Now,
upon TryCall, pure-runner quickly proceeds to call
Submit. This is good since at this stage pure-runner
already has all relevant data to initiate the call.
*) Unless pure-runner emits a NACK, LB immediately
streams http body to runners.
*) For retriable requests added a CachedReader for
http.Request Body.
*) Idempotenty/retry is similar to previous code.
After initial success in Engament, after attempting
a TryCall, unless we receive NACK, we cannot retry
that call.
*) ch and naive places now wraps each TryExec with
a cancellable context to clean up gRPC contexts
quicker.

* fn: err for simpler one-time read GetBody approach

This allows for a more flexible approach since we let
users to define GetBody() to allow repetitive http body
read. In default LB case, LB executes a one-time io.ReadAll
and sets of GetBody, which is detected by RunnerCall.RequestBody().

* fn: additional check for non-nil req.body

* fn: attempt to override IO errors with ctx for TryExec

* fn: system-tests log dest

* fn: LB: EOF send handling

* fn: logging for partial IO

* fn: use buffer pool for IO storage in lb agent

* fn: pure runner should use chunks for data msgs

* fn: required config validations and pass APIErrors

* fn: additional tests and gRPC proto simplification

*) remove ACK/NACK messages as Finish message type works
OK for this purpose.
*) return resp in api tests for check for status code
*) empty body json test in api tests for lb & pure-runner

* fn: buffer adjustments

*) setRequestBody result handling correction
*) switch to bytes.Reader for read-only safety
*) io.EOF can be returned for non-nil Body in request.

* fn: clarify detection of 503 / Server Too Busy
2018-05-17 12:09:03 -07:00
Mark Godfrey
ac4e6c5a03 Replace panic on enqueue for LB agent with error. (#1004) 2018-05-17 13:05:47 +01:00
Tolga Ceylan
eab85dfab0 fn: agent MaxRequestSize limit (#998)
* fn: agent MaxRequestSize limit

Currently, LimitRequestBody() exists to install a
http request body size in http/gin server. For production
enviroments, this is expected to be used. However, in agents
we may need to verify/enforce these size limits and to be
able to assert in case of missing limits is valuable.
With this change, operators can define an agent env variable
to limit this in addition to installing Gin/Http handler.

http.MaxBytesReader is superior in some cases as it sets
http headers (Connection: close) to guard against subsequent
requests.

However, NewClampReadCloser() is superior in other cases,
where it can cleanly return an API error for this case alone
(http.MaxBytesReader() does not return a clean error type
for overflow case, which makes it difficult to use it without
peeking into its implementation.)

For lb agent, upcoming changes rely on such limits enabled
and using gin/http handler (http.MaxBytesReader) makes such
checks/safety validations difficult.

* fn: read/write clamp code adjustment

In case of overflows, opt for simple implementation
of a partial write followed by return error.
2018-05-16 11:45:57 -07:00
Reed Allman
cbe0d5e9ac add user syslog writers to app (#970)
* add user syslog writers to app

users may specify a syslog url[s] on apps now and all functions under that app
will spew their logs out to it. the docs have more information around details
there, please review those (swagger and operating/logging.md), tried to
implement to spec in some parts and improve others, open to feedback on
format though, lots of liberty there.

design decision wise, I am looking to the future and ignoring cold containers.
the overhead of the connections there will not be worth it, so this feature
only works for hot functions, since we're killing cold anyway (even if a user
can just straight up exit a hot container).

syslog connections will be opened against a container when it starts up, and
then the call id that is logged gets swapped out for each call that goes
through the container, this cuts down on the cost of opening/closing
connections significantly. there are buffers to accumulate logs until we get a
`\n` to actually write a syslog line, and a buffer to save some bytes when
we're writing the syslog formatting as well. underneath writers re-use the
line writer in certain scenarios (swapper). we could likely improve the ease
of setting this up, but opening the syslog conns against a container seems
worth it, and is a different path than the other func loggers that we create
when we make a call object. the Close() stuff is a little tricky, not sure how
to make it easier and have the ^ benefits, open to idears.

this does add another vector of 'limits' to consider for more strict service
operators. one being how many syslog urls can a user add to an app (infinite,
atm) and the other being on the order of number of containers per host we
could run out of connections in certain scenarios. there may be some utility
in having multiple syslog sinks to send to, it could help with debugging at
times to send to another destination or if a user is a client w/ someone and
both want the function logs, e.g. (have used this for that in the past,
specifically).

this also doesn't work behind a proxy, which is something i'm open to fixing,
but afaict will require a 3rd party dependency (we can pretty much steal what
docker does). this is mostly of utility for those of us that work behind a
proxy all the time, not really for end users.

there are some unit tests. integration tests for this don't sound very fun to
maintain. I did test against papertrail with each protocol and it works (and
even times out if you're behind a proxy!).

closes #337

* add trace to syslog dial
2018-05-15 11:00:26 -07:00
Tolga Ceylan
8e440c835e fn: fixup undeterministic test (#986) 2018-05-10 08:08:10 -07:00
Tolga Ceylan
508d9e18c7 fn: nonblocking resource manager tests (#987) 2018-05-09 19:23:10 -07:00
Tolga Ceylan
0f50537150 fn: allow specified docker networks in functions (#982)
* fn: allow specified docker networks in functions

If FN_DOCKER_NETWORK is specified with a list of
networks, then agent driver picks the least used
network to place functions on.

* add mutex comment
2018-05-09 12:24:15 -07:00
Tolga Ceylan
676c87f9a5 fn: pure-runner violates io Writer contract (#981)
We must copy the data slice.
2018-05-09 16:48:55 +01:00
jan grant
91e58afa55 The opencensus API changes between 0.6.0 and 0.9.0 (#980)
We get some useful features in later versions; update so as to not
pin downstream consumers (extensions) to an older version.
2018-05-09 14:55:00 +01:00
Reed Allman
1f1624782b related: https://github.com/fnproject/fdk-go/pull/26 (#968)
adds a test for the protocol dumping of a request to the containert stdin.
there are a number of vectors to test for a cloud event, but since we're going
to change that behavior soon it's probably a waste of time to go about doing
so. in any event, this was pretty broken. my understanding of the cloud event
spec is deepening and the json stuff overall seems a little weird.

* fixes content type issue around json checking (since a string is also a json
value, we can just decode it, even though it's wasteful it's more easily
correct)
* doesn't force all json values to be map[string]interface{} and lets them be
whoever they want to be. maybe their dads are still proud.

closes #966
2018-05-07 22:22:53 -07:00
Tolga Ceylan
f0f9a6d945 fn: LB ch and naive fixes (#942)
* fn: LB ch and naive fixes

*) Naive is now a naive RR algorithm.
*) Both now checks for ctx/timeout in each attempt.

* fn: test fix
2018-05-07 11:50:16 -07:00
Tolga Ceylan
54ba49be65 fn: non-blocking resource tracker and notification (#841)
* fn: non-blocking resource tracker and notification

For some types of errors, we might want to notify
the actual caller if the error is directly 1-1 tied
to that request. If hotLauncher is triggered with
signaller, then here we send a back communication
error notification channel. This is passed to
checkLaunch to send back synchronous responses
to the caller that initiated this hot container
launch.

This is useful if we want to run the agent in
quick fail mode, where instead of waiting for
CPU/Mem to become available, we prefer to fail
quick in order not to hold up the caller.
To support this, non-blocking resource tracker
option/functions are now available.

* fn: test env var rename tweak

* fn: fixup merge

* fn: rebase test fix

* fn: merge fixup

* fn: test tweak down to 70MB for 128MB total

* fn: refactor token creation and use broadcast regardless

* fn: nb description

* fn: bugfix
2018-04-24 21:59:33 -07:00
Travis Reeder
3eb60e2028 CloudEvents I/O format support. (#948)
* CloudEvents I/O format support.

* Updated format doc.

* Remove log lines

* This adds support for CloudEvent ingestion at the http router layer.

* Updated per comments.

* Responds with full CloudEvent message.

* Fixed up per comments

* Fix tests

* Checks for cloudevent content-type

* doesn't error on missing content-type.
2018-04-23 16:05:13 -07:00
Tolga Ceylan
c0ee3ce736 fn: locked mutex while blocked on I/O considered harmful (#935)
* fn: mutex while waiting I/O considered harmful

*) Removed hold mutex while wait I/O cases these
included possible disk I/O and network I/O.

*) Error/Context Close/Shutdown semantics changed since
the context timeout and comments were misleading. Close
always waits for pending gRPC session to complete.
Context usage here was merely 'wait up to x secs to
report an error' which only logs the error anyway.
Instead, the runner can log the error. And context
still can be passed around perhaps for future opencensus
instrumentation.
2018-04-13 11:23:29 -07:00
Tolga Ceylan
623aeb35b2 fn: common.WaitGroup improvements (#940)
* fn: common.WaitGroup improvements

*) Split the API into AddSession/DoneSession
*) Only wake up listeners when session count reaches zero.

* fn: WaitGroup go-routine blast test

* fn: test fix and rebase fixup
2018-04-12 16:21:13 -07:00
Tolga Ceylan
e47d55056a fn: reduce lbagent and agent dependency (#938)
* fn: reduce lbagent and agent dependency

lbagent and agent code is too dependent. This causes
any changed in agent to break lbagent. In reality, for
LB there should be no delegated agent. Splitting these
two will cause some code duplication, but it reduces
dependency and complexity (eg. agent without docker)

* fn: post rebase fixup

* fn: runner/runnercall should use lbDeadline

* fn: fixup ln agent test

* fn: remove agent create option for common.WaitGroup
2018-04-12 15:51:58 -07:00