* Move delegated agent creation within NewLBAgent so we can hide the fact we disable docker
* Move delegated agent creation within NewPureRunner for better encapsulation
* Move out node-pool manager and replace it with RunnerPool extension
* adds extension points for runner pools in load-balanced mode
* adds error to return values in RunnerPool and Runner interfaces
* Implements runner pool contract with context-aware shutdown
* fixes issue with range
* fixes tests to use runner abstraction
* adds empty test file as a workaround for build requiring go source files in top-level package
* removes flappy timeout test
* update docs to reflect runner pool setup
* refactors system tests to use runner abstraction
* removes poolmanager
* moves runner interfaces from models to api/runnerpool package
* Adds a second runner to pool docs example
* explicitly check for request spillover to second runner in test
* moves runner pool package name for system tests
* renames runner pool pointer variable for consistency
* pass model json to runner
* automatically cast to http.ResponseWriter in load-balanced call case
* allow overriding of server RunnerPool via a programmatic ServerOption
* fixes return type of ResponseWriter in test
* move Placer interface to runnerpool package
* moves hash-based placer out of open source project
* removes siphash from Gopkg.lock
Added env FN_MAX_FS_SIZE_MB, which if defined and non-zero
is passed to docker as storage opt size. We do not validate
if this option is supported by docker currently. This is
because it's difficult to actually validate this since it
not only depends on storage driver and its backing filesystem,
but also the mount options used to mount that fs.
* fn: reorg agent config
*) Moving constants in agent to agent config, which helps
with testing, tuning.
*) Added max total cpu & memory for testing & clamping max
mem & cpu usage if needed.
* fn: adjust PipeIO time
* fn: for hot, cannot reliably test EndOfLogs in TestRouteRunnerExecution
* add jaeger support, link hot container & req span
* adds jaeger support now with FN_JAEGER_URL, there's a simple tutorial in the
operating/metrics.md file now and it's pretty easy to get up and running.
* links a hot request span to a hot container span. when we change this to
sample at a lower ratio we'll need to finagle the hot container span to always
sample or something, otherwise we'll hide that info. at least, since we're
sampling at 100% for now if this is flipped on, can see freeze/unfreeze etc.
if they hit. this is useful for debugging. note that zipkin's exporter does
not follow the link at all, hence jaeger... and they're backed by the Cloud
Empire now (CNCF) so we'll probably use it anyway.
* vendor: add thrift for jaeger
* Refactor PureRunner as an Agent so that it encapsulates its grpc server
* Maintain a list of extra contexts for the server to select on to handle errors and cancellations
1) in theory it may be possible for an exited container to
requeue a slot, close this gap by always setting fatal error
for a slot if a container has exited.
2) when a client request times out or cancelled (client
disconnect, etc.) the slot should not be allowed to be
requeued and container should terminate to avoid accidental
mixing of previous response into next.
This change add the option to set a timeout for the dialer used in
making gRPC connection, with that we remove the check on the state of
the connections and therefore remove any potential race conditions.
If a runner disconnect not gracefully it could happen that the
connection gets stuck in connecting mode, this change verifies the state
of the connection before starting to execute a call, if the client
connection is not ready we fail fast to give a change to the next runner
(if any) to execute the call.
* fn: enable failing test back
* fn: fortifying the stderr output
Modified limitWriter to discard excess data instead
of returning error, this is to allow stderr/stdout
pipes flowing to avoid head-of-line blocking or
data corruption in container stdout/stderr output stream.
* Initial stab at the protocol
* initial protocol sketch for node pool manager
* Added http header frame as a message
* Force the use of WithAgent variants when creating a server
* adds grpc models for node pool manager plus go deps
* Naming things is really hard
* Merge (and optionally purge) details received by the NPM
* WIP: starting to add the runner-side functionality of the new data plane
* WIP: Basic startup of grpc server for pure runner. Needs proper certs.
* Go fmt
* Initial agent for LB nodes.
* Agent implementation for LB nodes.
* Pass keys and certs to LB node agent.
* Remove accidentally left reference to env var.
* Add env variables for certificate files
* stub out the capacity and group membership server channels
* implement server-side runner manager service
* removes unused variable
* fixes build error
* splits up GetCall and GetLBGroupId
* Change LB node agent to use TLS connection.
* Encode call model as JSON to send to runner node.
* Use hybrid client in LB node agent.
This should provide access to get app and route information for the call
from an API node.
* More error handling on the pure runner side
* Tentative fix for GetCall problem: set deadlines correctly when reserving slot
* Connect loop for LB agent to runner nodes.
* Extract runner connection function in LB agent.
* drops committed capacity counts
* Bugfix - end state tracker only in submit
* Do logs properly
* adds first pass of tracking capacity metrics in agent
* maked memory capacity metric uint64
* maked memory capacity metric uint64
* removes use of old capacity field
* adds remove capacity call
* merges overwritten reconnect logic
* First pass of a NPM
Provide a service that talks to a (simulated) CP.
- Receive incoming capacity assertions from LBs for LBGs
- expire LB requests after a short period
- ask the CP to add runners to a LBG
- note runner set changes and readvertise
- scale down by marking runners as "draining"
- shut off draining runners after some cool-down period
* add capacity update on schedule
* Send periodic capcacity metrics
Sending capcacity metrics to node pool manager
* splits grpc and api interfaces for capacity manager
* failure to advertise capacity shouldn't panic
* Add some instructions for starting DP/CP parts.
* Create the poolmanager server with TLS
* Use logrus
* Get npm compiling with cert fixups.
* Fix: pure runner should not start async processing
* brings runner, nulb and npm together
* Add field to acknowledgment to record slot allocation latency; fix a bug too
* iterating on pool manager locking issue
* raises timeout of placement retry loop
* Fix up NPM
Improve logging
Ensure that channels etc. are actually initialised in the structure
creation!
* Update the docs - runners GRPC port is 9120
* Bugfix: return runner pool accurately.
* Double locking
* Note purges as LBs stop talking to us
* Get the purging of old LBs working.
* Tweak: on restart, load runner set before making scaling decisions.
* more agent synchronization improvements
* Deal with teh CP pulling out active hosts from under us.
* lock at lbgroup level
* Send request and receive response from runner.
* Add capacity check right before slot reservation
* Pass the full Call into the receive loop.
* Wait for the data from the runner before finishing
* force runner list refresh every time
* Don't init db and mq for pure runners
* adds shutdown of npm
* fixes broken log line
* Extract an interface for the Predictor used by the NPM
* purge drained connections from npm
* Refactor of the LB agent into the agent package
* removes capacitytest wip
* Fix undefined err issue
* updating README for poolmanager set up
* ues retrying dial for lb to npm connections
* Rename lb_calls to lb_agent now that all functionality is there
* Use the right deadline and errors in LBAgent
* Make stream error flag per-call rather than global otherwise the whole runner is damaged by one call dropping
* abstracting gRPCNodePool
* Make stream error flag per-call rather than global otherwise the whole runner is damaged by one call dropping
* Add some init checks for LB and pure runner nodes
* adding some useful debug
* Fix default db and mq for lb node
* removes unreachable code, fixes typo
* Use datastore as logstore in API nodes.
This fixes a bug caused by trying to insert logs into a nil logstore. It
was nil because it wasn't being set for API nodes.
* creates placement abstraction and moves capacity APIs to NodePool
* removed TODO, added logging
* Dial reconnections for LB <-> runners
LB grpc connections to runners are established using a backoff stategy
in event of reconnections, this allows to let the LB up even in case one
of the runners go away and reconnect to it as soon as it is back.
* Add a status call to the Runner protocol
Stub at the moment. To be used for things like draindown, health checks.
* Remove comment.
* makes assign/release capacity lockless
* Fix hanging issue in lb agent when connections drop
* Add the CH hash from fnlb
Select this with FN_PLACER=ch when launching the LB.
* small improvement for locking on reloadLBGmembership
* Stabilise the list of Runenrs returned by NodePool
The NodePoolManager makes some attempt to keep the list of runner nodes advertised as
stable as possible. Let's preserve this effort in the client side. The main point of this
is to attempt to keep the same runner at the same inxed in the []Runner returned by
NodePool.Runners(lbgid); the ch algorithm likes it when this is the case.
* Factor out a generator function for the Runners so that mocks can be injected
* temporarily allow lbgroup to be specified in HTTP header, while we sort out changes to the model
* fixes bug with nil runners
* Initial work for mocking things in tests
* fix for anonymouse go routine error
* fixing lb_test to compile
* Refactor: internal objects for gRPCNodePool are now injectable, with defaults for the real world case
* Make GRPC port configurable, fix weird handling of web port too
* unit test reload Members
* check on runner creation failure
* adding nullRunner in case of failure during runner creation
* Refactored capacity advertisements/aggregations. Made grpc advertisement post asynchronous and non-blocking.
* make capacityEntry private
* Change the runner gRPC bind address.
This uses the existing `whoAmI` function, so that the gRPC server works
when the runner is running on a different host.
* Add support for multiple fixed runners to pool mgr
* Added harness for dataplane system tests, minor refactors
* Add Dockerfiles for components, along with docs.
* Doc fix: second runner needs a different name.
* Let us have three runners in system tests, why not
* The first system test running a function in API/LB/PureRunner mode
* Add unit test for Advertiser logic
* Fix issue with Pure Runner not sending the last data frame
* use config in models.Call as a temporary mechanism to override lb group ID
* make gofmt happy
* Updates documentation for how to configure lb groups for an app/route
* small refactor unit test
* Factor NodePool into its own package
* Lots of fixes to Pure Runner - concurrency woes with errors and cancellations
* New dataplane with static runnerpool (#813)
Added static node pool as default implementation
* moved nullRunner to grpc package
* remove duplication in README
* fix go vet issues
* Fix server initialisation in api tests
* Tiny logging changes in pool manager.
Using `WithError` instead of `Errorf` when appropriate.
* Change some log levels in the pure runner
* fixing readme
* moves multitenant compute documentation
* adds introduction to multitenant readme
* Proper triggering of system tests in makefile
* Fix insructions about starting up the components
* Change db file for system tests to avoid contention in parallel tests
* fixes revisions from merge
* Fix merge issue with handling of reserved slot
* renaming nulb to lb in the doc and images folder
* better TryExec sleep logic clean shutdown
In this change we implement a better way to deal with the sleep inside
the for loop during the attempt for placing a call.
Plus we added a clean way to shutdown the connections with external
component when we shut down the server.
* System_test mysql port
set mysql port for system test to a different value to the one set for
the api tests to avoid conflicts as they can run in parallel.
* change the container name for system-test
* removes flaky test TestRouteRunnerExecution pending resolution by issue #796
* amend remove_containers to remove new added containers
* Rework capacity reservation logic at a higher level for now
* LB agent implements Submit rather than delegating.
* Fix go vet linting errors
* Changed a couple of error levels
* Fix formatting
* removes commmented out test
* adds snappy to vendor directory
* updates Gopkg and vendor directories, removing snappy and addhing siphash
* wait for db containers to come up before starting the tests
* make system tests start API node on 8085 to avoid port conflict with api_tests
* avoid port conflicts with api_test.sh which are run in parallel
* fixes postgres port conflict and issue with removal of old containers
* Remove spurious println
*) I/O protocol parse issues should shutdown the container as the container
goes to inconsistent state between calls. (eg. next call may receive previous
calls left overs.)
*) Move ghost read/write code into io_utils in common.
*) Clean unused error from docker Wait()
*) We can catch one case in JSON, if there's remaining unparsed data in
decoder buffer, we can shut the container
*) stdout/stderr when container is not handling a request are now blocked if freezer is also enabled.
*) if a fatal err is set for slot, we do not requeue it and proceed to shutdown
*) added a test function for a few cases with freezer strict behavior
* update vendor directory, add go.opencensus.io
* update imports
* oops
* s/opentracing/opencensus/ & remove prometheus / zipkin stuff & remove old stats
* the dep train rides again
* fix gin build
* deps from last guy
* start in on the agent metrics
* she builds
* remove tags for now, cardinality error is fussing. subscribe instead of register
* update to patched version of opencensus to proceed for now TODO switch to a release
* meh
fix imports
* println debug the bad boys
* lace it with the tags
* update deps again
* fix all inconsistent cardinality errors
* add our own logger
* fix init
* fix oom measure
* remove bugged removal code
* fix s3 measures
* fix prom handler nil
*) Limit response http body or json response size to FN_MAX_RESPONSE_SIZE (default unlimited)
*) If limits are exceeded 502 is returned with 'body too large' in the error message
while escape analysis didn't lie that the bytes underlying this string escaped
to the heap, the reference to them died and led to us getting an undefined
byte array underlying the string.
sadly, this makes 4 allocs here (still down from 31), but only adds 100ns per
op. I still don't get why 'buf' and 'byts' escape to the heap, blaming faulty
escape analysis code.
this one is kind of impossible to write a test for.
found this from doing benchmarking stuff and was getting weird behavior at the
end of runs where calls didn't find a slot, ran bisect on a known-good commit
from a couple weeks ago and found that it was this. voila. this could explain
the variance from the slack dude's benchmarks, too. anyway, confirmed that
this fixes the issue.
* fix up response headers
* stops defaulting to application/json. this was something awful, go stdlib has
a func to detect content type. sadly, it doesn't contain json, but we can do a
pretty good job by checking for an opening '{'... there are other fish in the
sea, and now we handle them nicely instead of saying it's a json [when it's
not]. a test confirms this, there should be no breakage for any routes
returning a json blob that were relying on us defaulting to this format
(granted that they start with a '{').
* buffers output now to a buffer for all protocol types (default is no longer
left out in the cold). use a little response writer so that we can still let
users write headers from their functions. this is useful for content type
detection instead of having to do it in multiple places.
* plumbs the little content type bit into fn-test-util just so we can test it,
we don't want to put this in the fdk since it's redundant.
I am totally in favor of getting rid of content type from the top level json
blurb. it's redundant, at best, and can have confusing behaviors if a user
uses both the headers and the content_type field (we override with the latter,
now). it's client protocol specific to http to a certain degree, other
protocols may use this concept but have their own way to set it (like http
does in headers..). I realize that it mostly exists because it's somewhat gross
to have to index a list from the headers in certain languages more than
others, but with the ^ behavior, is it really worth it?
closes#782
* reset idle timeouts back
* move json prefix to stack / next to use
this somewhat minimally comes up in profiling, but it was an itch i needed to
scratch. this does 10x less allocations and is 3x faster (with 3x less bytes),
and they're the small painful kind of allocation. we're only reading these
strings so the uses of unsafe are fine (I think audit me). the byte array
we're casting to a string at the end is also heap allocated and does
escape. I only count 2 allocations, but there's 3 (`hash.Sum` and
`make([]string)`), using a pool of sha1 hash.Hash shaves 120 byte and an alloc
off so seems worth it (it's minimal). if we set a max size of config vals with
a constant we could avoid that allocation and we could probably find a
checksum package that doesn't use the `hash.Hash` that would speed things up a
little (no dynamic dispatch, doesn't allocate in Sum) but there's not one I
know of in stdlib.
master:
```
✗: go test -run=yodawg -bench . -benchmem -benchtime 1s -cpuprofile cpu.out
goos: linux
goarch: amd64
pkg: github.com/fnproject/fn/api/agent
BenchmarkSlotKey 200000 6068 ns/op 696 B/op 31 allocs/op
PASS
ok github.com/fnproject/fn/api/agent 1.454s
```
now:
```
✗: go test -run=yodawg -bench . -benchmem -benchtime 1s -cpuprofile cpu.out
goos: linux
goarch: amd64
pkg: github.com/fnproject/fn/api/agent
BenchmarkSlotKey 1000000 1901 ns/op 168 B/op 3 allocs/op
PASS
ok github.com/fnproject/fn/api/agent 2.092s
```
once we have versioned apps/routes we don't need to build a sha or sort
configs so this will get a lot faster.
anyway, mostly funsies here... my life is that sad now.
* http now buffers the entire request body from the container before copying
it to the response writer (and sets content length). this is a level of sad i
don't feel comfortable talking about but it is what it is.
* json protocol was buffering the entire body so there wasn't any reason for
us to try to write this directly to the container stdin manually, we needed to
add a bufio.Writer around it anyway it was making too many write(fd) syscalls
with the way it was. this is just easier overall and has the same performance
as http now in my tests, whereas previously this was 50% slower [than http].
* add buffer pool for http & json to share/use. json doesn't create a new
buffer every stinkin request. we need to plumb down content length so that we
can properly size the buffer for json, have to add header size and everything
together but it's probably faster than malloc(); punting on properly sizing.
* json now sets content type to the length of the body from the returned json
blurb from the container
this does not handle imposing a maximum size of the response returned from a
container, which we need to add, but this has been open for some time
(specifically, on json). we can impose this by wrapping the pipes, but there's
some discussion to be had for json specifically we won't be able to just cut
off the output stream and use that (http we can do this). anyway, filing a
ticket...
closes#326 :(((((((
i would split this commit in two if i were a good dev.
the pprof stuff is really useful and this only samples when called. this is
pretty standard go service stuff. expvar is cool, too.
the additional spannos have turned up some interesting tid bits... gonna slide
em in
we have been getting these from attach all this time and never needed these
anyway.
I ran cpu profiles of dockerd and this was 90% of docker cpu usage (json
logs). woot. this will reduce i/o quite a bit, and we don't have to worry
about them taking up any disk space either.
from tests i get about 50% speedup with these off. the hunt continues...
previously we would retry infinitely up to the context with some backoff in
between. for hot functions, since we don't set any dead line on pulling or
creating the image, this means it would retry forever without making any
progress if e.g. the registry is inaccessable or any other temporary error
that isn't actually temporary. this adds a hard cap of 10 retries, which
gives approximately 13s if the ops take no time, still respecting the context
deadline enclosed.
the case where this was coming up is now tested for and was otherwise
confusing for users to debug, now it spits out an ECONNREFUSED with the
address of the registry, which should help users debug without having to poke
around fn logs (though I don't like this as an excuse, not all users will be
operators at some point in the near future, and this one makes sense)
closes#727
* return bad function http resp error
this was being thrown into the fn server logs but it's relatively easy to get
this to crop up if a function user forgets that they left a `println` laying
around that gets written to stdout, it garbles the http (or json, in its case)
output and they just see 'internal server error'. for certain clients i could
see that we really do want to keep this as 'internal server error' but for
things like e.g. docker image not authorized we're showing that in the
response, so this seems apt.
json likely needs the same treatment, will file a bug.
as always, my error messages are rarely helpful enough, help me please :)
closes#355
* add formatting directive
* fix up http error
* output bad jasons to user
closes#729
woo
* fn: hot container timer improvements
With this change, now we are allocating the timers
when the container starts and managing them via
stop/clear as needed, which should not only be more
efficient, but also easier to follow.
For example, previously, if eject time out was
set to 10 secs, this could have delayed idle timeout
up to 10 secs as well. It is also not necessary to do
any math for elapsed time.
Now consumers avoid any requeuing when startDequeuer() is cancelled.
This was triggering additional dequeue/requeue causing
containers to wake up spuriously. Also in startDequeuer(),
we no longer remove the item from the actual queue and
leave this to acquire/eject, which side steps issues related
with item landing in the channel, not consumed, etc.
closes#317
we could fiddle with this, but we need to at least bound these. this
accomplishes that. 1m is picked since that's our default max log size for the
time being per call, it also takes a little time to generate that many bytes
through logs, typically (i.e. without trying to). I tested with 0, which
spiked the i/o rate on my machine because it's constantly deleting the json
log file. I also tested with 1k and it was similar (for a task that generated
about 1k in logs quickly) -- in testing, this halved my throughput, whereas
using 1m did not change the throughput at all. trying the 'none' driver and
'syslog' driver weren't great, 'none' turns off all stderr and 'syslog' blocks
every log line (boo). anyway, this option seems to have no affect on the
output we get in 'attach', which is what we really care about (i.e. docker is
not logically capping this, just swapping out the log file).
using 1m for this, e.g. if we have 500 hot containers on a machine we have
potentially half a gig of worthless logs laying around. we don't need the
docker logs laying around at all really, but short of writing a storage driver
ourselves there don't seem to be too many better options. open to idears, but
this is likely to hold us over for some time.
* pipe swapparoo each slot
previously, we made a pair of pipes for stdin and stdout for each container,
and then handed them out to each call (slot) to use. this meant that multiple
calls could have a handle on the same stdin pipe and stdout pipe to read/write
to/from from fn's perspective and could mix input/output and get garbage. this
also meant that each was blocked on the previous' reads.
now we make a new pipe every time we get a slot, and swap it out with the
previous ones. calls are no longer blocked from fn's perspective, and we don't
have to worry about timing out dispatch for any hot format. there is still the
issue that if a function does not finish reading the input from the previous
task, from its perspective, and reads the next call's it can error out the
second call. with fn deadline we provide the necessary tools to skirt this,
but without some additional coordination am not sure this is a closable hole
with our current protocols since terminating a previous calls input requires
some protocol specific bytes to go in (json in particular is tricky). anyway,
from fn's side fixing pipes was definitely a hole, but this client hole is
still hanging out. there was an attempt to send an io.EOF but the issue is
that will shut down docker's read on the stdin pipe (and the container). poop.
this adds a test for this behavior, and makes sure 2 containers don't get
launched.
this also closes the response writer header race a little, but not entirely, I
think there's still a chance that we read a full response from a function and
get a timeout while we're changing the headers. I guess we need a thread safe
header bucket, otherwise we have to rely on timings (racy). thinking on it.
* fix stats mu race
* Improve deadline handling in streaming protocols
* Move special headers handling down to the protocols
* Adding function format documentation for JSON changes
* Add tests for request url and method in JSON protocol
* Fix protocol missing fn-specific info
* Fix import
* Add panic for something that should never happen
* add spans to async
* clean up / add spans to agent
* there were a few methods which had multiple contexts which existed in the same
scope (this doesn't end well, usually), flattened those out.
* loop bound context cancels now rely on defer (also was brittle)
* runHot had a lot of ctx shuffling, flattened that.
* added some additional spans in certain paths for added granularity
* linked up the hot launcher / run hot / wait hot to _a_ root span, the first
2 are follows from spans, but at least we can see the source of these and also
can see containers launched over a hot launcher's lifetime
I left TODO around the FollowsFrom because OpenCensus doesn't, at least at the
moment, appear to have any idea of FollowsFrom and it was an extra OpenTracing
method (we have to get the span out, start a new span with the option, then
add it to the context... some shuffling required). anyway, was on the fence
about adding at least.
* resource waiters need to manage their own goroutine lifecycle
* if we get an impossible memory request, bail instead of infinite loop
* handle timeout slippery case
* still sucks, but hotLauncher doesn't leak anything. even the time.After timer goroutines
* simplify GetResourceToken
GetCall can guard against the impossible to allocate resource tasks entering
the system by erroring instead of doling them out. this makes GetResourceToken
logic more straightforward for callers, who now simply have the contract that
they won't ever get a token if they let tasks into the agent that can't run
(but GetCall guards this, and there's a test for it).
sorry, I was going to make this only do that, but when I went to fix up the
tests, my last patch went haywire so I fixed that too. this also at least
tries to simplify the hotLaunch loop, which will now no longer leak time.After
timers (which were long, and with signaller, they were many -- I got a stack
trace :) -- this breaks out the bottom half of the logic to check to see if we
need to launch into its own function, and handles the cleaning duties only in
the caller instead of in 2 different select statements. played with this a
bit, no doubt further cleaning could be done, but this _seems_ better.
* fix vet
* add units to exported method contract docs
* oops