* fn: runner status and docker load images
Introducing a function run for pure runner Status
calls. Previously, Status gRPC calls returned active
inflight request counts with the purpose of a simple
health checker. However this is not sufficient since
it does not show if agent or docker is healthy. With
this change, if pure runner is configured with a status
image, that image is executed through docker. The
call uses zero memory/cpu/tmpsize settings to ensure
resource tracker does not block it.
However, operators might not always have a docker
repository accessible/available for status image. Or
operators might not want the status to go over the
network. To allow such cases, and in general possibly
caching docker images, added a new environment variable
FN_DOCKER_LOAD_FILE. If this is set, fn-agent during
startup will load these images that were previously
saved with 'docker save' into docker.
* Initial suypport for invoking tiggers
* dupe method
* tighten server constraints
* runner tests not working yet
* basic route tests passing
* post rebase fixes
* add hybrid support for trigger invoke and tests
* consoloidate all hybrid evil into one place
* cleanup and make triggers unique by source
* fix oops with Agent
* linting
* review fixes
* add DateTime sans mgo
* change all uses of strfmt.DateTime to common.DateTime, remove test strfmt usage
* remove api tests, system-test dep on api test
multiple reasons to remove the api tests:
* awkward dependency with fn_go meant generating bindings on a branched fn to
vendor those to test new stuff. this is at a minimum not at all intuitive,
worth it, nor a fun way to spend the finite amount of time we have to live.
* api tests only tested a subset of functionality that the server/ api tests
already test, and we risk having tests where one tests some thing and the
other doesn't. let's not. we have too many test suites as it is, and these
pretty much only test that we updated the fn_go bindings, which is actually a
hassle as noted above and the cli will pretty quickly figure out anyway.
* fn_go relies on openapi, which relies on mgo, which is deprecated and we'd
like to remove as a dependency. openapi is a _huge_ dep built in a NIH
fashion, that cannot simply remove the mgo dep as users may be using it.
we've now stolen their date time and otherwise killed usage of it in fn core,
for fn_go it still exists but that's less of a problem.
* update deps
removals:
* easyjson
* mgo
* go-openapi
* mapstructure
* fn_go
* purell
* go-validator
also, had to lock docker. we shouldn't use docker on master anyway, they
strongly advise against that. had no luck with latest version rev, so i locked
it to what we were using before. until next time.
the rest is just playing dep roulette, those end up removing a ton tho
* fix exec test to work
* account for john le cache
In pure-runner and LB agent, service providers might want to set specific driver options.
For example, to add cpu-shares to functions, LB can add the information as extensions
to the Call and pass this via gRPC to runners. Runners then pick these extensions from
gRPC call and pass it to driver. Using a custom driver implementation, pure-runners can
process these extensions to modify docker.CreateContainerOptions.
To achieve this, LB agents can now be configured using a call overrider.
Pure-runners can be configured using a custom docker driver.
RunnerCall and Call interfaces both expose call extensions.
An example to demonstrate this is implemented in test/fn-system-tests/system_test.go
which registers a call overrider for LB agent as well as a simple custom docker driver.
In this example, LB agent adds a key-value to extensions and runners add this key-value
as an environment variable to the container.
* fn: user friendly timeout handling changes
Timeout setting in routes now means "maximum amount
of time a function can run in a container".
Total wait time for a given http request is now expected
to be handled by the client. As long as the client waits,
the LB, runner or agents will search for resources to
schedule it.
* fn: size restricted tmpfs /tmp and read-only / support
*) read-only Root Fs Support
*) removed CPUShares from docker API. This was unused.
*) docker.Prepare() refactoring
*) added docker.configureTmpFs() for size limited tmpfs on /tmp
*) tmpfs size support in routes and resource tracker
*) fix fn-test-utils to handle sparse files better in create file
* test typo fix
* Implements graceful shutdown of agent.DataAccess and underlying Datastore/Logstore/MessageQueue
* adds tests for closing agent.DataAccess and Datastore
* add user syslog writers to app
users may specify a syslog url[s] on apps now and all functions under that app
will spew their logs out to it. the docs have more information around details
there, please review those (swagger and operating/logging.md), tried to
implement to spec in some parts and improve others, open to feedback on
format though, lots of liberty there.
design decision wise, I am looking to the future and ignoring cold containers.
the overhead of the connections there will not be worth it, so this feature
only works for hot functions, since we're killing cold anyway (even if a user
can just straight up exit a hot container).
syslog connections will be opened against a container when it starts up, and
then the call id that is logged gets swapped out for each call that goes
through the container, this cuts down on the cost of opening/closing
connections significantly. there are buffers to accumulate logs until we get a
`\n` to actually write a syslog line, and a buffer to save some bytes when
we're writing the syslog formatting as well. underneath writers re-use the
line writer in certain scenarios (swapper). we could likely improve the ease
of setting this up, but opening the syslog conns against a container seems
worth it, and is a different path than the other func loggers that we create
when we make a call object. the Close() stuff is a little tricky, not sure how
to make it easier and have the ^ benefits, open to idears.
this does add another vector of 'limits' to consider for more strict service
operators. one being how many syslog urls can a user add to an app (infinite,
atm) and the other being on the order of number of containers per host we
could run out of connections in certain scenarios. there may be some utility
in having multiple syslog sinks to send to, it could help with debugging at
times to send to another destination or if a user is a client w/ someone and
both want the function logs, e.g. (have used this for that in the past,
specifically).
this also doesn't work behind a proxy, which is something i'm open to fixing,
but afaict will require a 3rd party dependency (we can pretty much steal what
docker does). this is mostly of utility for those of us that work behind a
proxy all the time, not really for end users.
there are some unit tests. integration tests for this don't sound very fun to
maintain. I did test against papertrail with each protocol and it works (and
even times out if you're behind a proxy!).
closes#337
* add trace to syslog dial
* fn: allow specified docker networks in functions
If FN_DOCKER_NETWORK is specified with a list of
networks, then agent driver picks the least used
network to place functions on.
* add mutex comment
* fn: non-blocking resource tracker and notification
For some types of errors, we might want to notify
the actual caller if the error is directly 1-1 tied
to that request. If hotLauncher is triggered with
signaller, then here we send a back communication
error notification channel. This is passed to
checkLaunch to send back synchronous responses
to the caller that initiated this hot container
launch.
This is useful if we want to run the agent in
quick fail mode, where instead of waiting for
CPU/Mem to become available, we prefer to fail
quick in order not to hold up the caller.
To support this, non-blocking resource tracker
option/functions are now available.
* fn: test env var rename tweak
* fn: fixup merge
* fn: rebase test fix
* fn: merge fixup
* fn: test tweak down to 70MB for 128MB total
* fn: refactor token creation and use broadcast regardless
* fn: nb description
* fn: bugfix
* CloudEvents I/O format support.
* Updated format doc.
* Remove log lines
* This adds support for CloudEvent ingestion at the http router layer.
* Updated per comments.
* Responds with full CloudEvent message.
* Fixed up per comments
* Fix tests
* Checks for cloudevent content-type
* doesn't error on missing content-type.
* fn: common.WaitGroup improvements
*) Split the API into AddSession/DoneSession
*) Only wake up listeners when session count reaches zero.
* fn: WaitGroup go-routine blast test
* fn: test fix and rebase fixup
* fn: reduce lbagent and agent dependency
lbagent and agent code is too dependent. This causes
any changed in agent to break lbagent. In reality, for
LB there should be no delegated agent. Splitting these
two will cause some code duplication, but it reduces
dependency and complexity (eg. agent without docker)
* fn: post rebase fixup
* fn: runner/runnercall should use lbDeadline
* fn: fixup ln agent test
* fn: remove agent create option for common.WaitGroup
* fn: sync.WaitGroup replacement common.WaitGroup
agent/lb_agent/pure_runner has been incorrectly using
sync.WaitGroup semantics. Switching these components to
use the new common.WaitGroup() that provides a few handy
functionality for common graceful shutdown cases.
From https://golang.org/pkg/sync/#WaitGroup,
"Note that calls with a positive delta that occur when the counter
is zero must happen before a Wait. Calls with a negative delta,
or calls with a positive delta that start when the counter is
greater than zero, may happen at any time. Typically this means
the calls to Add should execute before the statement creating
the goroutine or other event to be waited for. If a WaitGroup
is reused to wait for several independent sets of events,
new Add calls must happen after all previous Wait calls have
returned."
HandleCallEnd introduces some complexity to the shutdowns, but this
is currently handled by AddSession(2) initially and letting the
HandleCallEnd() when to decrement by -1 in addition to decrement -1 in
Submit().
lb_agent shutdown sequence and particularly timeouts with runner pool
needs another look/revision, but this is outside of the scope of this
commit.
* fn: lb-agent wg share
* fn: no need to +2 in Submit with defer.
Removed defer since handleCallEnd already has
this responsibility.
* fn: move call error/end handling to handleCallEnd
This simplifies submit() function but moves the burden
of retriable-versus-committed request handling and slot.Close()
responsibility to handleCallEnd().
* fn: experimental prefork recycle and other improvements
*) Recycle and do not use same pool container again option.
*) Two state processing: initializing versus ready (start-kill).
*) Ready state is exempt from rate limiter.
* fn: experimental prefork pool multiple network support
In order to exceed 1023 container (bridge port) limit, add
multiple networks:
for i in fn-net1 fn-net2 fn-net3 fn-net4
do
docker network create $i
done
to Docker startup, (eg. dind preentry.sh), then provide this
to prefork pool using:
export FN_EXPERIMENTAL_PREFORK_NETWORKS="fn-net1 fn-net2 fn-net3 fn-net4"
which should be able to spawn 1023 * 4 containers.
* fn: fixup tests for cfg move
* fn: add ipc and pid namespaces into prefork pooling
* fn: revert ipc and pid namespaces for now
Pid/Ipc opens up the function container to pause container.
* fn: perform call.End() after request is processed
call.End() performs several tasks in sequence; insert call,
insert log, (todo) remove mq entry, fireAfterCall callback, etc.
These currently add up to the request latency as return
from agent.Submit() is blocked on these. We also haven't been
able to apply any timeouts on these operations since they are
handled during request processing and it is hard to come up
with a strategy for it. Also the error cases
(couldn't insert call or log) are not propagated to the caller.
With this change, call.End() handling becomes asynchronous where
we perform these tasks after the request is done. This improves
latency and we no longer have to block the call on these operations.
The changes will also free up the agent slot token more quickly
and now we are no longer tied to hiccups in call.End().
Now, a timeout policy is also added to this which can
be adjusted with an env variable. (default 10 minutes)
This accentuates the fact that call/log/fireAfterCall are not
completed when request is done. So, there's a window there where
call is done, but call/log/fireAfterCall are not yet propagated.
This was already the case especially for error cases.
There's slight risk of accumulating call.End() operations in
case of hiccups in these log/call/callback systems.
* fn: address risk of overstacking of call.End() calls.
* App ID
* Clean-up
* Use ID or name to reference apps
* Can use app by name or ID
* Get rid of AppName for routes API and model
routes API is completely backwards-compatible
routes API accepts both app ID and name
* Get rid of AppName from calls API and model
* Fixing tests
* Get rid of AppName from logs API and model
* Restrict API to work with app names only
* Addressing review comments
* Fix for hybrid mode
* Fix rebase problems
* Addressing review comments
* Addressing review comments pt.2
* Fixing test issue
* Addressing review comments pt.3
* Updated docstring
* Adjust UpdateApp SQL implementation to work with app IDs instead of names
* Fixing tests
* fmt after rebase
* Make tests green again!
* Use GetAppByID wherever it is necessary
- adding new v2 endpoints to keep hybrid api/runner mode working
- extract CallBase from Call object to expose that to a user
(it doesn't include any app reference, as we do for all other API objects)
* Get rid of GetAppByName
* Adjusting server router setup
* Make hybrid work again
* Fix datastore tests
* Fixing tests
* Do not ignore app_id
* Resolve issues after rebase
* Updating test to make it work as it was
* Tabula rasa for migrations
* Adding calls API test
- we need to ensure we give "App not found" for the missing app and missing call in first place
- making previous test work (request missing call for the existing app)
* Make datastore tests work fine with correctly applied migrations
* Make CallFunction middleware work again
had to adjust its implementation to set app ID before proceeding
* The biggest rebase ever made
* Fix 8's migration
* Fix tests
* Fix hybrid client
* Fix tests problem
* Increment app ID migration version
* Fixing TestAppUpdate
* Fix rebase issues
* Addressing review comments
* Renew vendor
* Updated swagger doc per recommendations
* Move delegated agent creation within NewLBAgent so we can hide the fact we disable docker
* Move delegated agent creation within NewPureRunner for better encapsulation
Added env FN_MAX_FS_SIZE_MB, which if defined and non-zero
is passed to docker as storage opt size. We do not validate
if this option is supported by docker currently. This is
because it's difficult to actually validate this since it
not only depends on storage driver and its backing filesystem,
but also the mount options used to mount that fs.
* fn: reorg agent config
*) Moving constants in agent to agent config, which helps
with testing, tuning.
*) Added max total cpu & memory for testing & clamping max
mem & cpu usage if needed.
* fn: adjust PipeIO time
* fn: for hot, cannot reliably test EndOfLogs in TestRouteRunnerExecution
* add jaeger support, link hot container & req span
* adds jaeger support now with FN_JAEGER_URL, there's a simple tutorial in the
operating/metrics.md file now and it's pretty easy to get up and running.
* links a hot request span to a hot container span. when we change this to
sample at a lower ratio we'll need to finagle the hot container span to always
sample or something, otherwise we'll hide that info. at least, since we're
sampling at 100% for now if this is flipped on, can see freeze/unfreeze etc.
if they hit. this is useful for debugging. note that zipkin's exporter does
not follow the link at all, hence jaeger... and they're backed by the Cloud
Empire now (CNCF) so we'll probably use it anyway.
* vendor: add thrift for jaeger
1) in theory it may be possible for an exited container to
requeue a slot, close this gap by always setting fatal error
for a slot if a container has exited.
2) when a client request times out or cancelled (client
disconnect, etc.) the slot should not be allowed to be
requeued and container should terminate to avoid accidental
mixing of previous response into next.
* fn: enable failing test back
* fn: fortifying the stderr output
Modified limitWriter to discard excess data instead
of returning error, this is to allow stderr/stdout
pipes flowing to avoid head-of-line blocking or
data corruption in container stdout/stderr output stream.
* Initial stab at the protocol
* initial protocol sketch for node pool manager
* Added http header frame as a message
* Force the use of WithAgent variants when creating a server
* adds grpc models for node pool manager plus go deps
* Naming things is really hard
* Merge (and optionally purge) details received by the NPM
* WIP: starting to add the runner-side functionality of the new data plane
* WIP: Basic startup of grpc server for pure runner. Needs proper certs.
* Go fmt
* Initial agent for LB nodes.
* Agent implementation for LB nodes.
* Pass keys and certs to LB node agent.
* Remove accidentally left reference to env var.
* Add env variables for certificate files
* stub out the capacity and group membership server channels
* implement server-side runner manager service
* removes unused variable
* fixes build error
* splits up GetCall and GetLBGroupId
* Change LB node agent to use TLS connection.
* Encode call model as JSON to send to runner node.
* Use hybrid client in LB node agent.
This should provide access to get app and route information for the call
from an API node.
* More error handling on the pure runner side
* Tentative fix for GetCall problem: set deadlines correctly when reserving slot
* Connect loop for LB agent to runner nodes.
* Extract runner connection function in LB agent.
* drops committed capacity counts
* Bugfix - end state tracker only in submit
* Do logs properly
* adds first pass of tracking capacity metrics in agent
* maked memory capacity metric uint64
* maked memory capacity metric uint64
* removes use of old capacity field
* adds remove capacity call
* merges overwritten reconnect logic
* First pass of a NPM
Provide a service that talks to a (simulated) CP.
- Receive incoming capacity assertions from LBs for LBGs
- expire LB requests after a short period
- ask the CP to add runners to a LBG
- note runner set changes and readvertise
- scale down by marking runners as "draining"
- shut off draining runners after some cool-down period
* add capacity update on schedule
* Send periodic capcacity metrics
Sending capcacity metrics to node pool manager
* splits grpc and api interfaces for capacity manager
* failure to advertise capacity shouldn't panic
* Add some instructions for starting DP/CP parts.
* Create the poolmanager server with TLS
* Use logrus
* Get npm compiling with cert fixups.
* Fix: pure runner should not start async processing
* brings runner, nulb and npm together
* Add field to acknowledgment to record slot allocation latency; fix a bug too
* iterating on pool manager locking issue
* raises timeout of placement retry loop
* Fix up NPM
Improve logging
Ensure that channels etc. are actually initialised in the structure
creation!
* Update the docs - runners GRPC port is 9120
* Bugfix: return runner pool accurately.
* Double locking
* Note purges as LBs stop talking to us
* Get the purging of old LBs working.
* Tweak: on restart, load runner set before making scaling decisions.
* more agent synchronization improvements
* Deal with teh CP pulling out active hosts from under us.
* lock at lbgroup level
* Send request and receive response from runner.
* Add capacity check right before slot reservation
* Pass the full Call into the receive loop.
* Wait for the data from the runner before finishing
* force runner list refresh every time
* Don't init db and mq for pure runners
* adds shutdown of npm
* fixes broken log line
* Extract an interface for the Predictor used by the NPM
* purge drained connections from npm
* Refactor of the LB agent into the agent package
* removes capacitytest wip
* Fix undefined err issue
* updating README for poolmanager set up
* ues retrying dial for lb to npm connections
* Rename lb_calls to lb_agent now that all functionality is there
* Use the right deadline and errors in LBAgent
* Make stream error flag per-call rather than global otherwise the whole runner is damaged by one call dropping
* abstracting gRPCNodePool
* Make stream error flag per-call rather than global otherwise the whole runner is damaged by one call dropping
* Add some init checks for LB and pure runner nodes
* adding some useful debug
* Fix default db and mq for lb node
* removes unreachable code, fixes typo
* Use datastore as logstore in API nodes.
This fixes a bug caused by trying to insert logs into a nil logstore. It
was nil because it wasn't being set for API nodes.
* creates placement abstraction and moves capacity APIs to NodePool
* removed TODO, added logging
* Dial reconnections for LB <-> runners
LB grpc connections to runners are established using a backoff stategy
in event of reconnections, this allows to let the LB up even in case one
of the runners go away and reconnect to it as soon as it is back.
* Add a status call to the Runner protocol
Stub at the moment. To be used for things like draindown, health checks.
* Remove comment.
* makes assign/release capacity lockless
* Fix hanging issue in lb agent when connections drop
* Add the CH hash from fnlb
Select this with FN_PLACER=ch when launching the LB.
* small improvement for locking on reloadLBGmembership
* Stabilise the list of Runenrs returned by NodePool
The NodePoolManager makes some attempt to keep the list of runner nodes advertised as
stable as possible. Let's preserve this effort in the client side. The main point of this
is to attempt to keep the same runner at the same inxed in the []Runner returned by
NodePool.Runners(lbgid); the ch algorithm likes it when this is the case.
* Factor out a generator function for the Runners so that mocks can be injected
* temporarily allow lbgroup to be specified in HTTP header, while we sort out changes to the model
* fixes bug with nil runners
* Initial work for mocking things in tests
* fix for anonymouse go routine error
* fixing lb_test to compile
* Refactor: internal objects for gRPCNodePool are now injectable, with defaults for the real world case
* Make GRPC port configurable, fix weird handling of web port too
* unit test reload Members
* check on runner creation failure
* adding nullRunner in case of failure during runner creation
* Refactored capacity advertisements/aggregations. Made grpc advertisement post asynchronous and non-blocking.
* make capacityEntry private
* Change the runner gRPC bind address.
This uses the existing `whoAmI` function, so that the gRPC server works
when the runner is running on a different host.
* Add support for multiple fixed runners to pool mgr
* Added harness for dataplane system tests, minor refactors
* Add Dockerfiles for components, along with docs.
* Doc fix: second runner needs a different name.
* Let us have three runners in system tests, why not
* The first system test running a function in API/LB/PureRunner mode
* Add unit test for Advertiser logic
* Fix issue with Pure Runner not sending the last data frame
* use config in models.Call as a temporary mechanism to override lb group ID
* make gofmt happy
* Updates documentation for how to configure lb groups for an app/route
* small refactor unit test
* Factor NodePool into its own package
* Lots of fixes to Pure Runner - concurrency woes with errors and cancellations
* New dataplane with static runnerpool (#813)
Added static node pool as default implementation
* moved nullRunner to grpc package
* remove duplication in README
* fix go vet issues
* Fix server initialisation in api tests
* Tiny logging changes in pool manager.
Using `WithError` instead of `Errorf` when appropriate.
* Change some log levels in the pure runner
* fixing readme
* moves multitenant compute documentation
* adds introduction to multitenant readme
* Proper triggering of system tests in makefile
* Fix insructions about starting up the components
* Change db file for system tests to avoid contention in parallel tests
* fixes revisions from merge
* Fix merge issue with handling of reserved slot
* renaming nulb to lb in the doc and images folder
* better TryExec sleep logic clean shutdown
In this change we implement a better way to deal with the sleep inside
the for loop during the attempt for placing a call.
Plus we added a clean way to shutdown the connections with external
component when we shut down the server.
* System_test mysql port
set mysql port for system test to a different value to the one set for
the api tests to avoid conflicts as they can run in parallel.
* change the container name for system-test
* removes flaky test TestRouteRunnerExecution pending resolution by issue #796
* amend remove_containers to remove new added containers
* Rework capacity reservation logic at a higher level for now
* LB agent implements Submit rather than delegating.
* Fix go vet linting errors
* Changed a couple of error levels
* Fix formatting
* removes commmented out test
* adds snappy to vendor directory
* updates Gopkg and vendor directories, removing snappy and addhing siphash
* wait for db containers to come up before starting the tests
* make system tests start API node on 8085 to avoid port conflict with api_tests
* avoid port conflicts with api_test.sh which are run in parallel
* fixes postgres port conflict and issue with removal of old containers
* Remove spurious println
*) I/O protocol parse issues should shutdown the container as the container
goes to inconsistent state between calls. (eg. next call may receive previous
calls left overs.)
*) Move ghost read/write code into io_utils in common.
*) Clean unused error from docker Wait()
*) We can catch one case in JSON, if there's remaining unparsed data in
decoder buffer, we can shut the container
*) stdout/stderr when container is not handling a request are now blocked if freezer is also enabled.
*) if a fatal err is set for slot, we do not requeue it and proceed to shutdown
*) added a test function for a few cases with freezer strict behavior
* update vendor directory, add go.opencensus.io
* update imports
* oops
* s/opentracing/opencensus/ & remove prometheus / zipkin stuff & remove old stats
* the dep train rides again
* fix gin build
* deps from last guy
* start in on the agent metrics
* she builds
* remove tags for now, cardinality error is fussing. subscribe instead of register
* update to patched version of opencensus to proceed for now TODO switch to a release
* meh
fix imports
* println debug the bad boys
* lace it with the tags
* update deps again
* fix all inconsistent cardinality errors
* add our own logger
* fix init
* fix oom measure
* remove bugged removal code
* fix s3 measures
* fix prom handler nil
*) Limit response http body or json response size to FN_MAX_RESPONSE_SIZE (default unlimited)
*) If limits are exceeded 502 is returned with 'body too large' in the error message
* fn: hot container timer improvements
With this change, now we are allocating the timers
when the container starts and managing them via
stop/clear as needed, which should not only be more
efficient, but also easier to follow.
For example, previously, if eject time out was
set to 10 secs, this could have delayed idle timeout
up to 10 secs as well. It is also not necessary to do
any math for elapsed time.
Now consumers avoid any requeuing when startDequeuer() is cancelled.
This was triggering additional dequeue/requeue causing
containers to wake up spuriously. Also in startDequeuer(),
we no longer remove the item from the actual queue and
leave this to acquire/eject, which side steps issues related
with item landing in the channel, not consumed, etc.
* pipe swapparoo each slot
previously, we made a pair of pipes for stdin and stdout for each container,
and then handed them out to each call (slot) to use. this meant that multiple
calls could have a handle on the same stdin pipe and stdout pipe to read/write
to/from from fn's perspective and could mix input/output and get garbage. this
also meant that each was blocked on the previous' reads.
now we make a new pipe every time we get a slot, and swap it out with the
previous ones. calls are no longer blocked from fn's perspective, and we don't
have to worry about timing out dispatch for any hot format. there is still the
issue that if a function does not finish reading the input from the previous
task, from its perspective, and reads the next call's it can error out the
second call. with fn deadline we provide the necessary tools to skirt this,
but without some additional coordination am not sure this is a closable hole
with our current protocols since terminating a previous calls input requires
some protocol specific bytes to go in (json in particular is tricky). anyway,
from fn's side fixing pipes was definitely a hole, but this client hole is
still hanging out. there was an attempt to send an io.EOF but the issue is
that will shut down docker's read on the stdin pipe (and the container). poop.
this adds a test for this behavior, and makes sure 2 containers don't get
launched.
this also closes the response writer header race a little, but not entirely, I
think there's still a chance that we read a full response from a function and
get a timeout while we're changing the headers. I guess we need a thread safe
header bucket, otherwise we have to rely on timings (racy). thinking on it.
* fix stats mu race
* Improve deadline handling in streaming protocols
* Move special headers handling down to the protocols
* Adding function format documentation for JSON changes
* Add tests for request url and method in JSON protocol
* Fix protocol missing fn-specific info
* Fix import
* Add panic for something that should never happen
* add spans to async
* clean up / add spans to agent
* there were a few methods which had multiple contexts which existed in the same
scope (this doesn't end well, usually), flattened those out.
* loop bound context cancels now rely on defer (also was brittle)
* runHot had a lot of ctx shuffling, flattened that.
* added some additional spans in certain paths for added granularity
* linked up the hot launcher / run hot / wait hot to _a_ root span, the first
2 are follows from spans, but at least we can see the source of these and also
can see containers launched over a hot launcher's lifetime
I left TODO around the FollowsFrom because OpenCensus doesn't, at least at the
moment, appear to have any idea of FollowsFrom and it was an extra OpenTracing
method (we have to get the span out, start a new span with the option, then
add it to the context... some shuffling required). anyway, was on the fence
about adding at least.
* resource waiters need to manage their own goroutine lifecycle
* if we get an impossible memory request, bail instead of infinite loop
* handle timeout slippery case
* still sucks, but hotLauncher doesn't leak anything. even the time.After timer goroutines
* simplify GetResourceToken
GetCall can guard against the impossible to allocate resource tasks entering
the system by erroring instead of doling them out. this makes GetResourceToken
logic more straightforward for callers, who now simply have the contract that
they won't ever get a token if they let tasks into the agent that can't run
(but GetCall guards this, and there's a test for it).
sorry, I was going to make this only do that, but when I went to fix up the
tests, my last patch went haywire so I fixed that too. this also at least
tries to simplify the hotLaunch loop, which will now no longer leak time.After
timers (which were long, and with signaller, they were many -- I got a stack
trace :) -- this breaks out the bottom half of the logic to check to see if we
need to launch into its own function, and handles the cleaning duties only in
the caller instead of in 2 different select statements. played with this a
bit, no doubt further cleaning could be done, but this _seems_ better.
* fix vet
* add units to exported method contract docs
* oops