fn-serverless

mirror of https://github.com/fnproject/fn.git synced 2022-10-28 21:29:17 +03:00

Author	SHA1	Message	Date
Owen Cliffe	fff95e7992	Clean up/make consistent the APIs for registering core components, make Docker an optional component at compile time (#1111 )	2018-07-07 10:37:19 +01:00
Owen Cliffe	b8b544ed25	HTTP Triggers hookup (#1086 ) * Initial suypport for invoking tiggers * dupe method * tighten server constraints * runner tests not working yet * basic route tests passing * post rebase fixes * add hybrid support for trigger invoke and tests * consoloidate all hybrid evil into one place * cleanup and make triggers unique by source * fix oops with Agent * linting * review fixes	2018-07-05 12:56:07 -05:00
Reed Allman	51ff7caeb2	Bye bye openapi (#1081 ) * add DateTime sans mgo * change all uses of strfmt.DateTime to common.DateTime, remove test strfmt usage * remove api tests, system-test dep on api test multiple reasons to remove the api tests: * awkward dependency with fn_go meant generating bindings on a branched fn to vendor those to test new stuff. this is at a minimum not at all intuitive, worth it, nor a fun way to spend the finite amount of time we have to live. * api tests only tested a subset of functionality that the server/ api tests already test, and we risk having tests where one tests some thing and the other doesn't. let's not. we have too many test suites as it is, and these pretty much only test that we updated the fn_go bindings, which is actually a hassle as noted above and the cli will pretty quickly figure out anyway. * fn_go relies on openapi, which relies on mgo, which is deprecated and we'd like to remove as a dependency. openapi is a _huge_ dep built in a NIH fashion, that cannot simply remove the mgo dep as users may be using it. we've now stolen their date time and otherwise killed usage of it in fn core, for fn_go it still exists but that's less of a problem. * update deps removals: * easyjson * mgo * go-openapi * mapstructure * fn_go * purell * go-validator also, had to lock docker. we shouldn't use docker on master anyway, they strongly advise against that. had no luck with latest version rev, so i locked it to what we were using before. until next time. the rest is just playing dep roulette, those end up removing a ton tho * fix exec test to work * account for john le cache	2018-06-21 11:09:16 -07:00
Tolga Ceylan	881a0ba1db	fn: agent call overrider (#1080 ) Similar to LB Agent call overrider, this PR adds Agent overrider for Agents to modify/analyze a Call/Extensions during GetCall().	2018-06-20 16:21:09 -07:00
Tolga Ceylan	e67d0e5f3f	fn: Call extensions/overriding and more customization friendly docker driver (#1065 ) In pure-runner and LB agent, service providers might want to set specific driver options. For example, to add cpu-shares to functions, LB can add the information as extensions to the Call and pass this via gRPC to runners. Runners then pick these extensions from gRPC call and pass it to driver. Using a custom driver implementation, pure-runners can process these extensions to modify docker.CreateContainerOptions. To achieve this, LB agents can now be configured using a call overrider. Pure-runners can be configured using a custom docker driver. RunnerCall and Call interfaces both expose call extensions. An example to demonstrate this is implemented in test/fn-system-tests/system_test.go which registers a call overrider for LB agent as well as a simple custom docker driver. In this example, LB agent adds a key-value to extensions and runners add this key-value as an environment variable to the container.	2018-06-18 14:42:28 -07:00
Peter Jausovec	bd5150f1ac	Extract register view functionality (#1056 ) * WIP * Create separate Register*Views functions that are called from main.	2018-06-12 17:24:21 +01:00
Owen Cliffe	c6abc8bf64	Use context logging more to ensure context vars are present in log lines (#1039 )	2018-06-06 15:14:29 +01:00
Tolga Ceylan	a57907eed0	fn: user friendly timeout handling changes (#1021 ) * fn: user friendly timeout handling changes Timeout setting in routes now means "maximum amount of time a function can run in a container". Total wait time for a given http request is now expected to be handled by the client. As long as the client waits, the LB, runner or agents will search for resources to schedule it.	2018-06-01 13:18:13 -07:00
Tolga Ceylan	d190167580	fn: read-only root fs becomes default (#1019 ) * fn: read-only root fs becomes default Set root fs as read-only by default. * fn: update doc for FN_DISABLE_READONLY_ROOTFS	2018-05-30 18:17:28 -07:00
Tolga Ceylan	9584643142	fn: size restricted tmpfs /tmp and read-only / support (#1012 ) * fn: size restricted tmpfs /tmp and read-only / support ) read-only Root Fs Support ) removed CPUShares from docker API. This was unused. ) docker.Prepare() refactoring ) added docker.configureTmpFs() for size limited tmpfs on /tmp ) tmpfs size support in routes and resource tracker ) fix fn-test-utils to handle sparse files better in create file * test typo fix	2018-05-25 14:12:29 -07:00
Gerardo Viedma	ea1f94253f	Implement graceful shutdown of agent.DataAccess (#1008 ) * Implements graceful shutdown of agent.DataAccess and underlying Datastore/Logstore/MessageQueue * adds tests for closing agent.DataAccess and Datastore	2018-05-21 11:28:21 +01:00
Reed Allman	cbe0d5e9ac	add user syslog writers to app (#970 ) * add user syslog writers to app users may specify a syslog url[s] on apps now and all functions under that app will spew their logs out to it. the docs have more information around details there, please review those (swagger and operating/logging.md), tried to implement to spec in some parts and improve others, open to feedback on format though, lots of liberty there. design decision wise, I am looking to the future and ignoring cold containers. the overhead of the connections there will not be worth it, so this feature only works for hot functions, since we're killing cold anyway (even if a user can just straight up exit a hot container). syslog connections will be opened against a container when it starts up, and then the call id that is logged gets swapped out for each call that goes through the container, this cuts down on the cost of opening/closing connections significantly. there are buffers to accumulate logs until we get a `\n` to actually write a syslog line, and a buffer to save some bytes when we're writing the syslog formatting as well. underneath writers re-use the line writer in certain scenarios (swapper). we could likely improve the ease of setting this up, but opening the syslog conns against a container seems worth it, and is a different path than the other func loggers that we create when we make a call object. the Close() stuff is a little tricky, not sure how to make it easier and have the ^ benefits, open to idears. this does add another vector of 'limits' to consider for more strict service operators. one being how many syslog urls can a user add to an app (infinite, atm) and the other being on the order of number of containers per host we could run out of connections in certain scenarios. there may be some utility in having multiple syslog sinks to send to, it could help with debugging at times to send to another destination or if a user is a client w/ someone and both want the function logs, e.g. (have used this for that in the past, specifically). this also doesn't work behind a proxy, which is something i'm open to fixing, but afaict will require a 3rd party dependency (we can pretty much steal what docker does). this is mostly of utility for those of us that work behind a proxy all the time, not really for end users. there are some unit tests. integration tests for this don't sound very fun to maintain. I did test against papertrail with each protocol and it works (and even times out if you're behind a proxy!). closes #337 * add trace to syslog dial	2018-05-15 11:00:26 -07:00
Tolga Ceylan	508d9e18c7	fn: nonblocking resource manager tests (#987 )	2018-05-09 19:23:10 -07:00
Tolga Ceylan	0f50537150	fn: allow specified docker networks in functions (#982 ) * fn: allow specified docker networks in functions If FN_DOCKER_NETWORK is specified with a list of networks, then agent driver picks the least used network to place functions on. * add mutex comment	2018-05-09 12:24:15 -07:00
jan grant	91e58afa55	The opencensus API changes between 0.6.0 and 0.9.0 (#980 ) We get some useful features in later versions; update so as to not pin downstream consumers (extensions) to an older version.	2018-05-09 14:55:00 +01:00
Tolga Ceylan	54ba49be65	fn: non-blocking resource tracker and notification (#841 ) * fn: non-blocking resource tracker and notification For some types of errors, we might want to notify the actual caller if the error is directly 1-1 tied to that request. If hotLauncher is triggered with signaller, then here we send a back communication error notification channel. This is passed to checkLaunch to send back synchronous responses to the caller that initiated this hot container launch. This is useful if we want to run the agent in quick fail mode, where instead of waiting for CPU/Mem to become available, we prefer to fail quick in order not to hold up the caller. To support this, non-blocking resource tracker option/functions are now available. * fn: test env var rename tweak * fn: fixup merge * fn: rebase test fix * fn: merge fixup * fn: test tweak down to 70MB for 128MB total * fn: refactor token creation and use broadcast regardless * fn: nb description * fn: bugfix	2018-04-24 21:59:33 -07:00
Travis Reeder	3eb60e2028	CloudEvents I/O format support. (#948 ) * CloudEvents I/O format support. * Updated format doc. * Remove log lines * This adds support for CloudEvent ingestion at the http router layer. * Updated per comments. * Responds with full CloudEvent message. * Fixed up per comments * Fix tests * Checks for cloudevent content-type * doesn't error on missing content-type.	2018-04-23 16:05:13 -07:00
Tolga Ceylan	623aeb35b2	fn: common.WaitGroup improvements (#940 ) * fn: common.WaitGroup improvements ) Split the API into AddSession/DoneSession ) Only wake up listeners when session count reaches zero. * fn: WaitGroup go-routine blast test * fn: test fix and rebase fixup	2018-04-12 16:21:13 -07:00
Tolga Ceylan	e47d55056a	fn: reduce lbagent and agent dependency (#938 ) * fn: reduce lbagent and agent dependency lbagent and agent code is too dependent. This causes any changed in agent to break lbagent. In reality, for LB there should be no delegated agent. Splitting these two will cause some code duplication, but it reduces dependency and complexity (eg. agent without docker) * fn: post rebase fixup * fn: runner/runnercall should use lbDeadline * fn: fixup ln agent test * fn: remove agent create option for common.WaitGroup	2018-04-12 15:51:58 -07:00
Tolga Ceylan	e53d23afc9	fn: sync.WaitGroup replacement common.WaitGroup (#937 ) * fn: sync.WaitGroup replacement common.WaitGroup agent/lb_agent/pure_runner has been incorrectly using sync.WaitGroup semantics. Switching these components to use the new common.WaitGroup() that provides a few handy functionality for common graceful shutdown cases. From https://golang.org/pkg/sync/#WaitGroup, "Note that calls with a positive delta that occur when the counter is zero must happen before a Wait. Calls with a negative delta, or calls with a positive delta that start when the counter is greater than zero, may happen at any time. Typically this means the calls to Add should execute before the statement creating the goroutine or other event to be waited for. If a WaitGroup is reused to wait for several independent sets of events, new Add calls must happen after all previous Wait calls have returned." HandleCallEnd introduces some complexity to the shutdowns, but this is currently handled by AddSession(2) initially and letting the HandleCallEnd() when to decrement by -1 in addition to decrement -1 in Submit(). lb_agent shutdown sequence and particularly timeouts with runner pool needs another look/revision, but this is outside of the scope of this commit. * fn: lb-agent wg share * fn: no need to +2 in Submit with defer. Removed defer since handleCallEnd already has this responsibility.	2018-04-12 11:33:01 -07:00
Tolga Ceylan	ee262901a2	fn: handleCallEnd and submit improvements (#919 ) * fn: move call error/end handling to handleCallEnd This simplifies submit() function but moves the burden of retriable-versus-committed request handling and slot.Close() responsibility to handleCallEnd().	2018-04-10 10:48:12 -07:00
Tolga Ceylan	584e4e75eb	Experimental Pre-fork Pool: Recycle net ns (#890 ) * fn: experimental prefork recycle and other improvements ) Recycle and do not use same pool container again option. ) Two state processing: initializing versus ready (start-kill). ) Ready state is exempt from rate limiter. fn: experimental prefork pool multiple network support In order to exceed 1023 container (bridge port) limit, add multiple networks: for i in fn-net1 fn-net2 fn-net3 fn-net4 do docker network create $i done to Docker startup, (eg. dind preentry.sh), then provide this to prefork pool using: export FN_EXPERIMENTAL_PREFORK_NETWORKS="fn-net1 fn-net2 fn-net3 fn-net4" which should be able to spawn 1023 * 4 containers. * fn: fixup tests for cfg move * fn: add ipc and pid namespaces into prefork pooling * fn: revert ipc and pid namespaces for now Pid/Ipc opens up the function container to pause container.	2018-04-05 15:07:30 -07:00
Tolga Ceylan	81954bcf53	fn: perform call.End() after request is processed (#918 ) * fn: perform call.End() after request is processed call.End() performs several tasks in sequence; insert call, insert log, (todo) remove mq entry, fireAfterCall callback, etc. These currently add up to the request latency as return from agent.Submit() is blocked on these. We also haven't been able to apply any timeouts on these operations since they are handled during request processing and it is hard to come up with a strategy for it. Also the error cases (couldn't insert call or log) are not propagated to the caller. With this change, call.End() handling becomes asynchronous where we perform these tasks after the request is done. This improves latency and we no longer have to block the call on these operations. The changes will also free up the agent slot token more quickly and now we are no longer tied to hiccups in call.End(). Now, a timeout policy is also added to this which can be adjusted with an env variable. (default 10 minutes) This accentuates the fact that call/log/fireAfterCall are not completed when request is done. So, there's a window there where call is done, but call/log/fireAfterCall are not yet propagated. This was already the case especially for error cases. There's slight risk of accumulating call.End() operations in case of hiccups in these log/call/callback systems. * fn: address risk of overstacking of call.End() calls.	2018-04-05 14:42:12 -07:00
Justin Ko	9cb883ca68	Godoc fixes (#898 ) Add some godoc comments for the api/agent package and some of its subpackages.	2018-03-28 10:16:40 -07:00
Reed Allman	8af605cf3d	update thrift, opencensus, others (#893 ) * update thrift, opencensus, others * stats: update to opencensus 0.6.0 view api	2018-03-26 15:43:49 -07:00
Denis Makogon	3c15ca6ea6	App ID (#641 ) * App ID * Clean-up * Use ID or name to reference apps * Can use app by name or ID * Get rid of AppName for routes API and model routes API is completely backwards-compatible routes API accepts both app ID and name * Get rid of AppName from calls API and model * Fixing tests * Get rid of AppName from logs API and model * Restrict API to work with app names only * Addressing review comments * Fix for hybrid mode * Fix rebase problems * Addressing review comments * Addressing review comments pt.2 * Fixing test issue * Addressing review comments pt.3 * Updated docstring * Adjust UpdateApp SQL implementation to work with app IDs instead of names * Fixing tests * fmt after rebase * Make tests green again! * Use GetAppByID wherever it is necessary - adding new v2 endpoints to keep hybrid api/runner mode working - extract CallBase from Call object to expose that to a user (it doesn't include any app reference, as we do for all other API objects) * Get rid of GetAppByName * Adjusting server router setup * Make hybrid work again * Fix datastore tests * Fixing tests * Do not ignore app_id * Resolve issues after rebase * Updating test to make it work as it was * Tabula rasa for migrations * Adding calls API test - we need to ensure we give "App not found" for the missing app and missing call in first place - making previous test work (request missing call for the existing app) * Make datastore tests work fine with correctly applied migrations * Make CallFunction middleware work again had to adjust its implementation to set app ID before proceeding * The biggest rebase ever made * Fix 8's migration * Fix tests * Fix hybrid client * Fix tests problem * Increment app ID migration version * Fixing TestAppUpdate * Fix rebase issues * Addressing review comments * Renew vendor * Updated swagger doc per recommendations	2018-03-26 11:19:36 -07:00
Tolga Ceylan	0addcb8911	fn: pre-fork pool for namespace/network speedup (#874 ) * fn: pre-fork pool experimental implementation	2018-03-23 16:35:35 -07:00
Dario Domizioli	27ffb561e8	Hide details of delegated agents for PR and LB, to disable docker for LB (#872 ) * Move delegated agent creation within NewLBAgent so we can hide the fact we disable docker * Move delegated agent creation within NewPureRunner for better encapsulation	2018-03-20 13:45:45 +00:00
Tolga Ceylan	cb61a678d9	fn: add storage opt size support (#860 ) Added env FN_MAX_FS_SIZE_MB, which if defined and non-zero is passed to docker as storage opt size. We do not validate if this option is supported by docker currently. This is because it's difficult to actually validate this since it not only depends on storage driver and its backing filesystem, but also the mount options used to mount that fs.	2018-03-14 15:47:34 -07:00
Tolga Ceylan	74a51f3f88	fn: reorg agent config (#853 ) * fn: reorg agent config ) Moving constants in agent to agent config, which helps with testing, tuning. ) Added max total cpu & memory for testing & clamping max mem & cpu usage if needed. * fn: adjust PipeIO time * fn: for hot, cannot reliably test EndOfLogs in TestRouteRunnerExecution	2018-03-13 18:38:47 -07:00
Reed Allman	9eaf824398	add jaeger support, link hot container & req span (#840 ) * add jaeger support, link hot container & req span * adds jaeger support now with FN_JAEGER_URL, there's a simple tutorial in the operating/metrics.md file now and it's pretty easy to get up and running. * links a hot request span to a hot container span. when we change this to sample at a lower ratio we'll need to finagle the hot container span to always sample or something, otherwise we'll hide that info. at least, since we're sampling at 100% for now if this is flipped on, can see freeze/unfreeze etc. if they hit. this is useful for debugging. note that zipkin's exporter does not follow the link at all, hence jaeger... and they're backed by the Cloud Empire now (CNCF) so we'll probably use it anyway. * vendor: add thrift for jaeger	2018-03-13 15:57:12 -07:00
Tolga Ceylan	e80a06937b	fn: timeouts and container exists should stop slot queuing (#843 ) 1) in theory it may be possible for an exited container to requeue a slot, close this gap by always setting fatal error for a slot if a container has exited. 2) when a client request times out or cancelled (client disconnect, etc.) the slot should not be allowed to be requeued and container should terminate to avoid accidental mixing of previous response into next.	2018-03-12 11:18:55 -07:00
Tolga Ceylan	7177bf3923	fn: enable failing test back (#826 ) * fn: enable failing test back * fn: fortifying the stderr output Modified limitWriter to discard excess data instead of returning error, this is to allow stderr/stdout pipes flowing to avoid head-of-line blocking or data corruption in container stdout/stderr output stream.	2018-03-09 09:57:28 -08:00
Tolga Ceylan	f85294b0fc	fn: log agent cfg with field names (#829 )	2018-03-09 09:53:16 -08:00
Gerardo Viedma	8af57da7b2	Support load-balanced runner groups for multitenant compute isolation (#814 ) * Initial stab at the protocol * initial protocol sketch for node pool manager * Added http header frame as a message * Force the use of WithAgent variants when creating a server * adds grpc models for node pool manager plus go deps * Naming things is really hard * Merge (and optionally purge) details received by the NPM * WIP: starting to add the runner-side functionality of the new data plane * WIP: Basic startup of grpc server for pure runner. Needs proper certs. * Go fmt * Initial agent for LB nodes. * Agent implementation for LB nodes. * Pass keys and certs to LB node agent. * Remove accidentally left reference to env var. * Add env variables for certificate files * stub out the capacity and group membership server channels * implement server-side runner manager service * removes unused variable * fixes build error * splits up GetCall and GetLBGroupId * Change LB node agent to use TLS connection. * Encode call model as JSON to send to runner node. * Use hybrid client in LB node agent. This should provide access to get app and route information for the call from an API node. * More error handling on the pure runner side * Tentative fix for GetCall problem: set deadlines correctly when reserving slot * Connect loop for LB agent to runner nodes. * Extract runner connection function in LB agent. * drops committed capacity counts * Bugfix - end state tracker only in submit * Do logs properly * adds first pass of tracking capacity metrics in agent * maked memory capacity metric uint64 * maked memory capacity metric uint64 * removes use of old capacity field * adds remove capacity call * merges overwritten reconnect logic * First pass of a NPM Provide a service that talks to a (simulated) CP. - Receive incoming capacity assertions from LBs for LBGs - expire LB requests after a short period - ask the CP to add runners to a LBG - note runner set changes and readvertise - scale down by marking runners as "draining" - shut off draining runners after some cool-down period * add capacity update on schedule * Send periodic capcacity metrics Sending capcacity metrics to node pool manager * splits grpc and api interfaces for capacity manager * failure to advertise capacity shouldn't panic * Add some instructions for starting DP/CP parts. * Create the poolmanager server with TLS * Use logrus * Get npm compiling with cert fixups. * Fix: pure runner should not start async processing * brings runner, nulb and npm together * Add field to acknowledgment to record slot allocation latency; fix a bug too * iterating on pool manager locking issue * raises timeout of placement retry loop * Fix up NPM Improve logging Ensure that channels etc. are actually initialised in the structure creation! * Update the docs - runners GRPC port is 9120 * Bugfix: return runner pool accurately. * Double locking * Note purges as LBs stop talking to us * Get the purging of old LBs working. * Tweak: on restart, load runner set before making scaling decisions. * more agent synchronization improvements * Deal with teh CP pulling out active hosts from under us. * lock at lbgroup level * Send request and receive response from runner. * Add capacity check right before slot reservation * Pass the full Call into the receive loop. * Wait for the data from the runner before finishing * force runner list refresh every time * Don't init db and mq for pure runners * adds shutdown of npm * fixes broken log line * Extract an interface for the Predictor used by the NPM * purge drained connections from npm * Refactor of the LB agent into the agent package * removes capacitytest wip * Fix undefined err issue * updating README for poolmanager set up * ues retrying dial for lb to npm connections * Rename lb_calls to lb_agent now that all functionality is there * Use the right deadline and errors in LBAgent * Make stream error flag per-call rather than global otherwise the whole runner is damaged by one call dropping * abstracting gRPCNodePool * Make stream error flag per-call rather than global otherwise the whole runner is damaged by one call dropping * Add some init checks for LB and pure runner nodes * adding some useful debug * Fix default db and mq for lb node * removes unreachable code, fixes typo * Use datastore as logstore in API nodes. This fixes a bug caused by trying to insert logs into a nil logstore. It was nil because it wasn't being set for API nodes. * creates placement abstraction and moves capacity APIs to NodePool * removed TODO, added logging * Dial reconnections for LB <-> runners LB grpc connections to runners are established using a backoff stategy in event of reconnections, this allows to let the LB up even in case one of the runners go away and reconnect to it as soon as it is back. * Add a status call to the Runner protocol Stub at the moment. To be used for things like draindown, health checks. * Remove comment. * makes assign/release capacity lockless * Fix hanging issue in lb agent when connections drop * Add the CH hash from fnlb Select this with FN_PLACER=ch when launching the LB. * small improvement for locking on reloadLBGmembership * Stabilise the list of Runenrs returned by NodePool The NodePoolManager makes some attempt to keep the list of runner nodes advertised as stable as possible. Let's preserve this effort in the client side. The main point of this is to attempt to keep the same runner at the same inxed in the []Runner returned by NodePool.Runners(lbgid); the ch algorithm likes it when this is the case. * Factor out a generator function for the Runners so that mocks can be injected * temporarily allow lbgroup to be specified in HTTP header, while we sort out changes to the model * fixes bug with nil runners * Initial work for mocking things in tests * fix for anonymouse go routine error * fixing lb_test to compile * Refactor: internal objects for gRPCNodePool are now injectable, with defaults for the real world case * Make GRPC port configurable, fix weird handling of web port too * unit test reload Members * check on runner creation failure * adding nullRunner in case of failure during runner creation * Refactored capacity advertisements/aggregations. Made grpc advertisement post asynchronous and non-blocking. * make capacityEntry private * Change the runner gRPC bind address. This uses the existing `whoAmI` function, so that the gRPC server works when the runner is running on a different host. * Add support for multiple fixed runners to pool mgr * Added harness for dataplane system tests, minor refactors * Add Dockerfiles for components, along with docs. * Doc fix: second runner needs a different name. * Let us have three runners in system tests, why not * The first system test running a function in API/LB/PureRunner mode * Add unit test for Advertiser logic * Fix issue with Pure Runner not sending the last data frame * use config in models.Call as a temporary mechanism to override lb group ID * make gofmt happy * Updates documentation for how to configure lb groups for an app/route * small refactor unit test * Factor NodePool into its own package * Lots of fixes to Pure Runner - concurrency woes with errors and cancellations * New dataplane with static runnerpool (#813) Added static node pool as default implementation * moved nullRunner to grpc package * remove duplication in README * fix go vet issues * Fix server initialisation in api tests * Tiny logging changes in pool manager. Using `WithError` instead of `Errorf` when appropriate. * Change some log levels in the pure runner * fixing readme * moves multitenant compute documentation * adds introduction to multitenant readme * Proper triggering of system tests in makefile * Fix insructions about starting up the components * Change db file for system tests to avoid contention in parallel tests * fixes revisions from merge * Fix merge issue with handling of reserved slot * renaming nulb to lb in the doc and images folder * better TryExec sleep logic clean shutdown In this change we implement a better way to deal with the sleep inside the for loop during the attempt for placing a call. Plus we added a clean way to shutdown the connections with external component when we shut down the server. * System_test mysql port set mysql port for system test to a different value to the one set for the api tests to avoid conflicts as they can run in parallel. * change the container name for system-test * removes flaky test TestRouteRunnerExecution pending resolution by issue #796 * amend remove_containers to remove new added containers * Rework capacity reservation logic at a higher level for now * LB agent implements Submit rather than delegating. * Fix go vet linting errors * Changed a couple of error levels * Fix formatting * removes commmented out test * adds snappy to vendor directory * updates Gopkg and vendor directories, removing snappy and addhing siphash * wait for db containers to come up before starting the tests * make system tests start API node on 8085 to avoid port conflict with api_tests * avoid port conflicts with api_test.sh which are run in parallel * fixes postgres port conflict and issue with removal of old containers * Remove spurious println	2018-03-08 14:45:19 -08:00
Tolga Ceylan	7677aad450	fn: I/O related improvements (#809 ) ) I/O protocol parse issues should shutdown the container as the container goes to inconsistent state between calls. (eg. next call may receive previous calls left overs.) ) Move ghost read/write code into io_utils in common. ) Clean unused error from docker Wait() ) We can catch one case in JSON, if there's remaining unparsed data in decoder buffer, we can shut the container ) stdout/stderr when container is not handling a request are now blocked if freezer is also enabled. ) if a fatal err is set for slot, we do not requeue it and proceed to shutdown *) added a test function for a few cases with freezer strict behavior	2018-03-07 15:09:24 -08:00
Reed Allman	206aa3c203	opentracing -> opencensus (#802 ) * update vendor directory, add go.opencensus.io * update imports * oops * s/opentracing/opencensus/ & remove prometheus / zipkin stuff & remove old stats * the dep train rides again * fix gin build * deps from last guy * start in on the agent metrics * she builds * remove tags for now, cardinality error is fussing. subscribe instead of register * update to patched version of opencensus to proceed for now TODO switch to a release * meh fix imports * println debug the bad boys * lace it with the tags * update deps again * fix all inconsistent cardinality errors * add our own logger * fix init * fix oom measure * remove bugged removal code * fix s3 measures * fix prom handler nil	2018-03-05 09:35:28 -08:00
Tolga Ceylan	89a1fc7c72	Response size clamp (#786 ) ) Limit response http body or json response size to FN_MAX_RESPONSE_SIZE (default unlimited) ) If limits are exceeded 502 is returned with 'body too large' in the error message	2018-03-01 17:14:50 -08:00
Tolga Ceylan	320b766a6d	fn: introduce agent config and minor ghostreader tweak (#797 ) * fn: introduce agent config and minor ghostreader tweak TODO: move all constants/tweaks in agent to agent config. * fn: json convention	2018-02-27 12:17:13 -08:00
Tolga Ceylan	af1ea0fa95	fn: ui no longer uses /stats (#776 ) Decommission /stats related code.	2018-02-15 16:05:59 -08:00
Tolga Ceylan	567136cb5e	fn: required docker version fix (#759 )	2018-02-12 15:53:05 -08:00
Tolga Ceylan	c848fc6181	fn: hot container timer improvements (#751 ) * fn: hot container timer improvements With this change, now we are allocating the timers when the container starts and managing them via stop/clear as needed, which should not only be more efficient, but also easier to follow. For example, previously, if eject time out was set to 10 secs, this could have delayed idle timeout up to 10 secs as well. It is also not necessary to do any math for elapsed time. Now consumers avoid any requeuing when startDequeuer() is cancelled. This was triggering additional dequeue/requeue causing containers to wake up spuriously. Also in startDequeuer(), we no longer remove the item from the actual queue and leave this to acquire/eject, which side steps issues related with item landing in the channel, not consumed, etc.	2018-02-12 14:12:03 -08:00
Reed Allman	27179ddf54	plumb ctx for container removal spanno (#750 ) these were just dangling off on the side, took some plumbing work but not so bad	2018-02-08 22:48:23 -08:00
Tolga Ceylan	f27d47f2dd	Idle Hot Container Freeze/Preempt Support (#733 ) * fn: freeze/unfreeze and eject idle under resource contention	2018-02-07 17:21:53 -08:00
Tolga Ceylan	ebc6657071	fn: docker version check2 (#744 ) 1) now required docker version is 17.06 2) enable circle ci latest docker install 3) docker driver & agent check minimum version before start	2018-02-06 16:16:40 -08:00
Reed Allman	3b261fc144	pipe swapparoo each slot (#721 ) * pipe swapparoo each slot previously, we made a pair of pipes for stdin and stdout for each container, and then handed them out to each call (slot) to use. this meant that multiple calls could have a handle on the same stdin pipe and stdout pipe to read/write to/from from fn's perspective and could mix input/output and get garbage. this also meant that each was blocked on the previous' reads. now we make a new pipe every time we get a slot, and swap it out with the previous ones. calls are no longer blocked from fn's perspective, and we don't have to worry about timing out dispatch for any hot format. there is still the issue that if a function does not finish reading the input from the previous task, from its perspective, and reads the next call's it can error out the second call. with fn deadline we provide the necessary tools to skirt this, but without some additional coordination am not sure this is a closable hole with our current protocols since terminating a previous calls input requires some protocol specific bytes to go in (json in particular is tricky). anyway, from fn's side fixing pipes was definitely a hole, but this client hole is still hanging out. there was an attempt to send an io.EOF but the issue is that will shut down docker's read on the stdin pipe (and the container). poop. this adds a test for this behavior, and makes sure 2 containers don't get launched. this also closes the response writer header race a little, but not entirely, I think there's still a chance that we read a full response from a function and get a timeout while we're changing the headers. I guess we need a thread safe header bucket, otherwise we have to rely on timings (racy). thinking on it. * fix stats mu race	2018-01-31 17:25:24 -08:00
Dario Domizioli	e753732bd8	Hot protocols improvements (for 662) (#724 ) * Improve deadline handling in streaming protocols * Move special headers handling down to the protocols * Adding function format documentation for JSON changes * Add tests for request url and method in JSON protocol * Fix protocol missing fn-specific info * Fix import * Add panic for something that should never happen	2018-01-31 12:26:43 +00:00
Tolga Ceylan	97d78c584b	fn: better slot/container/request state tracking (#719 ) * fn: better slot/container/request state tracking	2018-01-26 12:21:11 -08:00
Reed Allman	bbd50a0e02	additional ctx spans / maid service (#716 ) * add spans to async * clean up / add spans to agent * there were a few methods which had multiple contexts which existed in the same scope (this doesn't end well, usually), flattened those out. * loop bound context cancels now rely on defer (also was brittle) * runHot had a lot of ctx shuffling, flattened that. * added some additional spans in certain paths for added granularity * linked up the hot launcher / run hot / wait hot to _a_ root span, the first 2 are follows from spans, but at least we can see the source of these and also can see containers launched over a hot launcher's lifetime I left TODO around the FollowsFrom because OpenCensus doesn't, at least at the moment, appear to have any idea of FollowsFrom and it was an extra OpenTracing method (we have to get the span out, start a new span with the option, then add it to the context... some shuffling required). anyway, was on the fence about adding at least. * resource waiters need to manage their own goroutine lifecycle * if we get an impossible memory request, bail instead of infinite loop * handle timeout slippery case * still sucks, but hotLauncher doesn't leak anything. even the time.After timer goroutines * simplify GetResourceToken GetCall can guard against the impossible to allocate resource tasks entering the system by erroring instead of doling them out. this makes GetResourceToken logic more straightforward for callers, who now simply have the contract that they won't ever get a token if they let tasks into the agent that can't run (but GetCall guards this, and there's a test for it). sorry, I was going to make this only do that, but when I went to fix up the tests, my last patch went haywire so I fixed that too. this also at least tries to simplify the hotLaunch loop, which will now no longer leak time.After timers (which were long, and with signaller, they were many -- I got a stack trace :) -- this breaks out the bottom half of the logic to check to see if we need to launch into its own function, and handles the cleaning duties only in the caller instead of in 2 different select statements. played with this a bit, no doubt further cleaning could be done, but this _seems_ better. * fix vet * add units to exported method contract docs * oops	2018-01-23 19:52:22 -08:00
Tolga Ceylan	ee59361bda	fn: added server too busy stats (#717 )	2018-01-23 19:30:01 -08:00

1 2 3

104 Commits