fn-serverless

mirror of https://github.com/fnproject/fn.git synced 2022-10-28 21:29:17 +03:00

Author	SHA1	Message	Date
Tolga Ceylan	0f50537150	fn: allow specified docker networks in functions (#982 ) * fn: allow specified docker networks in functions If FN_DOCKER_NETWORK is specified with a list of networks, then agent driver picks the least used network to place functions on. * add mutex comment	2018-05-09 12:24:15 -07:00
Tolga Ceylan	676c87f9a5	fn: pure-runner violates io Writer contract (#981 ) We must copy the data slice.	2018-05-09 16:48:55 +01:00
jan grant	91e58afa55	The opencensus API changes between 0.6.0 and 0.9.0 (#980 ) We get some useful features in later versions; update so as to not pin downstream consumers (extensions) to an older version.	2018-05-09 14:55:00 +01:00
Reed Allman	1f1624782b	related: https://github.com/fnproject/fdk-go/pull/26 (#968 ) adds a test for the protocol dumping of a request to the containert stdin. there are a number of vectors to test for a cloud event, but since we're going to change that behavior soon it's probably a waste of time to go about doing so. in any event, this was pretty broken. my understanding of the cloud event spec is deepening and the json stuff overall seems a little weird. * fixes content type issue around json checking (since a string is also a json value, we can just decode it, even though it's wasteful it's more easily correct) * doesn't force all json values to be map[string]interface{} and lets them be whoever they want to be. maybe their dads are still proud. closes #966	2018-05-07 22:22:53 -07:00
Tolga Ceylan	f0f9a6d945	fn: LB ch and naive fixes (#942 ) * fn: LB ch and naive fixes ) Naive is now a naive RR algorithm. ) Both now checks for ctx/timeout in each attempt. * fn: test fix	2018-05-07 11:50:16 -07:00
Tolga Ceylan	54ba49be65	fn: non-blocking resource tracker and notification (#841 ) * fn: non-blocking resource tracker and notification For some types of errors, we might want to notify the actual caller if the error is directly 1-1 tied to that request. If hotLauncher is triggered with signaller, then here we send a back communication error notification channel. This is passed to checkLaunch to send back synchronous responses to the caller that initiated this hot container launch. This is useful if we want to run the agent in quick fail mode, where instead of waiting for CPU/Mem to become available, we prefer to fail quick in order not to hold up the caller. To support this, non-blocking resource tracker option/functions are now available. * fn: test env var rename tweak * fn: fixup merge * fn: rebase test fix * fn: merge fixup * fn: test tweak down to 70MB for 128MB total * fn: refactor token creation and use broadcast regardless * fn: nb description * fn: bugfix	2018-04-24 21:59:33 -07:00
Travis Reeder	3eb60e2028	CloudEvents I/O format support. (#948 ) * CloudEvents I/O format support. * Updated format doc. * Remove log lines * This adds support for CloudEvent ingestion at the http router layer. * Updated per comments. * Responds with full CloudEvent message. * Fixed up per comments * Fix tests * Checks for cloudevent content-type * doesn't error on missing content-type.	2018-04-23 16:05:13 -07:00
Tolga Ceylan	c0ee3ce736	fn: locked mutex while blocked on I/O considered harmful (#935 ) * fn: mutex while waiting I/O considered harmful ) Removed hold mutex while wait I/O cases these included possible disk I/O and network I/O. ) Error/Context Close/Shutdown semantics changed since the context timeout and comments were misleading. Close always waits for pending gRPC session to complete. Context usage here was merely 'wait up to x secs to report an error' which only logs the error anyway. Instead, the runner can log the error. And context still can be passed around perhaps for future opencensus instrumentation.	2018-04-13 11:23:29 -07:00
Tolga Ceylan	623aeb35b2	fn: common.WaitGroup improvements (#940 ) * fn: common.WaitGroup improvements ) Split the API into AddSession/DoneSession ) Only wake up listeners when session count reaches zero. * fn: WaitGroup go-routine blast test * fn: test fix and rebase fixup	2018-04-12 16:21:13 -07:00
Tolga Ceylan	e47d55056a	fn: reduce lbagent and agent dependency (#938 ) * fn: reduce lbagent and agent dependency lbagent and agent code is too dependent. This causes any changed in agent to break lbagent. In reality, for LB there should be no delegated agent. Splitting these two will cause some code duplication, but it reduces dependency and complexity (eg. agent without docker) * fn: post rebase fixup * fn: runner/runnercall should use lbDeadline * fn: fixup ln agent test * fn: remove agent create option for common.WaitGroup	2018-04-12 15:51:58 -07:00
Tolga Ceylan	e53d23afc9	fn: sync.WaitGroup replacement common.WaitGroup (#937 ) * fn: sync.WaitGroup replacement common.WaitGroup agent/lb_agent/pure_runner has been incorrectly using sync.WaitGroup semantics. Switching these components to use the new common.WaitGroup() that provides a few handy functionality for common graceful shutdown cases. From https://golang.org/pkg/sync/#WaitGroup, "Note that calls with a positive delta that occur when the counter is zero must happen before a Wait. Calls with a negative delta, or calls with a positive delta that start when the counter is greater than zero, may happen at any time. Typically this means the calls to Add should execute before the statement creating the goroutine or other event to be waited for. If a WaitGroup is reused to wait for several independent sets of events, new Add calls must happen after all previous Wait calls have returned." HandleCallEnd introduces some complexity to the shutdowns, but this is currently handled by AddSession(2) initially and letting the HandleCallEnd() when to decrement by -1 in addition to decrement -1 in Submit(). lb_agent shutdown sequence and particularly timeouts with runner pool needs another look/revision, but this is outside of the scope of this commit. * fn: lb-agent wg share * fn: no need to +2 in Submit with defer. Removed defer since handleCallEnd already has this responsibility.	2018-04-12 11:33:01 -07:00
Tolga Ceylan	9b86e3626e	fn: avoid go-routine leak (#934 )	2018-04-11 12:11:08 -07:00
Tolga Ceylan	600dce5b2c	fn: lb_agent_test tweak (#931 ) Time.Sleep() blocking fixes in placers (naive and ch) improves some of the timing in processing, therefore reducing the max calls settings in mock runner pool.	2018-04-10 18:28:47 -07:00
Tolga Ceylan	e7658db822	Move ch ring placement back from old FnLB. (#930 ) * fn: bring back CH ring placer into FN repo based on original FnLB * fn: move placement code into runnerpool directory	2018-04-10 17:26:24 -07:00
Tolga Ceylan	95d7379a81	fn: pure runner concurrency fixes (#928 ) * fn: experimental pure_runner rewrite * fn: minor refactor * fn: added comments and moved pipe initialization to NewCallHandle * fn: EOF gRPC and EOF DataFrame handling	2018-04-10 15:50:32 -07:00
Tolga Ceylan	ee262901a2	fn: handleCallEnd and submit improvements (#919 ) * fn: move call error/end handling to handleCallEnd This simplifies submit() function but moves the burden of retriable-versus-committed request handling and slot.Close() responsibility to handleCallEnd().	2018-04-10 10:48:12 -07:00
Tolga Ceylan	c1f0707b60	fn: lb_agent Close should close shutdown channel (#925 )	2018-04-09 10:56:15 -07:00
Tolga Ceylan	34b386b944	fn: lb_agent wait group unused, added to Close (#924 )	2018-04-06 18:02:51 -07:00
Tolga Ceylan	1170645af2	fn: lb_agent state trackers only apply to runners (#923 ) State trackers do not apply to LB agent.	2018-04-06 17:59:10 -07:00
Tolga Ceylan	584e4e75eb	Experimental Pre-fork Pool: Recycle net ns (#890 ) * fn: experimental prefork recycle and other improvements ) Recycle and do not use same pool container again option. ) Two state processing: initializing versus ready (start-kill). ) Ready state is exempt from rate limiter. fn: experimental prefork pool multiple network support In order to exceed 1023 container (bridge port) limit, add multiple networks: for i in fn-net1 fn-net2 fn-net3 fn-net4 do docker network create $i done to Docker startup, (eg. dind preentry.sh), then provide this to prefork pool using: export FN_EXPERIMENTAL_PREFORK_NETWORKS="fn-net1 fn-net2 fn-net3 fn-net4" which should be able to spawn 1023 * 4 containers. * fn: fixup tests for cfg move * fn: add ipc and pid namespaces into prefork pooling * fn: revert ipc and pid namespaces for now Pid/Ipc opens up the function container to pause container.	2018-04-05 15:07:30 -07:00
Tolga Ceylan	81954bcf53	fn: perform call.End() after request is processed (#918 ) * fn: perform call.End() after request is processed call.End() performs several tasks in sequence; insert call, insert log, (todo) remove mq entry, fireAfterCall callback, etc. These currently add up to the request latency as return from agent.Submit() is blocked on these. We also haven't been able to apply any timeouts on these operations since they are handled during request processing and it is hard to come up with a strategy for it. Also the error cases (couldn't insert call or log) are not propagated to the caller. With this change, call.End() handling becomes asynchronous where we perform these tasks after the request is done. This improves latency and we no longer have to block the call on these operations. The changes will also free up the agent slot token more quickly and now we are no longer tied to hiccups in call.End(). Now, a timeout policy is also added to this which can be adjusted with an env variable. (default 10 minutes) This accentuates the fact that call/log/fireAfterCall are not completed when request is done. So, there's a window there where call is done, but call/log/fireAfterCall are not yet propagated. This was already the case especially for error cases. There's slight risk of accumulating call.End() operations in case of hiccups in these log/call/callback systems. * fn: address risk of overstacking of call.End() calls.	2018-04-05 14:42:12 -07:00
Reed Allman	56a2861748	move calls to logstore, implement s3 (#911 ) * move calls to logstore, implement s3 closes #482 the basic motivation is that logs and calls will be stored with a very high write rate, while apps and routes will be relatively infrequently updated; it follows that we should likely split up their storage location, to back them with appropriate storage facilities. s3 is a good candidate for ingesting higher write rate data than a sql database, and will make it easier to manage that data set. can read #482 for more detailed justification. summary: * calls api moved from datastore to logstore * logstore used in front-end to serve calls endpoints * agent now throws calls into logstore instead of datastore * s3 implementation of calls api for logstore * s3 logs key changed (nobody using / nbd?) * removed UpdateCall api (not in use) * moved call tests from datastore to logstore tests * mock logstore now tested (prev. sqlite3 only) * logstore tests run against every datastore (mysql, pg; prev. only sqlite3) * simplify NewMock in tests commentary: brunt of the work is implementing the listing of calls in GetCalls for the s3 logstore implementation. the GetCalls API requires returning items in the newest to oldest order, and the s3 api lists items in lexicographic order based on created_at. An easy thing to do here seemed to be to reverse the encoding of our id format to return a lexicographically descending order, since ids are time based, reasonably encoded to be lexicographically sortable, and de-duped (unlike created_at). This seems to work pretty well, it's not perfect around the boundaries of to_time and from_time and a tiny amount of results may be omitted, but to me this doesn't seem like a deal breaker to get 6999 results instead of 7000 when trying to get calls between 3:00pm and 4:00pm Monday 3 weeks ago. Of course, without to_time and from_time, there are no issues in listing results. We could use created at and encode it, but it would be an additional marker for point lookup (GetCall) since we would have to search for a created_at stamp, search for ids around that until we find the matching one, just to do a point lookup. So, the tradeoff here seems worth it. There is additional optimization around to_time to seek over newer results (since we have descending order). The other complication in GetCalls is returning a list of calls for a given path. Since the keys to do point lookups are only app_id + call_id, and we need listing across an app as well, this leads us to the 'marker' collection which is sorted by app_id + path + call_id, to allow quick listing by path. All in all, it should be pretty straightforward to follow the implementation and I tried to be lavish with the comments, please let me know if anything needs further clarification in the code. The implementation itself has some glaring inefficiencies, but they're relatively minute: json encoding is kinda lazy, but workable; s3 doesn't offer batch retrieval, so we point look up each call one by one in get call; not re-using buffers -- but the seeking around the keys should all be relatively fast, not too worried about performance really and this isn't a hot path for reads (need to make a cut point and turn this in!). Interestingly, in testing, minio performs significantly worse than pg for storing both logs and calls (or just logs, I tested that too). minio seems to have really high cpu consumption, but in any event, we won't be using minio, we'll be using a cloud object store that implements the s3 api. Anyway, mostly a knock on using minio for high performance, not really anything to do with this, just thought it was interesting. I think it's safe to remove UpdateCall, admittedly this made implementing the s3 api a lot easier. This operation may also be something we never need, it was unused at present and was only in the cards for a previous hybrid implementation, which we've now abandoned. If we need, we can always resurrect from git. Also not worried about changing the log key, we need to put a prefix on this thing anyway, but I don't think anybody is using this anyway. in any event, it simply means old logs won't show up through the API, but aside from nobody using this yet, that doesn't seem a big deal breaker really -- new logs will appear fine. future: TODO make logstore implementation optional for datastore, check in front-end at runtime and offer a nil logstore that errors appropriately TODO low hanging fruit optimizations of json encoding, re-using buffers for download, get multiple calls at a time, id reverse encoding could be optimized like normal encoding to not be n^2 TODO api for range removal of logs and calls * address review comments * push id to_time magic into id package * add note about s3 key sizes * fix validation check	2018-04-05 10:49:25 -07:00
Tolga Ceylan	c58caee78d	fn: update minimum docker version required. (#916 ) Oracle Linux 7.4 backported versions still having issues with freezing/terminating containers. 17.10.0-ce seems like a resonable lowest common denominator.	2018-04-04 16:43:30 -07:00
Reed Allman	3a2550d042	change slots when annotations change (#914 ) when the annotations change, we need to get a different slot key to launch new containers with these annotations and let the old containers die off unused. I started on a test for this, but changing all combinations of each field in isolation to change is not very fun without reflection, and there's still a subset of fields we're managing, so it put us in about the same spot as we are now.	2018-04-03 11:21:56 -07:00
jan grant	9633cf022b	Bugfix: unsafeBytes slices were getting GCed (#913 ) There are alternative formulations of this, for instance see https://www.reddit.com/r/golang/comments/5zctpf/unsafe_conversion_between_strings_and_byte_slices/ The problem manifested in the returned values from unsafeBytes occasionally being broken. It's possible that by keeping a reference to the `a` parameter alive, the original code would still work - however, this definitely seems like a fix. (A cast to `[]byte(a)` looks increasingly attractive, for all that it'll perform small allocations and copies.)	2018-04-03 16:53:02 +01:00
jan grant	88074a42c0	Bugfix/grpc consume eof (#912 ) * GRPC streams end with an EOF The client should ensure that the final packet is followed by a GRPC EOF. This has the benefit of permitting the client code to clean up resources. * Don't require an entire HTTP request in RunnerCall TryExec needs a handle on an incoming ReadCloser containing the body of a request; however, everything else will already have been extracted from the HTTP request in the case of lbAgent use. (The point of this change is to simplify the interface for other uses.) * Return error from GRPC layer explicitly As per review	2018-04-03 15:04:21 +01:00
Andrea Rosa	72a2eb933f	Returning Agent on exported func for pureRunner (#905 ) pureRunner is a not exported struct and it was set as return value for few exported method, in this change we return Agent which is the interface implemented by pureRunner to avoid to leak an unexprted type.	2018-03-30 09:15:55 -07:00
Tolga Ceylan	369f2ea17c	fn: experimental prefork tests should skip non Linux OSs (#904 )	2018-03-28 14:40:51 -07:00
Justin Ko	9cb883ca68	Godoc fixes (#898 ) Add some godoc comments for the api/agent package and some of its subpackages.	2018-03-28 10:16:40 -07:00
Gerardo Viedma	348bbaf36b	support runner TLS certificates with specified certificate Common Names (#900 ) * support runner TLS certificates with specified certificate Common Names * removes duplicate constant * run in insecure mode by default but expose ability to create tls-secured runner pools programmatically * fixes runner tests to use new tls interfaces	2018-03-28 13:57:15 +01:00
Reed Allman	8af605cf3d	update thrift, opencensus, others (#893 ) * update thrift, opencensus, others * stats: update to opencensus 0.6.0 view api	2018-03-26 15:43:49 -07:00
Denis Makogon	3c15ca6ea6	App ID (#641 ) * App ID * Clean-up * Use ID or name to reference apps * Can use app by name or ID * Get rid of AppName for routes API and model routes API is completely backwards-compatible routes API accepts both app ID and name * Get rid of AppName from calls API and model * Fixing tests * Get rid of AppName from logs API and model * Restrict API to work with app names only * Addressing review comments * Fix for hybrid mode * Fix rebase problems * Addressing review comments * Addressing review comments pt.2 * Fixing test issue * Addressing review comments pt.3 * Updated docstring * Adjust UpdateApp SQL implementation to work with app IDs instead of names * Fixing tests * fmt after rebase * Make tests green again! * Use GetAppByID wherever it is necessary - adding new v2 endpoints to keep hybrid api/runner mode working - extract CallBase from Call object to expose that to a user (it doesn't include any app reference, as we do for all other API objects) * Get rid of GetAppByName * Adjusting server router setup * Make hybrid work again * Fix datastore tests * Fixing tests * Do not ignore app_id * Resolve issues after rebase * Updating test to make it work as it was * Tabula rasa for migrations * Adding calls API test - we need to ensure we give "App not found" for the missing app and missing call in first place - making previous test work (request missing call for the existing app) * Make datastore tests work fine with correctly applied migrations * Make CallFunction middleware work again had to adjust its implementation to set app ID before proceeding * The biggest rebase ever made * Fix 8's migration * Fix tests * Fix hybrid client * Fix tests problem * Increment app ID migration version * Fixing TestAppUpdate * Fix rebase issues * Addressing review comments * Renew vendor * Updated swagger doc per recommendations	2018-03-26 11:19:36 -07:00
Tolga Ceylan	0addcb8911	fn: pre-fork pool for namespace/network speedup (#874 ) * fn: pre-fork pool experimental implementation	2018-03-23 16:35:35 -07:00
Gerardo Viedma	101236f7d8	Remove npm remnants (#882 ) * create an Annotation map of the right size to avoid resizing * removes all references to deprecated nodepool manager	2018-03-23 10:29:32 +00:00
Gerardo Viedma	0c47dbf26d	create an Annotation map of the right size to avoid resizing (#881 )	2018-03-23 10:29:07 +00:00
Dario Domizioli	8df8ed6360	Expose route and app models to RunnerCall for extensions (alternative 2) (#880 )	2018-03-22 20:07:39 +00:00
Dario Domizioli	27ffb561e8	Hide details of delegated agents for PR and LB, to disable docker for LB (#872 ) * Move delegated agent creation within NewLBAgent so we can hide the fact we disable docker * Move delegated agent creation within NewPureRunner for better encapsulation	2018-03-20 13:45:45 +00:00
Gerardo Viedma	1cae6f988e	Make PKI data and RunnerFactory public objects (#865 ) * Make PKI data and RunnerFactory public objects * removes unnecessary nullRunner object * renames secure factory to point out MTLS	2018-03-16 15:40:58 +00:00
Gerardo Viedma	73ae77614c	Moves out node pool manager behind an extension using runner pool abstraction (Part 2) (#862 ) * Move out node-pool manager and replace it with RunnerPool extension * adds extension points for runner pools in load-balanced mode * adds error to return values in RunnerPool and Runner interfaces * Implements runner pool contract with context-aware shutdown * fixes issue with range * fixes tests to use runner abstraction * adds empty test file as a workaround for build requiring go source files in top-level package * removes flappy timeout test * update docs to reflect runner pool setup * refactors system tests to use runner abstraction * removes poolmanager * moves runner interfaces from models to api/runnerpool package * Adds a second runner to pool docs example * explicitly check for request spillover to second runner in test * moves runner pool package name for system tests * renames runner pool pointer variable for consistency * pass model json to runner * automatically cast to http.ResponseWriter in load-balanced call case * allow overriding of server RunnerPool via a programmatic ServerOption * fixes return type of ResponseWriter in test * move Placer interface to runnerpool package * moves hash-based placer out of open source project * removes siphash from Gopkg.lock	2018-03-16 13:46:21 +00:00
Dario Domizioli	362e910d9d	Make dataplane system test behave deterministically (#849 ) Make dataplane system test deterministic by injecting capacity constraints	2018-03-16 11:50:44 +00:00
Tolga Ceylan	cb61a678d9	fn: add storage opt size support (#860 ) Added env FN_MAX_FS_SIZE_MB, which if defined and non-zero is passed to docker as storage opt size. We do not validate if this option is supported by docker currently. This is because it's difficult to actually validate this since it not only depends on storage driver and its backing filesystem, but also the mount options used to mount that fs.	2018-03-14 15:47:34 -07:00
Tolga Ceylan	74a51f3f88	fn: reorg agent config (#853 ) * fn: reorg agent config ) Moving constants in agent to agent config, which helps with testing, tuning. ) Added max total cpu & memory for testing & clamping max mem & cpu usage if needed. * fn: adjust PipeIO time * fn: for hot, cannot reliably test EndOfLogs in TestRouteRunnerExecution	2018-03-13 18:38:47 -07:00
Reed Allman	9eaf824398	add jaeger support, link hot container & req span (#840 ) * add jaeger support, link hot container & req span * adds jaeger support now with FN_JAEGER_URL, there's a simple tutorial in the operating/metrics.md file now and it's pretty easy to get up and running. * links a hot request span to a hot container span. when we change this to sample at a lower ratio we'll need to finagle the hot container span to always sample or something, otherwise we'll hide that info. at least, since we're sampling at 100% for now if this is flipped on, can see freeze/unfreeze etc. if they hit. this is useful for debugging. note that zipkin's exporter does not follow the link at all, hence jaeger... and they're backed by the Cloud Empire now (CNCF) so we'll probably use it anyway. * vendor: add thrift for jaeger	2018-03-13 15:57:12 -07:00
Dario Domizioli	2c8b02c845	Make PureRunner an Agent so that it encapsulates its grpc server (#834 ) * Refactor PureRunner as an Agent so that it encapsulates its grpc server * Maintain a list of extra contexts for the server to select on to handle errors and cancellations	2018-03-13 15:51:32 +00:00
Tolga Ceylan	e80a06937b	fn: timeouts and container exists should stop slot queuing (#843 ) 1) in theory it may be possible for an exited container to requeue a slot, close this gap by always setting fatal error for a slot if a container has exited. 2) when a client request times out or cancelled (client disconnect, etc.) the slot should not be allowed to be requeued and container should terminate to avoid accidental mixing of previous response into next.	2018-03-12 11:18:55 -07:00
Andrea Rosa	3261e48843	Add a timeout to the net dialer (#844 ) This change add the option to set a timeout for the dialer used in making gRPC connection, with that we remove the check on the state of the connections and therefore remove any potential race conditions.	2018-03-12 13:36:53 +00:00
Andrea Rosa	43547a572f	Check runner connection before sending requests (#831 ) If a runner disconnect not gracefully it could happen that the connection gets stuck in connecting mode, this change verifies the state of the connection before starting to execute a call, if the client connection is not ready we fail fast to give a change to the next runner (if any) to execute the call.	2018-03-12 09:38:27 +00:00
Dario Domizioli	9b28497cff	Add a basic concurrency test for the dataplane system tests (#832 ) Add a basic concurrency test for the dataplane system tests. Also remove some spurious logging.	2018-03-10 00:51:02 +00:00
Tolga Ceylan	afeb8e6f6a	fn: json excess data check should ignore whitespace (#830 ) * fn: json excess data check should ignore whitespace * fn: adjustments and test case	2018-03-09 11:59:30 -08:00
Tolga Ceylan	7177bf3923	fn: enable failing test back (#826 ) * fn: enable failing test back * fn: fortifying the stderr output Modified limitWriter to discard excess data instead of returning error, this is to allow stderr/stdout pipes flowing to avoid head-of-line blocking or data corruption in container stdout/stderr output stream.	2018-03-09 09:57:28 -08:00

1 2 3 4 5

205 Commits