fn-serverless

mirror of https://github.com/fnproject/fn.git synced 2022-10-28 21:29:17 +03:00

Author	SHA1	Message	Date
Tolga Ceylan	df53250958	fn: latency metrics for various call states (#1332 ) * fn: latency metrics for various call states This complements the API latency metrics available on LB agent. In this case, we would like to measure calls that have finished with the following status: "completed" "canceled" "timeouts" "errors" "server_busy" and while measuring this latency, we subtract the amount of time actual function execution took. This is not precise, but an approximation mostly suitable for trending. Going forward, we could also subtract UDS wait time and/or docker pull latency from this latency as an enhancement to this PR.	2018-11-30 13:20:59 -08:00
Tolga Ceylan	549ba65dea	fn: placer stats and client context (#1322 ) For the sake of completeness and also as defensive coding, let's record client context cancel/timeout cases. In retry reason, if error is same as client context (timeout/cancel), we should not report as retry due to error. Similarly in placed calls, do not flag the placed call as error if client canceled or timedout. We track client context timeout/cancel separately in addition to this.	2018-11-28 10:32:55 -08:00
Krister Johansen	af59d19d24	Add support for counting kdumps in pure-runner. (#1309 )	2018-11-21 16:26:30 -08:00
Shreya Garge	91f6ef3402	added context for runnerpool interface (#1320 ) * added context for runnerpool interface * added context for runnerpool interface	2018-11-20 17:02:47 +00:00
Andrea Rosa	182db94fad	Feature/acksync response writer (#1267 ) This implements a "detached" mechanism to get an ack from the runner once it actually starts to run a function. In this scenario the response returned back is just a 202 if we placed the function in a specific time-frame. If we hit some errors or we fail to place the fn in time we return back different errors.	2018-11-09 10:25:43 -08:00
Owen Strain	21f77f837e	Add APIErrorWrapper so that underlying errors can be logged (#1246 )	2018-09-28 17:26:54 -07:00
Tom Coupland	d56a49b321	Remove V1 endpoints and Routes (#1210 ) Largely a removal job, however many tests, particularly system level ones relied on Routes. These have been migrated to use Fns. * Add 410 response to swagger * No app names in log tags * Adding constraint in GetCall for FnID * Adding test to check FnID is required on call * Add fn_id to call selector * Fix text in docker mem warning * Correct buildConfig func name * Test fix up * Removing CPU setting from Agent test CPU setting has been deprecated, but the code base is still riddled with it. This just removes it from this layer. Really we need to remove it from Call. * Remove fn id check on calls * Reintroduce fn id required on call * Adding fnID to calls for execute test * Correct setting of app id in middleware * Removes root middlewares ability to redirect fun invocations * Add over sized test check * Removing call fn id check	2018-09-17 16:44:51 +01:00
Tolga Ceylan	f57571fb3a	fn: SSL config adjustments (#1160 ) SSL related FN_NODE_CERT (and related) settings are not very clear today. Removing this in favor of a simple map of tls.Config objects. Three keys are provided for this map: TLSGRPCServer TLSAdminServer TLSWebServer which correspond to server TLS settings for the associated services. Operators/implementers can further add more keys to the map and add their own TLS config.	2018-08-06 20:57:03 -07:00
Tolga Ceylan	0105f8321e	fn: stats view/distribution improvements (#1154 ) * fn: stats view/distribution improvements ) View latency distribution is now an argument in view creation functions. This allows easier override to set custom buckets. It is simplistic and assumes all latency views would use the same set, but in practice this is already the case. ) Removed API view creation to main, this should not be enabled for all node types. This is consistent with the rest of the system. * fn: Docker samples of cpu/mem/disk with specific buckets	2018-08-03 11:06:54 -07:00
Tolga Ceylan	07d59247ec	fn: adjusting LB retry view buckets (#1139 ) [0, 2, 3, 4, 8, 16, 32, 64, 128, 256] gives us: s >= 0 s >= 2 s >= 3 and so on for better observability.	2018-07-26 15:54:56 -07:00
Tolga Ceylan	9f29d824d6	fn: New timeout for LB Placer (#1137 ) * fn: New timeout for LB Placer Previously, LB Placers worked hard as long as client contexts allowed for. Adding a Placer config setting to bound this by 360 seconds by default. The new timeout is not accounted during actual function execution and only applies to the amount of wait time in Placers when the call is not being executed.	2018-07-26 10:19:25 -07:00
Tolga Ceylan	db7cbf73e2	fn: add requests received/handled in Status responses (#1132 ) This is useful as additional data to inflight requests. Callers can determine request arrival and processing rate.	2018-07-20 16:00:02 -07:00
Tolga Ceylan	564db4e9d2	fn: Status should expose if data was served from cache. (#1123 ) This is useful in scenarios where gRPC client might want to reliably observe/report the status latency metrics and remove any possible duplicates. If the status query was served from cache, then these latencies show last execution latency.	2018-07-13 17:35:00 -07:00
Tolga Ceylan	5dc5740a54	fn: runner status and docker load images (#1116 ) * fn: runner status and docker load images Introducing a function run for pure runner Status calls. Previously, Status gRPC calls returned active inflight request counts with the purpose of a simple health checker. However this is not sufficient since it does not show if agent or docker is healthy. With this change, if pure runner is configured with a status image, that image is executed through docker. The call uses zero memory/cpu/tmpsize settings to ensure resource tracker does not block it. However, operators might not always have a docker repository accessible/available for status image. Or operators might not want the status to go over the network. To allow such cases, and in general possibly caching docker images, added a new environment variable FN_DOCKER_LOAD_FILE. If this is set, fn-agent during startup will load these images that were previously saved with 'docker save' into docker.	2018-07-12 13:58:38 -07:00
Tolga Ceylan	317de18e6b	fn: lb-agent: Add Runner Scheduler/Execution Stats (#1107 ) LB agent reports lb placer latency. It should also report how long it took for the runner to initiate the call as well as execution time inside the container if the runner has accepted (committed) to the call.	2018-07-02 17:15:43 -07:00
jan grant	edf2fc8831	Add a finer-grained view for placer latency metrics (#1085 ) This is a small tweak to the placer latency stats. If we have a cluster of values around the 1-2s mark, then having a single relatively broad bucket that captures the (1s, 10s] range will obscure that. In particular, typical Prometheus quartile estimates may be distorted by this bucket size.	2018-06-25 10:36:46 +01:00
Tolga Ceylan	e67d0e5f3f	fn: Call extensions/overriding and more customization friendly docker driver (#1065 ) In pure-runner and LB agent, service providers might want to set specific driver options. For example, to add cpu-shares to functions, LB can add the information as extensions to the Call and pass this via gRPC to runners. Runners then pick these extensions from gRPC call and pass it to driver. Using a custom driver implementation, pure-runners can process these extensions to modify docker.CreateContainerOptions. To achieve this, LB agents can now be configured using a call overrider. Pure-runners can be configured using a custom docker driver. RunnerCall and Call interfaces both expose call extensions. An example to demonstrate this is implemented in test/fn-system-tests/system_test.go which registers a call overrider for LB agent as well as a simple custom docker driver. In this example, LB agent adds a key-value to extensions and runners add this key-value as an environment variable to the container.	2018-06-18 14:42:28 -07:00
Tolga Ceylan	f24172aa9d	fn: introducing lb placer basic metrics (#1058 ) * fn: introducing lb placer basic metrics This change adds basic metrics to naive and consistent hash LB placers. The stats show how many times we scanned the full runner list, if runner pool failed to return a runner list or if runner pool returned an empty list. Placed and not placed status are also tracked along with if TryExec returned an error or not. Most common error code, Too-Busy is specifically tracked. If client cancels/times out, this is also tracked as a client cancel metric. For placer latency, we would like to know how much time the placer spent on searching for a runner until it successfully places a call. This includes round-trip times for NACK responses from the runners until a successful TryExec() call. By excluding last successful TryExec() latency, we try to exclude function execution & runner container startup time from this metric in an attempt to isolate Placer only latency. * fn: latency and attempt tracker Removing full scan metric. Tracking number of runners attempted is a better metric for this purpose. Also, if rp.Runners() fail, this is an unrecoverable error and we should bail out instead of retrying. * fn: typo fix, ch placer finalize err return * fn: enable LB placer metrics in WithAgentFromEnv if prometheus is enabled	2018-06-12 13:36:05 -07:00
Owen Cliffe	c6abc8bf64	Use context logging more to ensure context vars are present in log lines (#1039 )	2018-06-06 15:14:29 +01:00
Tolga Ceylan	1cd5894f41	fn: LB agent: reduce 'Too Busy' error logs (#1033 ) With this PR, runner client translates too busy errors from gRPC session and runner itself into Fn error type. Placers now ignore this error message to reduce unnecessary logging.	2018-06-04 12:16:00 -07:00
Tolga Ceylan	a57907eed0	fn: user friendly timeout handling changes (#1021 ) * fn: user friendly timeout handling changes Timeout setting in routes now means "maximum amount of time a function can run in a container". Total wait time for a given http request is now expected to be handled by the client. As long as the client waits, the LB, runner or agents will search for resources to schedule it.	2018-06-01 13:18:13 -07:00
Tolga Ceylan	e1b7e30e49	fn: cleanup of unused/global constants in lb agent (#1020 ) Moved retry interval as placer member variable for time-being.	2018-05-31 13:04:06 -07:00
Tolga Ceylan	74a5379dec	fn: lb & pure-runner slot hash id communication (#1007 ) * fn: lb & pure-runner slot hash id communication With this change, LB can pre-calculate the slot hash key and pass it to runners. If LB knows/calculates the slot hash ids, then it can also make better estimates on which runner can successfully execute it especially when status messages from runner include a small summary of idle slots for a given slot hash id. (TODO) * fn: fix mock test	2018-05-25 14:12:48 -07:00
Tolga Ceylan	4ccde8897e	fn: lb and pure-runner with non-blocking agent (#989 ) * fn: lb and pure-runner with non-blocking agent ) Removed pure-runner capacity tracking code. This did not play well with internal agent resource tracker. ) In LB and runner gRPC comm, removed ACK. Now, upon TryCall, pure-runner quickly proceeds to call Submit. This is good since at this stage pure-runner already has all relevant data to initiate the call. ) Unless pure-runner emits a NACK, LB immediately streams http body to runners. ) For retriable requests added a CachedReader for http.Request Body. ) Idempotenty/retry is similar to previous code. After initial success in Engament, after attempting a TryCall, unless we receive NACK, we cannot retry that call. ) ch and naive places now wraps each TryExec with a cancellable context to clean up gRPC contexts quicker. * fn: err for simpler one-time read GetBody approach This allows for a more flexible approach since we let users to define GetBody() to allow repetitive http body read. In default LB case, LB executes a one-time io.ReadAll and sets of GetBody, which is detected by RunnerCall.RequestBody(). * fn: additional check for non-nil req.body * fn: attempt to override IO errors with ctx for TryExec * fn: system-tests log dest * fn: LB: EOF send handling * fn: logging for partial IO * fn: use buffer pool for IO storage in lb agent * fn: pure runner should use chunks for data msgs * fn: required config validations and pass APIErrors * fn: additional tests and gRPC proto simplification ) remove ACK/NACK messages as Finish message type works OK for this purpose. ) return resp in api tests for check for status code ) empty body json test in api tests for lb & pure-runner fn: buffer adjustments ) setRequestBody result handling correction ) switch to bytes.Reader for read-only safety ) io.EOF can be returned for non-nil Body in request. fn: clarify detection of 503 / Server Too Busy	2018-05-17 12:09:03 -07:00
Tolga Ceylan	f0f9a6d945	fn: LB ch and naive fixes (#942 ) * fn: LB ch and naive fixes ) Naive is now a naive RR algorithm. ) Both now checks for ctx/timeout in each attempt. * fn: test fix	2018-05-07 11:50:16 -07:00
Tolga Ceylan	c0ee3ce736	fn: locked mutex while blocked on I/O considered harmful (#935 ) * fn: mutex while waiting I/O considered harmful ) Removed hold mutex while wait I/O cases these included possible disk I/O and network I/O. ) Error/Context Close/Shutdown semantics changed since the context timeout and comments were misleading. Close always waits for pending gRPC session to complete. Context usage here was merely 'wait up to x secs to report an error' which only logs the error anyway. Instead, the runner can log the error. And context still can be passed around perhaps for future opencensus instrumentation.	2018-04-13 11:23:29 -07:00
Tolga Ceylan	e47d55056a	fn: reduce lbagent and agent dependency (#938 ) * fn: reduce lbagent and agent dependency lbagent and agent code is too dependent. This causes any changed in agent to break lbagent. In reality, for LB there should be no delegated agent. Splitting these two will cause some code duplication, but it reduces dependency and complexity (eg. agent without docker) * fn: post rebase fixup * fn: runner/runnercall should use lbDeadline * fn: fixup ln agent test * fn: remove agent create option for common.WaitGroup	2018-04-12 15:51:58 -07:00
Tolga Ceylan	e7658db822	Move ch ring placement back from old FnLB. (#930 ) * fn: bring back CH ring placer into FN repo based on original FnLB * fn: move placement code into runnerpool directory	2018-04-10 17:26:24 -07:00
jan grant	88074a42c0	Bugfix/grpc consume eof (#912 ) * GRPC streams end with an EOF The client should ensure that the final packet is followed by a GRPC EOF. This has the benefit of permitting the client code to clean up resources. * Don't require an entire HTTP request in RunnerCall TryExec needs a handle on an incoming ReadCloser containing the body of a request; however, everything else will already have been extracted from the HTTP request in the case of lbAgent use. (The point of this change is to simplify the interface for other uses.) * Return error from GRPC layer explicitly As per review	2018-04-03 15:04:21 +01:00
Gerardo Viedma	348bbaf36b	support runner TLS certificates with specified certificate Common Names (#900 ) * support runner TLS certificates with specified certificate Common Names * removes duplicate constant * run in insecure mode by default but expose ability to create tls-secured runner pools programmatically * fixes runner tests to use new tls interfaces	2018-03-28 13:57:15 +01:00
Gerardo Viedma	1cae6f988e	Make PKI data and RunnerFactory public objects (#865 ) * Make PKI data and RunnerFactory public objects * removes unnecessary nullRunner object * renames secure factory to point out MTLS	2018-03-16 15:40:58 +00:00
Gerardo Viedma	73ae77614c	Moves out node pool manager behind an extension using runner pool abstraction (Part 2) (#862 ) * Move out node-pool manager and replace it with RunnerPool extension * adds extension points for runner pools in load-balanced mode * adds error to return values in RunnerPool and Runner interfaces * Implements runner pool contract with context-aware shutdown * fixes issue with range * fixes tests to use runner abstraction * adds empty test file as a workaround for build requiring go source files in top-level package * removes flappy timeout test * update docs to reflect runner pool setup * refactors system tests to use runner abstraction * removes poolmanager * moves runner interfaces from models to api/runnerpool package * Adds a second runner to pool docs example * explicitly check for request spillover to second runner in test * moves runner pool package name for system tests * renames runner pool pointer variable for consistency * pass model json to runner * automatically cast to http.ResponseWriter in load-balanced call case * allow overriding of server RunnerPool via a programmatic ServerOption * fixes return type of ResponseWriter in test * move Placer interface to runnerpool package * moves hash-based placer out of open source project * removes siphash from Gopkg.lock	2018-03-16 13:46:21 +00:00

32 Commits