fn-serverless

mirror of https://github.com/fnproject/fn.git synced 2022-10-28 21:29:17 +03:00

Author	SHA1	Message	Date
Reed Allman	3a2550d042	change slots when annotations change (#914 ) when the annotations change, we need to get a different slot key to launch new containers with these annotations and let the old containers die off unused. I started on a test for this, but changing all combinations of each field in isolation to change is not very fun without reflection, and there's still a subset of fields we're managing, so it put us in about the same spot as we are now.	2018-04-03 11:21:56 -07:00
jan grant	9633cf022b	Bugfix: unsafeBytes slices were getting GCed (#913 ) There are alternative formulations of this, for instance see https://www.reddit.com/r/golang/comments/5zctpf/unsafe_conversion_between_strings_and_byte_slices/ The problem manifested in the returned values from unsafeBytes occasionally being broken. It's possible that by keeping a reference to the `a` parameter alive, the original code would still work - however, this definitely seems like a fix. (A cast to `[]byte(a)` looks increasingly attractive, for all that it'll perform small allocations and copies.)	2018-04-03 16:53:02 +01:00
jan grant	88074a42c0	Bugfix/grpc consume eof (#912 ) * GRPC streams end with an EOF The client should ensure that the final packet is followed by a GRPC EOF. This has the benefit of permitting the client code to clean up resources. * Don't require an entire HTTP request in RunnerCall TryExec needs a handle on an incoming ReadCloser containing the body of a request; however, everything else will already have been extracted from the HTTP request in the case of lbAgent use. (The point of this change is to simplify the interface for other uses.) * Return error from GRPC layer explicitly As per review	2018-04-03 15:04:21 +01:00
Andrea Rosa	72a2eb933f	Returning Agent on exported func for pureRunner (#905 ) pureRunner is a not exported struct and it was set as return value for few exported method, in this change we return Agent which is the interface implemented by pureRunner to avoid to leak an unexprted type.	2018-03-30 09:15:55 -07:00
Tolga Ceylan	369f2ea17c	fn: experimental prefork tests should skip non Linux OSs (#904 )	2018-03-28 14:40:51 -07:00
Justin Ko	9cb883ca68	Godoc fixes (#898 ) Add some godoc comments for the api/agent package and some of its subpackages.	2018-03-28 10:16:40 -07:00
Gerardo Viedma	348bbaf36b	support runner TLS certificates with specified certificate Common Names (#900 ) * support runner TLS certificates with specified certificate Common Names * removes duplicate constant * run in insecure mode by default but expose ability to create tls-secured runner pools programmatically * fixes runner tests to use new tls interfaces	2018-03-28 13:57:15 +01:00
Reed Allman	8af605cf3d	update thrift, opencensus, others (#893 ) * update thrift, opencensus, others * stats: update to opencensus 0.6.0 view api	2018-03-26 15:43:49 -07:00
Denis Makogon	3c15ca6ea6	App ID (#641 ) * App ID * Clean-up * Use ID or name to reference apps * Can use app by name or ID * Get rid of AppName for routes API and model routes API is completely backwards-compatible routes API accepts both app ID and name * Get rid of AppName from calls API and model * Fixing tests * Get rid of AppName from logs API and model * Restrict API to work with app names only * Addressing review comments * Fix for hybrid mode * Fix rebase problems * Addressing review comments * Addressing review comments pt.2 * Fixing test issue * Addressing review comments pt.3 * Updated docstring * Adjust UpdateApp SQL implementation to work with app IDs instead of names * Fixing tests * fmt after rebase * Make tests green again! * Use GetAppByID wherever it is necessary - adding new v2 endpoints to keep hybrid api/runner mode working - extract CallBase from Call object to expose that to a user (it doesn't include any app reference, as we do for all other API objects) * Get rid of GetAppByName * Adjusting server router setup * Make hybrid work again * Fix datastore tests * Fixing tests * Do not ignore app_id * Resolve issues after rebase * Updating test to make it work as it was * Tabula rasa for migrations * Adding calls API test - we need to ensure we give "App not found" for the missing app and missing call in first place - making previous test work (request missing call for the existing app) * Make datastore tests work fine with correctly applied migrations * Make CallFunction middleware work again had to adjust its implementation to set app ID before proceeding * The biggest rebase ever made * Fix 8's migration * Fix tests * Fix hybrid client * Fix tests problem * Increment app ID migration version * Fixing TestAppUpdate * Fix rebase issues * Addressing review comments * Renew vendor * Updated swagger doc per recommendations	2018-03-26 11:19:36 -07:00
Tolga Ceylan	0addcb8911	fn: pre-fork pool for namespace/network speedup (#874 ) * fn: pre-fork pool experimental implementation	2018-03-23 16:35:35 -07:00
Gerardo Viedma	101236f7d8	Remove npm remnants (#882 ) * create an Annotation map of the right size to avoid resizing * removes all references to deprecated nodepool manager	2018-03-23 10:29:32 +00:00
Gerardo Viedma	0c47dbf26d	create an Annotation map of the right size to avoid resizing (#881 )	2018-03-23 10:29:07 +00:00
Dario Domizioli	8df8ed6360	Expose route and app models to RunnerCall for extensions (alternative 2) (#880 )	2018-03-22 20:07:39 +00:00
Dario Domizioli	27ffb561e8	Hide details of delegated agents for PR and LB, to disable docker for LB (#872 ) * Move delegated agent creation within NewLBAgent so we can hide the fact we disable docker * Move delegated agent creation within NewPureRunner for better encapsulation	2018-03-20 13:45:45 +00:00
Gerardo Viedma	1cae6f988e	Make PKI data and RunnerFactory public objects (#865 ) * Make PKI data and RunnerFactory public objects * removes unnecessary nullRunner object * renames secure factory to point out MTLS	2018-03-16 15:40:58 +00:00
Gerardo Viedma	73ae77614c	Moves out node pool manager behind an extension using runner pool abstraction (Part 2) (#862 ) * Move out node-pool manager and replace it with RunnerPool extension * adds extension points for runner pools in load-balanced mode * adds error to return values in RunnerPool and Runner interfaces * Implements runner pool contract with context-aware shutdown * fixes issue with range * fixes tests to use runner abstraction * adds empty test file as a workaround for build requiring go source files in top-level package * removes flappy timeout test * update docs to reflect runner pool setup * refactors system tests to use runner abstraction * removes poolmanager * moves runner interfaces from models to api/runnerpool package * Adds a second runner to pool docs example * explicitly check for request spillover to second runner in test * moves runner pool package name for system tests * renames runner pool pointer variable for consistency * pass model json to runner * automatically cast to http.ResponseWriter in load-balanced call case * allow overriding of server RunnerPool via a programmatic ServerOption * fixes return type of ResponseWriter in test * move Placer interface to runnerpool package * moves hash-based placer out of open source project * removes siphash from Gopkg.lock	2018-03-16 13:46:21 +00:00
Dario Domizioli	362e910d9d	Make dataplane system test behave deterministically (#849 ) Make dataplane system test deterministic by injecting capacity constraints	2018-03-16 11:50:44 +00:00
Tolga Ceylan	cb61a678d9	fn: add storage opt size support (#860 ) Added env FN_MAX_FS_SIZE_MB, which if defined and non-zero is passed to docker as storage opt size. We do not validate if this option is supported by docker currently. This is because it's difficult to actually validate this since it not only depends on storage driver and its backing filesystem, but also the mount options used to mount that fs.	2018-03-14 15:47:34 -07:00
Tolga Ceylan	74a51f3f88	fn: reorg agent config (#853 ) * fn: reorg agent config ) Moving constants in agent to agent config, which helps with testing, tuning. ) Added max total cpu & memory for testing & clamping max mem & cpu usage if needed. * fn: adjust PipeIO time * fn: for hot, cannot reliably test EndOfLogs in TestRouteRunnerExecution	2018-03-13 18:38:47 -07:00
Reed Allman	9eaf824398	add jaeger support, link hot container & req span (#840 ) * add jaeger support, link hot container & req span * adds jaeger support now with FN_JAEGER_URL, there's a simple tutorial in the operating/metrics.md file now and it's pretty easy to get up and running. * links a hot request span to a hot container span. when we change this to sample at a lower ratio we'll need to finagle the hot container span to always sample or something, otherwise we'll hide that info. at least, since we're sampling at 100% for now if this is flipped on, can see freeze/unfreeze etc. if they hit. this is useful for debugging. note that zipkin's exporter does not follow the link at all, hence jaeger... and they're backed by the Cloud Empire now (CNCF) so we'll probably use it anyway. * vendor: add thrift for jaeger	2018-03-13 15:57:12 -07:00
Dario Domizioli	2c8b02c845	Make PureRunner an Agent so that it encapsulates its grpc server (#834 ) * Refactor PureRunner as an Agent so that it encapsulates its grpc server * Maintain a list of extra contexts for the server to select on to handle errors and cancellations	2018-03-13 15:51:32 +00:00
Tolga Ceylan	e80a06937b	fn: timeouts and container exists should stop slot queuing (#843 ) 1) in theory it may be possible for an exited container to requeue a slot, close this gap by always setting fatal error for a slot if a container has exited. 2) when a client request times out or cancelled (client disconnect, etc.) the slot should not be allowed to be requeued and container should terminate to avoid accidental mixing of previous response into next.	2018-03-12 11:18:55 -07:00
Andrea Rosa	3261e48843	Add a timeout to the net dialer (#844 ) This change add the option to set a timeout for the dialer used in making gRPC connection, with that we remove the check on the state of the connections and therefore remove any potential race conditions.	2018-03-12 13:36:53 +00:00
Andrea Rosa	43547a572f	Check runner connection before sending requests (#831 ) If a runner disconnect not gracefully it could happen that the connection gets stuck in connecting mode, this change verifies the state of the connection before starting to execute a call, if the client connection is not ready we fail fast to give a change to the next runner (if any) to execute the call.	2018-03-12 09:38:27 +00:00
Dario Domizioli	9b28497cff	Add a basic concurrency test for the dataplane system tests (#832 ) Add a basic concurrency test for the dataplane system tests. Also remove some spurious logging.	2018-03-10 00:51:02 +00:00
Tolga Ceylan	afeb8e6f6a	fn: json excess data check should ignore whitespace (#830 ) * fn: json excess data check should ignore whitespace * fn: adjustments and test case	2018-03-09 11:59:30 -08:00
Tolga Ceylan	7177bf3923	fn: enable failing test back (#826 ) * fn: enable failing test back * fn: fortifying the stderr output Modified limitWriter to discard excess data instead of returning error, this is to allow stderr/stdout pipes flowing to avoid head-of-line blocking or data corruption in container stdout/stderr output stream.	2018-03-09 09:57:28 -08:00
Tolga Ceylan	f85294b0fc	fn: log agent cfg with field names (#829 )	2018-03-09 09:53:16 -08:00
Tolga Ceylan	0ef0118150	fn: wait for async attach with success channel (#810 ) * fn: wait for async attach with success channel * fn: debug logs in test.sh * fn: circleci test output as artifact * fn: docker attach non-blocking adjustments * fn: remove retry from risky NB attach	2018-03-08 15:46:32 -08:00
Gerardo Viedma	8af57da7b2	Support load-balanced runner groups for multitenant compute isolation (#814 ) * Initial stab at the protocol * initial protocol sketch for node pool manager * Added http header frame as a message * Force the use of WithAgent variants when creating a server * adds grpc models for node pool manager plus go deps * Naming things is really hard * Merge (and optionally purge) details received by the NPM * WIP: starting to add the runner-side functionality of the new data plane * WIP: Basic startup of grpc server for pure runner. Needs proper certs. * Go fmt * Initial agent for LB nodes. * Agent implementation for LB nodes. * Pass keys and certs to LB node agent. * Remove accidentally left reference to env var. * Add env variables for certificate files * stub out the capacity and group membership server channels * implement server-side runner manager service * removes unused variable * fixes build error * splits up GetCall and GetLBGroupId * Change LB node agent to use TLS connection. * Encode call model as JSON to send to runner node. * Use hybrid client in LB node agent. This should provide access to get app and route information for the call from an API node. * More error handling on the pure runner side * Tentative fix for GetCall problem: set deadlines correctly when reserving slot * Connect loop for LB agent to runner nodes. * Extract runner connection function in LB agent. * drops committed capacity counts * Bugfix - end state tracker only in submit * Do logs properly * adds first pass of tracking capacity metrics in agent * maked memory capacity metric uint64 * maked memory capacity metric uint64 * removes use of old capacity field * adds remove capacity call * merges overwritten reconnect logic * First pass of a NPM Provide a service that talks to a (simulated) CP. - Receive incoming capacity assertions from LBs for LBGs - expire LB requests after a short period - ask the CP to add runners to a LBG - note runner set changes and readvertise - scale down by marking runners as "draining" - shut off draining runners after some cool-down period * add capacity update on schedule * Send periodic capcacity metrics Sending capcacity metrics to node pool manager * splits grpc and api interfaces for capacity manager * failure to advertise capacity shouldn't panic * Add some instructions for starting DP/CP parts. * Create the poolmanager server with TLS * Use logrus * Get npm compiling with cert fixups. * Fix: pure runner should not start async processing * brings runner, nulb and npm together * Add field to acknowledgment to record slot allocation latency; fix a bug too * iterating on pool manager locking issue * raises timeout of placement retry loop * Fix up NPM Improve logging Ensure that channels etc. are actually initialised in the structure creation! * Update the docs - runners GRPC port is 9120 * Bugfix: return runner pool accurately. * Double locking * Note purges as LBs stop talking to us * Get the purging of old LBs working. * Tweak: on restart, load runner set before making scaling decisions. * more agent synchronization improvements * Deal with teh CP pulling out active hosts from under us. * lock at lbgroup level * Send request and receive response from runner. * Add capacity check right before slot reservation * Pass the full Call into the receive loop. * Wait for the data from the runner before finishing * force runner list refresh every time * Don't init db and mq for pure runners * adds shutdown of npm * fixes broken log line * Extract an interface for the Predictor used by the NPM * purge drained connections from npm * Refactor of the LB agent into the agent package * removes capacitytest wip * Fix undefined err issue * updating README for poolmanager set up * ues retrying dial for lb to npm connections * Rename lb_calls to lb_agent now that all functionality is there * Use the right deadline and errors in LBAgent * Make stream error flag per-call rather than global otherwise the whole runner is damaged by one call dropping * abstracting gRPCNodePool * Make stream error flag per-call rather than global otherwise the whole runner is damaged by one call dropping * Add some init checks for LB and pure runner nodes * adding some useful debug * Fix default db and mq for lb node * removes unreachable code, fixes typo * Use datastore as logstore in API nodes. This fixes a bug caused by trying to insert logs into a nil logstore. It was nil because it wasn't being set for API nodes. * creates placement abstraction and moves capacity APIs to NodePool * removed TODO, added logging * Dial reconnections for LB <-> runners LB grpc connections to runners are established using a backoff stategy in event of reconnections, this allows to let the LB up even in case one of the runners go away and reconnect to it as soon as it is back. * Add a status call to the Runner protocol Stub at the moment. To be used for things like draindown, health checks. * Remove comment. * makes assign/release capacity lockless * Fix hanging issue in lb agent when connections drop * Add the CH hash from fnlb Select this with FN_PLACER=ch when launching the LB. * small improvement for locking on reloadLBGmembership * Stabilise the list of Runenrs returned by NodePool The NodePoolManager makes some attempt to keep the list of runner nodes advertised as stable as possible. Let's preserve this effort in the client side. The main point of this is to attempt to keep the same runner at the same inxed in the []Runner returned by NodePool.Runners(lbgid); the ch algorithm likes it when this is the case. * Factor out a generator function for the Runners so that mocks can be injected * temporarily allow lbgroup to be specified in HTTP header, while we sort out changes to the model * fixes bug with nil runners * Initial work for mocking things in tests * fix for anonymouse go routine error * fixing lb_test to compile * Refactor: internal objects for gRPCNodePool are now injectable, with defaults for the real world case * Make GRPC port configurable, fix weird handling of web port too * unit test reload Members * check on runner creation failure * adding nullRunner in case of failure during runner creation * Refactored capacity advertisements/aggregations. Made grpc advertisement post asynchronous and non-blocking. * make capacityEntry private * Change the runner gRPC bind address. This uses the existing `whoAmI` function, so that the gRPC server works when the runner is running on a different host. * Add support for multiple fixed runners to pool mgr * Added harness for dataplane system tests, minor refactors * Add Dockerfiles for components, along with docs. * Doc fix: second runner needs a different name. * Let us have three runners in system tests, why not * The first system test running a function in API/LB/PureRunner mode * Add unit test for Advertiser logic * Fix issue with Pure Runner not sending the last data frame * use config in models.Call as a temporary mechanism to override lb group ID * make gofmt happy * Updates documentation for how to configure lb groups for an app/route * small refactor unit test * Factor NodePool into its own package * Lots of fixes to Pure Runner - concurrency woes with errors and cancellations * New dataplane with static runnerpool (#813) Added static node pool as default implementation * moved nullRunner to grpc package * remove duplication in README * fix go vet issues * Fix server initialisation in api tests * Tiny logging changes in pool manager. Using `WithError` instead of `Errorf` when appropriate. * Change some log levels in the pure runner * fixing readme * moves multitenant compute documentation * adds introduction to multitenant readme * Proper triggering of system tests in makefile * Fix insructions about starting up the components * Change db file for system tests to avoid contention in parallel tests * fixes revisions from merge * Fix merge issue with handling of reserved slot * renaming nulb to lb in the doc and images folder * better TryExec sleep logic clean shutdown In this change we implement a better way to deal with the sleep inside the for loop during the attempt for placing a call. Plus we added a clean way to shutdown the connections with external component when we shut down the server. * System_test mysql port set mysql port for system test to a different value to the one set for the api tests to avoid conflicts as they can run in parallel. * change the container name for system-test * removes flaky test TestRouteRunnerExecution pending resolution by issue #796 * amend remove_containers to remove new added containers * Rework capacity reservation logic at a higher level for now * LB agent implements Submit rather than delegating. * Fix go vet linting errors * Changed a couple of error levels * Fix formatting * removes commmented out test * adds snappy to vendor directory * updates Gopkg and vendor directories, removing snappy and addhing siphash * wait for db containers to come up before starting the tests * make system tests start API node on 8085 to avoid port conflict with api_tests * avoid port conflicts with api_test.sh which are run in parallel * fixes postgres port conflict and issue with removal of old containers * Remove spurious println	2018-03-08 14:45:19 -08:00
Tolga Ceylan	7677aad450	fn: I/O related improvements (#809 ) ) I/O protocol parse issues should shutdown the container as the container goes to inconsistent state between calls. (eg. next call may receive previous calls left overs.) ) Move ghost read/write code into io_utils in common. ) Clean unused error from docker Wait() ) We can catch one case in JSON, if there's remaining unparsed data in decoder buffer, we can shut the container ) stdout/stderr when container is not handling a request are now blocked if freezer is also enabled. ) if a fatal err is set for slot, we do not requeue it and proceed to shutdown *) added a test function for a few cases with freezer strict behavior	2018-03-07 15:09:24 -08:00
Reed Allman	206aa3c203	opentracing -> opencensus (#802 ) * update vendor directory, add go.opencensus.io * update imports * oops * s/opentracing/opencensus/ & remove prometheus / zipkin stuff & remove old stats * the dep train rides again * fix gin build * deps from last guy * start in on the agent metrics * she builds * remove tags for now, cardinality error is fussing. subscribe instead of register * update to patched version of opencensus to proceed for now TODO switch to a release * meh fix imports * println debug the bad boys * lace it with the tags * update deps again * fix all inconsistent cardinality errors * add our own logger * fix init * fix oom measure * remove bugged removal code * fix s3 measures * fix prom handler nil	2018-03-05 09:35:28 -08:00
Tolga Ceylan	89a1fc7c72	Response size clamp (#786 ) ) Limit response http body or json response size to FN_MAX_RESPONSE_SIZE (default unlimited) ) If limits are exceeded 502 is returned with 'body too large' in the error message	2018-03-01 17:14:50 -08:00
Reed Allman	997c7fce89	fix undefined string slot key (#806 ) while escape analysis didn't lie that the bytes underlying this string escaped to the heap, the reference to them died and led to us getting an undefined byte array underlying the string. sadly, this makes 4 allocs here (still down from 31), but only adds 100ns per op. I still don't get why 'buf' and 'byts' escape to the heap, blaming faulty escape analysis code. this one is kind of impossible to write a test for. found this from doing benchmarking stuff and was getting weird behavior at the end of runs where calls didn't find a slot, ran bisect on a known-good commit from a couple weeks ago and found that it was this. voila. this could explain the variance from the slack dude's benchmarks, too. anyway, confirmed that this fixes the issue.	2018-02-28 18:35:07 -08:00
Tolga Ceylan	a83f2cfbe8	fn: favor fn-test-utils over hello (to be decommissioned) (#761 )	2018-02-28 17:44:13 -08:00
Tolga Ceylan	320b766a6d	fn: introduce agent config and minor ghostreader tweak (#797 ) * fn: introduce agent config and minor ghostreader tweak TODO: move all constants/tweaks in agent to agent config. * fn: json convention	2018-02-27 12:17:13 -08:00
Tolga Ceylan	46fad7ef80	fn: plumb up I/O errors from docker wait (#798 ) Reed Allman <rdallman10@gmail.com>'s I/O error fix.	2018-02-27 12:17:02 -08:00
Reed Allman	a56d204450	fix up response headers (#788 ) * fix up response headers * stops defaulting to application/json. this was something awful, go stdlib has a func to detect content type. sadly, it doesn't contain json, but we can do a pretty good job by checking for an opening '{'... there are other fish in the sea, and now we handle them nicely instead of saying it's a json [when it's not]. a test confirms this, there should be no breakage for any routes returning a json blob that were relying on us defaulting to this format (granted that they start with a '{'). * buffers output now to a buffer for all protocol types (default is no longer left out in the cold). use a little response writer so that we can still let users write headers from their functions. this is useful for content type detection instead of having to do it in multiple places. * plumbs the little content type bit into fn-test-util just so we can test it, we don't want to put this in the fdk since it's redundant. I am totally in favor of getting rid of content type from the top level json blurb. it's redundant, at best, and can have confusing behaviors if a user uses both the headers and the content_type field (we override with the latter, now). it's client protocol specific to http to a certain degree, other protocols may use this concept but have their own way to set it (like http does in headers..). I realize that it mostly exists because it's somewhat gross to have to index a list from the headers in certain languages more than others, but with the ^ behavior, is it really worth it? closes #782 * reset idle timeouts back * move json prefix to stack / next to use	2018-02-27 10:30:33 -08:00
Tolga Ceylan	8b65ae8f9a	fn: add docker command info to retry when logging errors (#795 )	2018-02-27 01:10:07 -08:00
Travis Reeder	575e1d3d0c	Removes "type" from json format. Was pointless. (#783 )	2018-02-20 12:04:08 -08:00
Reed Allman	c0df9496a7	reduce allocs in getSlotQueueKey (#778 ) this somewhat minimally comes up in profiling, but it was an itch i needed to scratch. this does 10x less allocations and is 3x faster (with 3x less bytes), and they're the small painful kind of allocation. we're only reading these strings so the uses of unsafe are fine (I think audit me). the byte array we're casting to a string at the end is also heap allocated and does escape. I only count 2 allocations, but there's 3 (`hash.Sum` and `make([]string)`), using a pool of sha1 hash.Hash shaves 120 byte and an alloc off so seems worth it (it's minimal). if we set a max size of config vals with a constant we could avoid that allocation and we could probably find a checksum package that doesn't use the `hash.Hash` that would speed things up a little (no dynamic dispatch, doesn't allocate in Sum) but there's not one I know of in stdlib. master: ``` ✗: go test -run=yodawg -bench . -benchmem -benchtime 1s -cpuprofile cpu.out goos: linux goarch: amd64 pkg: github.com/fnproject/fn/api/agent BenchmarkSlotKey 200000 6068 ns/op 696 B/op 31 allocs/op PASS ok github.com/fnproject/fn/api/agent 1.454s ``` now: ``` ✗: go test -run=yodawg -bench . -benchmem -benchtime 1s -cpuprofile cpu.out goos: linux goarch: amd64 pkg: github.com/fnproject/fn/api/agent BenchmarkSlotKey 1000000 1901 ns/op 168 B/op 3 allocs/op PASS ok github.com/fnproject/fn/api/agent 2.092s ``` once we have versioned apps/routes we don't need to build a sha or sort configs so this will get a lot faster. anyway, mostly funsies here... my life is that sad now.	2018-02-16 11:39:10 -08:00
Tolga Ceylan	af1ea0fa95	fn: ui no longer uses /stats (#776 ) Decommission /stats related code.	2018-02-15 16:05:59 -08:00
Reed Allman	04ae223a5d	fixup json,http protocols (#772 ) * http now buffers the entire request body from the container before copying it to the response writer (and sets content length). this is a level of sad i don't feel comfortable talking about but it is what it is. * json protocol was buffering the entire body so there wasn't any reason for us to try to write this directly to the container stdin manually, we needed to add a bufio.Writer around it anyway it was making too many write(fd) syscalls with the way it was. this is just easier overall and has the same performance as http now in my tests, whereas previously this was 50% slower [than http]. * add buffer pool for http & json to share/use. json doesn't create a new buffer every stinkin request. we need to plumb down content length so that we can properly size the buffer for json, have to add header size and everything together but it's probably faster than malloc(); punting on properly sizing. * json now sets content type to the length of the body from the returned json blurb from the container this does not handle imposing a maximum size of the response returned from a container, which we need to add, but this has been open for some time (specifically, on json). we can impose this by wrapping the pipes, but there's some discussion to be had for json specifically we won't be able to just cut off the output stream and use that (http we can do this). anyway, filing a ticket... closes #326 :(((((((	2018-02-14 14:06:36 -08:00
Reed Allman	9cbe4ea536	add pprof endpoints, additional spans (#770 ) i would split this commit in two if i were a good dev. the pprof stuff is really useful and this only samples when called. this is pretty standard go service stuff. expvar is cool, too. the additional spannos have turned up some interesting tid bits... gonna slide em in	2018-02-13 20:01:41 -08:00
Reed Allman	1a1250e5ea	disable fail whale logs (#768 ) we have been getting these from attach all this time and never needed these anyway. I ran cpu profiles of dockerd and this was 90% of docker cpu usage (json logs). woot. this will reduce i/o quite a bit, and we don't have to worry about them taking up any disk space either. from tests i get about 50% speedup with these off. the hunt continues...	2018-02-13 17:45:11 -08:00
Reed Allman	f287ad274e	support deeper / nesting of image names (#765 ) closes #764	2018-02-13 11:26:28 -08:00
Reed Allman	cbfd659e7e	cap docker retries to fixed number (#762 ) previously we would retry infinitely up to the context with some backoff in between. for hot functions, since we don't set any dead line on pulling or creating the image, this means it would retry forever without making any progress if e.g. the registry is inaccessable or any other temporary error that isn't actually temporary. this adds a hard cap of 10 retries, which gives approximately 13s if the ops take no time, still respecting the context deadline enclosed. the case where this was coming up is now tested for and was otherwise confusing for users to debug, now it spits out an ECONNREFUSED with the address of the registry, which should help users debug without having to poke around fn logs (though I don't like this as an excuse, not all users will be operators at some point in the near future, and this one makes sense) closes #727	2018-02-12 18:45:30 -08:00
Reed Allman	97194b3d8b	return bad function http resp error (#728 ) * return bad function http resp error this was being thrown into the fn server logs but it's relatively easy to get this to crop up if a function user forgets that they left a `println` laying around that gets written to stdout, it garbles the http (or json, in its case) output and they just see 'internal server error'. for certain clients i could see that we really do want to keep this as 'internal server error' but for things like e.g. docker image not authorized we're showing that in the response, so this seems apt. json likely needs the same treatment, will file a bug. as always, my error messages are rarely helpful enough, help me please :) closes #355 * add formatting directive * fix up http error * output bad jasons to user closes #729 woo	2018-02-12 17:51:45 -08:00
Tolga Ceylan	567136cb5e	fn: required docker version fix (#759 )	2018-02-12 15:53:05 -08:00
Tolga Ceylan	c848fc6181	fn: hot container timer improvements (#751 ) * fn: hot container timer improvements With this change, now we are allocating the timers when the container starts and managing them via stop/clear as needed, which should not only be more efficient, but also easier to follow. For example, previously, if eject time out was set to 10 secs, this could have delayed idle timeout up to 10 secs as well. It is also not necessary to do any math for elapsed time. Now consumers avoid any requeuing when startDequeuer() is cancelled. This was triggering additional dequeue/requeue causing containers to wake up spuriously. Also in startDequeuer(), we no longer remove the item from the actual queue and leave this to acquire/eject, which side steps issues related with item landing in the channel, not consumed, etc.	2018-02-12 14:12:03 -08:00

1 2 3 4

182 Commits