prior to this patch we were allowing 256MB for every function run, just
because that was the default for the docker driver and we were not using the
memory field on any given route configuration. this fixes that, now docker
containers will get the correct memory limit passed into the container from
the route. the default is still 128.
there is also an env var now, `MEMORY_MB` that is set on each function call,
see the linked issue below for rationale.
closes#186
ran the given function code from #186, and now i only see allocations up to
32MB before the function is killed. yay.
notes:
there is no max for memory. for open source fn i'm not sure we want to
cap it, really. in the services repo we probably should add a cap before prod.
since we don't know any given fn server's ram, we can't try to make sure the
setting on any given route is something that can even be run.
remove envconfig & bytefmt
this updates the glide.yaml file to remove the unused deps, but trying to
install fresh is broken atm so i couldn't remove from vendor/, going to fix
separately (next update we just won't get these). also changed the skip dir to
be the cli dir now that its name has changed (related to brokenness).
fix how ram slots were being allocated. integer division is significantly
slower than subtraction.
Each time when MQ becomes unreachable HTTP GET /tasks returned HTTP 500
and code was not handling this case except expecting networking errors.
After that it tried to unmarshal empty response body that caused another sort of an error.
This patch triggers error based on http response code, explicitly checking if response code
is something unexpected (not HTTP 200 OK).
Response status code for /tasks for changed from 202 Accepted to 200 OK according to swagger doc.
this works by having every request from the functions server kick back a
FXLB-WAIT header on every request with the wait time for that function to
start. the lb then keeps track on a per node+function basis an ewma of the
last 10 request's wait times (to reduce jitter). now that we don't have max
concurrency it's actually pretty challenging to get the wait time stuff to
tick. i expect in the near future we will be throttling functions on a given
node in order to induce this, but that is for another day as that code needs a
lot of reworking. i tested this by introducing some arbitrary throttling (not
checked in) and load spreads over nodes correctly (see images). we will also
need to play with the intervals we want to use, as if you have a func with
50ms run time then basically 10 of those will rev up another node (this was
before removing max_c, with max_c=1) but in any event this wires in the basic
plumbing.
* make docs great again. renamed lb dir to fnlb
* added wait time to dashboard
* wires in a ready channel to await the first pull for hot images to count in
the wait time (should be otherwise useful)
future:
TODO rework lb code api to be pluggable + wire in data store
TODO toss out first data point containing pull to not jump onto another node
immediately (maybe this is actually a good thing?)
this patch gets rid of max concurrency for functions altogether, as discussed,
since it will be challenging to support across functions nodes. as a result of
doing so, the previous version of functions would fall over when offered 1000
functions, so there was some work needed in order to push this through.
further work is necessary as docker basically falls over when trying to start
enough containers at the same time, and with this patch essentially every
function can scale infinitely. it seems like we could add some kind of
adaptive restrictions based on task run length and configured wait time so
that fast running functions will line up to run in a hot container instead of
them all creating new hot containers.
this patch takes a first cut at whacking out some of the insanity that was the
previous concurrency model, which was problematic in that it limited
concurrency significantly across all functions since every task went through
the same unbuffered channel, which could create blocking issues for all
functions if the channel is not picked off fast enough (it's not apparent that
this was impossible in the previous implementation). in any event, each
request has a goroutine already, there's no reason not to use it. not too hard
to wrap a map in a lock, not sure what the benefits were (added insanity?) in effect
this is marginally easier to understand and less insane (marginally). after
getting rid of max c this adds a blocking mechanism for the first invocation
of any function so that all other hot functions will wait on the first one to
finish to avoid a herd issue (was making docker die...) -- this could be
slightly improved, but works in a pinch. reduced some memory usage by having
redundant maps of htfnsvr's and task.Requests (by a factor of 2!). cleaned up
some of the protocol stuff, need to clean this up further. anyway, it's a
first cut. have another patch that rewrites all of it but was getting into
rabbit hole territory, would be happy to oblige if anybody else has problems
understanding this rat's nest of channels. there is a good bit of work left to
make this prod ready (regardless of removing max c).
a warning that this will break the db schemas, didn't put the effort in to add
migration stuff since this isn't deployed anywhere in prod...
TODO need to clean out the htfnmgr bucket with LRU
TODO need to clean up runner interface
TODO need to unify the task running paths across protocols
TODO need to move the ram checking stuff into worker for noted reasons
TODO need better elasticity of hot f(x) containers
* functions: modify datastore to accomodate hot containers support
* functions: protocol between functions and hot containers
* functions: add hot containers clockwork
* fn: add hot containers support
* functions: add bounded concurrency
* functions: plug runners to sync and async interfaces
* functions: update documentation about the new env var
* functions: fix test flakiness
* functions: the runner is self-regulated, no need to set a number of runners
* functions: push the execution to the background on incoming requests
* functions: ensure async tasks are always on
* functions: add prioritization to tasks consumption
Ensure that Sync tasks are consumed before Async tasks. Also, fixes
termination races problems for free.
* functions: remove stale comments
* functions: improve mem availability calculation
* functions: parallel run for async tasks
* functions: check for memory availability before pulling async task
* functions: comment about rnr.hasAvailableMemory and sync.Cond
* functions: implement memory check for async runners using Cond vars
* functions: code grooming
- remove unnecessary goroutines
- fix stale docs
- reorganize import group
* Revert "functions: implement memory check for async runners using Cond vars"
This reverts commit 922e64032201a177c03ce6a46240925e3d35430d.
* Revert "functions: comment about rnr.hasAvailableMemory and sync.Cond"
This reverts commit 49ad7d52d341f12da9603b1a1df9d145871f0e0a.
* functions: set a minimum memory availability for sync
* functions: simplify the implementation by removing the priority queue
* functions: code grooming
- code deduplication
- review waitgroups Waits
Currently, async workers are started before HTTP interface is available
to get their requests. It fixes by ensuring that async workers are
started after HTTP interface is up.
Essentially we are getting rid of an error message during bootstrap:
ERRO[0000] Could not fetch task error=Get http://127.0.0.1:8080/tasks: dial tcp 127.0.0.1:8080: getsockopt: connection refused
By default, BoltDB will hang while waiting to acquire lock to the
datafile, thus the users might find themselves waiting for something
but not what. The added timeout aims inform use about what's
happening.
Also this renames MQADR to TASKSRV, refactor configuration to read
environment variables. RunAsyncRunner now fills the gaps when
parsing TASKSRV.
Fixes#119