* get rid of old format stuff, utils usage, fix up for fdk2.0 interface
* pure agent format removal, TODO remove format field, fix up all tests
* shitter's clogged
* fix agent tests
* start rolling through server tests
* tests compile, some failures
* remove json / content type detection on invoke/httptrigger, fix up tests
* remove hello, fixup system tests
the fucking status checker test just hangs and it's testing that it doesn't
work so the test passes but the test doesn't pass fuck life it's not worth it
* fix migration
* meh
* make dbhelper shut up about dbhelpers not being used
* move fail status at least into main thread, jfc
* fix status call to have FN_LISTENER
also turns off the stdout/stderr blocking between calls, because it's
impossible to debug without that (without syslog), now that stdout and stderr
go to the same place (either to host stderr or nowhere) and isn't used for
function output this shouldn't be a big fuss really
* remove stdin
* cleanup/remind: fixed bug where watcher would leak if container dies first
* silence system-test logs until fail, fix datastore tests
postgres does weird things with constraints when renaming tables, took the
easy way out
system-tests were loud as fuck and made you download a circleci text file of
the logs, made them only yell when they goof
* fix fdk-go dep for test image. fun
* fix swagger and remove test about format
* update all the gopkg files
* add back FN_FORMAT for fdks that assert things. pfft
* add useful error for functions that exit
this error is really confounding because containers can exit for all manner of
reason, we're just guessing that this is the most likely cause for now, and
this error message should very likely change or be removed from the client
path anyway (context.Canceled wasn't all that useful either, but anyway, I'd
been hunting for this... so found it). added a test to avoid being publicly
shamed for 1 line commits (beware...).
Largely a removal job, however many tests, particularly system level
ones relied on Routes. These have been migrated to use Fns.
* Add 410 response to swagger
* No app names in log tags
* Adding constraint in GetCall for FnID
* Adding test to check FnID is required on call
* Add fn_id to call selector
* Fix text in docker mem warning
* Correct buildConfig func name
* Test fix up
* Removing CPU setting from Agent test
CPU setting has been deprecated, but the code base is still riddled
with it. This just removes it from this layer. Really we need to
remove it from Call.
* Remove fn id check on calls
* Reintroduce fn id required on call
* Adding fnID to calls for execute test
* Correct setting of app id in middleware
* Removes root middlewares ability to redirect fun invocations
* Add over sized test check
* Removing call fn id check
Copies the log endpoints up to the V2 endpoints, in a similar way to
the call endpoints.
The main change is to when logs are inserted into S3. The signature of
the function has been changed to take the whole call object, rather
than just the app and call id's. This allows the function to switch
between calls for Routes and those for Fns. Obviously this switching
can be removed when v1 is removed.
In the sql implementation it inserts with both appID and fnID, this
allows the two get's to work, and the down grade of the
migration. When the v1 logs are removed, the appId can be dropped.
The log fetch test and error messages have been changed to be FnID specific.
/fns/{fnID}/calls
/fns/{fnID}/calls/{callID}
The S3 implementation forces our hand as we if we want to list Calls
under a Fn, we have to use the FnID as a prefix on the object names,
which mean we need it to look up any Call. It also makes sense in
terms of resource hierarchy.
These endpoints can optionally be disabled (as other endpoints), if a
service provider needs to provide this functionality via other means.
The 'calls' test has been fully migrated to fn calls. This has been
done to reduce the copy pasta a bit, and on balance is ok as the
routes calls will be removed soon.
* fn: stats view/distribution improvements
*) View latency distribution is now an argument
in view creation functions. This allows easier
override to set custom buckets. It is simplistic
and assumes all latency views would use the same
set, but in practice this is already the case.
*) Removed API view creation to main, this should not
be enabled for all node types. This is consistent with
the rest of the system.
* fn: Docker samples of cpu/mem/disk with specific buckets
Vast commit, includes:
* Introduces the Trigger domain entity.
* Introduces the Fns domain entity.
* V2 of the API for interacting with the new entities in swaggerv2.yml
* Adds v2 end points for Apps to support PUT updates.
* Rewrites the datastore level tests into a new pattern.
* V2 routes use entity ID over name as the path parameter.
* add DateTime sans mgo
* change all uses of strfmt.DateTime to common.DateTime, remove test strfmt usage
* remove api tests, system-test dep on api test
multiple reasons to remove the api tests:
* awkward dependency with fn_go meant generating bindings on a branched fn to
vendor those to test new stuff. this is at a minimum not at all intuitive,
worth it, nor a fun way to spend the finite amount of time we have to live.
* api tests only tested a subset of functionality that the server/ api tests
already test, and we risk having tests where one tests some thing and the
other doesn't. let's not. we have too many test suites as it is, and these
pretty much only test that we updated the fn_go bindings, which is actually a
hassle as noted above and the cli will pretty quickly figure out anyway.
* fn_go relies on openapi, which relies on mgo, which is deprecated and we'd
like to remove as a dependency. openapi is a _huge_ dep built in a NIH
fashion, that cannot simply remove the mgo dep as users may be using it.
we've now stolen their date time and otherwise killed usage of it in fn core,
for fn_go it still exists but that's less of a problem.
* update deps
removals:
* easyjson
* mgo
* go-openapi
* mapstructure
* fn_go
* purell
* go-validator
also, had to lock docker. we shouldn't use docker on master anyway, they
strongly advise against that. had no luck with latest version rev, so i locked
it to what we were using before. until next time.
the rest is just playing dep roulette, those end up removing a ton tho
* fix exec test to work
* account for john le cache
* fn: test scripts should use well defined ports
Moved allocation of listener ports for mysql/minio/postgres
to helper script with a list of service list names.
* fn: makefile docker pull mysql version must match tests
* initial Db helper split - make SQL and datastore packages optional
* abstracting log store
* break out DB, MQ and log drivers as extensions
* cleanup
* fewer deps
* fixing docker test
* hmm dbness
* updating db startup
* Consolidate all your extensions into one convenient package
* cleanup
* clean up dep constraints
* datastore no longer implements logstore
the underlying implementation of our sql store implements both the datastore
and the logstore interface, however going forward we are likely to encounter
datastore implementers that would mock out the logstore interface and not use
its methods - signalling a poor interface. this remedies that, now they are 2
completely separate things, which our sqlstore happens to implement both of.
related to some recent changes around wrapping, this keeps the imposed metrics
and validation wrapping of a servers logstore and datastore, just moving it
into New instead of in the opts - this is so that a user can have the
underlying datastore in order to set the logstore to it, since wrapping it in
a validator/metrics would render it no longer a logstore implementer (i.e.
validate datastore doesn't implement the logstore interface), we need to do
this after setting the logstore to the datastore if one wasn't provided
explicitly.
* splits logstore and datastore metrics & validation logic
* `make test` should be `make full-test` always. got rid of that so that
nobody else has to wait for CI to blow up on them after the tests pass locally
ever again.
* fix new tests
* Implements graceful shutdown of agent.DataAccess and underlying Datastore/Logstore/MessageQueue
* adds tests for closing agent.DataAccess and Datastore
* move calls to logstore, implement s3
closes#482
the basic motivation is that logs and calls will be stored with a very high
write rate, while apps and routes will be relatively infrequently updated; it
follows that we should likely split up their storage location, to back them
with appropriate storage facilities. s3 is a good candidate for ingesting
higher write rate data than a sql database, and will make it easier to manage
that data set. can read #482 for more detailed justification.
summary:
* calls api moved from datastore to logstore
* logstore used in front-end to serve calls endpoints
* agent now throws calls into logstore instead of datastore
* s3 implementation of calls api for logstore
* s3 logs key changed (nobody using / nbd?)
* removed UpdateCall api (not in use)
* moved call tests from datastore to logstore tests
* mock logstore now tested (prev. sqlite3 only)
* logstore tests run against every datastore (mysql, pg; prev. only sqlite3)
* simplify NewMock in tests
commentary:
brunt of the work is implementing the listing of calls in GetCalls for the s3
logstore implementation. the GetCalls API requires returning items in the
newest to oldest order, and the s3 api lists items in lexicographic order
based on created_at. An easy thing to do here seemed to be to reverse the
encoding of our id format to return a lexicographically descending order,
since ids are time based, reasonably encoded to be lexicographically
sortable, and de-duped (unlike created_at). This seems to work pretty well,
it's not perfect around the boundaries of to_time and from_time and a tiny
amount of results may be omitted, but to me this doesn't seem like a deal
breaker to get 6999 results instead of 7000 when trying to get calls between
3:00pm and 4:00pm Monday 3 weeks ago. Of course, without to_time and
from_time, there are no issues in listing results. We could use created at and
encode it, but it would be an additional marker for point lookup (GetCall)
since we would have to search for a created_at stamp, search for ids around
that until we find the matching one, just to do a point lookup. So, the
tradeoff here seems worth it. There is additional optimization around to_time
to seek over newer results (since we have descending order).
The other complication in GetCalls is returning a list of calls for a given
path. Since the keys to do point lookups are only app_id + call_id, and we
need listing across an app as well, this leads us to the 'marker' collection
which is sorted by app_id + path + call_id, to allow quick listing by path.
All in all, it should be pretty straightforward to follow the implementation
and I tried to be lavish with the comments, please let me know if anything
needs further clarification in the code.
The implementation itself has some glaring inefficiencies, but they're
relatively minute: json encoding is kinda lazy, but workable; s3 doesn't offer
batch retrieval, so we point look up each call one by one in get call; not
re-using buffers -- but the seeking around the keys should all be relatively
fast, not too worried about performance really and this isn't a hot path for
reads (need to make a cut point and turn this in!).
Interestingly, in testing, minio performs significantly worse than pg for
storing both logs and calls (or just logs, I tested that too). minio seems to
have really high cpu consumption, but in any event, we won't be using minio,
we'll be using a cloud object store that implements the s3 api. Anyway, mostly
a knock on using minio for high performance, not really anything to do with
this, just thought it was interesting.
I think it's safe to remove UpdateCall, admittedly this made implementing the
s3 api a lot easier. This operation may also be something we never need, it
was unused at present and was only in the cards for a previous hybrid
implementation, which we've now abandoned. If we need, we can always resurrect
from git.
Also not worried about changing the log key, we need to put a prefix on this
thing anyway, but I don't think anybody is using this anyway. in any event, it
simply means old logs won't show up through the API, but aside from nobody
using this yet, that doesn't seem a big deal breaker really -- new logs will
appear fine.
future:
TODO make logstore implementation optional for datastore, check in front-end
at runtime and offer a nil logstore that errors appropriately
TODO low hanging fruit optimizations of json encoding, re-using buffers for
download, get multiple calls at a time, id reverse encoding could be optimized
like normal encoding to not be n^2
TODO api for range removal of logs and calls
* address review comments
* push id to_time magic into id package
* add note about s3 key sizes
* fix validation check
* App ID
* Clean-up
* Use ID or name to reference apps
* Can use app by name or ID
* Get rid of AppName for routes API and model
routes API is completely backwards-compatible
routes API accepts both app ID and name
* Get rid of AppName from calls API and model
* Fixing tests
* Get rid of AppName from logs API and model
* Restrict API to work with app names only
* Addressing review comments
* Fix for hybrid mode
* Fix rebase problems
* Addressing review comments
* Addressing review comments pt.2
* Fixing test issue
* Addressing review comments pt.3
* Updated docstring
* Adjust UpdateApp SQL implementation to work with app IDs instead of names
* Fixing tests
* fmt after rebase
* Make tests green again!
* Use GetAppByID wherever it is necessary
- adding new v2 endpoints to keep hybrid api/runner mode working
- extract CallBase from Call object to expose that to a user
(it doesn't include any app reference, as we do for all other API objects)
* Get rid of GetAppByName
* Adjusting server router setup
* Make hybrid work again
* Fix datastore tests
* Fixing tests
* Do not ignore app_id
* Resolve issues after rebase
* Updating test to make it work as it was
* Tabula rasa for migrations
* Adding calls API test
- we need to ensure we give "App not found" for the missing app and missing call in first place
- making previous test work (request missing call for the existing app)
* Make datastore tests work fine with correctly applied migrations
* Make CallFunction middleware work again
had to adjust its implementation to set app ID before proceeding
* The biggest rebase ever made
* Fix 8's migration
* Fix tests
* Fix hybrid client
* Fix tests problem
* Increment app ID migration version
* Fixing TestAppUpdate
* Fix rebase issues
* Addressing review comments
* Renew vendor
* Updated swagger doc per recommendations
* update vendor directory, add go.opencensus.io
* update imports
* oops
* s/opentracing/opencensus/ & remove prometheus / zipkin stuff & remove old stats
* the dep train rides again
* fix gin build
* deps from last guy
* start in on the agent metrics
* she builds
* remove tags for now, cardinality error is fussing. subscribe instead of register
* update to patched version of opencensus to proceed for now TODO switch to a release
* meh
fix imports
* println debug the bad boys
* lace it with the tags
* update deps again
* fix all inconsistent cardinality errors
* add our own logger
* fix init
* fix oom measure
* remove bugged removal code
* fix s3 measures
* fix prom handler nil
* Logs should support specifying region when using S3-compatible object store
* Use aws-sdk-go client for s3 backed logstore
* fixes vendor with aws-sdk-go dependencies
* Use retry func while trying to ping SQL datastore
- implements retry func specifically for SQL datastore ping
- fmt fixes
- using sqlx.Db.PingContext instead of sqlx.Db.Ping
- propogate context to SQL datastore
* Rely on context from ServerOpt
* Consolidate log instances
* Cleanup
* Fix server usage in API tests
* add minio-go dep, update deps
* add minio s3 client
minio has an s3 compatible api and is an open source project and, notably, is
not amazon, so it seems best to use their client (fwiw the aws-sdk-go is a
giant hair ball of things we don't need, too). it was pretty easy and seems
to work, so rolling with it. also, minio is a totally feasible option for fn
installs in prod / for demos / for local.
* adds 's3' package for s3 compatible log storage api, for use with storing
logs from calls and retrieving them.
* removes DELETE /v1/apps/:app/calls/:call/log endpoint
* removes internal log deletion api
* changes the GetLog API to use an io.Reader, which is a backwards step atm
due to the json api for logs, I have another branch lined up to make a plain
text log API and this will be much more efficient (also want to gzip)
* hooked up minio to the test suite and fixed up the test suite
* add how to run minio docs and point fn at it docs
some notes: notably we aren't cleaning up these logs. there is a ticket
already to make a Mr. Clean who wakes up periodically and nukes old stuff, so
am punting any api design around some kind of TTL deletion of logs. there are
a lot of options really for Mr. Clean, we can notably defer to him when apps
are deleted, too, so that app deletion is fast and then Mr. Clean will just
clean them up later (seems like a good option).
have not tested against BMC object store, which has an s3 compatible API. but
in theory it 'just works' (the reason for doing this). in any event, that's
part of the service land to figure out.
closes#481closes#473
* add log not found error to minio land
something still feels off with this, but i tinkered with it for a day-ish and
didn't come up with anything a whole lot better. doing a lot of the
maneuvering in the caller seemed better but it was just bloating up GetCall so
went back to having it basically like it was, but returning the limited
underlying buffer to read from so we can ship to the db.
some small changes to the LogStore interface, swapped it to take an
io.Reader instead of a string for more flexibility in the future while
essentially maintaining the same level of performance that we have now.
i'm guessing in the not so distant future we'll ship these to some s3 like
service and it would be better to stream them in than carry around a giant
string anyway. also, carrying around up to 1MB buffers in memory isn't great,
we may want to switch to file backed logs for calls, too. using io.Reader for
logs should make #279 more reasonable if/once we move to some s3-like thing,
we can stream from the log storage service direct to clients.
this fixes the span being out of whack and allows the 'right' context to be
used to upload logs (next to inserting the call). deletes the dbWriter we had,
and we just do this in call.End now (which makes sense to me at least).
removes the dupe code for making an stderr for hot / cold and simplifies the
way to get a func logger (no more 7 param methods yay).
closes#298
* fix docker build
this is trivially incorrect since glide doesn't actually provide reproducible
builds. the idea is to build with the deps that we have checked into git, so
that we actually know what code is executing so that we might debug it...
all for multi stage build instead of what we had, but adding the glide step is
wrong. i added a loud warning so as to discourage this behavior in the future.
* hang the runner, agent=new sheriff
tl;dr agent is now runner, with a hopefully saner api
the general idea is get rid of all the various 'task' structs now, change our
terminology to only be 'calls' now, push a lot of the http construction of a
call into the agent, allow calls to mutate their state around their execution
easily and to simplify the number of code paths, channels and context timeouts
in something [hopefully] easy to understand.
this introduces the idea of 'slots' which are either hot or cold and are
separate from reserving memory (memory is denominated in 'tokens' now).
a 'slot' is essentially a container that is ready for execution of a call, be
it hot or cold (it just means different things based on hotness). taking a
look into Submit should make these relatively easy to grok.
sorry, things were pretty broken especially wrt timings. I tried to keep good
notes (maybe too good), to highlight stuff so that we don't make the same
mistakes again (history repeating itself blah blah quote). even now, there is
lots of work to do :)
I encourage just reading the agent.go code, Submit is really simple and
there's a description of how the whole thing works at the head of the file
(after TODOs). call.go contains code for constructing calls, as well as Start
/ End (small atm). I did some amount of code massaging to try to make things
simple / straightforward / fit reasonable mental model, but as always am open
to critique (the more negative the better) as I'm just one guy and wth do i
know...
-----------------------------------------------------------------------------
below enumerates a number of changes as briefly as possible (heh..):
models.Call all the things
removes models.Task as models.Call is now what it previously was.
models.FnCall is now rid of in favor of models.Call, despite the datastore
only storing a few fields of it [for now]. we should probably store entire
calls in the db, since app & route configurations can change at any given
moment, it would be nice to see the parameters of each call (costs db space,
obviously).
this removes the endpoints for getting & deleting messages, we were just
looping back to localhost to call the MQ (wtf? this was for iron integration i
think) and just calls the MQ.
changes the name of the FnLog to LogStore, confusing cause there's also a
`FuncLogger` which uses the Logstore (punting). removes other `Fn` prefixed
structs (redundant naming convention).
removes some unused and/or weird structs (IDStatus, CompleteTime)
updates the swagger
makes the db methods consistent to use 'Call' nomenclature.
remove runner nuisances:
* push down registry stuff to docker driver
* remove Environment / Stats stuff of yore
* remove unused writers (now in FuncLogger)
* remove 2 of the task types, old hot stuff, runner, etc
fixes ram available calculation on startup to not always be 300GB (helps a lot
on a laptop!)
format for DOCKER_AUTH env now is not a list but a map (there are no docs,
would prefer to get rid of this altogether anyway). the ~/.docker/cfg expected
format is unchanged.
removes arbitrary task queue, if a machine is out of ram we can probably just
time out without queueing... (can open separate discussion) in any case the
old one didn't really account well for hot tasks, it just lined everyone up in
the task queue if there wasn't a place to run hot and then timed them out
[even if a slot became free].
removes HEADER_ prefixing on any headers in the request to a invoke a call.
(this was inconsistent with cli for test anyway)
removes TASK_ID header sent in to hot only (this is a dupe of FN_CALL_ID,
which has not been removed)
now user functions can reply directly to the client. this means that for
cold containers if they write to stdout it will send a 200 + headers. for
hot containers, the user can reply directly to the client from the container,
i.e. with its preferred status code / headers (vs. always getting a 200).
the dispatch itself is a little http specific atm, i think we can add an
interchange format but the current version is easily extended to add json for
now, separate discussion. this eliminates a lot of the request/response
rewriting and buffering we were doing (yey). now Dispatch ONLY does input and
output, vs. managing the call timeout and having access to a call's fields.
cache is pushed down into agent now instead of in the front end, would like to
push it down to the datastore actually but it's here for now anyway. cache
delete functions removed (b/c fn is distributed anyway?). added app caching,
should help with latency.
in general, a lot of server/runner.go got pushed down into the agent. i think
it will be useful in testing to be able to construct calls without having to
invoke http handlers + async also needs to construct calls without a handler.
safe shutdown actually works now for everything (leaked / didn't wait on
certain things before)
now we're waiting for hot slots to open up while we're attempting to get ram
to launch a container if we didn't find any hot slots to run the call in
immediately. we can change this policy really easily now (no more channel
jungle; still some channels). also looking for somewhere else to go while the
container is launching now. slots now get sent _out_ of a container, vs.
a container receiving calls, which makes this kind of policy easier to
implement. this fixes a number of bugs around things like trying to execute
calls against containers that have not and may never start and trying to
launch a bazillion containers when there are no free containers. the driver api
underwent some changes to make this possible (relatively minimal, added Wait).
the easiest way to think about this is that allocating ram has moved 'up'
instead of just wrapping launching containers, so that we can select on a
channel trying to find ram.
not dispatching hot calls to containers that died anymore either...
the timeout is now started at the beginning of Submit, rather than Dispatch or
the container itself having to manage the call timeout, which was an
inaccurate way of doing things since finding a slot / allocating ram / pulling
image can all take a non-trivial (timeout amount, even!) amount of time. this
makes for much more reasonable response times from fn under load, there's
still a little TODO about handling cold+timeout container removal response
times but it's much improved.
if call.Start is called with < call.timeout/2 time left, then the call will
not be executed and return a timeout. we can discuss. this makes async play
_a lot_ nicer, specifically. for large timeouts / 2 makes less sense.
env is no longer getting upper cased (admittedly, this can look a little weird
now). our whole route.Config/app.Config/env/headers stuff probably deserves a
whole discussion...
sync output no longer has the call id in json if there's an error / timeout.
we could add this back to signify that it's _us_ writing these but this was
out of place. FN_CALL_ID is still shipped out to get the id for sync calls,
and async [server] output remains unchanged.
async logs are now an entire raw http request (so that a user can write a 400
or something from their hot async container)
async hot now 'just works'
cold sync calls can now reply to the client before container removal, which
shaves a lot of latency off of those (still eat start). still need to figure
out async removal if timeout or something.
-----------------------------------------------------------------------------
i've located a number of bugs that were generally inherited, and also added
a number of TODOs in the head of the agent.go file according to robustness we
probably need to add. this is at least at parity with the previous
implementation, to my knowledge (hopefully/likely a good bit ahead). I can
memorialize these to github quickly enough, not that anybody searches before
adding bugs anyway (sigh).
the big thing to work on next imo is async being a lot more robust,
specifically to survive fn server failures / network issues.
thanks for review (gulp)
* Renamed a bunch of images to use fnproject org.
* Multi-stage build for Docker.
* Added tmp vendor dirs to gitignore.
* Run docker-build at beginning of test.
replace default bolt option with sqlite3 option. the story here is that we
just need a working out of the box solution, and sqlite3 is just fine for that
(actually, likely better than bolt).
with sqlite3 supplanting bolt, we mostly have sql databases. so remove redis
and then we just have one package that has a `sql` implementation of the
`models.Datastore` and lean on sqlx to do query rewriting. this does mean
queries have to be formed a certain way and likely have to be ANSI-SQL (no
special features) but we weren't using them anyway and our base api is
basically done and we can easily extend this api as needed to only implement
certain methods in certain backends if we need to get cute.
* remove bolt & redis datastores (can still use as mqs)
* make sql queries work on all 3 (maybe?)
* remove bolt log store and use sqlite3
* shove the FnLog shit into the datastore shit for now (free pg/mysql logs...
just for demos, etc, not prod)
* fix up the docs to remove bolt references
* add sqlite3, sqlx dep
* fix up tests & mock stuff, make validator less insane
* remove put & get in datastore layer as nobody is using.
this passes tests which at least seem like they test all the different
backends. if we trust our tests then this seems to work great. (tests `make
docker-test-run-with-*` work now too)