* WIP: add k8s grouper
- This shares a great deal of behaviour with allGrouper. Once it's
tested, refactor that to share as much as possible
- Glide hell. Checked in the yaml and lock files but a glide i -v
will be required to bring vendor/ up-to-date. Will address once this
is ready.
* Update README. Make the watch tracking work.
(To follow: add the junk that was pulled in via the glide update.)
* Vendor updates.
* go fmt
* Use the allGrouper with a k8s-backed DBStore instead.
This is much tidier :-)
* Fix up go vet
* fn: fnlb: default health state for new nodes
*) Any new node now by default is in unknown state.
*) One health check is required for unknown state to move in
to healthy/unhealthy states, then actual interval and
thresholds apply.
*) add() no longer runs health check as this is now handled
with the new logic.
This means during restarts fnlb will run one health check
immediately for nodes to switch to healthy/unhealthy state
to ensure speedy start, but guard against routing traffic to
unhealthy servers. After this initial state, nodes are
subjected to regular interval and thresholds.
* *) style fixes
* fn: fnlb: enhancements and new grouper tests
*) added healthy threshold (default: 1)
*) grouper is now using configured hcEndpoint for version checks
*) grouper now logs when servers switch between healthy/unhealthy status
*) moved DB code out of grouper
*) run health check immediately at start (don't wait until hcInterval)
*) optional shutdown timeout (default: 0) & mgmt port (default: 8081)
*) hot path List() in grouper now uses atomic ptr Load
*) consistent router: moved closure to a new function
*) bugfix: version parsing from fn servers should not panic fnlb
*) bugfix: servers removed from DB, stayed in healthy list
*) bugfix: if DB is down, health checker stopped monitoring
*) basic new tests for grouper (add/rm/unhealthy/healthy) server
go vet caught some nifty bugs. so fixed those here, and also made it so that
we vet everything from now on since the robots seem to do a better job of
vetting than we have managed to.
also adds gofmt check to circle. could move this to the test.sh script (didn't
want a script calling a script, because $reasons) and it's nice and isolated
in its own little land as it is. side note, changed the script so it runs in
100ms instead of 3s, i think find is a lot faster than go list.
attempted some minor cleanup of various scripts
now we can run multiple lbs in the same 'cluster' and they will all point to
the same nodes. all lb nodes are not guaranteed to have the same set of
functions nodes to route to at any point in time since each lb node will
perform its own health checks independently, but they will all be backed by
the same list from the db to health check at least. in cases where there will
be more than a few lbs we can rethink this strategy, we mostly need to back
the lbs with a db so that they persist nodes and remain fault tolerant in that
sense. the strategy of independent health checks is useful to reduce thrashing
the db during network partitions between lb and fn pairs. it would be nice to
have gossip health checking to reduce network traffic, but this works too, and
we'll need to seed any gossip protocol with a list from a db anyway.
db_url is the same format as what functions takes. i don't have env vars set
up for fnlb right now (low hanging fruit), the flag is `-db`, it defaults to
in memory sqlite3 so nodes will be forgotten between reboots. used the sqlx
stuff, decided not to put the lb stuff in the datastore stuff as this was easy
enough to just add here to get the sugar, and avoid bloating the datastore
interface. the tables won't collide, so can just use same pg/mysql as what the
fn servers are running in prod even, db load is low from lb (1 call every 1s
per lb).
i need to add some tests, touch testing worked as expected.
this structure should allow us to keep the consistent hash code and just use
consistent hashing on a subset of nodes, then in order to satisfy the oracle
service stuff in functions-service we can just implement a different "Grouper"
that does vm allocation and whatever other magic we need to manage nodes and
poop out sets of nodes based on tenant id / func.
for the suga... see main.go and proxy.go, the rest is basically renaming /
moving stuff (not easy to follow changes, nature of the beast).
the only 'issues' i can think of is that down in the ch stuff (or Router) we
will need a back channel to tell the 'Grouper' to add a node (i.e. all nodes for
that shard are currently loaded) which isn't great and also the grouper has no
way of knowing that a node in the given set may not be being used anymore.
still thinking about how to couple those two. basically don't want to have to
just copy that consistent hash code but after munging with stuff i'm almost at
'fuck it' level and maybe it's worth it to just copy and hack it up in
functions-service for what we need. we'll also need to have different key
funcs for groupers and routers eventually (grouper wants tenant id, router
needs tenant id + router). anyway, open to any ideas, i haven't come up with
anything great. feedback on interface would be great
after this can plumb the datastore stuff into the allGrouper pretty easily