* fn: non-blocking resource tracker and notification
For some types of errors, we might want to notify
the actual caller if the error is directly 1-1 tied
to that request. If hotLauncher is triggered with
signaller, then here we send a back communication
error notification channel. This is passed to
checkLaunch to send back synchronous responses
to the caller that initiated this hot container
launch.
This is useful if we want to run the agent in
quick fail mode, where instead of waiting for
CPU/Mem to become available, we prefer to fail
quick in order not to hold up the caller.
To support this, non-blocking resource tracker
option/functions are now available.
* fn: test env var rename tweak
* fn: fixup merge
* fn: rebase test fix
* fn: merge fixup
* fn: test tweak down to 70MB for 128MB total
* fn: refactor token creation and use broadcast regardless
* fn: nb description
* fn: bugfix
* fn: reorg agent config
*) Moving constants in agent to agent config, which helps
with testing, tuning.
*) Added max total cpu & memory for testing & clamping max
mem & cpu usage if needed.
* fn: adjust PipeIO time
* fn: for hot, cannot reliably test EndOfLogs in TestRouteRunnerExecution
* update vendor directory, add go.opencensus.io
* update imports
* oops
* s/opentracing/opencensus/ & remove prometheus / zipkin stuff & remove old stats
* the dep train rides again
* fix gin build
* deps from last guy
* start in on the agent metrics
* she builds
* remove tags for now, cardinality error is fussing. subscribe instead of register
* update to patched version of opencensus to proceed for now TODO switch to a release
* meh
fix imports
* println debug the bad boys
* lace it with the tags
* update deps again
* fix all inconsistent cardinality errors
* add our own logger
* fix init
* fix oom measure
* remove bugged removal code
* fix s3 measures
* fix prom handler nil
* add spans to async
* clean up / add spans to agent
* there were a few methods which had multiple contexts which existed in the same
scope (this doesn't end well, usually), flattened those out.
* loop bound context cancels now rely on defer (also was brittle)
* runHot had a lot of ctx shuffling, flattened that.
* added some additional spans in certain paths for added granularity
* linked up the hot launcher / run hot / wait hot to _a_ root span, the first
2 are follows from spans, but at least we can see the source of these and also
can see containers launched over a hot launcher's lifetime
I left TODO around the FollowsFrom because OpenCensus doesn't, at least at the
moment, appear to have any idea of FollowsFrom and it was an extra OpenTracing
method (we have to get the span out, start a new span with the option, then
add it to the context... some shuffling required). anyway, was on the fence
about adding at least.
* resource waiters need to manage their own goroutine lifecycle
* if we get an impossible memory request, bail instead of infinite loop
* handle timeout slippery case
* still sucks, but hotLauncher doesn't leak anything. even the time.After timer goroutines
* simplify GetResourceToken
GetCall can guard against the impossible to allocate resource tasks entering
the system by erroring instead of doling them out. this makes GetResourceToken
logic more straightforward for callers, who now simply have the contract that
they won't ever get a token if they let tasks into the agent that can't run
(but GetCall guards this, and there's a test for it).
sorry, I was going to make this only do that, but when I went to fix up the
tests, my last patch went haywire so I fixed that too. this also at least
tries to simplify the hotLaunch loop, which will now no longer leak time.After
timers (which were long, and with signaller, they were many -- I got a stack
trace :) -- this breaks out the bottom half of the logic to check to see if we
need to launch into its own function, and handles the cleaning duties only in
the caller instead of in 2 different select statements. played with this a
bit, no doubt further cleaning could be done, but this _seems_ better.
* fix vet
* add units to exported method contract docs
* oops
* fn: resource and slot cancel and broadcast improvements
*) Context argument does not wake up the waiters correctly upon
cancellation/timeout.
*) Avoid unnecessary broadcasts in slot and resource.
* fn: limit scope of context in resource/slot calls in agent
* fn: cancellations in WaitAsyncResource
Added go context with cancel to wait async resource. Although
today, the only case for cancellation is shutdown, this cleans
up agent shutdown a little bit.
* fn: locked broadcast to avoid missed wake-ups
* fn: removed ctx arg to WaitAsyncResource and startDequeuer
This is confusing and unnecessary.
* squash# This is a combination of 10 commits2
fn: get available memory related changes
*) getAvailableMemory() improvements
*) early fail if requested memory too large to meet
*) tracking async and sync pools individually. Sync pool
is reserved for sync jobs only, while async pool can be
used by all jobs.
*) head room estimation for available memory in Linux.