Commit Graph

229 Commits

Author SHA1 Message Date
Tolga Ceylan
8f969918bd fn: removing unused/dead code (#1051) 2018-06-08 15:51:19 -07:00
Tolga Ceylan
4fcb52f69d fn: MaxTotalCPU and MaxTotalMemory in non-Linux systems (#1043)
Non-Linux systems skip some of memory/cpu determination
code in resource tracker. But config settings to cap
these are used in tests, so they must not be ignored.

With this change, we apply these config settings even
on non-Linux systems.

Memory allocation code is also now same in non-Linux
systems, but default is raised to 2GB from 1.5GB.
2018-06-06 14:50:21 -07:00
Owen Cliffe
c6abc8bf64 Use context logging more to ensure context vars are present in log lines (#1039) 2018-06-06 15:14:29 +01:00
Tolga Ceylan
4af53025d8 fn: lb-agent: Initial TryCall result can be retriable. (#1035)
Before this change, we assumed data may end up in a container
once we placed a TryCall() and if gRPC send failed, we did not
retry. However, a send failure cannot result in data in a
container, since only upon successful receipt of a TryCall can
pure-runner schedule a call into a container. Here we trust
gRPC and if gRPC layer says it could not send a msg, then
the receiver did not receive it.
2018-06-05 14:41:13 -07:00
Andrea Rosa
c2c295ffb3 Add a LBAgent constructor which accept AgentConfig (#1037)
In some cases could be useful to pass Agent configurations to the
LnAgent constuctor, this small change adds a new constructor which
accepts an agent configuration as additional parameter.
2018-06-05 13:59:43 -07:00
Tolga Ceylan
1cd5894f41 fn: LB agent: reduce 'Too Busy' error logs (#1033)
With this PR, runner client translates too busy errors
from gRPC session and runner itself into Fn error type.
Placers now ignore this error message to reduce unnecessary
logging.
2018-06-04 12:16:00 -07:00
Tolga Ceylan
7261ddedcc fn: LB agent: EOF from runner is normal in nack cases (#1032) 2018-06-04 12:10:00 -07:00
Reed Allman
00c29b8bf3 datastore no longer implements logstore (#1013)
* datastore no longer implements logstore

the underlying implementation of our sql store implements both the datastore
and the logstore interface, however going forward we are likely to encounter
datastore implementers that would mock out the logstore interface and not use
its methods - signalling a poor interface. this remedies that, now they are 2
completely separate things, which our sqlstore happens to implement both of.

related to some recent changes around wrapping, this keeps the imposed metrics
and validation wrapping of a servers logstore and datastore, just moving it
into New instead of in the opts - this is so that a user can have the
underlying datastore in order to set the logstore to it, since wrapping it in
a validator/metrics would render it no longer a logstore implementer (i.e.
validate datastore doesn't implement the logstore interface), we need to do
this after setting the logstore to the datastore if one wasn't provided
explicitly.

* splits logstore and datastore metrics & validation logic
* `make test` should be `make full-test` always. got rid of that so that
nobody else has to wait for CI to blow up on them after the tests pass locally
ever again.

* fix new tests
2018-06-04 00:08:16 -07:00
Tolga Ceylan
a57907eed0 fn: user friendly timeout handling changes (#1021)
* fn: user friendly timeout handling changes

Timeout setting in routes now means "maximum amount
of time a function can run in a container".

Total wait time for a given http request is now expected
to be handled by the client. As long as the client waits,
the LB, runner or agents will search for resources to
schedule it.
2018-06-01 13:18:13 -07:00
Tolga Ceylan
f97b63f878 fn: fixup temp dir read/write permissions if tmp fs size is not set. (#1024)
When TmpFsSize is not set in a route, docker fails to create a /tmp
mount that is writable. Forcing docker to explicitly to this if
read-only root directory is enabled (default).
2018-06-01 10:49:07 -07:00
Tolga Ceylan
e1b7e30e49 fn: cleanup of unused/global constants in lb agent (#1020)
Moved retry interval as placer member variable for time-being.
2018-05-31 13:04:06 -07:00
Tolga Ceylan
d190167580 fn: read-only root fs becomes default (#1019)
* fn: read-only root fs becomes default

Set root fs as read-only by default.

* fn: update doc for FN_DISABLE_READONLY_ROOTFS
2018-05-30 18:17:28 -07:00
Tolga Ceylan
7f1d14d21f fn: slot hash id must be utf8 in gRPC (#1016) 2018-05-29 16:26:43 -07:00
Tolga Ceylan
74a5379dec fn: lb & pure-runner slot hash id communication (#1007)
* fn: lb & pure-runner slot hash id communication

With this change, LB can pre-calculate the slot hash
key and pass it to runners. If LB knows/calculates
the slot hash ids, then it can also make better
estimates on which runner can successfully execute
it especially when status messages from runner
include a small summary of idle slots for a given
slot hash id. (TODO)

* fn: fix mock test
2018-05-25 14:12:48 -07:00
Tolga Ceylan
9584643142 fn: size restricted tmpfs /tmp and read-only / support (#1012)
* fn: size restricted tmpfs /tmp and read-only / support

*) read-only Root Fs Support
*) removed CPUShares from docker API. This was unused.
*) docker.Prepare() refactoring
*) added docker.configureTmpFs() for size limited tmpfs on /tmp
*) tmpfs size support in routes and resource tracker
*) fix fn-test-utils to handle sparse files better in create file

* test typo fix
2018-05-25 14:12:29 -07:00
Gerardo Viedma
ea1f94253f Implement graceful shutdown of agent.DataAccess (#1008)
* Implements graceful shutdown of agent.DataAccess and underlying Datastore/Logstore/MessageQueue

* adds tests for closing agent.DataAccess and Datastore
2018-05-21 11:28:21 +01:00
Tolga Ceylan
77086ecc24 fn: lb-agent & runner gRPC updates (#1005)
Breaking changes:

*) Removed unused ACK/NACK definitions
*) Extended Finished messages with error code/str
2018-05-17 15:02:15 -07:00
Tolga Ceylan
7cf8e2a61d fn: pure-runner time out while waiting TryCall (#1006)
This should return a retriable error code 503.
2018-05-17 15:00:50 -07:00
Tolga Ceylan
4ccde8897e fn: lb and pure-runner with non-blocking agent (#989)
* fn: lb and pure-runner with non-blocking agent

*) Removed pure-runner capacity tracking code. This did
not play well with internal agent resource tracker.
*) In LB and runner gRPC comm, removed ACK. Now,
upon TryCall, pure-runner quickly proceeds to call
Submit. This is good since at this stage pure-runner
already has all relevant data to initiate the call.
*) Unless pure-runner emits a NACK, LB immediately
streams http body to runners.
*) For retriable requests added a CachedReader for
http.Request Body.
*) Idempotenty/retry is similar to previous code.
After initial success in Engament, after attempting
a TryCall, unless we receive NACK, we cannot retry
that call.
*) ch and naive places now wraps each TryExec with
a cancellable context to clean up gRPC contexts
quicker.

* fn: err for simpler one-time read GetBody approach

This allows for a more flexible approach since we let
users to define GetBody() to allow repetitive http body
read. In default LB case, LB executes a one-time io.ReadAll
and sets of GetBody, which is detected by RunnerCall.RequestBody().

* fn: additional check for non-nil req.body

* fn: attempt to override IO errors with ctx for TryExec

* fn: system-tests log dest

* fn: LB: EOF send handling

* fn: logging for partial IO

* fn: use buffer pool for IO storage in lb agent

* fn: pure runner should use chunks for data msgs

* fn: required config validations and pass APIErrors

* fn: additional tests and gRPC proto simplification

*) remove ACK/NACK messages as Finish message type works
OK for this purpose.
*) return resp in api tests for check for status code
*) empty body json test in api tests for lb & pure-runner

* fn: buffer adjustments

*) setRequestBody result handling correction
*) switch to bytes.Reader for read-only safety
*) io.EOF can be returned for non-nil Body in request.

* fn: clarify detection of 503 / Server Too Busy
2018-05-17 12:09:03 -07:00
Mark Godfrey
ac4e6c5a03 Replace panic on enqueue for LB agent with error. (#1004) 2018-05-17 13:05:47 +01:00
Tolga Ceylan
eab85dfab0 fn: agent MaxRequestSize limit (#998)
* fn: agent MaxRequestSize limit

Currently, LimitRequestBody() exists to install a
http request body size in http/gin server. For production
enviroments, this is expected to be used. However, in agents
we may need to verify/enforce these size limits and to be
able to assert in case of missing limits is valuable.
With this change, operators can define an agent env variable
to limit this in addition to installing Gin/Http handler.

http.MaxBytesReader is superior in some cases as it sets
http headers (Connection: close) to guard against subsequent
requests.

However, NewClampReadCloser() is superior in other cases,
where it can cleanly return an API error for this case alone
(http.MaxBytesReader() does not return a clean error type
for overflow case, which makes it difficult to use it without
peeking into its implementation.)

For lb agent, upcoming changes rely on such limits enabled
and using gin/http handler (http.MaxBytesReader) makes such
checks/safety validations difficult.

* fn: read/write clamp code adjustment

In case of overflows, opt for simple implementation
of a partial write followed by return error.
2018-05-16 11:45:57 -07:00
Reed Allman
cbe0d5e9ac add user syslog writers to app (#970)
* add user syslog writers to app

users may specify a syslog url[s] on apps now and all functions under that app
will spew their logs out to it. the docs have more information around details
there, please review those (swagger and operating/logging.md), tried to
implement to spec in some parts and improve others, open to feedback on
format though, lots of liberty there.

design decision wise, I am looking to the future and ignoring cold containers.
the overhead of the connections there will not be worth it, so this feature
only works for hot functions, since we're killing cold anyway (even if a user
can just straight up exit a hot container).

syslog connections will be opened against a container when it starts up, and
then the call id that is logged gets swapped out for each call that goes
through the container, this cuts down on the cost of opening/closing
connections significantly. there are buffers to accumulate logs until we get a
`\n` to actually write a syslog line, and a buffer to save some bytes when
we're writing the syslog formatting as well. underneath writers re-use the
line writer in certain scenarios (swapper). we could likely improve the ease
of setting this up, but opening the syslog conns against a container seems
worth it, and is a different path than the other func loggers that we create
when we make a call object. the Close() stuff is a little tricky, not sure how
to make it easier and have the ^ benefits, open to idears.

this does add another vector of 'limits' to consider for more strict service
operators. one being how many syslog urls can a user add to an app (infinite,
atm) and the other being on the order of number of containers per host we
could run out of connections in certain scenarios. there may be some utility
in having multiple syslog sinks to send to, it could help with debugging at
times to send to another destination or if a user is a client w/ someone and
both want the function logs, e.g. (have used this for that in the past,
specifically).

this also doesn't work behind a proxy, which is something i'm open to fixing,
but afaict will require a 3rd party dependency (we can pretty much steal what
docker does). this is mostly of utility for those of us that work behind a
proxy all the time, not really for end users.

there are some unit tests. integration tests for this don't sound very fun to
maintain. I did test against papertrail with each protocol and it works (and
even times out if you're behind a proxy!).

closes #337

* add trace to syslog dial
2018-05-15 11:00:26 -07:00
Tolga Ceylan
8e440c835e fn: fixup undeterministic test (#986) 2018-05-10 08:08:10 -07:00
Tolga Ceylan
508d9e18c7 fn: nonblocking resource manager tests (#987) 2018-05-09 19:23:10 -07:00
Tolga Ceylan
0f50537150 fn: allow specified docker networks in functions (#982)
* fn: allow specified docker networks in functions

If FN_DOCKER_NETWORK is specified with a list of
networks, then agent driver picks the least used
network to place functions on.

* add mutex comment
2018-05-09 12:24:15 -07:00
Tolga Ceylan
676c87f9a5 fn: pure-runner violates io Writer contract (#981)
We must copy the data slice.
2018-05-09 16:48:55 +01:00
jan grant
91e58afa55 The opencensus API changes between 0.6.0 and 0.9.0 (#980)
We get some useful features in later versions; update so as to not
pin downstream consumers (extensions) to an older version.
2018-05-09 14:55:00 +01:00
Reed Allman
1f1624782b related: https://github.com/fnproject/fdk-go/pull/26 (#968)
adds a test for the protocol dumping of a request to the containert stdin.
there are a number of vectors to test for a cloud event, but since we're going
to change that behavior soon it's probably a waste of time to go about doing
so. in any event, this was pretty broken. my understanding of the cloud event
spec is deepening and the json stuff overall seems a little weird.

* fixes content type issue around json checking (since a string is also a json
value, we can just decode it, even though it's wasteful it's more easily
correct)
* doesn't force all json values to be map[string]interface{} and lets them be
whoever they want to be. maybe their dads are still proud.

closes #966
2018-05-07 22:22:53 -07:00
Tolga Ceylan
f0f9a6d945 fn: LB ch and naive fixes (#942)
* fn: LB ch and naive fixes

*) Naive is now a naive RR algorithm.
*) Both now checks for ctx/timeout in each attempt.

* fn: test fix
2018-05-07 11:50:16 -07:00
Tolga Ceylan
54ba49be65 fn: non-blocking resource tracker and notification (#841)
* fn: non-blocking resource tracker and notification

For some types of errors, we might want to notify
the actual caller if the error is directly 1-1 tied
to that request. If hotLauncher is triggered with
signaller, then here we send a back communication
error notification channel. This is passed to
checkLaunch to send back synchronous responses
to the caller that initiated this hot container
launch.

This is useful if we want to run the agent in
quick fail mode, where instead of waiting for
CPU/Mem to become available, we prefer to fail
quick in order not to hold up the caller.
To support this, non-blocking resource tracker
option/functions are now available.

* fn: test env var rename tweak

* fn: fixup merge

* fn: rebase test fix

* fn: merge fixup

* fn: test tweak down to 70MB for 128MB total

* fn: refactor token creation and use broadcast regardless

* fn: nb description

* fn: bugfix
2018-04-24 21:59:33 -07:00
Travis Reeder
3eb60e2028 CloudEvents I/O format support. (#948)
* CloudEvents I/O format support.

* Updated format doc.

* Remove log lines

* This adds support for CloudEvent ingestion at the http router layer.

* Updated per comments.

* Responds with full CloudEvent message.

* Fixed up per comments

* Fix tests

* Checks for cloudevent content-type

* doesn't error on missing content-type.
2018-04-23 16:05:13 -07:00
Tolga Ceylan
c0ee3ce736 fn: locked mutex while blocked on I/O considered harmful (#935)
* fn: mutex while waiting I/O considered harmful

*) Removed hold mutex while wait I/O cases these
included possible disk I/O and network I/O.

*) Error/Context Close/Shutdown semantics changed since
the context timeout and comments were misleading. Close
always waits for pending gRPC session to complete.
Context usage here was merely 'wait up to x secs to
report an error' which only logs the error anyway.
Instead, the runner can log the error. And context
still can be passed around perhaps for future opencensus
instrumentation.
2018-04-13 11:23:29 -07:00
Tolga Ceylan
623aeb35b2 fn: common.WaitGroup improvements (#940)
* fn: common.WaitGroup improvements

*) Split the API into AddSession/DoneSession
*) Only wake up listeners when session count reaches zero.

* fn: WaitGroup go-routine blast test

* fn: test fix and rebase fixup
2018-04-12 16:21:13 -07:00
Tolga Ceylan
e47d55056a fn: reduce lbagent and agent dependency (#938)
* fn: reduce lbagent and agent dependency

lbagent and agent code is too dependent. This causes
any changed in agent to break lbagent. In reality, for
LB there should be no delegated agent. Splitting these
two will cause some code duplication, but it reduces
dependency and complexity (eg. agent without docker)

* fn: post rebase fixup

* fn: runner/runnercall should use lbDeadline

* fn: fixup ln agent test

* fn: remove agent create option for common.WaitGroup
2018-04-12 15:51:58 -07:00
Tolga Ceylan
e53d23afc9 fn: sync.WaitGroup replacement common.WaitGroup (#937)
* fn: sync.WaitGroup replacement common.WaitGroup

agent/lb_agent/pure_runner has been incorrectly using
sync.WaitGroup semantics. Switching these components to
use the new common.WaitGroup() that provides a few handy
functionality for common graceful shutdown cases.

From https://golang.org/pkg/sync/#WaitGroup,
    "Note that calls with a positive delta that occur when the counter
     is zero must happen before a Wait. Calls with a negative delta,
     or calls with a positive delta that start when the counter is
     greater than zero, may happen at any time. Typically this means
     the calls to Add should execute before the statement creating
     the goroutine or other event to be waited for. If a WaitGroup
     is reused to wait for several independent sets of events,
     new Add calls must happen after all previous Wait calls have
     returned."

HandleCallEnd introduces some complexity to the shutdowns, but this
is currently handled by AddSession(2) initially and letting the
HandleCallEnd() when to decrement by -1 in addition to decrement -1 in
Submit().

lb_agent shutdown sequence and particularly timeouts with runner pool
needs another look/revision, but this is outside of the scope of this
commit.

* fn: lb-agent wg share

* fn: no need to +2 in Submit with defer.

Removed defer since handleCallEnd already has
this responsibility.
2018-04-12 11:33:01 -07:00
Tolga Ceylan
9b86e3626e fn: avoid go-routine leak (#934) 2018-04-11 12:11:08 -07:00
Tolga Ceylan
600dce5b2c fn: lb_agent_test tweak (#931)
Time.Sleep() blocking fixes in placers (naive and ch) improves
some of the timing in processing, therefore reducing the max
calls settings in mock runner pool.
2018-04-10 18:28:47 -07:00
Tolga Ceylan
e7658db822 Move ch ring placement back from old FnLB. (#930)
* fn: bring back CH ring placer into FN repo based on original FnLB
* fn: move placement code into runnerpool directory
2018-04-10 17:26:24 -07:00
Tolga Ceylan
95d7379a81 fn: pure runner concurrency fixes (#928)
* fn: experimental pure_runner rewrite

* fn: minor refactor

* fn: added comments and moved pipe initialization to NewCallHandle

* fn: EOF gRPC and EOF DataFrame handling
2018-04-10 15:50:32 -07:00
Tolga Ceylan
ee262901a2 fn: handleCallEnd and submit improvements (#919)
* fn: move call error/end handling to handleCallEnd

This simplifies submit() function but moves the burden
of retriable-versus-committed request handling and slot.Close()
responsibility to handleCallEnd().
2018-04-10 10:48:12 -07:00
Tolga Ceylan
c1f0707b60 fn: lb_agent Close should close shutdown channel (#925) 2018-04-09 10:56:15 -07:00
Tolga Ceylan
34b386b944 fn: lb_agent wait group unused, added to Close (#924) 2018-04-06 18:02:51 -07:00
Tolga Ceylan
1170645af2 fn: lb_agent state trackers only apply to runners (#923)
State trackers do not apply to LB agent.
2018-04-06 17:59:10 -07:00
Tolga Ceylan
584e4e75eb Experimental Pre-fork Pool: Recycle net ns (#890)
* fn: experimental prefork recycle and other improvements

*) Recycle and do not use same pool container again option.
*) Two state processing: initializing versus ready (start-kill).
*) Ready state is exempt from rate limiter.

* fn: experimental prefork pool multiple network support

In order to exceed 1023 container (bridge port) limit, add
multiple networks:

    for i in fn-net1 fn-net2 fn-net3 fn-net4
    do
            docker network create $i
    done

to Docker startup, (eg. dind preentry.sh), then provide this
to prefork pool using:

    export FN_EXPERIMENTAL_PREFORK_NETWORKS="fn-net1 fn-net2 fn-net3 fn-net4"

which should be able to spawn 1023 * 4 containers.

* fn: fixup tests for cfg move

* fn: add ipc and pid namespaces into prefork pooling

* fn: revert ipc and pid namespaces for now

Pid/Ipc opens up the function container to pause container.
2018-04-05 15:07:30 -07:00
Tolga Ceylan
81954bcf53 fn: perform call.End() after request is processed (#918)
* fn: perform call.End() after request is processed

call.End() performs several tasks in sequence; insert call,
insert log, (todo) remove mq entry, fireAfterCall callback, etc.
These currently add up to the request latency as return
from agent.Submit() is blocked on these. We also haven't been
able to apply any timeouts on these operations since they are
handled during request processing and it is hard to come up
with a strategy for it. Also the error cases
(couldn't insert call or log) are not propagated to the caller.

With this change, call.End() handling becomes asynchronous where
we perform these tasks after the request is done. This improves
latency and we no longer have to block the call on these operations.
The changes will also free up the agent slot token more quickly
and now we are no longer tied to hiccups in call.End().

Now, a timeout policy is also added to this which can
be adjusted with an env variable. (default 10 minutes)

This accentuates the fact that call/log/fireAfterCall are not
completed when request is done. So, there's a window there where
call is done, but call/log/fireAfterCall are not yet propagated.
This was already the case especially for error cases.

There's slight risk of accumulating call.End() operations in
case of hiccups in these log/call/callback systems.

* fn: address risk of overstacking of call.End() calls.
2018-04-05 14:42:12 -07:00
Reed Allman
56a2861748 move calls to logstore, implement s3 (#911)
* move calls to logstore, implement s3

closes #482

the basic motivation is that logs and calls will be stored with a very high
write rate, while apps and routes will be relatively infrequently updated; it
follows that we should likely split up their storage location, to back them
with appropriate storage facilities. s3 is a good candidate for ingesting
higher write rate data than a sql database, and will make it easier to manage
that data set. can read #482 for more detailed justification.

summary:

* calls api moved from datastore to logstore
* logstore used in front-end to serve calls endpoints
* agent now throws calls into logstore instead of datastore
* s3 implementation of calls api for logstore
* s3 logs key changed (nobody using / nbd?)
* removed UpdateCall api (not in use)
* moved call tests from datastore to logstore tests
* mock logstore now tested (prev. sqlite3 only)
* logstore tests run against every datastore (mysql, pg; prev. only sqlite3)
* simplify NewMock in tests

commentary:

brunt of the work is implementing the listing of calls in GetCalls for the s3
logstore implementation. the GetCalls API requires returning items in the
newest to oldest order, and the s3 api lists items in lexicographic order
based on created_at. An easy thing to do here seemed to be to reverse the
encoding of our id format to return a lexicographically descending order,
since ids are time based, reasonably encoded to be lexicographically
sortable, and de-duped (unlike created_at). This seems to work pretty well,
it's not perfect around the boundaries of to_time and from_time and a tiny
amount of results may be omitted, but to me this doesn't seem like a deal
breaker to get 6999 results instead of 7000 when trying to get calls between
3:00pm and 4:00pm Monday 3 weeks ago. Of course, without to_time and
from_time, there are no issues in listing results. We could use created at and
encode it, but it would be an additional marker for point lookup (GetCall)
since we would have to search for a created_at stamp, search for ids around
that until we find the matching one, just to do a point lookup. So, the
tradeoff here seems worth it. There is additional optimization around to_time
to seek over newer results (since we have descending order).

The other complication in GetCalls is returning a list of calls for a given
path. Since the keys to do point lookups are only app_id + call_id, and we
need listing across an app as well, this leads us to the 'marker' collection
which is sorted by app_id + path + call_id, to allow quick listing by path.
All in all, it should be pretty straightforward to follow the implementation
and I tried to be lavish with the comments, please let me know if anything
needs further clarification in the code.

The implementation itself has some glaring inefficiencies, but they're
relatively minute: json encoding is kinda lazy, but workable; s3 doesn't offer
batch retrieval, so we point look up each call one by one in get call; not
re-using buffers -- but the seeking around the keys should all be relatively
fast, not too worried about performance really and this isn't a hot path for
reads (need to make a cut point and turn this in!).

Interestingly, in testing, minio performs significantly worse than pg for
storing both logs and calls (or just logs, I tested that too). minio seems to
have really high cpu consumption, but in any event, we won't be using minio,
we'll be using a cloud object store that implements the s3 api. Anyway, mostly
a knock on using minio for high performance, not really anything to do with
this, just thought it was interesting.

I think it's safe to remove UpdateCall, admittedly this made implementing the
s3 api a lot easier. This operation may also be something we never need, it
was unused at present and was only in the cards for a previous hybrid
implementation, which we've now abandoned. If we need, we can always resurrect
from git.

Also not worried about changing the log key, we need to put a prefix on this
thing anyway, but I don't think anybody is using this anyway. in any event, it
simply means old logs won't show up through the API, but aside from nobody
using this yet, that doesn't seem a big deal breaker really -- new logs will
appear fine.

future:

TODO make logstore implementation optional for datastore, check in front-end
at runtime and offer a nil logstore that errors appropriately

TODO low hanging fruit optimizations of json encoding, re-using buffers for
download, get multiple calls at a time, id reverse encoding could be optimized
like normal encoding to not be n^2

TODO api for range removal of logs and calls

* address review comments

* push id to_time magic into id package
* add note about s3 key sizes
* fix validation check
2018-04-05 10:49:25 -07:00
Tolga Ceylan
c58caee78d fn: update minimum docker version required. (#916)
Oracle Linux 7.4 backported versions still having issues
with freezing/terminating containers. 17.10.0-ce seems like
a resonable lowest common denominator.
2018-04-04 16:43:30 -07:00
Reed Allman
3a2550d042 change slots when annotations change (#914)
when the annotations change, we need to get a different slot key to launch new
containers with these annotations and let the old containers die off unused.

I started on a test for this, but changing all combinations of each field in
isolation to change is not very fun without reflection, and there's still a
subset of fields we're managing, so it put us in about the same spot as we are
now.
2018-04-03 11:21:56 -07:00
jan grant
9633cf022b Bugfix: unsafeBytes slices were getting GCed (#913)
There are alternative formulations of this, for instance see
	https://www.reddit.com/r/golang/comments/5zctpf/unsafe_conversion_between_strings_and_byte_slices/

The problem manifested in the returned values from unsafeBytes occasionally
being broken. It's possible that by keeping a reference to the `a` parameter
alive, the original code would still work - however, this definitely seems like
a fix.

(A cast to `[]byte(a)` looks increasingly attractive, for all that it'll
perform small allocations and copies.)
2018-04-03 16:53:02 +01:00
jan grant
88074a42c0 Bugfix/grpc consume eof (#912)
* GRPC streams end with an EOF

The client should ensure that the final packet is followed by a GRPC
EOF. This has the benefit of permitting the client code to clean up resources.

* Don't require an entire HTTP request in RunnerCall

TryExec needs a handle on an incoming ReadCloser containing the body
of a request; however, everything else will already have been extracted
from the HTTP request in the case of lbAgent use.

(The point of this change is to simplify the interface for other uses.)

* Return error from GRPC layer explicitly

As per review
2018-04-03 15:04:21 +01:00