Commit Graph

2675 Commits

Author SHA1 Message Date
Dario Domizioli
0a72cb3ef4 Fix missing values in context created with common.BackgroundContext (#950)
* Fix missing values in context when created through common.BackgroundContext

* pin to mysql 5.7.22
2018-04-20 17:29:27 +01:00
CI
07388774db fnserver: 0.3.421 release [skip ci] 0.3.421 2018-04-17 10:55:43 +00:00
Dario Domizioli
d23f4e7b39 Add a way to tweak node capacities in system tests to enable more tests (#869)
* Add a way to tweak node capacities in system tests to enable more tests
* Add test for saturated system
2018-04-17 11:46:17 +01:00
CI
eb6630f22c fnserver: 0.3.420 release [skip ci] 0.3.420 2018-04-13 22:30:51 +00:00
Reed Allman
a481191db2 migratex api uses tx now instead of db (#939)
* migratex api uses tx now instead of db

we want to be able to do external queries outside of the migration itself
inside of the same transaction for version checking. if we don't do this, we
risk the case where we set the version to the latest but we don't run the
table creates at all, so we have a db that thinks it's up to date but doesn't
even have any tables, and on subsequent boots if a migration slides in then
the migrations will run when there are no tables. it was unlikely, but now
it's dead.

* tx friendly table exists check

the previous existence checker for dbs was relying on getting back errors
about the db not existing. if we use this in a tx, it makes the whole tx
invalid for postgres. so, now we have count the table queries which return a 1
or a 0 instead of a 1 or an error so that we can check existence inside of a
transaction. voila.
2018-04-13 15:21:54 -07:00
CI
312fd8ec12 fnserver: 0.3.419 release [skip ci] 0.3.419 2018-04-13 18:31:10 +00:00
Tolga Ceylan
c0ee3ce736 fn: locked mutex while blocked on I/O considered harmful (#935)
* fn: mutex while waiting I/O considered harmful

*) Removed hold mutex while wait I/O cases these
included possible disk I/O and network I/O.

*) Error/Context Close/Shutdown semantics changed since
the context timeout and comments were misleading. Close
always waits for pending gRPC session to complete.
Context usage here was merely 'wait up to x secs to
report an error' which only logs the error anyway.
Instead, the runner can log the error. And context
still can be passed around perhaps for future opencensus
instrumentation.
2018-04-13 11:23:29 -07:00
CI
8a35c9876a fnserver: 0.3.418 release [skip ci] 0.3.418 2018-04-13 17:43:52 +00:00
Tolga Ceylan
00bb4d1257 fn: empty body tests for cold and hot (json/http) (#941) 2018-04-13 10:35:57 -07:00
CI
38c332f5f4 fnserver: 0.3.417 release [skip ci] 0.3.417 2018-04-12 23:29:13 +00:00
Tolga Ceylan
623aeb35b2 fn: common.WaitGroup improvements (#940)
* fn: common.WaitGroup improvements

*) Split the API into AddSession/DoneSession
*) Only wake up listeners when session count reaches zero.

* fn: WaitGroup go-routine blast test

* fn: test fix and rebase fixup
2018-04-12 16:21:13 -07:00
CI
e7dd095b92 fnserver: 0.3.416 release [skip ci] 0.3.416 2018-04-12 22:59:31 +00:00
Tolga Ceylan
e47d55056a fn: reduce lbagent and agent dependency (#938)
* fn: reduce lbagent and agent dependency

lbagent and agent code is too dependent. This causes
any changed in agent to break lbagent. In reality, for
LB there should be no delegated agent. Splitting these
two will cause some code duplication, but it reduces
dependency and complexity (eg. agent without docker)

* fn: post rebase fixup

* fn: runner/runnercall should use lbDeadline

* fn: fixup ln agent test

* fn: remove agent create option for common.WaitGroup
2018-04-12 15:51:58 -07:00
Tolga Ceylan
e53d23afc9 fn: sync.WaitGroup replacement common.WaitGroup (#937)
* fn: sync.WaitGroup replacement common.WaitGroup

agent/lb_agent/pure_runner has been incorrectly using
sync.WaitGroup semantics. Switching these components to
use the new common.WaitGroup() that provides a few handy
functionality for common graceful shutdown cases.

From https://golang.org/pkg/sync/#WaitGroup,
    "Note that calls with a positive delta that occur when the counter
     is zero must happen before a Wait. Calls with a negative delta,
     or calls with a positive delta that start when the counter is
     greater than zero, may happen at any time. Typically this means
     the calls to Add should execute before the statement creating
     the goroutine or other event to be waited for. If a WaitGroup
     is reused to wait for several independent sets of events,
     new Add calls must happen after all previous Wait calls have
     returned."

HandleCallEnd introduces some complexity to the shutdowns, but this
is currently handled by AddSession(2) initially and letting the
HandleCallEnd() when to decrement by -1 in addition to decrement -1 in
Submit().

lb_agent shutdown sequence and particularly timeouts with runner pool
needs another look/revision, but this is outside of the scope of this
commit.

* fn: lb-agent wg share

* fn: no need to +2 in Submit with defer.

Removed defer since handleCallEnd already has
this responsibility.
2018-04-12 11:33:01 -07:00
CI
f350b2ca48 fnserver: 0.3.415 release [skip ci] 0.3.415 2018-04-11 23:51:05 +00:00
Tolga Ceylan
fc0d3d49d2 fn: introducing WaitGroup suitable for shutdown/session mgmt (#936)
* fn: introducing WaitGroup suitable for shutdown/session mgmt
2018-04-11 16:42:59 -07:00
Tolga Ceylan
9b86e3626e fn: avoid go-routine leak (#934) 2018-04-11 12:11:08 -07:00
CI
b5a732ff1b fnserver: 0.3.414 release [skip ci] 0.3.414 2018-04-11 12:34:42 +00:00
jan grant
2387d070bf Fix docker login syntax (#933)
With docker 18.04 the behaviour of a documented interface has changed from 18.03 -
to wit, you need to use a specific noninteractive mode of `docker login` to avoid
being prompted about insecure credential storage.
2018-04-11 13:25:37 +01:00
Tolga Ceylan
600dce5b2c fn: lb_agent_test tweak (#931)
Time.Sleep() blocking fixes in placers (naive and ch) improves
some of the timing in processing, therefore reducing the max
calls settings in mock runner pool.
2018-04-10 18:28:47 -07:00
Tolga Ceylan
e7658db822 Move ch ring placement back from old FnLB. (#930)
* fn: bring back CH ring placer into FN repo based on original FnLB
* fn: move placement code into runnerpool directory
2018-04-10 17:26:24 -07:00
Tolga Ceylan
95d7379a81 fn: pure runner concurrency fixes (#928)
* fn: experimental pure_runner rewrite

* fn: minor refactor

* fn: added comments and moved pipe initialization to NewCallHandle

* fn: EOF gRPC and EOF DataFrame handling
2018-04-10 15:50:32 -07:00
CI
bd53ee9a2d fnserver: 0.3.413 release [skip ci] 0.3.413 2018-04-10 20:36:00 +00:00
Tolga Ceylan
ee262901a2 fn: handleCallEnd and submit improvements (#919)
* fn: move call error/end handling to handleCallEnd

This simplifies submit() function but moves the burden
of retriable-versus-committed request handling and slot.Close()
responsibility to handleCallEnd().
2018-04-10 10:48:12 -07:00
CI
f705fc8d8f fnserver: 0.3.412 release [skip ci] 0.3.412 2018-04-09 18:25:41 +00:00
Tolga Ceylan
e36e25150c fn: api and systems tests port cleanup (#926)
*) removed unused cancel in api-test harness/server
*) removed hard coded port in getServerWithCancel along
with faulty health check code.
*) in SetupHarness() fixed code that skipped server start.
2018-04-09 11:16:08 -07:00
CI
e5952a7843 fnserver: 0.3.411 release [skip ci] 0.3.411 2018-04-09 18:05:42 +00:00
Tolga Ceylan
c1f0707b60 fn: lb_agent Close should close shutdown channel (#925) 2018-04-09 10:56:15 -07:00
Tolga Ceylan
dc6a3305eb fn: datastore dial retry should be configurable (#927)
* fn: datastore dial retry should be configurable

Setting this to high (60) in api and system tests.

* fn: netcat connect is not meaningful in wait for DB.
2018-04-09 10:47:00 -07:00
CI
b6caf50f7d fnserver: 0.3.410 release [skip ci] 0.3.410 2018-04-07 22:55:15 +00:00
Tolga Ceylan
34b386b944 fn: lb_agent wait group unused, added to Close (#924) 2018-04-06 18:02:51 -07:00
Tolga Ceylan
1170645af2 fn: lb_agent state trackers only apply to runners (#923)
State trackers do not apply to LB agent.
2018-04-06 17:59:10 -07:00
Chad Arimura
d008d11c94 Update README.md (#921) 2018-04-06 14:38:02 -04:00
CI
d7188c792a fnserver: 0.3.409 release [skip ci] 0.3.409 2018-04-05 22:15:51 +00:00
Tolga Ceylan
584e4e75eb Experimental Pre-fork Pool: Recycle net ns (#890)
* fn: experimental prefork recycle and other improvements

*) Recycle and do not use same pool container again option.
*) Two state processing: initializing versus ready (start-kill).
*) Ready state is exempt from rate limiter.

* fn: experimental prefork pool multiple network support

In order to exceed 1023 container (bridge port) limit, add
multiple networks:

    for i in fn-net1 fn-net2 fn-net3 fn-net4
    do
            docker network create $i
    done

to Docker startup, (eg. dind preentry.sh), then provide this
to prefork pool using:

    export FN_EXPERIMENTAL_PREFORK_NETWORKS="fn-net1 fn-net2 fn-net3 fn-net4"

which should be able to spawn 1023 * 4 containers.

* fn: fixup tests for cfg move

* fn: add ipc and pid namespaces into prefork pooling

* fn: revert ipc and pid namespaces for now

Pid/Ipc opens up the function container to pause container.
2018-04-05 15:07:30 -07:00
CI
629559ecc8 fnserver: 0.3.408 release [skip ci] 0.3.408 2018-04-05 21:51:59 +00:00
Tolga Ceylan
81954bcf53 fn: perform call.End() after request is processed (#918)
* fn: perform call.End() after request is processed

call.End() performs several tasks in sequence; insert call,
insert log, (todo) remove mq entry, fireAfterCall callback, etc.
These currently add up to the request latency as return
from agent.Submit() is blocked on these. We also haven't been
able to apply any timeouts on these operations since they are
handled during request processing and it is hard to come up
with a strategy for it. Also the error cases
(couldn't insert call or log) are not propagated to the caller.

With this change, call.End() handling becomes asynchronous where
we perform these tasks after the request is done. This improves
latency and we no longer have to block the call on these operations.
The changes will also free up the agent slot token more quickly
and now we are no longer tied to hiccups in call.End().

Now, a timeout policy is also added to this which can
be adjusted with an env variable. (default 10 minutes)

This accentuates the fact that call/log/fireAfterCall are not
completed when request is done. So, there's a window there where
call is done, but call/log/fireAfterCall are not yet propagated.
This was already the case especially for error cases.

There's slight risk of accumulating call.End() operations in
case of hiccups in these log/call/callback systems.

* fn: address risk of overstacking of call.End() calls.
2018-04-05 14:42:12 -07:00
CI
82bf532fa7 fnserver: 0.3.407 release [skip ci] 0.3.407 2018-04-05 17:56:56 +00:00
Reed Allman
56a2861748 move calls to logstore, implement s3 (#911)
* move calls to logstore, implement s3

closes #482

the basic motivation is that logs and calls will be stored with a very high
write rate, while apps and routes will be relatively infrequently updated; it
follows that we should likely split up their storage location, to back them
with appropriate storage facilities. s3 is a good candidate for ingesting
higher write rate data than a sql database, and will make it easier to manage
that data set. can read #482 for more detailed justification.

summary:

* calls api moved from datastore to logstore
* logstore used in front-end to serve calls endpoints
* agent now throws calls into logstore instead of datastore
* s3 implementation of calls api for logstore
* s3 logs key changed (nobody using / nbd?)
* removed UpdateCall api (not in use)
* moved call tests from datastore to logstore tests
* mock logstore now tested (prev. sqlite3 only)
* logstore tests run against every datastore (mysql, pg; prev. only sqlite3)
* simplify NewMock in tests

commentary:

brunt of the work is implementing the listing of calls in GetCalls for the s3
logstore implementation. the GetCalls API requires returning items in the
newest to oldest order, and the s3 api lists items in lexicographic order
based on created_at. An easy thing to do here seemed to be to reverse the
encoding of our id format to return a lexicographically descending order,
since ids are time based, reasonably encoded to be lexicographically
sortable, and de-duped (unlike created_at). This seems to work pretty well,
it's not perfect around the boundaries of to_time and from_time and a tiny
amount of results may be omitted, but to me this doesn't seem like a deal
breaker to get 6999 results instead of 7000 when trying to get calls between
3:00pm and 4:00pm Monday 3 weeks ago. Of course, without to_time and
from_time, there are no issues in listing results. We could use created at and
encode it, but it would be an additional marker for point lookup (GetCall)
since we would have to search for a created_at stamp, search for ids around
that until we find the matching one, just to do a point lookup. So, the
tradeoff here seems worth it. There is additional optimization around to_time
to seek over newer results (since we have descending order).

The other complication in GetCalls is returning a list of calls for a given
path. Since the keys to do point lookups are only app_id + call_id, and we
need listing across an app as well, this leads us to the 'marker' collection
which is sorted by app_id + path + call_id, to allow quick listing by path.
All in all, it should be pretty straightforward to follow the implementation
and I tried to be lavish with the comments, please let me know if anything
needs further clarification in the code.

The implementation itself has some glaring inefficiencies, but they're
relatively minute: json encoding is kinda lazy, but workable; s3 doesn't offer
batch retrieval, so we point look up each call one by one in get call; not
re-using buffers -- but the seeking around the keys should all be relatively
fast, not too worried about performance really and this isn't a hot path for
reads (need to make a cut point and turn this in!).

Interestingly, in testing, minio performs significantly worse than pg for
storing both logs and calls (or just logs, I tested that too). minio seems to
have really high cpu consumption, but in any event, we won't be using minio,
we'll be using a cloud object store that implements the s3 api. Anyway, mostly
a knock on using minio for high performance, not really anything to do with
this, just thought it was interesting.

I think it's safe to remove UpdateCall, admittedly this made implementing the
s3 api a lot easier. This operation may also be something we never need, it
was unused at present and was only in the cards for a previous hybrid
implementation, which we've now abandoned. If we need, we can always resurrect
from git.

Also not worried about changing the log key, we need to put a prefix on this
thing anyway, but I don't think anybody is using this anyway. in any event, it
simply means old logs won't show up through the API, but aside from nobody
using this yet, that doesn't seem a big deal breaker really -- new logs will
appear fine.

future:

TODO make logstore implementation optional for datastore, check in front-end
at runtime and offer a nil logstore that errors appropriately

TODO low hanging fruit optimizations of json encoding, re-using buffers for
download, get multiple calls at a time, id reverse encoding could be optimized
like normal encoding to not be n^2

TODO api for range removal of logs and calls

* address review comments

* push id to_time magic into id package
* add note about s3 key sizes
* fix validation check
2018-04-05 10:49:25 -07:00
CI
6ca002973d fnserver: 0.3.406 release [skip ci] 0.3.406 2018-04-04 23:51:02 +00:00
Tolga Ceylan
c58caee78d fn: update minimum docker version required. (#916)
Oracle Linux 7.4 backported versions still having issues
with freezing/terminating containers. 17.10.0-ce seems like
a resonable lowest common denominator.
2018-04-04 16:43:30 -07:00
Reed Allman
f635359ff8 loosen perms on bolt mq (#910)
the bolt mq file should only be used for local dev and isn't necessarily
sensitive, don't think 0655 restriction was necessary and the data isn't
likely all that sensitive anyway.

see https://github.com/fnproject/fn/issues/404#issuecomment-377570626
2018-04-04 15:41:37 -07:00
CI
c96f4c5a5d fnserver: 0.3.405 release [skip ci] 0.3.405 2018-04-03 18:31:37 +00:00
Reed Allman
3a2550d042 change slots when annotations change (#914)
when the annotations change, we need to get a different slot key to launch new
containers with these annotations and let the old containers die off unused.

I started on a test for this, but changing all combinations of each field in
isolation to change is not very fun without reflection, and there's still a
subset of fields we're managing, so it put us in about the same spot as we are
now.
2018-04-03 11:21:56 -07:00
CI
1b0a9e2276 fnserver: 0.3.404 release [skip ci] 0.3.404 2018-04-03 16:00:55 +00:00
jan grant
9633cf022b Bugfix: unsafeBytes slices were getting GCed (#913)
There are alternative formulations of this, for instance see
	https://www.reddit.com/r/golang/comments/5zctpf/unsafe_conversion_between_strings_and_byte_slices/

The problem manifested in the returned values from unsafeBytes occasionally
being broken. It's possible that by keeping a reference to the `a` parameter
alive, the original code would still work - however, this definitely seems like
a fix.

(A cast to `[]byte(a)` looks increasingly attractive, for all that it'll
perform small allocations and copies.)
2018-04-03 16:53:02 +01:00
CI
28edd1779f fnserver: 0.3.403 release [skip ci] 0.3.403 2018-04-03 14:12:57 +00:00
jan grant
88074a42c0 Bugfix/grpc consume eof (#912)
* GRPC streams end with an EOF

The client should ensure that the final packet is followed by a GRPC
EOF. This has the benefit of permitting the client code to clean up resources.

* Don't require an entire HTTP request in RunnerCall

TryExec needs a handle on an incoming ReadCloser containing the body
of a request; however, everything else will already have been extracted
from the HTTP request in the case of lbAgent use.

(The point of this change is to simplify the interface for other uses.)

* Return error from GRPC layer explicitly

As per review
2018-04-03 15:04:21 +01:00
CI
3c536d1e01 fnserver: 0.3.402 release [skip ci] 0.3.402 2018-03-30 16:50:16 +00:00
Andrea Rosa
72a2eb933f Returning Agent on exported func for pureRunner (#905)
pureRunner is a not exported struct and it was set as return value for
few exported method, in this change we return Agent which is the
interface implemented by pureRunner to avoid to leak an unexprted type.
2018-03-30 09:15:55 -07:00