We increase available_storage_gib at vm_host update query. But we delete
vm storage volumes before this query. So `vm.storage_size_gib` returns 0
We need to fix value for production hosts.
Recently we realized that when we added an IPv6 block (additional, by mistake),
the VmHost.create_addresses did not create the ipv6 block entity in the
database. Looking into the pull_ips, I realized that the ipv6 blocks come with
active_server_ip as the main ipv6 address of the server, not the ipv4. Since the
find_matching_ips function filters the ip addresses according to the host
address in sshable, this was expected. It's harmless for now but we should
tackle this once we start implementing extra ipv6 block support. That's why the
comment is added to show guidance for future.
With this commit, we introduce the mechanism to replace the encryption
keys for existing ipsec tunnels. Here is the detailed explanation of the
work;
1. Private Subnet nexus hops to the refresh_keys state when needed. Here
we are going through a preparation phase. We start producing new
encryption keys, SPIs and reqids for the new state and policy objects.
At the end of the day, we write the new encryption keys, SPIs and the
reqids to the nic entities. So, for every nic in the system, we prepare
the new parameters at the private subnet level.
2. Trigger NicNexus to start rekeying.
3. NicNexus buds RekeyNicTunnel to create the SA objects for inbound
packets. This part is important. Let me explain the current state of
the tunnelling in detail here. Please refer to the end of the commit
message.
3. The RekeyNic prog in setup_inbound state, first finds all the state
objects required to be created at the destination end. This way, no
matter when the source policy/state is updated, the receiving end will
be ready with the encryption key.
4. Once everyone creates their receiving end state objects, we switch to
update the sender (aka outbound). This ping-pong is managed via
SubnetNexus by checking the state of NicNexus strands and using
semaphores to progress the state machine. This part of the code means
any packet that is being created at this moment, will be encrypted with
the new keys.
5. Once everyone in the mesh has created/updated their sender policies
and state objects. We go ahead and drop the state objects that are not
referenced by any policy.
Ipsec Tunneling
Each tunnel has two ends <source_nic> and <destination_nic>. For a
tunnel to work, both ends need to know about the encryption key so that
they can either encrypt or decrypt the packet. However, the encryption
key is not sent over with the packet. Therefore, the key must exist
locally in both systems. How does it work? Firstly, I should tell you
about the 2 objects we use to implement ipsec tunnels.
1. Policy:
Policies are used to capture packages and apply encryption. The way it
works is easy. All the packets are observed and if there is a packet
that matches one of the policies we created, we capture the reqid from
the policy. The reqid is the unique identifier of a state object that
has the encryption key information. Therefore, if a policy is matched,
we get the reqid from the policy, find the matching state object and
encrypt the packet using the encryption key, send it to the networking
stack again.
2. State:
States are there to store the encryption key information. They also have
fields like spi and reqid. reqid is mentioned in the policy part, so,
for outgoing connections, reqid is used to identify the encryption key.
For incoming connections, that is spi because spi is sent with the
packet in the header so that when receiving end gets the packet, finds a
matching policy, the spi is read from the header and the encryption key
is identified. This gives us the flexibility of using multiple state
objects for 1 tunnel. Also, handy for the rekeying process.
In development, now we can reset the host server programmatically. For this to
work properly, you should make sure all the VMs are destroyed. Call VmHost.reset
function and then set the strand label to "start". It will give you a fresh
new host.
To decide show sidebar item to user, we need to check user's permission
on project. We check one by one, but I don't want it. It runs a query
for each check.
I made `actions` argument optional in our authorization query. If it's
not provided, it returns all matched policies that user has permission
on given object. So we can fetch this list once, and check permission
from it.
Also we don't show create buttons if user doesn't have create permission
for Vm or Private Subnet.
Fixes#436
Readme edits based on feedback.
- Include link to managed platform
- Update steps to use the open and free version
- Update benefits section and include use-cases
- Refresh list of available services
- Add FAQ
Different operating systems have different conventions. Below are reasonable
restrictions that works for most (all?) systems.
- Max length 32
- Only lowercase letters, numbers, hyphens and underscore
- Not start with a hyphen or number
We introduce a new state to verify the vm is up, running and the ssh agent is
up. We perform a quick, doomed to fail ssh command and simply check the output.
If it returns "Host key verification failed.", it means ssh-agent is also up
and responsive.
This PR doesn't simply set display_state as "deleting" at "destroy"
label because our operations are async. When user clicked "delete"
button, web service increases "destroy" semaphore, and respirate moves
nexus to destroy label. Until respirate fetches strand and run, state
isn't changed. In addition, destroying vm is quick, so probably vm's
disappeared when user reload page again.
To fix this delayed display_state issue, I override method to decide
display state for some special semaphores. In this PR, if "destroy"
semaphore is increased, vm.display_state returns "deleting" immediately.
I added `one_to_many :semaphores` relation to model to use eager loading.
Otherwise fetching semaphores for each vm separately at vm listing page
causes N+1 query issue.
Instead of all this mambo jambo, I might just update display_state as
"deleting" at place where we increase destroy semaphore. This place is
route file. Updating vm model at outside of its nexus felt like antipattern.
I didn't do that.
Fixes#438
Users need to refresh the page manually to see if VM creation is
completed.
I added auto-refresh to VM creation and VM listing page. It reloads page
every 10 seconds.
Fixes#437
`freeze` override is in CloverVase so CloverWeb and CloverApi have this
new version. But we are freezing Clover not nested ones.
I moved this version of freeze to Clover
We should not call `DB.freeze` for tests too, because it blocks to
CloverWeb/CloverApi to enable new plugins that use new pg extensions.
We need to decide to show which private subnets dynamically depending on
selected location. So it requires JavaScript code. We need to pass list
of subnets for that location to frontend code. There are two ways: pass
statically when page is rendered or pass dynamically via an ajax HTTP
request. Latter one requires an extra endpoint on backend to return
subnets. I implemented first option. If number of subnets increased too
much for each location we can consider fetching dynamically for selected
location.
Safari doesn't allow hiding select options, because of that we are
disabling them.
SPI is not really necessary in policies. For outgoing connections, once the
template is matched the correct SA is identified locally via reqid. For
incoming connections, if the template matches, spi is already in the header
of the packet, which is used to match the correct SA. Therefore, we do not need
spi in policy.
We also set the reqid for to 0 for the destination. That way, the incoming
packet can be matched flexibly with the correct SA. reqid 0 means, no reqid
preference.
When we implement rekeying, this change will be handy since we won't be dealing
with destination policies separately.
With this commit, we start to store encryption key in the database in encrypted
form instead of plain text. We also start to clean encryption key from the
database at the end of the mesh refreshing.
Without proc, default_value is shared for all instances of the Strand and when
one of the callers of Strand.new updates the value of stack, it also modifies
the default value.
CloverBase is required two times in CloverWeb/CloverApi files probably
by mistake. Also we can move it to loader, but loader expects to module
name and filename matched. I renamed it as `clover_base.rb`.
I made location based pricing helper on UI more generic at 247c987.
When you add `.location-based-price` class and `data-amount`,
`data-resource-type`, `data-resource-family` data attributes to an input,
it updates location based when changed.
I refactored checkbox component little bit to add this attributes to
options.
Also I added a description to IPv4 input.
We need to get billing information from user to charge them at end of the
month.
We decided to use Stripe as payment provider. It has lots of features.
We can use low level APIs or some predefined UI from Stripe. We can keep
some of the information on our database too or keep all data at Stripe
and just keep Stripe identifier. I tried to find balance between them.
Then decided to keep all personal data on Stripe and we only keep Stripe
resource identifiers. It helps us to avoid GDPR issues. Also it helps to
keep service secure, it prevent to leak confidential data in dump or
logs.
Stripe has two resources: customer and payment method. We need to save
credit card information of users before they create any resource. Then
we will charge them using these information.
We have new two models: BillingInfo and PaymentMethod. BillingInfo is
corresponding for customer on Stripe side. We keep stripe_id for both
objects.
Each project has a BillingInfo. BillingInfo can be shared by projects.
So user can associate same BillingInfo with multiple project. But
currently I designed UI and models as each BillingInfo attached to
single project. We can relax it based on customer feedbacks.
BillingInfo has invoice details such as name, address, country etc. at
Stripe. PaymentMethod has metadata about credit card at Stripe.
PaymentMethod has order column, users can set a priority for payment
methods. We can try to charge them in order.
We collect initial BillingInfo and PaymentMethod details on Stripe's UI.
We tried to decrease friction for new users. We create Checkout session
with setup mode. It's Stripe's UI for saving customer and credit card
details. Stripe returns success url with session id.
After creation we use our UI to update customer details. Stripe's
Checkout session doesn't support updating customer details.
If STRIPE_SECRET_KEY isn't provided, billing is disabled.
Previously we expected the host memory/core ratio be >= VMs
memory/core ratio. Now that we use 8GiB memory/core ratio,
allocating VMs in a an empty AX161 host failed, because it
had 4GiB memory/core ratio (32 cores and 128GiB memory).
But we could actually allocate VMs in that case by limiting
the number of cores that could be assigned to VMs instead.
For example in the AX161 case, instead of failing the allocation
we could limit ourselves only to 16 cores so we can still maintain
the 8GiB memory/core ratio.
This PR does that.
Before this commit, billing rates were backed by a database table, because we
anticipated that the number of billing rates would be in the order of 10k as we
add new products and regions. However keeping this information in DB makes it
difficult to see current list of billing rates. Adding new billing raters also
requires database migration.
10k is high but still manageable as an in-memory array. We also have a quite a
long way until we reach that number. So, we are changing billing rates to be
backed by a YAML file.
IPv4 is a very scarce resource, and it starts to cost. Other cloud
providers charge extra when user enable IPv4. We decided to charge
$3 monthly for IPv4.
We decrease vmhost available storage size when allocating a Vm. A Vm
will fail to allocate if there are not enough available storage.
Previously we didn't update the size of available storage when
destroying a Vm, which caused in not being able to allocate new Vms.
This PR fixes that by updating available storage size when destroying
a Vm.
Currently we allow user to select storage size. It's a slider between
10-100GB. 100GB is low, but what's the suitable maximum storage size is
another good question.
Allowing custom storage size introduces another allocation problem. An
user can create a virtual machine with high storage but low cores count.
It causes an idle VmHost that we can't allocate new virtual machine on
that because of not enough storage.
We decided to link storage size with core count. Some other cloud
providers do same thing. They give less storage for small machines.
Digital Ocean gives 25GB per core, and Hetzner Cloud gives 40GB per
core.
Our current AX161 machines have 1843 GB total storage, and 32 cores.
Currently I set 25GB storage for each cores. It means storage will be
32x25=800GB at maximum for now.
We can revisit our strategy when we introduce network based storage or
after several customer feedbacks.
We don't charge our customers for storage, because it's local storage
and it's already in monthly single core cost calculations.
We increase number of used hugepages when allocating a Vm. A Vm will
fail to allocate if there are not enough hugepages.
Previously we didn't update the number of hugepages when destroying
a Vm, which caused in not being able to allocate new Vms.
This PR fixes that by updating number of used hugepages when destroying
a Vm.