SPEKTRA Edge Limits Service Design

Understanding the SPEKTRA Edge resource limit design.

1: Service Limit Initialization
2: Project and Organization Limit Initialization
3: AcceptedPlan Update Process
4: Resource Limit Assignment
5: Resource Limit Tracking
6: Project and Organization Deletion Process

Limits service is a fairly one of those services with special design. Service itself should be known from the user guide already and in some part from the developer guide. Knowledge about plans is assumed. Limits also is one of the standard services, and its code structure in the SPEKTRA Edge repository (limits directory) should be familiar.

What needs explanation is how Limits ensures that “values” don’t get corrupted, lost, over-allocated, etc. First, resources are allocated in each service, but resource limits.edgelq.com/Limit belongs to a Limits service. Therefore, we can’t easily guarantee counter-integrity if the resource is created in one service and counted elsewhere. Next, we know that limit values are passed from service to organizations, then to potential child organizations, and eventually to projects. From MultiRegion design, we know that each organization and project may point to a main region where resources are kept. Therefore, we know that organizations/{organization}/acceptedPlans/{acceptedPlan} is in the organization’s region, and projects/{project}/planAssignments/{planAssignment} is in the project’s region, may be different. This document describes how these Limits work in this case.

We will also be showing code pointers, where things can be found.

During this guide, you will find out why parallel creations/deletions are not parallel!

1 - Service Limit Initialization

Understanding how the SPEKTRA Edge service limit initialized.

When Service boots up, it creates limits.edgelq.com/Plan instances. Limits controller, defined in limits/controller/v1, has LimitsAssigner processor, defined in limits/controller/v1/limits_assigner/limits_assigner.go. It is created per each possible assigner, therefore, it is created per Service and Organization. LimitsAssigner is typically responsible for creating AcceptedPlan instances for child entities, but, for Services, it makes an exception: It creates an AcceptedPlan for itself! See file limits/controller/v1/limits_assigner/default_plan_acceptor.go, function calculateSnapshot computes plans for child entities, and for the Service itself! This is booting things up, the Service can assign any values it likes to itself.

2 - Project and Organization Limit Initialization

Understanding how the project and organization limit initialized.

If the project has a parent organization, then this parent organization is an assigner for a project. If the project is root-level, then its enabled services are assigners, each service can assign an individual plan for a project. Same for organizations. When a project/organization is created, the Limits Controller puts the newly created entity in the “assigner” box (or boxes for root-level entities). Then it creates an instance(s) of AcceptedPlan. Implementation, again, is in limits/controller/v1/limits_assigner/default_plan_acceptor.go. It is worth mentioning however now, that DefaultPlanAcceptor uses LimitPools of assigner to see if it will be able to create an AcceptedPlan resource. If not, it will instead annotate the Project/Organization that failed to create a plan for. This is why in limits_assigner.go you can see Syncer for not only AcceptedPlan but also Project and Organization.

3 - AcceptedPlan Update Process

Understanding the AcceptedPlan update process.

Implementation can be found naturally in the server: limits/server/v1/accepted_plan/accepted_plan_service.go. We pass of course “actual” creation to the core server, but this is just a small step, the whole logic to execute before any CUD operation is much more significant here.

When the server processes the AcceptedPlan resource (Create or Update), then we are guaranteed to be in the Limits service region where the assigner resides. Because LimitsPools are a child of Service or Organization, we can guarantee that they reside on the same regional database as AcceptedPlan. Thanks to this, we can verify, within the SNAPSHOT transaction, that the caller does not attempt to create/update any AcceptedPlan that would exceed the limit pools of an assigner! This is the primary guarantee here: Assigner will not be able to exceed allocated values in its LimitPools. We need to check cases where AcceptedPlan increases reservations on Assigner LimitPools. When we decrease (some updates, deletions), then we don’t need to do that.

However, there is some risk with decreasing accepted plans (some updates and deletions). There is a risk that doing so would decrease assignee limit values below current usage. To prevent this, in the function validateAssigneeLimitsAndGetLimitPoolUpdates in server implementation, we are checking assignee limit values. This will work in 99.99% of cases unless some new resources will be allocated while we confirm that we can decrease limits. Therefore, we don’t have guarantees here.

In the result, when we create/update AcceptedPlan, we are only increasing LimitPools reservations values of Assigner. When we would decrease LimitPool values, we just don’t yet.

Decreasing values is done by the Limits controller, we have a task for this, in limits/controller/v1/limits_assigner/limit_pool_state_syncer.go. It takes into account all child Limit and LimitPool instances (for assignees), which are synchronized with PlanAssignment instances. It then sends UpdateLimitPool requests when it confirms decreased values of AcceptedPlan action (updated or deleted) took an effect. Reservation is immediate, release is asynchronous and delayed.

Some cheating however is potentially possible, if the org admin sends UpdateLimitPool trying to minimize the “Reserved” field, after which it can attempt to create a new accepted plan quickly enough before the controller fixes values again. Securing this may be a bit more tricky, but such an update would leave LimitPool with a Reserved value way above the configured size, which will be detectable, along with ActivityLogs, and if not, ResourceChangeLogs. It is unlikely it will be tried this way. A potential way to secure this would be to disable AcceptedPlan updates if the Reserved value of LimitPool decreased recently, with some timeout like 30 seconds. Optionally, we can just put some custom code in the API Server for UpdateLimitPool, and validate only straight service admin updates them (check principal from context). This is not covered by IAM Authorization code-gen middleware, but custom code can simply do.

4 - Resource Limit Assignment

Understanding the resource limit assignment.

It is assumed that organization admins can see and manage AcceptedPlan instances, but their tenants can only see them. Furthermore, parent and child organization and other organization/final projects are separate IAM scopes. Child entities also may reside in different primary regions than their parent organization (or service). For these reasons, we have resource type PlanAssignment, which is even read-only, see its proto definition. This allows admins to see the plan assigned for them, but without any modifications, even if they are owners of their scope. Because PlanAssignment is located in a region pointed by the project/organization, we can guarantee synchronization with LimitPool/Limit resources!

When AcceptedPlan is made, the Limits Controller is responsible for creating PlanAssignment asynchronously, which may be in a different region than source AcceptedPlan. The code for it is in limits/controller/v1/limits_assigner/assigned_plans_copier.go. It creates an instance of PlanAssignment and sends a request to API Server. The server implementation is in, naturally, file limits/server/v1/plan_assignment/plan_assignment_service.go. Note that the controller is setting output-only fields, but it is fine, when the server creates an instance, it will have these fields too. This only ensures that, if there is any mismatch in the controller, it will be forced to make another update.

When processing writes to PlanAssignment, the API Server grabs AcceptedPlan from the database, we require the child organization or project to be in a subset of regions available in parents. Therefore, we know at least a synced read-only copy of AcceptedPlan will be in the database. This is where we grab the desired configuration from.

PlanAssignment is synchronized with Limit and LimitPool instances, all of these belong to the same assignee, so we know our database owns these resources. Therefore, we can provide some guarantees based on SNAPSHOT: Configured limit values in Limit/LimitPool resources are guaranteed to match those in PlanAssignment, users don’t get any chance to make any mistake, and the system is not going to be out of sync here.

Note that we are only changing the configured limit, we have also so-called active limits. This is maintained by the controller. There is some chance configured limit is being set below current usage, if this happens, the active limit will stay on a higher value, as large as usage. This will affect the source limit pool reserved value, it will stay elevated! It is assumed however that PlanAssignment and configured limits must stay in sync with AcceptedPlan values, no matter if we are currently allocating/deallocating resources on the final API Server side.

Note that the limits controller tracks the active size and reserved value for LimitPool instances. Limits are on the next level.

5 - Resource Limit Tracking

Understanding how the resource limit is tracked.

We need to provide a guarantee that the usage tracker stays in sync with the actual resource counter. The best way to do that is to count during local transactions. However, resource Limit belongs to the Limits service, not the actual servicee. This is why we have Limits Mixin Service in the SPEKTRA Edge repository, mixins/limits.

It injects one resource type: LocalLimitTracker. Note it is a regional resource, but not a child of a Project. This means that no project admin will be able to see this resource ever, or any parent organization. This resource type is hidden, only service admins will be able to see it. This prevents any chance of final user mismanagement as well. Because this resource type is mixed along with final service resources, we can achieve SNAPSHOT transactions between actual resources and trackers. We can even prevent bugs that could result in the usage tracker having invalid values. When we create/update the LocalLimitTracker resource, we can extract the true counter from the local database, see file mixins/limits/server/v1/local_limit_tracker/local_limit_tracker_service.go.

To check how LocalLimitTracker usage is tracked during transactions, check two files:

mixins/limits/resource_allocator/v1/resource_allocator.go
common/store_plugins/resource_allocator.go

This is how the store plugin tracks creations/deletions, at the end of the transaction, it tries to push extra updates, LocalLimitTracker instances for all resource types where several instances changed. This guarantees complete synchronization with the database. But note this does not create LocalLimitTrackers yet.

This is why Limits Mixin comes with not only an API Server (so LocalLimitTrackers can be accessed) but also a controller, see mixins/limits/controller/v1 directory. Inside Limits processor we have:

LocalLimitTrackersManager instance, which Creates/Updates/Deletes instances of LocalLimitTracker for every Limit instance in Limits service.
Synchronizes Limit instances in limits service using LocalLimitTrackers from its region. It means that there is no actual point in meddling with Limit fields, the controller will fix them anyway, and they don’t participate in actual usage checking anyway.
Maintains also PhantomTimeSeries, so we have special store usage metrics, showing how historically resource counters were changing.

Note that the Limits processor in this controller has built-in multi-region features, primary region for project creates/deletes LocalLimitTrackers, but the final regions are maintaining Limit instances and PhantomTimeSeries.

6 - Project and Organization Deletion Process

Understand the project and organization deletion process.

When Project/Organization is deleted, we need to ensure that limit values will return to the assigner. This is why AcceptedPlan instances have assignee reference fields, with the ASYNC_CASCADE_DELETE option. When they are deleted, plans follow. This will delete PlanAssignments, but as it was said, LimitPools are not given reserved values yet. Instead, db-controllers should be deleting all child resources of the assignee, like Project. This will decrease Limit usage values, till we hit 0.

To prevent deletion of Limit/LimitPool instances before they reach zero values, we utilize metadata.lifecycle.block_deletion field, as below:

limits/server/v1/limit/limit_service.go

Take a look at the update function, UpdateMetadataDeletionBlockFlag.
limits/server/v1/limit/limit_pool_service.go

Take a look at the update function, UpdateMetadataDeletionBlockFlag.

This way LimitPool and Limit resources disappear only last. We achieve some order of deletions, so it is not chaotic. The controller for the assignee will confirm the reserved value of LimitPool is decreased only after whole resource collections are truly deleted.