This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Developing your Service

How to develop your SPEKTRA Edge service.

1 - Developing your Service

How to develop your service.

Full example of sample service: https://github.com/cloudwan/inventory-manager-example

Service development preparation steps (as described in the introduction):

  • Reserving service name (domain format) using IAM API.
  • Creating a repository for the service.
  • Install Go SDK in the minimal version or better, the highest.
  • Setting up development - cloning edgelq, goten, setting env variables.

In your repository, you should first:

  • Create a proto directory.
  • Create a proto/api-skeleton-$VERSION.yaml file
  • Get familiar with API Skeleton doc and write some minimal skeleton. You can always come back at some point later on.
  • Generate protobuf files using the goten-bootstrap tool, which is described in the api-skeleton doc.

After this, service can be worked on, using this document and examples.

Further initialization

With api-skeleton and protobuf files you may model your service, but at this point, you need to start preparing some other common files. First is the go.mod file, which should start with something like this:

module github.com/your_organization/your_repository # REPLACE ME!

go 1.22

require (
	github.com/cloudwan/edgelq v1.X.X # PUT CURRENT VERSIONS
	github.com/cloudwan/goten v1.X.X  # PUT CURRENT VERSIONS
)

replace (
	cloud.google.com/go/firestore => github.com/cloudwan/goten-firestore v1.9.0
	google.golang.org/protobuf => github.com/cloudwan/goten-protobuf v1.26.1
)

Note that we have two special forks that are required in SPEKTRA Edge-based service.

The next crucial file is regenerate.sh, which we typically put at the top of the code repository. Refer to InventoryManager example application.

It includes steps:

  • Setting up the PROTOINCLUDE variable (also via script in SPEKTRA Edge repo)
  • Calling goten-bootstrap with clang formatter to create protobuf files
  • Generating server/client libraries (set of protoc calls)
  • Generating descriptor for REST API transcoding
  • Generating controller code (if business logic controller is needed)
  • Generating code for config files

For the startup part, you should skip business logic controller generation, as you may not have it (or need it). Config files (in the config directory), you should start by copying from the example here: https://github.com/cloudwan/inventory-manager-example/tree/master/config.

You should copy all *.proto files and config.go. You may need to remove the business logic controller config part if you don’t need it.

Another note about config files is resource sharding: For API server config, you must specify the following sharding:

  • byName (always)
  • byProjectId (if you have any resources where the parent contains Project)
  • byServiceId (if you have any resources where parent contains Service)
  • byOrgId (if you have any resources where parent contains Organization)
  • byIamScope (if you have resources where the parent contains either Project, Service, or Organization - de facto always).

Once this is done, you should execute regenerate.sh, and you will have almost all the code for the server, controllers and CLI utility ready.

Whenever you modify a Golang code, or after the regenerate.sh call, you may need to run:

go mod tidy # Ensures dependencies are all good

This will update the go.mod and go.sum files, you need to ensure all dependencies are in sync.

At this point, you are ready to start implementing your service. In the next parts, we will describe what you can find in generated code, and provide various advice on how to write code for your apps yourself.

Generated code

All Golang-generated files have .pb. in its file name. Developers can, and should in some cases, extend generated code (structs) with handwritten files using non-pb extensions. They will not be deleted.

We will describe briefly generated code packages and mention where manually written files have to be added.

Resource packages

The first directory you can explore generated code is the resources directory, it contains one package per resource per API version. Within a single resource module like this, we can find:

  • <resource_name>.pb.access.go

    Contains access interface for a resource collection (CRUD). It may be implemented by a database handle or API client.

  • <resource_name>.pb.collections.go

    Generated collections, developed from times before generics were introduced into Golang. We have standard maps/lists.

  • <resource_name>.pb.descriptor.go

    This file is crucial for the development of generic components. It contains a definition of a descriptor that is tied to a specific resource. It was inspired by the protobuf library, where each proto message has its descriptor. Here we do the same, but this descriptor more focuses on creating resource-specific objects without knowing the type. Descriptors are also registered globally, see github.com/cloudwan/goten/runtime/resource/registry.go.

  • <resource_name>.pb.fieldmask.go

    Contains generated type-safe field mask for a specific resource. Paths should be built with a builder, see below.

  • <resource_name>.pb.fieldpath.go

    Contains generated type-safe field path for a specific resource. Users don’t necessarily need to know its workings, apart from interfaces. Each path should be built with the builder and IDE should help show what is possible to do with field paths.

  • <resource_name>.pb.fieldpathbuilder.go

    Developers are recommended to use this file and its builder. It allows the construction of field paths, also with value variants.

  • <resource_name>.pb.filter.go

    Contains generated type-safe filter for a specific resource. Developers should rather not attempt to build filters directly from this but rather use a builder.

  • <resource_name>.pb.filterbuilder.go

    Developers are recommended to use filter builder in those files. It allows simple concatenations of conditions using functions like Where().Path.Eq(value).

  • <resource_name>.pb.go

    Contains generated resource model in Golang, with getters and setters.

  • <resource_name>.pb.name.go

    Contains Name and Reference objects generated for a specific resource. Note that those types are struct in Go, but string in protobuf. However, this allows much easier manipulation of names/references compared to standard strings.

  • <resource_name>.pb.namebuilder.go

    Contains use-to-use builder for name/reference/parent name types.

  • <resource_name>.pb.object_ext.go

    Contains additional utility-generated functions for copying/merging, and diffing.

  • <resource_name>.pb.pagination.go

    Contains types used by pagination components. Usually, developers don’t need to worry about them, but the function MakePagerQuery is often helpful to construct an initial pager.

  • <resource_name>.pb.parentname.go

    It is like its name equivalent but contains a name object for the parent. This file exists for resources with possible parents.

  • <resource_name>.pb.query.go

    Contains query objects for CRUD operations. Should be used with the Access interface.

  • <resource_name>.pb.validate.go

    Generated validation functions (based on goten annotations). They are automatically called by the generated server code.

  • <resource_name>.pb.view.go

    Contains function to generate default field mask from view object.

  • <resource_name>_change.pb.change.go

    Contains additional utility functions for the ResourceChange object.

  • <resource_name>_change.pb.go

    Contains model of change object in Golang.

  • <resource_name>_change.pb.validate.go

    Generated validation functions (based on goten annotations) but for Change object.

Generated types often implement common interfaces as defined in the package github.com/cloudwan/goten/runtime/resource. Notable interfaces: Access, Descriptor, Filter, Name, Reference, PagerQuery, Query, Resource, Registry (global registry of descriptors).

Field mask/Field path base interfaces can be found in module github.com/cloudwan/goten/runtime/object.

While by default resource packages are considered complete and can be used out of the box, often some additional methods extending resource structs are implemented in separate files.

Client packages

Higher-level modules from resources can be found in client, this is typically the second directory to explore. It contains one package per API group plus one final glue package for the whole service (in a specific version).

API group package directory contains:

  • <api_name>_service.pb.go

    Contains definitions of request/response objects, but excluding those from _custom.proto files.

  • <api_name>_custom.pb.go

    Contains definitions of request/response objects from _custom.proto files.

  • <api_name>_service.pb.validate.go

    Contains validation utilities of request/response objects, excluding those from _custom.proto files.

  • <api_name>_custom.pb.validate.go

    Contains validation utilities of request/response objects from _custom.proto files.

  • <api_name>_service.pb.client.go

    Contains wrapper around gRPC connection object. The wrapper contains all actions offered by an API group in type type-safe manner.

  • <api_name>_service.pb.descriptors.go

    Contains descriptor per each method and one per whole API group.

Usually, developers will need to use just the client wrapper and request/response objects.

Descriptors in this case are more usable for maintainers building generic modules, modules responsible for things like Auditing and usage tracing use method descriptors. Those are often using annotations derived from the API skeleton, like requestPaths.

Client modules contain one final “all” package - it is under a directory name having a short service name. It contains typically two files:

  • <service_short_name>.pb.client.go

    Combines API wrappers from all API groups together as one bundle.

  • <service_short_name>.pb.descriptor.go

    Contains descriptor for the whole service in a specific version, with all metadata. Used for generic modules.

When developing applications, developers are encouraged to maintain a single gRPC Connection object and use only those wrappers (clients for API groups) that are needed. This should reduce compiled binary sizes.

Client packages can be usually considered complete - developers don’t need to provide anything there.

Store packages

Directory store contains packages building on top of resources, and is used by server binary. There is one package per resource plus one final wrapper for the whole service.

Within the resource store package we can find:

  • <resource_name>.pb.cache.go

    It has generated code specifically for cache. It is based on cache protobuf annotations.

  • <resource_name>.pb.store_access.go

    It is a wrapper that takes the store handle in the constructor. It provides convenient CRUD access to resources. Note that this implements interface Access defined in <resource_name>.pb.access.go files (In the resources directory).

One common package for the whole service has a name equal to the short service name. It contains files:

  • <service_short_name>.pb.cache.go

    It wraps up all cache descriptors from all resources.

  • <service_short_name>.pb.go

    It takes generic store handle in the constructor and wraps to provide an interface with CRUD for all resources within the service.

There are no known cases where some custom implementation had ever to be provided within those packages, it can be considered complete on its own.

Server packages

Goten/SPEKTRA Edge strives to provide as much ready-to-use code as possible, and this includes almost full server code in the server directory. Each API group has a separate package, but there is one additional overarching package gluing all API groups together.

For each API group, we have for the server side:

  • <api_name>_service.pb.grpc.go

    Server handler interfaces (per each API group)

  • <api_name>_service.pb.middleware.routing.go

    MultiRegion middleware layer.

  • <api_name>_service.pb.middleware.authorization.go

    Authorization middleware layer (but see more in IAM integration)

  • <api_name>_service.pb.middleware.tx.go

    Transaction middleware layer, regulating access to the store for the call.

  • <api_name>_service.pb.middleware.outer.go

    Outer middleware layer - with validation, Compare And Swap checks, etc.

  • <api_name>_service.pb.server.core.go

    Core server that handles all CRUD functions already.

Note that for CRUD, everything is provided fully out of the box, but often there are custom actions or some extra steps required for some basic CRUD, in that case, it is recommended to write custom middleware between outer middleware and core.

Directory server contains also a glue package for a whole service in a specific version, with files:

  • <service_short_name>.pb.grpc.go

    Constructs server interface by gluing interfaces from all API groups

  • <api_name>_service.pb.middleware.routing.go

    Glue for multiRegion middleware layer.

  • <api_name>_service.pb.middleware.authorization.go

    Glue for the authorization middleware layer

  • <api_name>_service.pb.middleware.tx.go

    Glue for the transaction middleware layer

  • <api_name>_service.pb.middleware.outer.go

    Glue for the outer middleware layer

  • <api_name>_service.pb.server.core.go

    Glue for a core server that handles all CRUD functions.

This last directory with glue will also need manually written code files, like https://github.com/cloudwan/inventory-manager-example/blob/master/server/v1/inventory_manager/inventory_manager.go.

Note that this example server constructor shows the order of middleware execution. It corresponds to the process described in the prerequisites.

Be aware, that transaction middleware MAY be executed more than once for SNAPSHOT transaction types, in case we get ABORTED error. Transaction is retried a couple (typically 10) times. This also means that all middleware after TX must contain code that can be executed more than once. The database is guaranteed to reverse any write changes, BUT it is important to keep a check on another state (for example, if we send requests to other services and a transaction fails, those won’t be reversed!). If we change the request body, changes will be present in the request object on the second run too!

Apart from this, developers need to provide files only if there is a need for custom middleware (which fairly is needed always to some extent)

cli packages

Packages cli are used to create the simplest CLI utility based on cuttle. It is complete and only main.go file will be needed later on (to be explained later).

audit handlers packages

Packages for audithandlers contain one package per version for the whole service-generated handlers for all audited methods and resources. It is complete and some minor customizations are only needed, see the Audit integration document part. These packages need only inclusion in the main file, during server initialization. It is not necessarily needed to understand internal workings here.

access packages

Packages for the access directory contain modules that are built around client ones. There are two differences here.

First, while client contains basic objects for client-side code, access is delivering more high-level modules, that are not necessarily needed for all clients. Splitting them into separate packages allows clients to pick smaller packages.

Second, while client packages are built in one-per-API-group mode, access packages are built on the one-per-resource-type basis, and are focused on CRUD functionality only.

In access, each resource has its package, and finally, we have one glue package for the whole service.

Files generated for each resource:

  • <resource_name>.pb.api_access.go

    This implements interface Access defined in <resource_name>.pb.access.go files (In the resources directory). In the constructor, it takes the client interface as defined in <resource_name>_service.pb.client.go file (In the client directory), the one containing CRUD methods.

  • <resource_name>.pb.query_watcher.go

    Lower level watcher built around Watch<CollectionName> method. It takes the client interface and channel where it will be supplying events in real-time. It simplifies the handling of Watch calls for collections. It hides some level of complexity associated with stateless watch calls like soft resets or partial changes.

  • <resource_name>.pb.watcher.go

    High-level watcher components built around the Watch<CollectionName> method. It can support multiple queries and hides all complexity associated with stateless watch calls (resets, snapshot checks, partial snapshots, partial changes, etc.).

Files generated for a glue package for the whole service in a specific version:

  • <service_short_name>.pb.api_access.go

    Glues all access interfaces for each resource.

Watcher components require special attention and are best used for real-time database update observation. They are used heavily in our applications to provide system reactions in real-time. They can be used by web browsers to provide dynamic changes to a view, by client applications to react swiftly to some configuration updates, or by controllers to keep data in sync. We will cover this topic more in real-time updates topic.

Fixtures

Inside the fixtures directory, you will find some base files containing definitions of various resources that will have to be bootstrapped for your service. Usually, fixtures are created:

  • for the service itself.
  • per each project that enables a given service (dynamic creation).

Those files are not a “code” in any form, but some of those fixtures are still generated and it may be worth adding them here for completeness. We will come back to them in SPEKTRA Edge migration document.

Main files (runtime entry points)

SPEKTRA Edge-based service backend consists of:

  • Server runtime, which handles all incoming gRPC, webGRPC, and REST API calls.
  • DbController runtime, which executes all asynchronous database tasks (like Garbage Collecting, multi-region syncing, etc).
  • Controller runtime, that executes all asynchronous tasks related to business logic to keep the system working. It also handles various bootstrapping tasks, like for IAM integration.

For each runtime, it is necessary to write one main.go file.

Apart from the backend, it is very advisable to create a CLI tool that will allow developers to quickly play with the backend at least. It should use generated cli packages.

Clients for web browsers and agents are not covered by this document, but examples provide some insights into how to create a client agent application running on the edge.

All main file examples can be found here: https://github.com/cloudwan/inventory-manager-example/tree/master/cmd.

Service developers should create a cmd directory with relevant runtimes.

Server

In the main file for the server, we need:

  • Initialize the EnvRegistry component, responsible for interacting with the wider SPEKTRA Edge platform (includes the discovery of endpoints, real-time changes, etc.).

  • Initialize observability components

    SPEKTRA Edge provides Audit for recording many API calls, Monitoring for usage tracking (it can also be used to monitor error counters). It is also possible to initialize tracing.

  • Run the server in the selected version (as detected by envRegistry).

In function running a server in a specific version:

  • We are initializing the database access handle. Note that it needs to support collections for your resources, but also for mixins.

  • We need to initialize a multi-region policy store

    It will observe all resources that are multi-region policy-holders for your service. If you use policyholders from imported services, you may need to add a filter that will guarantee you are not trying to access resources unavailable for your service.

  • We need to initialize AuthInfoProvider, which is common for Authenticator and then Authorization.

  • We need to initialize the Authenticator module.

  • Finally, we initialize the gRPC server object. It does not contain any registered handlers on its own yet

    Only common interceptors like Authentication.

For the gRPC server instance, we need to create and register handlers, it is required to provide server handlers for your particular server, then for mandatory mixins:

  • schema mixin is mandatory, and it provides all methods related to database/schema consistency across multi-service, multi-region, and multi-version environments.
  • limits mixin is mandatory if you want to use Limits integration for your service.
  • Diagnostics mixin is optional for now, but this may change once EnvRegistry gets proper health checks based on gRPC. It should be included.

Mixins provide their API methods - they are separate “services” with their own API skeletons and protobuf files.

Refer to the example in the instructions on how to provide your main file for the server.

Note: As of now, webGRPC or REST API is handled not by server runtime, but by envoyproxy component. Examples include configuration example file (And Kubernetes deployment declaration).

Controller

In the main file for the controller, we need:

  • Initialize the EnvRegistry component, responsible for interacting with the wider SPEKTRA Edge platform (includes the discovery of endpoints, real-time changes, etc.).
  • Initialize observability components.
  • Run controller in selected version (as detected by envRegistry).

For a selected version of the controller, we need to:

  • Create business logic controller virtual nodes manager

    This step is necessary if you have business logic nodes, otherwise, you can skip. You can refer to the business logic controller document for more information on what it is and how to use it.

  • Limits-mixin controller virtual nodes manager is mandatory if you include the Limits feature in your service. You can skip this module if you don’t need Limits. Otherwise, it is needed to execute common Limits logic.

  • Fixtures controller nodes are necessary to:

    • Bootstrap resources related to the service itself (like IAM permissions).
    • Bootstrap resources related to the projects enabling service-like metric descriptors. Note that this means that the controller needs to dynamically create new resources and watch project resources appearing in the service (tenants).

The fixtures controller is described more in the document about SPEKTRA Edge integration. However, since some fixtures are mandatory, it is a practically mandatory component to include.

Refer to the example of how to provide your main file for the controller.

DbController

Db-Controller is a set of modules executing tasks related to the database:

  • MultiRegion syncing.

  • Search database syncing

    If the search is enabled and uses a separate database.

  • Schema consistency (like asynchronous cascade unsets/deletions when some resources are deleted).

In the main file for the controller, we need:

  • Initialize the EnvRegistry component, responsible for interacting with the wider SPEKTRA Edge platform (includes the discovery of endpoints, real-time changes, etc.).
  • Initialize observability components.
  • Configure database and search DB indices, as described in proto files.
  • Run db-syncing controller for all syncing-related tasks
  • Run db-constraint controller for all schema consistency tasks

All db-controller modules are provided by the Goten framework, so developers need to provide just the main file only.

Refer to the example in the instructions on how to provide your main file for db-controller.

CLI

If you have Cuttle installed, you can use core SPEKTRA Edge services with it. However, it is useful, especially when developing a service, to have a similar tool for own service too. Goten generates a CLI module in the cli directory, developers need only to provide their main file for CLI. Refer to the inventory-manager example.

In that example, we include an example service, and we add some mixins, schema-mixin and limits-mixin. Those objects for CLI can access mixin APIs exposed by your service. They can be skipped to reduce code size if you prefer. They contain calls that are relevant for service developers or maintainers. Mixins contain internal APIs and, if there are no bugs, even service developers don’t have to know their internals (and if there is a bug, they can submit an issue). Mixins try to operate on mixin APIs on their own and should do all the job.

Inclusion of Audit is recommended, the default cuttle provided by SPEKTRA Edge will not be able to decode Audit messages for custom services. However, CLI utility with all types of service registered will be.

Refer to the example in the instructions on how to provide your main file for CLI.

Note: Compiled CLI will only work if there is a cuttle locally installed and initialized. Apart from that, you need to add endpoint for your service separately to the environment, if your cuttle environment points to the core SPEKTRA Edge platform.

For example: Suppose that the domain for SPEKTRA Edge is beta.apis.edgelq.com:

cuttle config  environment get staging-env
Environment:  staging-env
Domain:  beta.apis.edgelq.com
Auth data:
    ...
Endpoint specific configs:
+--------------+----------+--------------+-----------------+------------------+---------------+
| SERVICE NAME | ENDPOINT | TLS DISABLED | TLS SKIP VERIFY | TLS SERVICE NAME | TLS CERT FILE |
+--------------+----------+--------------+-----------------+------------------+---------------+
+--------------+----------+--------------+-----------------+------------------+---------------+

Therefore, the connection to the IAM service will be iam.beta.apis.edgelq.com because this is the default domain. Considering 3rd party services use different domains, you will need to add different endpoint-specific settings like:

cuttle config environment set-endpoint \
  staging-env $SERVICE_SHORT_NAME --endpoint $SERVICE_ENDPOINT

Variable $SERVICE_SHORT_NAME should be snake_cased, it is derived from the short name of the service in api-skeleton. For The inventory manager example is inventory_manager (in api-skeleton, the short name is InventoryManager). See https://github.com/cloudwan/inventory-manager-example/blob/master/proto/api-skeleton-v1.yaml, field proto.service.name.

Variable $SERVICE_ENDPOINT must point to your service, like inventory-manager.examples.custom.domain.com:443. Note that you must include the port number, but not the method (like https://).

2 - Developing your Business Logic in Controller

How to develop your business logic in controller.

API Server can execute very little actual work - all reading requests are limited in size, they can fetch a page. Write requests will stop working if you start saving/deleting too many resources in a single transaction. Multiple transactions will also make users wonder if something is stuck. Some actions are intense, for example, when a user creates a Distribution resource in applications.edgelq.com, that matches thousands of Edge devices, the system needs to create thousands of Pod services. Transaction from this side is practically impossible, pods must be created asynchronously for a service to operate correctly.

Since we are using No-SQL databases, which don’t have cross-collection joins, we need sometimes to denormalize data, and make copies to be able to read from a single collection all necessary data.

Service development very often requires the development of its business logic controller - it is designed to execute all additional write tasks in an asynchronous manner.

We also need to acknowledge that:

  • Some write requests may be failing, and some parts of the system may be not available. We need to have reasonable retries.

  • System may be in constant move

    actions changing the desired state may be arriving asynchronously. Tasks may change dynamically even before they are completed.

  • For various reasons (mistake?) users may delete objects that need to exist. We need to handle interruptions and correct errors.

The business logic controller was designed to react in real-time, able to handle failures, cancel or amend actions when necessary, heal the system to the desired state.

Desired state/Observed state are the key things here. Controllers are first optimized for Create/Update/Delete operations, trying to match the desired state with the observed. The pattern is the following: The Controller uses Watchers to know the current system state, ideally, it should watch the subset it needs. The observed state of some resources is used to compute the desired state. Then desired state is compared with the relevant part of the observed state again, any mismatch is handled by Create/Update/Delete operation. Although this is not the only way the controller can operate, this is the most common.

Since the exact tasks of the business logic controller are service-specific, SPEKTRA Edge/Goten provides a framework for building it. This is different compared to db-controller, where we have just ready modules to use.

Of course, there are some ready controller node managers, like limits mixin, which need to be included in each controller runtime if the limits feature is used. This document however provides explanations of how to create own one.

Some example tasks

Going back to Distributions, Devices, and Pods: The Controller should, for any matching combination of Device + Distribution, create a Pod resource. If the Device is deleted, all its pods must be deleted. If Distribution is deleted, then similarly pods need to be deleted from all devices.

Observed states are Devices, Pods, and Distributions. The observed state of Distributions and Devices is used to compute the desired pod set. This is then compared with observed pods to create action points.

Another example: Imagine we have a collection of Orders and Products, and one order can point to one product, but the product can be pointed to by many orders. Imagine that we want to display a view of orders, but each item has also short product info. Since we have no SQL without joins, we will need to copy short info from a product into order. We can do this when the Order is created/updated, get Product resource, and copy its info to the Order record. It may be questionable whether we want to update existing orders if the product is updated, for the sake of this example, suppose we need to support this case. In this case, we observe products and orders as observed state. For each observed order, we compute the desired one by checking current product info. If there is any mismatch, we issue an Update to a server. Note we have both observed and desired state of orders here.

Architecture Overview

When it comes to controllers, we define a thing called a “Processor”. The processor is a module that accepts the observed state as the input. Inside, it computes the desired state. Note that observed and desired states can potentially consist of many resource types and many collections. Still, it should concentrate on isolated business logic tasks, for example, management of Pods based on Distributions and Devices is such a task. Still inside a processor, desired and observed state is provided into internal syncers that ensure the valid state of the system. The processor does not have any output, it is a rather high-level and large object. However, processors are scoped.

Resource models in SPEKTRA Edge are concentrated around tenants, Services, Organizations, Projects, usually the last one though. This is where each processor is scoped around, selected Service, Organization, or Project. We have as many processor instances as many tenant resources in total. This is for safety reasons, to ensure that tenants are separated. It would not be good if by mistake we matched Distribution with Devices from different projects. Then one tenant could schedule pods in the other one…

Therefore, we need to remember, that the Processor in the Business Logic Controller is a unit scoped by a tenant (usually Project), and focused on executing a single business logic task (developer defined). This business logic task may produce as many desired states (one per collection) as deemed necessary by the developer.

Above a Processor, we have a “Node”. Node contains:

  • Set of processors, one per tenant it sees.
  • Set of watchers, one per each input (observed) collection.

Node is responsible for:

  • Management of processors, for example, if a new project is created, it should create a new processor object. If the project is deleted, then the processor must also be deleted.
  • Running set of watchers. By using common watchers for processors, we ensure that we do not have too many streams to the servers (multiple small projects are a thing here).
  • Distributing observed state changes to the processors, each change set should be split if necessary and provided to the relevant processors.

The node should be considered self-contained and generally, the highest-level object, although we have things like “Node Managers”, which manage a typically fixed set of nodes, typically one to four of them. We will come back to this with Scaling considerations topic here.

We can go back to a Processor instance: each has a “heart”, one primary goroutine that runs all the internal computations and events, only one, to avoid multi-threading issues as much as possible. Those “events” include all observed state changes provided by a Node to Processor. This “heart” is called a processor runner here. Its responsibility includes computing the desired state.

Modules-wise, each processor consists of (typically):

  • Set of input objects. They are queues, where the Node is pushing observed state changes on the produces side. On the consumer side, the processor runner extracts those updates and pushes them into “stores”.
  • Set of stores, one per observed collection, stores are stateful and contain full snapshots of the observed collection. When the processor runner gets an update from the input object, it applies change on the “store”. This is where we decide if there was any update/deletion/creation.
  • Set of transformers. They observe one or many stores, which are responsible for propagating changes in real-time to them. Transformers contain code responsible for computing the desired state based on the observed one.
  • Set of syncers. Each has two inputs: One is some store with the observed state, other is the transformer producing the desired state. In some cases though, it is possible to provide more than one transformer to the desired state input of a syncer.

All of these components are run by Processor Runtime goroutine with little exception, Syncers have internally additional goroutines that are executing actual updates (create, delete, and update operations). Those are IO operations, therefore it is necessary to delegate those tasks away from the processor runner.

An important note is that the processor runner MUST NOT execute and IO work, it should be always fast. If necessary, framework allows to run additional goroutines in the processor, which can execute longer operations (or those that can return errors).

One final thing to talk about processors is initial synchronization. When Node boots up initially, its number of processors is 0, and watchers for the observed state are empty. First, watchers need to start observing relevant input collections, as instructed. When they start, before getting real-time updates, they get a current snapshot of the data. Only then will we start getting real-time updates, that happened after point of snapshot in time. Node is responsible for creating as many processors as many tenants it observes. Events from different collections may be out of sync, sometimes we may get tenants after other collections, sometimes before, often both. It is also possible for a watcher to lose connectivity with a server. If disconnection is long enough, it may opt for requesting a snapshot again after successful reconnection. A full snapshot for each tenant is delivered to each corresponding processor. Therefore, when Node provides an “event” to a processor, it must include “Sync” or “LostSync” flags too. In the case of “Sync”, the processor is responsible for generating its diff using its internal Store with the previous snapshot.

Note that each observed input of the processor will get its own “sync” event, and we can’t control the order here. It is considered that:

  • Sync/LostSync events must be propagated from inputs/stores to syncers.
  • Transformer must send a “Sync” signal when all of its inputs (or stores it uses) are in sync. If at least one gets a LostSync event, then it must propagate LostSync to the Syncer’s desired state.
  • Syncer’s desired state is in sync/non-sync depending on events from the transformer(s).
  • Syncer’s observed state is in sync/non-sync depending on the sync/lostSync event from the store it observes.
  • Syncer’s updater executes updates only when both desired and observed states are in sync. When they both gain a sync event, the syncer executes a fresh snapshot of Create/Update/Delete operations, all previous operations are discarded.
  • Syncer’s updater must stop actions when either observed or the desired state loses sync.
  • Transformers may postpone desired state calculation till all inputs achieve sync state (developer decides).

Prototyping controllers with proto annotations

In Goten, we first define the structure of the business logic controller (or what is possible) in protobuf files, we define the structure of Nodes, Processors, and their components.

A full reference can be found here: https://github.com/cloudwan/goten/blob/main/annotations/controller.proto. We will discuss some examples here to provide some more clarity.

By convention, in proto/$VERSION we create a controller subdirectory for proto files. In regenerate.sh we add relevant protoc compiler call, like in https://github.com/cloudwan/inventory-manager-example/blob/master/regenerate.sh, find --goten-controller_out.

When going through examples, we will explore some common patterns and techniques.

Inventory manager example - Processor and Node definitions.

We can review some examples, first Inventory Manager, definition of a Processor: https://github.com/cloudwan/inventory-manager-example/blob/master/proto/v1/controller/agent_phantom_processor.proto

We can start from the top of the file (imports and top options):

  • See go_package annotation - this is the location where generated files will be put. Directory controller/$version/$processor_module is a convention we use and recommend for Processors.
  • Import of goten.proto and controller.proto from goten/annotations is required.
  • We need to import service packages’ main files for the versions we intend to use. For this example, we want to use monitoring from v4, and Inventory manager for v1. Relevant import files were added.
  • We also import “common components”, but we will return to it later.

In this file, we define a processor called “AgentPhantomProcessor”. We MAY then optionally specify the types we want to use. This one (CommonInventoryManagerControllerTypes), is specified in the one imported file we mentioned we will come back later to it. Let’s skip explaining this one yet.

The next important part is definitions. In Goten, resource-type names are fully qualified with the format $SERVICE_DOMAIN/$RESOURCE_NAME. This is how we need to specify resources. Definitions can be used to escape long names into shorter ones. With the next example, we will also demonstrate another use case.

In AgentPhantomProcessor, we would like to generate a single PhantomTimeSerie resource per each ReaderAgent in existence. So this is a very simple business logic task, make one additional resource for everyone in another collection.

Since both ReaderAgent and PhantomTimeSerie are project-scoped resources, we want processors to operate per project. Therefore, we declare that “Project” is a scope object in the processor. Then we define two inputs: ReaderAgent and PhantomTimeSerie. Here in protobuf, “input” code-wise will consist of 2 components: Input and Store object (as described in the architecture overview).

We define a single transformer object: AgentPhantomTransformer. There, we want to notify you that this transformer should produce the desired collection of PhantomTimeSerie instances, where each will be owned by some ReaderAgent. It simplifies cleanup, if ReaderAgent is deleted, the transformer will delete PhantomTimeSerie from the desired collection. The best transformer type is owner_ownee in such a situation, where each output resource belongs to a separate parent.

After transformers, we define syncers, we have one instance, PhantomTimeSerieSyncer. It takes PhantomTimeSerie from the input list as observed input. Then the desired collection must come from AgentPhantomTransformer.

This Processor instance shows us what connects with what, we constructed the structure in a declarative way.

Now let’s come back to types. As we said in the architecture overview, the Processor consists of input, store, transformer, and syncer objects. While transformers can be specified only in Processor definitions, the rest of those little elements can be delegated to type sets (here it is CommonInventoryManagerControllerTypes). This is optional, type_sets are not needed very often, here as well. If they were not defined, then the compiler would generate all necessary components implicitly in the same directory indicated by go_package along the processor. If type_sets are defined, then it will try to find types elsewhere before deciding to generate some on its own.

Separate type_sets can be used for example to reduce unnecessary generated code, especially if we have multiple processors using similar underlying types. In the Inventory manager, it was done for demonstration purposes only. Let’s see this file though: https://github.com/cloudwan/inventory-manager-example/blob/master/proto/v1/controller/common_components.proto.

We define here input, store, and syncer components. Note that go_package is different compared to the one in the processor file. It means that generated components will reside in a different directory than the processor. The only benefit here is this separation, but it’s not strictly required.

Finally, note that in the processor we only indicated what the is controller doing, and the connections. However, implementation is not here yet, it will be in the Golang. For now, let’s jump to the Node declaration, which can be found here: https://github.com/cloudwan/inventory-manager-example/blob/master/proto/v1/controller/inventory_manager_ctrl_node.proto

Node is a component managing Processor instances and is responsible for dispatching real-time updates from all watchers to processors, which are scoped in this example by an Inventory Manager Project (inventory-manager.edgelq.com/Project).

In this example file, we declare a Node called “InventoryManagerCtrl”. The processors we want to attach are just one element array containing AgentPhantomProcessor. We potentially could attach more processors, under one condition though: All must be scoped by the exactly same object. Since AgentPhantomProcessor is scoped by Project (inventory-manager.edgelq.com/Project), other processors would need the same.

Compiler parsing such a Node definition will automatically detect Scope and all Input resources. What we need to define is:

  • Sharding method, since the scope is a Project, the standard sharding for it is “byProjectId”. For organization, it would be “byOrgId”, for service, “byServiceId”. All 3 can be optionally replaced with “byIamScope”. We will return to it when talking about scaling.
  • Dispatchment: When Node gets snapshot + real-time updates from watchers for all resources (Scope + Input), it needs to also know how resources should be grouped.
    • Param scope_grouping tells us how the Project is identified. Normally, we want to define Project ID by using its name, if you are unsure, just pass method: NAME for scope_grouping. In result, Node will extract the name field from a Project and use it as a Processor Identifier.
    • Param input_groupings is defined per each detected input resource. In the processor, we defined monitoring.edgelq.com/PhantomTimeSerie and inventory-manager.edgelq.com/ReaderAgent (which were shortened to PhantomTimeSerie and ReaderAgent). Input groupings instruct Node how each resource instance of a given type should be classified, which means, how to extract ID of the corresponding processor instance. Resource ReaderAgent is a child of an inventory-manager.edgelq.com/Project instance according to the api-skeleton. Therefore, we want to indicate that the method for grouping is a “NAME” type. Node can figure out the rest. Resource PhantomTimeSerie is a bit more tricky because its parent resource is not inventory-manager.edgelq.com/Project, but monitoring.edgelq.com/Project. Still, Node will need a method to extract the name of inventory-manager.edgelq.com/Project from the monitoring.edgelq.com/PhantomTimeSerie instance. Because it can’t be done in a declarative way (as of now, the compiler does not figure out things by string value as IAM Authorizer), we must pass the CUSTOM method. It means that in Golang we provide our function of getting processor ID.

When deciding on dispatchment annotation, we need to know that Node has a customizable way of defining Processor Identifier. We need to provide a way how <Resource Instance> needs to be mapped into <Processor Identifier>, and we need to do this for Scope AND all Input resources. Method NAME passed to either scope or input resource means that Node should just call the GetName() function on the resource instance to get an Identifier. It will work for same-service resources, but for others like PhantomTimeSerie not, GetName returned by it would eventually point to Project in monitoring service.

Although the GetName() method on the ReaderAgent instance would return the Name of ReaderAgent than a Project, Node is able to notice that the Name of ReaderAgent contains also the name of the project.

Applications example - Processor and Node definitions.

We have a more interesting example of a Controller in applications.edgelq.com. We have a Controller processor responsible for Pods management, we say that “Pod management” is a business logic task. There are two things we want from such a processor:

  • Create a Pod instance per each matched Device and Distribution.
  • Whenever the Device goes offline, we want to mark all its Pods as offline.

Business notes:

  • One Device can host multiple Pods, and Distribution can create Pods across many devices. Still, Device+Distribution can have one Pod instance at most.
  • Pod can be deployed manually, not via Distribution. Not all pods are of distribution type.
  • When the Device gets online, it will Update pod statuses itself. But when it goes offline, the controller will need to do it. Note that it means, that basically, the controller will need to track pods of offline status.
  • If the device is offline, the pod status should be “Unknown”.
  • Resources Pods, Distribution, and Devices are project scoped, however, Device belongs to the devices.edgelq.com service, and other resources to applications.edgelq.com. Still, the Project is our primary scope.

With this knowledge, we can draft the following Processor declaration: https://github.com/cloudwan/edgelq/blob/main/applications/proto/v1/controller/pods_processor.proto

Compared to the previous example, goten.controller.type_set is declared in the same file, but for now let’s skip this part, and talk about the processor first. There, we have the PodsProcessor type defined. As we can deduce from business notes, the Scope resource should be “Project”, inputs should be clear too. Then we have two transformers, one per business task defined. You should also note that we have two additional definitions of the applications.edgelq.com/Pod instance. One is DistributionPod, other is UnknownStatePod. As mentioned in business notes, not all pods belong to distribution, and pods with unknown states are considered also a subset of all notes. Those extra definitions can be used to differentiate between types and help write proper controllers.

Transformer DistributionController is of known already type, owner/ownee. But in this case, each Pod instance is owned by a unique combination of Distributions and Devices. Also, when either of the parents is deleted, all associated pods will be automatically deleted.

Another transformer, UnknownStateTracker, is of a different kind: Generic. This type of transformer just takes some number of inputs, and then produces some number of outputs. In this particular case, we want to just have some Devices and Pods, where each belongs to a specific Device only. For each offline Device, we want to mark its Pods as of Unknown state. Generic type requires more code implementation and developers need to handle all input events: Additions, updates, and deletions too. For each change in the input resources new snapshot of the output (or DIFF to the snapshot) is required.

One alternative we could have used is a Decorator:

{
  name : "UnknownStateStatusDecorator"
  decorator : {
    resource : "Pod"
    additional_inputs : [ "Device" ]
  }
}

The decorator takes the same resource on the output, in this case, when Pod is changed, the decorator function will be called to decorate Pod resource. There, we could get the Device record owning pod, check the Device status, and then mark the Pod status. If the device changes, then it would trigger a re-compute of all Pods it belongs to (decorator is called again). We did not use this decorator here, because the Controller should only mark Pod status as UNKNOWN when the Device is offline. When the Device is online, it needs to manage its Pod statuses. This “shared” ownership means that the decorator was not exactly suitable, instead, we may need to use a “generic” type, and output pods that have UNKNOWN status. The controller needs to run UpdatePod for only offline device pods. If the device gets online, the controller should “forget” about those pods. What do we mean: UnknownStateTracker DELETES pods from output collection if the device becomes online (it’s not the same as actually Deleting pods!). This is why the output from UnknownStateTracker is UnknownStatePod, not Pod. We want to show that output contains pods with unknown status, not all pods. We will come back to this when commenting on the implementation in Go.

We also will be re-checking offline pods periodically, producing snapshots (per scope project) after each period. By default, transformer would be kicked only when some Pod/Device changes (create, update, delete).

Now going back to goten.controller.type_set - there, we defined only the Store component for a Pod resource, with one custom index, even though we have multiple resource definitions in the processor. As we mentioned, this type set is an optional annotation and the compiler can generate missing bits on its own. In this particular case, we wanted to have a custom index for pods, field path spec.node defines a field path to the Device resource. This index gives us just some convenience in the code later on. Anyway, this is another use case for type sets, the ability to enhance default types we would get from the code-generation compiler.

Node definition can be found here: https://github.com/cloudwan/edgelq/blob/main/applications/proto/v1/controller/applications_ctrl_node.proto

However, in this case, it is pretty much the same as in Inventory Manager.

Overview of generated code and implementing missing bits

The best way to discuss controller code is by examples again, we will check the example Inventory Manager and Application.

Inventory manager

In InventoryManager, we want the following feature: a time series showing the history of online/offline changes per agent. First, each agent runtime should be sending an online signal within the interval (1 minute), using a CreateTimeSeries call from monitoring. When an agent goes offline, it can be sending “offline” though - instead, we need to generate a PhantomTimeSerie object per each agent, so it can generate data when original metrics are missing. This is how we obtain online/offline history, zeroes are filling periods of offline, “ones” of online parts. This is a task we did for the Inventory Manager.

The controller code can be found here: https://github.com/cloudwan/inventory-manager-example/tree/master/controller/v1.

As with the rest of the packages, file names with .pb. are generated, otherwise handwritten. Directory common from there contain only generated types, as pointed out by the proto file for type_sets. More interesting is the agent phantom processor to be found here: https://github.com/cloudwan/inventory-manager-example/tree/master/controller/v1/agent_phantom.

We should start examining examples from there.

The first file is https://github.com/cloudwan/inventory-manager-example/blob/master/controller/v1/agent_phantom/agent_phantom_processor.pb.go.

It contains the Processor object and all its methods. We can notice the following:

  • In the constructor NewAgentPhantomProcessor, we are creating all processor components as described by the protobuf file for a processor. Connections are done automatically.
  • Constructor gets an instance of AgentPhantomProcessorCustomizer, which we will need to implement.
  • The processor has a “runner” object, this is the “heart” of the processor handling all the events.
  • Processor has a set of getters for all components, including runner and scope object.
  • Processor has AddExtraRunner function, where we can add extra procedures running on separate goroutines, doing some extra tasks not predicted by processor proto definition.
  • Interface AgentPhantomProcessorCustomizer has an extra default partial implementation.

In the customizer, we can:

  • Add PreInit and PostInit handlers

    PreInit is called for a processor with all internal components not initialized. PostInit is done after initial construction is completed (but not after it runs).

  • We have StoreConfig calls, which can be used to additionally customize Store objects. You can check the code to see the options, one option is to provide an additional filter applied to the store, so we don’t see all resources.

  • Functions ending with ConfigAndHandlers are for Syncer objects. We will have to implement them. This is for the final configuration & tuning of Syncers.

  • Functions ending with ConfigAndImpl must be used to customize transformers.

  • We can also hook a handler in case the Scope object changes itself (like, some fields in the Project). Usually, it is left empty, but we may hit some use cases for it still.

After reviewing the processor file, you should see the processor customizer implementation. This is a handwritten file, here example for InventoryManager: https://github.com/cloudwan/inventory-manager-example/blob/master/controller/v1/agent_phantom/agent_phantom_processor.go.

Constructor we can define however we want. Then, for implementation notes:

  • For PhantomTimeSerieStoreConfig, we want to ensure to filter out PhantomTimeSeries that are not of specific metric type, of which we don’t have specific meta owner types. This may often be redundant because we can define the proper filter for PhantomTimeSerie objects themselves (in a different file, we will come back to it).

  • In the function AgentPhantomTransformerConfigAndImpl, we need to return an implementation handler that must satisfy the specific interface required by the transformer. In the config file, usually, provide a reconciliation mask. These masks are used to prevent the triggering of the transformer function for non-interesting updates. In this example, we are checking field paths online, activation.status, and location. It means, that if some of those fields change in ReaderAgent, then we will need to trigger the transformer to recompute PhantomTimeSerie objects (for this agent only). Reconciliation mask helps reduce unnecessary work. If someone changed let’s say display name of the agent, then no work would be triggered.

  • In function PhantomTimeSerieSyncerConfigAndHandlers we are customizing Syncer for PhantomTimeSeries objects. In the config part, we almost always need to provide update mask, fields that are maintained by the controller. We also may provide information on what to do in case of duplicated resource detection - by default we delete them, but it may be OK to provide this value explicitly (AllowDuplicates is false). Apart from that, there is some quirk about PhantomTimeSerie instances:

    Fields resource and metric are non-updatable. Because of that, we need to disable updates UpdatesDisabled. It is recommended to review all options in the code itself to see what else can change. Handlers for syncer are a bit tricky here, we could have just returned NewDefaultPhantomTimeSerieSyncerHandlers, but we need some special cases, which is common for PhantomTimeSerie instances. We will come back later to it.

  • In the function PostInit we are providing extra goroutine, ConnectionTracker. It is doing work unpredicted by the controller framework for now and needs some IO. For those reasons, it is highly recommended to delegate this work on a separate goroutine. This component will also get updates from the ReaderAgent store (create, update, delete).

Let’s first discuss the phantomTimeSerieSyncerHandlers object. It extends generated common.PhantomTimeSerieSyncerHandlers. Custom handlers are quite powerful tools, we can customize even how the object is created/updated/deleted, by default, it uses standard Create/Update/Delete methods, but it does not need to be this way. In this particular case, we want to customize identifier extraction from the PhantomTimeSerie resource instance. We created a key instance for this defined here: https://github.com/cloudwan/inventory-manager-example/blob/master/controller/v1/agent_phantom/agent_phantom_key.go.

By default, the identifier of a resource is just simply extracted from the name field. However, PhantomTimeSerie is very special in this manner: This resource has a non-predictable name! All CreatePhantomTimeSerie requests must not specify its name, it’s assigned by a server during creation. This has nothing to do with the controller, it is part of the PhantomTimeSerie spec in monitoring. For this reason, we are extracting some fields that we know will be unique. Since we know that for a given ReaderAgent we will generate only one “Online” metric, we use just the agent name extracted from metadata along metric type value. This customized syncer will then match desired and observed PhantomTimeSerie resources using those custom IDs.

Connection tracker defined here: https://github.com/cloudwan/inventory-manager-example/blob/master/controller/v1/agent_phantom/connection_tracker.go shows some examples of controller tasks that were not predicted by the controller framework. It is being run on a separate goroutine, however, OnReaderAgentSet and OnReaderAgentDeleted are called by the processor runner, the main goroutine of the processor. This mandates some protection. Golang’s channels may have been used perhaps, but we need to note that they have limited capacity, if they get full processing threads stalls. Maps with traditional locks are safer in this manner and are often used in SPEKTRA Edge, which solved some issues when there were sudden floods of updates. The benefit of maps is that they can merge multiple updates at once (overrides). With channels, we would need to process all individual elements.

Going back to comments about implementation: As we said, we are ensuring monitoring has a time series per each agent showing if the agent was online or offline at a given time point. However, to synchronize the “online” flag, we are periodically asking for monitoring for time series for all agents, then flip flags if they mismatch with the desired value.

Let’s move forward, to files for the transformer. The generated one can be found here: https://github.com/cloudwan/inventory-manager-example/blob/master/controller/v1/agent_phantom/agent_phantom_transformer.pb.go.

Notes:

  • The interface we need to notice is AgentPhantomTransformerImpl, this one needs implementation from us.
  • Config structure AgentPhantomTransformerConfig, which needs to be provided by us.
  • In transformer code, we are already handling all events related to input resources, including deletions. This reduces the required interface from AgentPhantomTransformerImpl to a minimum, we just need to compute desired resources for a given input.
  • Note that this config and impl are provided by your customizer implementation for the processor.

The file with the implementation for the transformer is here: https://github.com/cloudwan/inventory-manager-example/blob/master/controller/v1/agent_phantom/agent_phantom_transformer.go.

Notes:

  • For a single ReaderAgent, we may have potentially N output resources (PhantomTimeSerie here).
  • When we create DESIRED PhantomTimeSerie, note that we provide only the parent part of the name field, When we call NewNameBuilder, we are NOT calling SetId. As part of PhantomTimeSerie spec, we can only provide parent names, but never own ID. This must be generated by a server. Note this combines with custom PhantomTimeSerie Syncer handlers, where we extract ID from metadata.ownerReferences and metric.type.
  • PhantomTimeSerie is constructed with service ownership info and ReaderAgent. This ensures that we will own this resource, not another service. Metadata ensures PhantomTimeSeries will be cleaned up (this is an additional cleanup guarantee, as a transformer of owner/ownee type can provide the same functionality).

To summarize, when implementing a Processor, it is necessary to (at the minimum):

  • Provide all transformers implementations and define their configs.
  • Provide an implementation of processor customizer, at the minimum, it needs to provide objects for syncers and transformers.

We need to provide missing implementation though not just for processors, but nodes too. You can typically find three code-generated files for nodes (for InventoryManager example):

The main file to review is the one ending with the pb.node.go name. In the constructor, it creates all watcher instances for all scope and input resources for all processors. It manages a set of processors per project (in this case we have one processor, but more could be available). All updates from watchers are distributed to relevant processors. It is quite a big file, initially, you may just remember, that this component just watches all collections in real-time and pushes updates to Processors, so they can react. However, at the top of the file, there are four types you need to see:

  • InventoryManagerCtrlFieldMasks

    Its generic name is <NodeName>FieldMasks.

  • InventoryManagerCtrlFilters

    Its generic name is <NodeName>Filters.

  • InventoryManagerCtrlCleaner

    Its generic name is <NodeName>Cleaner.

  • InventoryManagerCtrlNodeCustomizer

    Its generic name is <NodeName>NodeCustomizer.

Of these types, the most important for developers is NodeCustomizer. Developers should implement its functions:

  • Filters() needs to return filters for all input resources (From all processors) and scope resources. This is important, the controller should only know the resources it needs to know!
  • FieldMasks() needs to return field masks for all input resources (From all processors) and scope resources. It is very beneficial to return only the fields the controller needs to know, especially considering that the controller will need to keep those objects in RAM! However, be aware to include all needed fields, those needed by dispatchment (typically name), those needed by transformers and reconciliation masks, and all fields required by syncers (Update masks!).
  • Function GetScopeIdentifierForPhantomTimeSerie (or GetScopeIdentifierFor<Resource>) was generated because in protobuf, in dispatchment annotation for PhantomTimeSerie, we declared that the identifier is using the CUSTOM method!
  • Function CustomizedCleaner should return a cleaner that handles orphaned resources in case Scope resource (Project here) is deleted, but some kid resources exist. However, in 99.99% of cases, this functionality is not needed. When the Project is deleted, then all kid resources are cleaned up asynchronously by the db-controller.
  • Function AgentPhantomProcessorCustomizer must return a customizer for each Processor and scope object.

Developers need to implement a customizer, for the inventory manager we have the file: https://github.com/cloudwan/inventory-manager-example/blob/master/controller/v1/inventory_manager_ctrl_node.go.

Notes:

  • For GetScopeIdentifierForPhantomTimeSerie, we need to return the name object of the Inventory Manager project. Using the name of a PhantomTimeSerie is very easy though. We may find some autodetection in the future: If the name pattern matches across resources, then the developer won’t need to provide those simple functions.
  • In the FieldMask call, the Mask for ReaderAgent needs to be checked against all fields used in the processor - reconciliation mask and connection tracker. The name should always be included.
  • In the Filters call, we need to consider a couple of things:
    • We may have multi-region env, and each region will have its controllers. Typically, for regional resources, we should get those belonging to our region (ReaderAgents or PhantomTimeSeries). Projects should we get that can be in our region, so we filter by enabled regions.
    • Resources from core SPEKTRA Edge services we should filter by our service, ideally by owned. Otherwise, we would get PermissionDenied.
  • Customizer for processor construction should be straightforward. Any extra params were provided upfront, passed to the node customizer.

To see a final top-level implementation bit for the business logic controller for InventoryManager, see https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/inventorymanagercontroller/main.go.

Find the NewInventoryManagerCtrlNodeManager call, this is how we construct node manager and how we pass our node customizer there. It should conclude this example.

Applications controller implementation

Controller implementation for applications can be found here: https://github.com/cloudwan/edgelq/tree/main/applications/controller/v1.

It contains additional information compared to Inventory Manager so let’s go through it, but skip common parts with previous example.

Starting from the processor, the main files are:

As described in the protobuf part, we have essentially two transformers and two syncers, for two different sub-tasks of general pod processing.

Let’s start with a transformer called DistributionController. For a quick recap, this transformer is producing Pods based on combined Device and Distribution resources, each matched Device + Distribution should produce a Pod, called DistributionPod in protobuf. Not all pods belong to Distribution though! Some may be deployed manually by clients.

You should start examining code from the processor customizer (link above).

In the function DistributionControllerConfigAndImpl of customizer we are creating a config, that reacts to specific field path changes for Distributions and Devices. At least as for now, distribution is matched with the device based solely on metadata.labels field path in Device, so this is what we check in Device. For Distribution, we want to recompute pods if the selector or pod template changes, other updates to Distributions should not trigger Pod re-computation! Also, note that the implementation object also can have Store instances, so we can access the current state. This will be necessary.

In the transformer, https://github.com/cloudwan/edgelq/blob/main/applications/controller/v1/pods/distribution_controller.go, there are additional knowledge elements. Since we know that this transformer is meta-ownee type, BUT we have two owners, we must implement two functions, one computing Pods for Distribution across all matched devices, other computing Pods for Device across Distributions. Note that each DESIRED generated pod does not have a clear static name value, GenerateResourceIdFromElements may have non-deterministic elements. We will need to reflect this when configuring Syncer for the DistributionPod type.

This is how pods are generated. To continue analyzing the behavior of Distribution pods, go back to the customizer for the processor and find the function PodFactoryConfigAndHandlers. Config with an update mask only seems ordinary, but there is an example of limit integration. First, we construct default handlers. Then, we are attaching the limits guard. Note that Pods are subject to limits! There is a possibility that we will fit with Devices/Distributions in the plan, but we would exceed Pods.

In such a situation, Syncer must:

  • Stop executing CreatePod if we did hit a limit.
  • UpdatePod should continue being executed as normal.
  • DeletePod should be executed as normal.

Once we have a free limit (as a result of plan change or deleted other pods), creation should be resumed. Limits guard is a component that must be used if we may be creating resources in the limits plan! Note also that in the PostInit call, we must additionally configure the limits guard.

For the pod factory syncer, we also provided some other syncer customizations:

  • Identifiers of pods have a custom implementation, since the pod name may be non-deterministic.
  • We must be very careful of what we delete! Note that in the protobuf section for PodFactory Syncer desired state takes pods from the DistributionController transformer. But the observed state contains ALL pods! To prevent the wrong deletion, we must provide additional CanProceedWithDeletion.

Let’s move on to the next transformer and syncer, handling unknown state pods. As a recap, the controller must mark pods whose device went offline as of UNKNOWN status. A set of unknown pods is a collection on its own (UnknownStatePod). When the Device gets online, we will need to remove the pods belonging there. We want to recompute snapshots of unknown state pods periodically - so this is what we declared in protobuf.

Starting with the transformer, we have the UnknownStateTrackerConfigAndImpl function used to customize it and examine it, it is in the Customizer implementation for PodsProcessor. Note that the config object has now a SnapshotTimeout variable. This timeout decides the interval how often the desired collection is re-computed (in this case!). Note that we declared this transformer as periodic-snapshot generic type.

See transformer-generated file and handwritten customization:

From the PB file, note that the minimum implementation required is called CalculateSnapshot, which is called periodically as instructed. This is the minimum we require from implementation.

However, if you examine it carefully, you can notice code like this:

onDeviceSetExec, ok := t.Implementation.(interface {
    OnDeviceSet(
      ctx context.Context,
      current, previous *device.Device,
    ) *UnknownStateTrackerDiffActions })
if ok {
    t.ProcessActions(
      ctx,
      onDeviceSetExec.OnDeviceSet(ctx, current, previous),
    )
}

Basically, all generic transformers allow additional custom interfaces for implementations, generally, On<Resource>Set and On<Resource>Deleted calls for each input resource. Those allow us to update desired collections much faster!

There is also an additional benefit of implementing those optional methods:

  • For generic, without periodic snapshot transformers, this avoids the CalculateSnapshot call entirely. In regular generic transformers, if the implementation does not implement the On<Resource><SetOrDeleted> call, the snapshot is triggered with a delay specified by the SnapshotTimeout variable (different behavior than periodic snapshot!). To avoid some extra CPU work, it is recommended to implement optional methods.

For this particular transformer, in file https://github.com/cloudwan/edgelq/blob/main/applications/controller/v1/pods/unknown_state_tracker.go, we implemented basic snapshot computation, where we get all pods with unknown statuses based on the last heartbeat from devices. However, we also implemented OnDeviceSet and OnDeviceDeleted. The set is especially important, when the Device gets online, we want to remove pods with unknown states from the desired collection ASAP. If we waited for timeout (more than 10 seconds), there is a possibility Device will mark pods online, but our controller would mark them unknown till timeout happens. This mechanism may be improved in the future though, even now we risk having two to three additional updates unnecessarily.

Going back to customizer (file pods_processor.go), see finally UnknownStateMaintainerConfigAndHandlers. We are again using Syncer for pods, but it’s a separate instance with a different update mask. We just want to control specific fields only, related to the status. Note that as in Distribution pods, the observed state contains ALL pods, but the desired state is only those with unknown status. To avoid bad deletions, we are disabling deletions entirely, creations too, as we don’t need them.

We can now exit processor type and examine node customizer, which can be seen here: https://github.com/cloudwan/edgelq/blob/main/applications/controller/v1/applications_ctrl_node.go.

It is very similar to customizer for Inventory Manager, with some additional info:

  • Note we are passing limits observer instance for limits guard integration. We will return to it shortly.
  • For the Filters() call, we need to note that the Distribution resource is non-regional - in fact, its instances are copied to all regions where the project is present in the multi-region environment. In those situations, we should filter by the metadata.syncing.regions field. This will return all distributions for all projects enabled in our region, which is basically what we need.

For the limits guard integration, also see a controller main.go file: https://github.com/cloudwan/edgelq/blob/main/applications/cmd/applicationscontroller/main.go.

Note that there we are constructing limits observer:

limitsObserver := v1ctrllg.NewLimitTrackersObserver(ctx, envRegistry)

We also need to run it:

g.Go(func() error {
    return limitsObserver.Run(gctx)
})

Limits observer instance should be global for the whole NodesManager, and be declared before, in the main!

Scaling considerations

Note that processors, to work properly, need to have:

  • Scope object (Project typically)
  • All input collections (snapshot of each of them)
  • All desired collections

In the case of multiple processors, input collections may be shared, but that’s not the point here, the point is, that the controller will need to have sufficient RAM, at least for now. This may be improved in the future, for example with disk usage. It won’t change the fact, that the Controller node needs to handle all assigned scope objects and their collections.

First, to minimize memory footprint, provide field masks for all collections, but be careful to include all necessary paths, there were bugs because of missing values! Then we need some horizontal scaling. For this, we use sharding.

Sharding divides resources into some groups. Note that in both examples we used byProjectId, declared explicitly in the protobuf file for Inventory Manager and Applications controller. This project ID sharding means that each Node instance will get a share of projects, not all of them. If we have only one Node, then it will contain data for all projects. But, if we have multiple Nodes, projects will be spread across them. Sharding by project also guarantees that resources belonging to one project will always belong to the same shard, this is why it is called byProjectId. For each resource, we extract a name, then we extract the project ID part from it and hash it. Hash is some large integer value, like int64. We need to know how big the ring is: 16, 256, 4096… For each ring, we modulo ring size and we get the shard number. For example byProjectId hash mod 16 gives us the byProjectIdMod16 shard key. Those values are saved in metadata.shards for each resource. This is done by sharding store plugins on the server side. Note that the field metadata.shards is a map<string, int64>. See https://github.com/cloudwan/goten/blob/main/types/meta.proto.

The ring size we use everywhere is 16 now, meaning we could potentially divide work across 16 nodes for all controller nodes.

When the first node starts, it will get assigned 16 shards, a value from 0 to 15. If the second node starts, it will get some random starting point, let’s say from 10-15, while the first node keeps 0-9. When the third node starts, it grabs some new random range, like 0-3. The remaining nodes are left with 4-9 and 10-15. It can continue till we are blocked by ring size and scaling is no longer effective.

Note that when the node starts, it can lower pressure on the two nodes at the maximum, not all. For this reason, we have a thing called Node Managers in all controllers, in all examples. We are building node managers in main.go files in the first place. Node managers start with one to four virtual node instances, but the most common is two. This way, when the new runtime starts, we have a good chance of taking pressure off from more instances.

Node managers are responsible for communicating with each other and assigning shards to their nodes. As of now, we use a Redis instance for this purpose. If you examined generated files for nodes, you could see that each Node has a method for updating the shard range. Shard ranges add additional filter conditions to filters passed from the node customizer instance.

With Kubernetes Horizontal Pod Autoscaler we are solving some issues with scaling, by splitting projects across more instances. This gives us some room for breath. But we have remaining 2 issues:

  • A super large project could potentially outgrow the controller.
  • Super large shards (lots of projects assigned to the same value) can be too massive.

For the first issue, we could leverage multi-region env, like we already did for example, we get resources mostly from our region only, so large projects can be further split across regions. Still, we may get hit with a large project-region.

For the second issue, we could switch to a larger ring size: like 256. However, it means we will have lots of controller instances, like 20 or more. Controllers also induce their overhead, meaning that we are wasting plenty of resources just for a large number of instances.

Presented techniques still provide us with some flexibility and horizontal scaling. To scale further, we can:

  • Introduce improvements in the framework, so it can compress data, use disk, or even “forget” data and retrieve it on demand.
  • Use diagonal scaling - use horizontal autoscaling first (like in Kubernetes), then, if the number of instances hits some alert (like 4 pods), then we can increase assigned memory in the YAML declaration and redeploy.

Diagonal autoscaling with automation in one axis may be most efficient, even though it will require little reaction from the operator, to handle the alert and increase values in yaml. Note however this simple action also has a potential for automation.

3 - Registering your Service to the SPEKTRA Edge platform

How to register your service to the SPEKTRA Edge platform.

While goten provides a framework for building services, SPEKTRA Edge provides a ready environment with a set of common, pre-defined set of services. This document describes a selected set of specific registrations needed by the developer, other services can and should typically be used with the standard API approach.

Integration with SPEKTRA Edge is practically enforced/recommended on multiple levels:

  • Your service needs to register itself in meta.goten.com, otherwise it can’t simply work.
  • Your resources model must be organized around the following top resources:
    • meta.goten.com/Service
    • iam.edgelq.com/Organization
    • iam.edgelq.com/Project
      • For multi-tenants, you need to have your Project resource in the api-skeleton.
  • You need to follow authentication & authorization model of iam.edgelq.com.
  • Although you may skip it, it is highly recommended to use audit.edgelq.com to Audit the usage of service, and monitoring.edgelq.com to track its usage. Your service activities in core SPEKTRA Edge are monitored by those services.
  • If a service needs to control the amount of resources, limits.edgelq.com is highly recommended.

The above list contains mandatory or highly recommended registrations, but practically all services are at your disposal. SPEKTRA Edge provides also edge devices with their own OS, where you can deploy your agent applications. Hardware and containers are managed using services devices.edgelq.com and applications.edgelq.com.

Service with a high level of registration example: https://github.com/cloudwan/inventory-manager-example

This provides more insights into how custom services can be integrated with core SPEKTRA Edge services.

Fixtures controller

Before jumping into SPEKTRA Edge registration, one common element of all registrations is the fixtures controller.

The fixtures controller is responsible for creating & updating resources in various services that are needed for:

  • Correct operation of an SPEKTRA Edge Service. Example: Service iam.edgelq.com needs a list of permissions from each service, that describe what users can do in a given service. If permissions are not maintained in IAM, then SPEKTRA Edge will have trouble helping with Authorization. It would render the Service non-operable as a result. As part of the bootstrapping Service, Permission fixtures must be submitted by interested Service.

  • Correct operation of a Project or Organization.

    Example: The user who created a given Project/Organization automatically gets an administrator RoleBinding resource in the created Project or Organization. Without it, the creator of a Project/Organization would not be able to access their entity. It would render it non-operable.

Some fixtures are a bit more dynamic. For example, when an existing Project is enabling some particular service, then a given Service automatically gets RoleBinding in a project, which allows the Service to manage its resources that are associated with the Service. Without it, Service would not be able to provide services to a project, rendering it non-operable.

Those cases are handled by the fixtures controller, by convention, the fixtures controller is part of controller runtime.

Be aware, that the fixtures controller not only keeps in sync by creating/updating resources. It also detects if there is UNNEEDED fixture that is not defined, but exists, it is then deleted. This is necessary to clean up the garbage, as, in proper conditions, it also has the potential to make the Service/Project/Organization non-operable and full of errors.

The Fixtures Controller works in this way: It computes a DESIRED set of resources. Then it uses CRUD to get the observed state, and compares it with desired, finally executes a set of Create/Update/Delete calls as necessary. If there is a dynamic change in the desired state, the controller computes & executes a new set of commands. If there is a dynamic change in the observed state, the fixtures controller will attempt to fix it.

Fixtures are a set of YAML files in the fixtures directory. They are either completely static or templated (have <VARIABLE> elements). Templated fixtures are created < FOR EACH > Project, or Organization, or Service - typically, but not limited to. Those “for each” fixture provide a source of dynamic updates to the desired state.

Fixtures are built into the controller image during compilation. Then config file decides the rest, like how variables are resolved. See basic fixtures controller config in: https://github.com/cloudwan/inventory-manager-example/blob/master/config/controller.proto.

For fixtures, for every resource type, it is necessary to include an access package for related resources. For example, see https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/inventorymanagercontroller/main.go, and the fine import types needed by the fixture controller! This list must include all resource types fixtures the controller can create OR requires (via forEach directive in the config file).

During each registration, we will explain all discuss various fixtures:

  • For IAM registration, we will define some static fixtures
  • For adding projects, we will show examples of “synchronized collections”, dynamic fixture examples.
  • For Monitoring registration, we will show some static and dynamic (per project) fixtures.
  • For Logging registration, we will show again some per-project fixtures.
  • For Limits, we use plans as static fixtures.

What your service is authorized to do

Your service uses IAM ServiceAccount which will have its own assigned RoleBindings. For any SPEKTRA Edge-based service, you will be allowed to:

  • Do anything in your service namespace: services/{service}/. Your ServiceAccount will be marked as owner, so you will be able to do anything there. This applies to Service resources in all services, including core SPEKTRA Edge.
  • Do anything in the root scope (/), AS LONG AS permissions are related to your service. So for example, if your ServiceAccount wants to execute some Create for a resource belonging to your service, it will be able to. But not in other services, and especially not in core SPEKTRA Edge services.
  • ServiceAccount can do some things in projects/$SERVICE_PROJCT_ID, depending on the role you are assigned to when making the initial service reservation as described in the preparation section.
  • For core SPEKTRA Edge services, you will be able to have read access to all resources in the root scope (/), as long as they will satisfy the following filter condition: metadata.services.allowedServices CONTAINS $YOUR_SERVICE.
  • For core SPEKTRA Edge services, you will be able to have write access to all resources in the root scope (/), as long as they satisfy the following filter condition: metadata.services.owningService == $YOUR_SERVICE.
  • You will be able to create resources in projects that enable your service in its enabled_services field. But they will have to specify metadata.services.owningService = $YOUR_SERVICE if we talk about core SPEKTRA Edge service’s resources.

These rough permissions above must be remembered when you start making requests from your service. Those limitations are reflected in various examples (for example, when you create ServiceAccount for a project, you need to specify proper metadata.services).

IAM registration

Introduction

Service iam.edgelq.com handles all actor collections (Users, ServiceAccounts, Groups), tenants (Organizations, Projects), permission related (Permissions, Roles), finally binds actors with permissions within tenant scopes (RoleBinding).

The only tenant-type resource not in iam.edgelq.com is Service, which resides in meta.goten.com. It is still treated as a kind of tenant from the IAM point of view.

The primary point of registration between IAM and any SPEKTRA Edge-based service is permission-related. Permissions are generated for all services, for each API method (any API group). The typical format of permission is the following: services/$SERVICE_NAME/permissions/$COLLECTION_NAME.$ACTION_VERB. If some method has no resource configured in the API skeleton (no opResourceInfo value!), then permission has a name of this format: services/$SERVICE_NAME/permissions/$ACTION_VERB.

Variable $SERVICE_NAME is naturally a service name in domain format, $COLLECTION_NAME is a lowerPluralCamelJson format of resource collection (examples: role bindings, devices…), finally $ACTION_VERB is equal to the value of verb of the method in the api-skeleton file. For example, the action CreateRoleBinding operates on the roleBindings collection, the verb is create, and the service, where the action is defined, is iam.edgelq.com. Therefore, the permission name is services/iam.edgelq.com/permissions/roleBindings.create.

Another popular permission type is the “attach” kind. Even if the permission holder can create/update a resource if that the resource has references to different ones, then authorization must also validate actor can create a reference relationship. For example, the caller can create a RoleBinding thanks to the services/iam.edgelq.com/permissions/roleBindings.create permission, but reference to a Role requires that holder also has permission services/iam.edgelq.com/roles!attach.

You should be already familiar with the IAM model, using its README.

What is provided during generation

During Service code generation, the IAM protoc plugin analyzes a service and collects all permissions that need to exist. It creates a file in the fixtures directory, with the name <service_short_name>.pb.permissions.yaml. Apart from that, it also generates Authorization middleware for your server specifically.

Authorization middleware extracts WHAT for each call:

  • Collection (typically parent field) for collection type methods (isCollection = true in API-skeleton)
  • Resource name (typically name field) for single resource non-collection type methods (isCollection and isPlural = false)
  • Resource names (typically names field) for plural non-collection methods (BatchGet examples! isCollection is false, isPlural true).

To get this WHAT, it uses by default values provided in the API skeleton: Param opResourceInfo.requestPaths in an Action declaration. Note CRUD has implicit built-ins. It gets authenticated principal from the current context object (associated with the call) and attaches permission related to the current call. It uses the generic Authorizer component to verify if the request should pass or be denied.

Minimal registration required from developers

This whole registration is almost out of the box. The minimal elements to do are:

  • Developers need to create an appropriate main.go file for the server, with Auth-related modules. In the constructor for the main service server handlers, Authorization middleware must be added to the chain, all according to the example InventoryManager.
  • Developers are highly recommended to write their role fixtures per their service (Static fixture). Roles are necessary to bind users with permissions. Roles should be well-thought-out. Inventory manager has basic roles for users and specific limited role examples for agent application, with access to clearly defined resources within tenant project. Although there is a fixture called <service_short_name>.pb.default.roles.yaml provided, they are very limited and usually a “bad guess”. Usually, we create a file called <service_short_name>_roles.yaml for manually written ones.
  • Developers must configure at the minimum two fixture files: <service_short_name>_roles.yaml (or <service_short_name>.pb.default.roles.yaml), then <service_short_name>.pb.permissions.yaml.

Fixture controller registration requires two parts. First, in the main.go file for a controller, it is required to import github.com/cloudwan/edgelq/iam/access/v1/permission and github.com/cloudwan/edgelq/iam/access/v1/role. Those packages contain modules that are imported by the fixtures controller framework provided by Goten/SPEKTRA Edge. The fixtures controller analyzes YAML files and tries to find in the global registry associated types, without it, a program will crash.

Second, in a config file of the controller, you need to define fixture file paths. You can copy-paste them from the inventory manager example, like:

fixtureNodes:
  global:
    manifests:
    - file: "/etc/lqd/fixtures/v1/inventory_manager.pb.permissions.yaml"
      groupName: "inventory-manager.edgelq.com/Permissions/CodeGen"
      parent: "services/inventory-manager.edgelq.com"
    - file: "/etc/lqd/fixtures/v1/inventory_manager_roles.yaml"
      groupName: "inventory-manager.edgelq.com/Roles"
      parent: "services/inventory-manager.edgelq.com"

It will be mentioned in the deployment document, but by convention, the fixtures directory is placed in the /etc/lqd path.

Two notes:

  1. groupName is mandatory and generally should be unique. This helps in case there is more than one fixture file for the same resource type, to ensure they don’t clash. Still, resource names also must be unique.
  2. The parent field is mandatory in this particular case too, here, the fixtures controller gets a guarantee that all Roles and Permissions have the same parent resource called exactly services/inventory-manager.edgelq.com (in this case). Note that a Service has only access to scopes it owns. Without this parent value specified, we would get PermissionDenied error. We will also get a PermissionDenied error if, in the fixture file, we would attempt to create a Role or Permission with a different parent.

Using this example, we should clarify yet another thing: The Fixtures controller not only creates/updates resources that are defined in the fixtures. It also DELETES those that are not defined within fixtures. This is why we have groupName and parent. For example, if there was a Role, which groupName is equal toinventory-manager.edgelq.com/Roles, and its parent is equal to services/inventory-manager.edgelq.com, and it would not exist within the fixture file as defined by /etc/lqd/fixtures/v1/inventory_manager_roles.yaml, it WOULD BE DELETED. This is why params groupName or parents play an important role here, and why we would get PermissionDenied without parents. The fixtures controller always gets the observed state to compare against the desired one. This observed state is obtained using regular CRUD, and this is why we need to specify a parent for Roles/Permissions, the service will not be authorized if it tries to get resources from ALL services.

So far we explained the mandatory part of IAM registration. The first common additional registration, although a very small one, is to declare some actions of a Service public. An example is here: https://github.com/cloudwan/inventory-manager-example/blob/master/fixtures/v1/inventory_manager_role_bindings.yaml

We are granting some public role to all authenticated users, regardless of who they are (but they are users of our service). This requires a separate entry in fixtures and import in main.go for RoleBinding (access packages).

More advanced IAM registration

In this topic, there are two things extra that are offered:

  1. IAM provides a way to OVERRIDE generated Authorization middleware. Developers can define additional protobuf files with special annotations in their proto/$VERSION directory, that will be merged on generated/assumed defaults.
  2. Some fields in resources can be considered sensitive from a reading or writing perspective. Developers can define custom IAM permissions that are required to be owned to write to/read from them. Permissions and protected fields can be defined in protobuf files.

Starting from the first part, overriding Authorization defaults. By convention, we create an authorization.proto file along with others. Some simple examples:

Example service provides a first basic example: To disable Authorization altogether for a given action, you just need to provide a skip_authorization annotation flag for a specific method, in a specific API group. Since this example is a little too simplified, examples for Audit and IAM were provided as being more interesting.

For example, take the ListActivityLogs method:

{
  name : "ListActivityLogs"
  action_checks : [ {
    field_paths : [ "parents" ]
    permission : "activityLogs.list"
  } ]
}

There is an important problem with this particular method: SPEKTRA Edge code-generation supports the collection, single resource, or multi-resource request types. However, in ListActivityLogsRequest we have a plural parents field because we are enabling users to query from multiple collections at once. This is a kind of isPluralCollection type. But such an annotation does not exist in api-skeleton. However, there is some level of enhancement: we can explicitly tell IAM to use the “parents” field path, and it will authorize all individual paths from this field. If the user does not have access to any of the parents, they will receive a PermissionDenied error.

There is also the possibility to provide multiple field paths (but only one will be used).

Another interesting case example, is CreateProject:

{
  name : "CreateProject"
  action_checks : [ {
    field_paths : [ "project.parent_organization" ]
    permission : "projects.create"
  } ]
}

In api-skeleton, Project and Organization are both “top” resources. Their name patterns are: projects/{project} and organizations/{organization}. Judging by these, the creation project should require permission on the system level and, the same for the organization. However, in practice we want projects to be final tenants and organizations’ intermediaries. Note that Organization and Project resources have a parent_organization field. Especially for organization resources, it is not possible to specify that the parent of the Organization is “Organization”. Name pattern cannot be like: organizations/{organization}/organizations/{organization}/.... Therefore, from a naming perspective, both projects and organizations are considered to be “top” resources. However, when it comes to creation, IAM Authorization middleware should make an exception, and take authorization scope object (WHERE) from a different field path, in the case of CreateProject, it must be project.parent_organization. This changes generated code of Authorization for CreateProject, and permission is required in the parent organization scope instead.

To declare sensitive fields in resources, it is necessary to use annotations.iam.auth_checks annotations. There are no current examples in InventoryManager, but there are some examples in secrets.edgelq.com:

As of now, there is:

option (annotations.iam.auth_checks) = {
  read_checks : [
    {permission : "mask_encrypted_data" paths : "enc_data"},
    {permission : "secrets.sensitiveData" paths : "data"}
  ]
};

Note you also need to include also edgelq/iam/annotations/iam.proto import in the resource proto file.

When the secret is being read, then additional permissions may be checked:

  • services/secrets.edgelq.com/permissions/mask_encrypted_data, if denied, field path enc_data will be cleared from response object.
  • services/secrets.edgelq.com/permissions/secrets.sensitiveData, if denied, field path data will be cleared from the response object.

Those read checks apply to all methods that contain resource bodies in response, therefore, even UpdateSecret or CreateSecret responses would have fields cleared. However, it will mostly be used to clear values from List/Search/Get/BatchGet responses.

Param set_checks are just like read_checks, but work in reverse.

Note that you can specify multiple paths.

Users are generally free to pick any permission name for set/read checks, but it is recommended to follow secrets.sensitiveData than mask_encrypted_data.

To have a full document about iam-related protobuf annotations, you can access it here: https://github.com/cloudwan/edgelq/blob/main/iam/annotations/iam.proto.

Adding projects (tenants) to the service

For multi-tenant cases, it is recommended to copy Project resources from iam.edgelq.com into 3rd party service. You need a Project resource declared yourself in api-skeleton. This copying, or syncing was already mentioned in some places in developer-guide, as collection synchronization.

Service based on SPEKTRA Edge should copy only these projects, which are enabling that particular service (in enabled_services list). Note that services based on SPEKTRA Edge can only filter projects/organizations that are using particular services themselves.

Once the project instance copy is in the service database, it is assumed that it is now able to use that service. If project removes service from allowed, then its copy is removed from the service database (garbage collecting).

An example of registration is in InventoryManager. Integration steps:

Let’s copy and paste part of the config and discuss it more:

fixtureNodes:
  global:
    manifests:
    - file: "/etc/lqd/fixtures/v1/inventory_manager_project.yaml"
      groupName: "inventory-manager.edgelq.com/Projects"
      createForEach:
      - kind: iam.edgelq.com/Project
        version: v1
        filter: "enabledServices CONTAINS \"services/inventory-manager.edgelq.com\""
        varRef: project
      withVarReplacements:
      - placeholder: <project>
        value: $project.Name.ProjectId
      - placeholder: <multiRegionPolicy>
        value: $project.MultiRegionPolicy
      - placeholder: <metadataLabels>
        value: $project.Metadata.Labels
      - placeholder: <metadataServices>
        value: $project.Metadata.Services
      - placeholder: <title>
        value: $project.Title

As always, we need to provide file and groupName variables. Note that the resource we are creating in this fixture belongs to our service: inventory-manager.edgelq.com/Project. Because it is ours, the service does not need an additional parent or filter to be authorized correctly, so those parameters are not necessary here.

We have some new elements though, first is the createForEach directive. It instructs to create fixtures defined in a mentioned file for each combination of input resources. In this case, we have one input resource, and its type is iam.edgelq.com/Project, in version v1. Our service cannot list all IAM projects, but it can list them if they enable our service, therefore we are passing the proper filter param. Besides, we should create project copies only for projects interested in our service anyway. Each instance of iam.edgelq.com/Project is remembered as project variable (as indicated by varRef).

When fixtures are evaluated from file /etc/lqd/fixtures/v1/inventory_manager_project.yaml per each iam project, we need to replace all variables, so the final YAML is produced. This example above should be relatively self-explanatory. You may note, however, that you can extract IDs from names, and take full objects (fixtures variables are not limited to primitives), maps, or slices.

There is however one more important aspect: Project admins cannot by default add your service to their enabled list. This is to prevent the attachment of a private service to a project, it may be against the service maintainer’s wishes. To allow someone to create/update a project/organization using your service, you will need to create a RoleBinding:

cuttle iam create role-binding \
  --service $YOUR_SERVICE \
  --role 'services/iam.edgelq.com/service-user' \
  --member $ADMIN_OF_ORGS_AND_PROJECTS

Provided user from now on can create new organizational entity that uses your service.

Audit registration

Overview

SPEKTRA Edge provides a LogsExporter component, which is part of observability. It records selected API calls (unary and streams), and submits them to audit.edgelq.com. All activity or resource change logs are classified as service, organization, or project scoped. Out of these 3, service logs are default, if the method call was not classified as neither project nor organization.

Scope classification is relatively simple: When a unary request arrives, the logs exporter analyzes the request, extracts resource name(s) and collection, and decides what is the scope of the request (project, organization, or service). Resource change logs are submitted just before the transaction is concluded, if logs could not have been sent, the transaction fails. This is to ensure that we always track resource change logs at least. Activity logs are submitted in a manner of seconds after the request finishes, which allows some degree of lost messages. In practice, it does not happen often.

For streams, Audit examines client and server messages before deciding how activity logs should look like.

Resource change logs are submitted based on transaction lifespan regardless of grpc method streaming kinds.

Minimal registration

The audit requires minimal effort from developers to include in its default form. They just need to put a little initialization in the main.go file for a server runtime, as in the example InventoryManager service. You can see it in https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/inventorymanagerserver/main.go.

Find the following strings:

  • NewAuditStorePlugin is necessary to add to a store handle. It is a plugin that observes changes on DB.
  • InitAuditing is necessary to initialize the Audit Logs exporter that your server will use. You need to pass all relevant handlers (code-generated).

Audit handlers are code generated based on method annotations (therefore, the API skeleton decides normally). There are the following defaults:

  • Api-skeleton annotations opResourceInfo.requestPaths and opResourceInfo.responsePaths are used to determine what field paths in request/response objects contain values that would be interesting from an Audit point of view.
  • Audit by default focuses on auditing all writing calls. It checks the api-skeleton annotation withStoreHandle in each action. If the transaction type is SNAPSHOT or MANUAL, then the call will be logged, not otherwise.
  • By default, activity log types will be always classified as some kind of writes. Other kinds require manual configuration.

From this point on Audit will work, and service developers will be able to query for logs from their service. Let’s discuss list of possible customizations.

Customizations on the proto files level

Generally, full proto customizations can be found here: https://github.com/cloudwan/edgelq/blob/main/audit/annotations/audit.proto You will need to include the edgelq/audit/annotations/audit.proto import to use any audit annotations.

The most common customization is the categorization of write activities for a resource. Activity logs have categories: Operations, Creations, Deletions, Spec Updates, State Updates, Meta Updates, Internal, Rejected, Client and Server errors, Reads.

Note that write categories are quite a few: creations, deletions, and three different update kinds. Creations and deletions are easy to classify, but updates are not so much. When a resource is updated, the Audit Logs exporter examines a different object and determines which fields changed, and which not. To determine the update kind, it needs to know which fields are related to spec, which state, and which are meta. This has to be defined within the resource protobuf definition.

It is like the following:

message ResourceName {
  option (ntt.annotations.audit.fields) = {
    spec_fields : [
      "name",
      "spec_field",
      "other_spec_field"
    ]
    state_fields : [ "some_state_field" ]
    meta_fields : [ "metadata", "other_meta" ]
    hidden_fields : [ "sensitive_field", "too_big_field" ]
  };
}

We must classify all fields. Normally, we put “name” as a spec, and “metadata” as a meta field. Other choices are up to the developer. On top of spec/state/meta, we also can hide some fields from Audit at all (especially if they are sensitive, or big and we want to minimize log sizes).

Note that hidden_fields can also be defined for any messages, including request/response objects. Some example from SPEKTRA Edge: https://github.com/cloudwan/edgelq/blob/main/common/api/credentials.proto. See annotations for ServiceAccount, we are hiding private key objects for example, as this would be too sensitive to include in Audit logs. Be aware of what is being logged!

You can define field specifications on the resource level, or any nested object too.

Going back to update requests: Spec update takes importance over the state, then state over meta. Therefore, if we detect update that modifies one meta, two state, and one spec field, the update is classified as spec update.

Another part of customization developers may find useful, is to ability to attach labels to activity/resource change logs. Those logs can be queried (filtered) by service, method name, API version, resource name/type on which method operates (or which changed), category, and request ID… However, you can notice that resource change and activity logs also have a “labels” field, which is a generic map of strings. This can hold any labels that were defined by developers. Most common way of defining labels can be in request/response objects:

message ActionNameRequest {
  option (ntt.annotations.audit.fields) = {
    labels : [
      { path : "field_a", key: "label_a" },
      { path : "field_b", key: "label_b" }
    ]
    
    promoted_labels : [
      { label_keys : [ "label_a" ] }
    ]
  };
  
  string field_a = 1;
  
  string field_b = 2;
}

With this, you can start querying Activity logs like:

{parents: ["projects/tenant1"], filter: "service.name = \"custom.edgelq.com\" AND labels.field_a = \"V1\""}

This query above will also be optimized (index will be created, according to the promoted_labels value).

Note that each promoted label set require also service name and parent to be indexed!

Apart from field customization, developers can customize how Audit Logs Exporter handles method calls. We are typically creating the file auditing.proto in the proto/$VERSION directory for a given service. There we declare file-level annotation ntt.annotations.audit.service_audit_customizations.

Examples in SPEKTRA Edge:

Starting with the device’s service, for example for ProvisioningPolicyService, method ProvisionDeviceViaPolicy. As of now, we have annotations like:

{
  name : "ProvisionDeviceViaPolicy"
  activity_type : WriteType
  response_resource_field_paths : [ "device.name" ]
}

Method ProvisionDeviceViaPolicy has in api-skeleton:

actions:
- name: ProvisionDeviceViaPolicy
  verb: provision_device_via_policy
  withStoreHandle:
    readOnly: false
    transaction: SNAPSHOT

By default, opResourceInfo has these values for the action:

opResourceInfo:
  name: ProvisioningPolicy # Because this action is defined for this resource!
  isCollection: false      # Default is false
  isPlural: false          # Default is false
  # For single resource non-collection requests, defaults for paths are determined like below:
  requestPaths:
    resourceName: [ "name" ]
  responsePaths: {}

You can find request/response object definitions in: https://github.com/cloudwan/edgelq/blob/main/devices/proto/v1/provisioning_policy_custom.proto

This method primarily operates on the ProvisioningPolicy resource, and the exact resource can be extracted from the “name” field in the request. By default, Audit would decide that the primary resource for Activity logs for these calls is ProvisioningPolicy. The following Audit specification would be implicitly assumed:

{
  name : "ProvisionDeviceViaPolicy"
  activity_type : WriteType                 # Because withStoreHandle api-skeleton annotation tells it is a SNAPSHOT
  request_resource_field_paths : [ "name" ] # Because this is what requestPaths api-skeleton annotation tells us.
}

However, we know that this method takes the ProvisioningPolicy object, but creates a Device resource, and the response object contains the Device instance. To ensure that the field resource.name in Activity logs points to a Device, not ProvisioningPolicy, we write that response_resource_field_paths should point to device.name.

To be able to still query Activity logs by ProvisioningPolicy, we also attach annotation to request object:

option (annotations.audit.fields) = {
  labels : [ {key : "provisioning_policy_name" path : "name"} ]
};

This is one example modification of default behavior.

We can also disable auditing for particular methods entirely: Again in auditing.proto for the Devices service you may see:

{
  name : "DeviceService"
  methods : [ {name : "UpdateDevice" disable_logging : true} ]
},

The reason, in this case, is that, as of now, all devices are sending UpdateDevice each minute. To avoid too many requests to Audit, we have for now this disabled, till a solution is found (perhaps you already don’t see this part in auditing for devices).

In the auditing.proto file for the Proxies service ( https://github.com/cloudwan/edgelq/blob/main/proxies/proto/v1/auditing.proto ), you may see something different too:

{
  name : "BrokerService"
  methods : [
    {name : "Connect" activity_type : OperationType},
    {name : "Listen" activity_type : OperationType}
  ]
}

In Broker API in API-skeleton, you can see that Connect and Listen are streaming calls, Listen is used by an Edge agent to provide access to other actors, and Connect is used by an actor to connect an Edge agent. Those calls are non-writing and, therefore would not be audited by default. To force auditing, and classify them as Operation kind, we specify this directly in the auditing file.

A final example that is good to see, is the auditing file for monitoring: https://github.com/cloudwan/edgelq/blob/main/monitoring/proto/v4/auditing.proto.

First, you can see that we are classifying some resources as INTERNAL types, like RecoveryStoreShardingInfo. It means that any writes to these resources are not classified as writes, but as “internal”. This changes the category in Activity logs, making it easier to filter out. Finally, we are enabling reads auditing for ListTimeSeries call:

{
  name : "TimeSerieService"
  methods : [ {
    name : "ListTimeSeries"
    scope_field_paths : [ "parent" ]
    activity_type : ReadType
    disable_logging : false
  } ]
}

Before finishing, it will be worth we have some extra customizations in the code for ListTimeSeries calls.

Customizations of Audit in Golang code

There is a package github.com/cloudwan/edgelq/common/serverenv/auditing with some functions that can be used.

Most common examples can be summarized like this:

package some_server

import (
  "context"

  "google.golang.org/grpc/codes"
  "google.golang.org/grpc/status"

  "github.com/cloudwan/edgelq/common/serverenv/auditing"
)

func (srv *CustomMiddleware) SomeStreamingCall(
    stream StreamName,
) error {
	ctx := stream.Context()

    firstRequestObject, err := stream.Recv()
    if err != nil {
        return status.Errorf(status.Code(err), "Error receiving first client msg: %s", status.Convert(err).Message())
    }
	
	// Lets assume, that request contains project ID, but it is somehow encoded AND not available
	// from a field in a straight way. Because of this, we cannot provide protobuf annotation. We can
	// do this from code however:
	projectId := firstRequestObject.ExtractProjectId()
    auditing.SetCustomScope(ctx, "projects/" + projectId) // Now we ensure this is where log parent is.
	
	// We can set also some custom labels, because these were not available as any direct fields.
	// However, to have it working, we will still need to declare labels in protobuf:
	//
	// message StreamNameRequest {
	//  option (ntt.annotations.audit.fields) = {
	//    labels : [ { key: "custom_label" } ]
	//  };
	// }
	//
	// Note we specify only key, not path! But if we do this, we can then do:
	auditing.SetCustomLabel(ctx, "custom_label", firstRequestObject.ComputeSomething())

    // Now, we want to inform Audit Logs Exporter that this stream is exportable. If we did not do this,
	// then Audit would export Activity logs only AFTER STREAM FINISHES (this function exits!). If this
	// stream is long-running (like several minutes, or maybe hours), then it may not be the best option.
	// It would be better to send Activity logs NOW. However, be aware that you should not call
	// any SetCustomLabel or SetCustomScope calls after exporting stream - activity logs are "concluded"
	// and labels can no longer be modified. New activity log events may be still being appended for each
	// client and server message though!
	auditing.MarkStreamAsExportable(ctx)
	
	firstServerMsg := srv.makeFirstResp(stream, firstRequestObject)
	if err = stream.Send(firstServerMsg); err != nil {
      return status.Errorf(status.Code(err), "Error sending first server msg: %s", status.Convert(err).Message())
    }
	
	// There may be multiple Recv/Send here ...
	
	return nil
}

By default, Activity logs record all client/server messages, each represents an Activity Log Event object, appended to the existing Activity Log. It may not always be the best choice if objects are large. For example, for ListTimeSeries, which is audited, we don’t need responses. The request object contains elements like filter or parent, so we can predict/check what data was returned from monitoring. In such a case, we can disable appending ActivityLog (also, ListTimeSeriesResponse can be very large!):

func (r *ListTimeSeriesResponse) AuditShouldRecord() bool {
	return false
}

The function AuditShouldRecord can be defined for any request/response object. Audit Logs Exporter will examine if they implement this method to act accordingly.

We can also sample logs, we do this for ListTimeSeries. Since those methods are executed quite often, we don’t want too many activity logs for them. We implemented the following functions for request objects:

func (r *ListTimeSeriesRequest) ShouldSample(
	ctx context.Context,
	sampler handlers.Sampler,
) bool {
	return sampler.ShouldSample(ctx, r)
}

func (r *ListTimeSeriesRequest) SamplingKey() string {
	// ... Compute value and return
}

First, we need to implement ShouldSample, which gets the default sampler. If ShouldSample returns true, then the activity is logged. The default sampler requires a SamplingKey() string implemented from an object. It ensures that “new” requests are being logged, not similar to those before (at least till TTL expires or cache lost entry).

Also, if some streaming calls are heavy (like downloading a multi-GB image), make sure these requests/responses are not logged at all! Otherwise, Audit may get fat.

Monitoring registration (and usage notes)

Monitoring is a bit simpler case than IAM or Audit. Unlike them, it does not integrate on a protobuf level and does not inject any code. The common registration is via metric/resource descriptors, followed by periodic time series submission.

It is up to the service to decide if there is a need for time-series numeric data with aggregations needed. If there is, then service developers need to:

  • Declare MonitoredResourceDescriptor instances via fixtures file. Those resources are defined for the whole service.
  • Declare MetricDescriptor instances via fixture file. Those resources must be created per each project using a service.

With descriptors created from the fixture controller, clients can start submitting logs via CreateTimeSeries calls. It is recommended to use the cached client from Monitoring: https://github.com/cloudwan/edgelq/blob/main/monitoring/metrics_client/v4/tsh_cached_client.go

This typically is used for agents running on edge devices, it is the responsibility of service developers to create relevant code. It is good to use the InventoryManager example.

Fixture files for this example service can be found here:

Notes:

  • For MetricDescriptors, it is mandatory to provide value for metadata.services. The reason is, that the project is a separate entity from a Service, and can enable/disable the services it uses. Given limited access, the service should declare ownership of metric descriptors it is creating in a project.
  • As of now, in this example, the fixtures controller will forbid modifications of MetricDescriptors by project admins, for example, if they add some label or index, changes will be reverted to reflect these in fixtures. However, in the future, we plan to give some flexibility to mix user changes with fixtures. This can enable use cases, like additional indices that are usable for specific projects only. This allows per-tenant customizations. This is a good reason to keep MetricDescriptors are defined per project rather than per service.
  • Because metric descriptors are created per each project, we call them dynamic fixtures.

File main.go for a controller will need to import relevant Go packages from Monitoring. Example is in https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/inventorymanagercontroller/main.go.

Packages needed are:

  • github.com/cloudwan/edgelq/monitoring/access/v4/metric_descriptor
  • github.com/cloudwan/edgelq/monitoring/access/v4/monitored_resource_descriptor

In this config file ( https://github.com/cloudwan/inventory-manager-example/blob/master/deployment/controller-config.yaml ) we can find usage of these two fixture files. Note that MonitoredResourceDescriptors instances are declared with a parent. This is again, like in IAM registration, ensuring that the fixtures controller only gets the observed state from this particular sub-collection. Resources MetricDescriptors don’t specify the parent field (we have multiple projects!). Therefore, we must provide different mechanisms to ensure we get access to metric descriptors we can access. We do this with filter param: We filter by metadata.services.owningService value. This way we guarantee to see resources we have write access to.

Other notable elements for MetricDescriptors are how we are filtering input projects:

createForEach:
- kind: inventory-manager.edgelq.com/Project
  version: v1
  filter: multiRegionPolicy.defaultControlRegion="$myRegionId"
  varRef: project

First, we use inventory-manager.edgelq.com/Project instances, not iam.edgelq.com/Project. This way we can be sure we don’t get PermissionDenied, once (it is our service after all). We can skip the filter for enabledServices CONTAINS this way.

Another notable element is the filter, we get projects only from our region only. It is recommended to create per-project fixtures this way in multi-region env. If our service is in many regions, then each region will take its share of projects.

The last element is where the variable $myRegionId comes from. This is defined in the main.go file for the controller. If you take a look at the example: https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/inventorymanagercontroller/main.go.

In the versioned constructor, you can find the following:

vars := map[string]interface{}{
    "myRegionId": envRegistry.MyRegionId(),
}

This is an example of passing some custom variables to the fixture controller.

Some simplified examples of client submitting logs can be found here, in the function keepSendingConnectivityMetrics: https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/simple-agent-simulator/agent.go

Usage registration

Service monitoring.edgelq.com, apart from being an optional registration option, has some other specific built-in registration already. We talk here about usage metrics:

  • Number of open calls being currently processed and not concluded (more useful for long-running streams!)
  • Request and response byte sizes (uncompressed protobufs)
  • Call durations, in the form of Distributions, to catch all individual values.
  • Database read and write counts.
  • Database resource counters (but these are limited only to those tracked by Limits service).

SPEKTRA Edge platform creates metric descriptors for each service separately in this fixture file:

Resource descriptors are also defined per service:

This way, we can have separate resource types like:

  • custom.edgelq.com/server
  • another.edgelq.com/server
  • etc.

From these fixtures, you can learn what metrics your backend service will be submitting to monitoring.edgelq.com.

Notable things:

  • All usage metrics go to your service project, where the service belongs (along with its ServiceAccount).
  • To track usage by each tenant project, all metric descriptors have a user_project_id label. This will contain the project ID (without the projects/ prefix) for which a call is accounted for.
  • User project ID labels for calls are computed based on the requestPaths object in requests!

To ensure the backend sends usage metrics, it is necessary to include this in the main.go file. For example, for Inventory Manager, in server main.go we have an InitServerUsageReporter call, find it in https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/inventorymanagerserver/main.go. When constructing a store, you need to add a store and cache plugin, NewUsageStorePlugin. You can grep this string in the main.go file as well.

This describes all minimum registration needed from the developer.

There is some coding customization available though: It is possible to customize how user_project_id is extracted. By default, the usage component uses auto-generated method descriptors (in client packages), which are generated based on requestPaths in API skeletons. It is possible to customize this by implementing additional functions to generate objects. An example can be found here: https://github.com/cloudwan/edgelq/blob/main/monitoring/client/v4/time_serie/time_serie_service_descriptors.go.

For a client msg handle, we can define the UsageOverrideExtractUserProjectIds function, then from a request object extract the project ID where usage goes. If possible, it is however better to skip to defaults with api-skeleton.

Logging registration

Logging registration is another optional one and is even simpler than monitoring. It is recommended to use logging.edgelq.com if there is a need for non-numerical time series like data (logs).

Service developer needs to:

  • Define fixtures with LogDescriptor instances to be created per each project (optionally for service or organization). Defining per project may enable in the future some per-project customizations.
  • File main.go for the controller will need, traditionally, relevant Go package (now it is github.com/cloudwan/edgelq/logging/access/v1/log_descriptor).
  • Complete configuration of fixtures in controller config.
  • Use logging API from Edge agent runtime (or even any runtime if they want/need it, edge agents are just the most typical).

In InventoryManager we have an example:

It is similar to monitoring but simpler.

Limits registration

Service limits.edgelq.com allows to limit the number of resources that can be created in a Project, to avoid system overload, or because of contractual agreements.

Limitations:

  • Only resources under projects can be limited
  • Limit object is created per unique combination of Project, Region, and Resource type.

Therefore, when integrating with limits, it is highly recommended (again) to work primarily with Projects, and then model resources keeping in mind that only the total count of them (in a region) is limited. For example, we can’t limit the number of “items in an array in a resource”. If we need to, we should create a child resource type, and provide a limited number of these that can be created in a project/region entirely.

With those pre-conditions, the remaining steps are rather simple to follow, we will go one by one.

First, we need to define service plans. It is necessary to provide default plans for organizations and projects too. This should be done again with fixtures, as we have in this example: https://github.com/cloudwan/inventory-manager-example/blob/master/fixtures/v1/inventory_manager_plans.yaml.

As always, this requires importing the relevant package in main.go, and entry in config file. As in https://github.com/cloudwan/inventory-manager-example/blob/master/deployment/controller-config.yaml.

The service plan will be assigned automatically to the service during initial bootstrapping by limits.edgelq.com. Organization plans will be at least used by “top” organizations (those without parent organizations). They will have one of the organization plans assigned. Organizations from this point can define either their plans or continue using defaults provided by a service via fixtures.

When someone creates a resource under a project, the server needs to check whether it exceeds its limit, if it does, then the server must reject the call with a ResourceExhausted error. Similarly, when the resource is deleted, limit usage should decrease. This must happen on a Store level, not an API server. Resources often can be created or deleted not via standard Create/Delete calls, but custom methods. We need to track each Save/Delete call on the store level. SPEKTRA Edge provides relevant modules already though. If you look at the file here: https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/inventorymanagerserver/main.go, you should notice that, when we construct a store (via NewStoreBuilder), we are adding a relevant plugin (find NewV1ResourceAllocatorStorePlugin). It injects necessary behavior, it checks the local limit tracker and ensures its value is in sync. Version ‘v1’ corresponds to the limits service version, not 3rd party service.

There is also a need to maintain synchronization between SPEKTRA Edge-based service using Limits and limits.edgelq.com itself. Ultimately, it is limits.edgelq.com where limit configuration is happening. For this reason, it is required that the service using Limits exposes an API that Limits can understand. This is why, in the main.go file for a server runtime, you can find the mixin limits server instantiation (find NewLimitsMixinServer) call. It needs to be included.

Also, for limit synchronization, we need a controller module provided by the SPEKTRA Edge framework. By convention, this is a part of the business logic controller. You can find it example here: https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/inventorymanagercontroller/main.go

Find the NewLimitsMixinNodeManager call - this must be included, and the created manager must be run along with others.

Limits mixin node manager needs its entry in controller config, as in https://github.com/cloudwan/inventory-manager-example/blob/master/config/controller.proto.

There is one very common customization required for limit registration only. By default, if limits service is enabled, then ALL resources under projects are tracked. Sometimes it may not always be intended, and resources should not be limited. As of now, we can do this via code, we need to provide a function for the resource allocator.

We have an example in InventoryManager again: https://github.com/cloudwan/inventory-manager-example/blob/master/resource_allocator/resource_allocator.go.

In this example, we are creating an allocator that does not count usage if the resource type is ReaderAgent. It is also possible to filter out specific fields and so on. This function is called for any creation, update (if for some reason resource switches from/to counted to/from non-counted!), or deletion.

This ResourceAllocator is used in the main.go function in server runtime, we are passing it to the store plugin.

4 - Developing the Sample Service

Let’s develop the sample service.

When writing code for your service, it is important to know some Goten/SPEKTRA Edge-specific components and how to use them. This part contains notable examples and advice.

Some examples here apply to edge runtimes too, as they often describe methods of accessing service backends.

Basic CRUD functionality

Unit tests are often a good way to show the possibilities of Goten/SPEKTRA Edge. While example service implementation shows something more “real” and “full”, various use cases in the shorted form are better represented with tests. In Goten, we have CRUD with: https://github.com/cloudwan/goten/blob/main/example/library/integration_tests/crud_test.go And pagination: https://github.com/cloudwan/goten/blob/main/example/library/integration_tests/pagination_test.go

Client modules will always be used by edge applications, and often by servers too - since the backend, on top of storage access will always need some access to other services/regions.

Using field paths/masks generated by Goten

Goten generates plenty of code related to field masks and paths. Those can be used for various techniques.

import (
	// Imaginary resource, but you can still use example
	resmodel "github.com/cloudwan/some-repo/resources/v1/some_resource"
)

func DemoExampleFieldPathsUsage() {
	// Construction of some field mask
    fieldMaskObject := &resmodel.SomeResource_FieldMask{Paths: []resmodel.SomeResource_FieldPath{
        resmodel.NewSomeResourceFieldPathBuilder().SomeField().FieldPath(),
        resmodel.NewSomeResourceFieldPathBuilder().OtherField().NestedField().FieldPath(),
    }}
	
	// We can also set a value to an object... if there is path item equal to NIL, then it is allocated
	// on the way. 
	res := &resmodel.SomeResource{}
    resmodel.NewSomeResourceFieldPathBuilder().OtherField().NestedField().WithValue("SomeValue").SetTo(&res)
    resmodel.NewSomeResourceFieldPathBuilder().IntArrayField().WithValue([]int32{4,3,2,1}).SetTo(&res)
	
	// You can access items from a field path... we also support this if there is an array on the path. But
	// this time we need to cast.
	for _, iitem := range resmodel.NewSomeResourceFieldPathBuilder().ObjectField().ArrayOfObjectsField().ItemFieldOfStringType().Get(res) {
		item := iitem.(string) // If we know that "item_field_of_string_type" is a string, we can safely do that!
		// Do something with item here...
    }
}

It is worth seeing interfaces FieldMask, and FieldPath in the github.com/cloudwan/object module. Those interfaces are implemented for all resource-related objects. Many of these methods have their strong-typed equivalents.

With field path objects you can:

  • Set the value to a resource
  • Extract value (or values) from a resource.
  • Compare value from the one in resource
  • Clear value from a resource
  • Get the default value for a field path (you may need reflection though)

With field masks, you can:

  • project a resource (shallow copy for selected paths)
  • Merge resources with field mask…
  • Copy selected field paths from one resource to another

You can explore some examples also in unit tests: https://github.com/cloudwan/goten/blob/main/runtime/object/fieldmask_test.go https://github.com/cloudwan/goten/blob/main/runtime/object/object_test.go

Tests for objects show also more possibilities related to field paths: We can use those modules for general deep cloning, diffing, or merging.

Creating resources with meta-owner references

In inventory-manager there is some particular example of creating a Service resource, see CreateDeviceModel custom implementation. Before the resource DeviceOrder is created, we connect with the secrets.edgelq.com service, and we create a Secret resource. We are creating it with populated metadata.ownerReferences value, as an argument, we are passing meta OwnerReference object, which contains the name of the DeviceOrder being created, along with the region ID where it is being created.

This is the file with the code we describe: https://github.com/cloudwan/inventory-manager-example/blob/master/server/v1/device_order/device_order_service.go.

Find implementation for the CreateDeviceOrder method there.

Meta-owner references are different kinds of references compared to those defined in the schema. Mainly:

  • They are considered “soft”, and can never block pointed resources.
  • You cannot unfortunately filter by them.
  • During creation (or when making an Update request with a new meta owner), meta owner reference does not need to point to the existing resource (yet - see below).
  • Have specific deletion behavior (see below).

The resource being pointed by meta owner reference we call “meta owner”, the pointing one is “meta ownee”.

Meta owner refs have however following deletion property:

  • When the meta owner resource is being deleted, then the meta owner reference is unset in an asynchronous manner.
  • If the meta owner resource does not exist, then after some time (minutes), the meta owner reference is removed from the meta ownee.
  • If the field metadata.ownerReferences becomes an empty array due to the removal of the last meta owner, the meta ownee resource is automatically deleted!

Therefore, you may consider that meta ownee has specific ASYNC_CASCADE_DELETE behavior - except that it needs all parents to be deleted.

When it is possible, it is much better to use schema references, declared in the protobuf files. However, it is not always possible, like here, because the InventoryManager service is importing secrets.edgelq.com, not the other way around. Secrets service cannot possibly know about the existence of the InventoryManager resources model, therefore Secret resource cannot have any reference to DeviceOrder. Instead, when we want to create a Secret resource and associate it with the lifecycle of DeviceOrder (we want Secret to be garbage collected), then we should precisely use meta ownership.

This way, we can ensure that “child” resources from lower-level services like secrets are automatically cleaned up. It will also happen if, after successful Secret creation, we fail to create DeviceOrder (let’s say, something happened and the database rejected our transaction without a retry option). It is because meta owner references are timing out when meta owner fails to exist within a couple of minutes since meta owner reference attachment.

There is one super corner case though, it is possible, that Secret resource will be successfully created, BUT transaction saving DeviceOrder will fail with Aborted code, but this error type can be retried. As a result, the whole transaction will be repeated, including another CreateSecret call. After the second approach, we will have two Secrets pointing to the same DeviceOrder, but DeviceOrder will have only one reference to one of those secrets. The other is stale. This particular case is being handled by the option WithRequiresOwnerReference passed to the meta owner, it means that the Meta owner reference is removed from the meta ownee also when the parent resource has no “hard” reference pointing at the meta ownee. In this case, one of the secrets would not be pointed by DeviceOrder and would be automatically cleaned up asynchronously.

It is advised to always use meta owner reference with the WithRequiresOwnerReference option if the parent resource can have a schema reference to the meta ownee - like in this case, where DeviceOrder has a reference to a Secret. It follows the principle, where the owner has a reference to the ownee. Note that in this case, we are creating a kind of loop reference, but it is allowed in this case.

Creating resources from the 3rd party service.

Any 3rd party service can create resources in SPEKTRA Edge core services, however, there is a condition attached to it. They must mark resources with service ownership information.

In method CreateDeviceOrder from https://github.com/cloudwan/inventory-manager-example/blob/master/server/v1/device_order/device_order_service.go, look again at the CreateSecret call and see field metadata.services of a Secret to create. We need to pass on the following information:

  • Which service owns this particular resource

    and we must point to our service.

  • List of allowed services that can read this resource

    we should point to our service, but we may optionally include other services too if this is needed.

Setting this field is a common requirement when 3rd party service needs to create a resource owned by it.

It is assumed that Service should not have full access to the project. Users however can create resources without this restriction.

Accessing service from the client

Services on SPEKTRA Edge typically have Edge clients, devices/applications running with ServiceAccount registered in IAM, connecting to SPEKTRA Edge/ Third party service via API.

An example is provided with inventory-manager here: https://github.com/cloudwan/inventory-manager-example/blob/master/cmd/simple-agent-simulator/dialer.go

Note that you can skip WithPerRPCCredentials to have anonymous access. The Authenticator will classify the principal as Anonymous, and the Authorizer will then likely reject the request with a PermissionDenied code. It may still be useful, for example during activation, when a service account is being created and credentials keys are allocated, the backend will need to allow anonymous access though, and custom security needs to be provided. See Edge agent activation in this doc.

Created gRPC connection you can use to wrap with client interfaces generated in client packages for your service (or also any SPEKTRA Edge-based service).

Edge agent activation

SPEKTRA Edge-based service has typically human users (represented by the User resource in iam.edgelq.com), or agents running on the edge (represented by the ServiceAccount resource in iam.edgelq.com). Users typically access SPEKTRA Edge via a web browser or CLI and get access to the service via invitation.

A common problem with Edge devices is that, during the first startup, they don’t have credentials yet (typically).

If you have an agent runtime running on the edge, and it needs to self-activate by connecting to the backend and requesting credentials, this part is for you to read.

Activation can be done with a token - the client needs to establish a connection without RPC credentials in GRPC. Then it can connect to a special API method for activation. During activation, it should send a token for identification. At this exchange, credentials are created and returned by the server. There is a plan to have a generic Activation module in SPEKTRA Edge framework, but it’s not ready yet.

For the inventory manager, we have:

It is a fairly complex example though, therefore Activation module is planned to be added in the future.

The token for activation is created with DeviceOrderService when an order for edge devices is created. We store token value using a secrets service, to ensure its value is not stored in any database just in case. This token is then needed during the Activation stream.

The activation method is bidi-streaming, as seen in api-skeleton. The client will initialize activation with the first request containing the token value. The server will respond with credentials, but to activate, the client will need to send additional confirmation. Because of multiple requests done by the client/server side, it was necessary to make this call a streaming type.

When implementing activation, there is another issue with it: ActivationRequest sent by the client has no region ID information, if there are multiple regions for a given service, and the agent connects with the wrong region, the backend will have issues during execution. RegionID is encoded however in the token itself. As of now, code-generated multi-region routing does not support methods where region ID is encoded in some field in the request. For now, it is necessary to disable multi-region routing here and implement the custom method, as shown in the example file.

During proper implementation of Activation (examine example file activation_service.go), we are:

  • Using secrets service to validate token first

  • We are opening a transaction to create an initial record for the agent object. This part may be more service-specific

    in this case, we are associating an agent with a device from a different project, which is not typical here! More likely we would need to associate the agent with a device from same project.

  • We are creating several resources for our agent: a logging bucket, a metrics bucket, and finally service account with key and role binding.

  • We then ask the client to confirm activation, if fine, we save the agent in another transaction to associate with created objects (buckets and service account)

This activation example is however good at showing how to implement custom middleware, interact with other services and create resources there.

Notable elements:

  • When creating ServiceAccount, it is not usable at the beginning: you need to create also a ServiceAccountKey, along with RoleBinding, so this ServiceAccount can do anything useful. We will discuss this example more in the document about the IAM integration document.
  • Note that the ServiceAccount object has a set meta owner reference pointing to the agent resource. It also gets the attribute WithRequiresOwnerReference(). It is highly advisable to create resources here in this way. ServiceAccount in this way is bound to the agent resource, when the agent is deleted, ServiceAccount is also deleted. Also, if Activation failed and ServiceAccount was created, then ServiceAccount will be cleaned up, along with ServiceAccountKey and RoleBinding. Note we talked about it when describing meta-owner references.
  • Logging and metrics buckets are also created using meta owner references, if an agent record is deleted, they will be cleaned automatically. The usage of buckets specified per agent is required to ensure that agents cannot read data owned by others. This topic will be covered more in a document describing SPEKTRA Edge integration. If logging and/or metrics are not needed by the agent, they can be skipped.
  • All resources in SPEKTRA Edge created by Activation require the metadata.services field populated.

EnvRegistry usage and accessing other services/regions from the server backend

The envRegistry component is used for connecting the current runtime with other services/regions. It can also provide real-time updates to changes (like dynamic deployment of a service in a new region). Although those things are rare, dynamic updates help in those cases, we should not need to redeploy clusters from existing regions if we are adding a new deployment in a new region.

EnvRegistry can be used to find regional deployments and services.

It is worth to remind difference between Deployment and Service: While service represents service as a whole, with public domain, Deployment is a regional instance of Deployment (specific cluster).

The interface of EnvRegistry can be found here: https://github.com/cloudwan/goten/blob/main/runtime/env_registry/env_registry.go

You will encounter EnvRegistry usage throughput examples, they are always constructed in the main file.

The notable thing about EnvRegistry is that all dial functions also have “fctx” equivalent calls (like DialServiceInRegion and DialServiceInRegionFCtx). FCtx stands for Forward Context. We are passing over various headers from the previous call to the next one, like authorization or call ID. Usually, it is called from MultiRegion middleware, when headers need to be passed to the new call (especially Authorization). It has some restrictions though, since services do not necessarily trust each other, forwarding authorization to another service may be rejected. MultiRegion routing is a different topic because a request is routed between different regions of the same service, meaning that the service being called stays the same.

As of now, envRegistry is available only for backend services, it may be enhanced in the future, so clients can just pass bootstrap endpoint (meta.goten.com service), and all other endpoints are discovered.

Store usage (database)

In files main.go for servers you will see a call to NewStoreBuilder. We typically add a cache and constraints layer. Then we must add plugins (this list is for server runtimes):

  • Mandatory: MetaStorePlugin, Various sharding plugins (for all used sharding)
  • Highly recommended: AuditStorePlugin and UsageStorePlugin.
  • Mandatory if multi-region features are used: SyncingDecoratorStorePlugin
  • Mandatory if you use Limits service integration: V1ResourceAllocatorStorePlugin for the v1 limits version.

Such a constructed store handle already has all the functionality: Get, Search, Query, Save, Delete, List, Watch… However, it does not have type-safe equivalents for individual resources, like SaveRoleBinding, DeleteRoleBinding, etc. To have a nice wrapper, we have a set of As<ServiceShortName>Store functions that decorate a given store handle. Note that all collections must exist within a specified namespace.

You need to call the WithStoreHandleOpts function on the Store interface before you can access the database. Typically, you should use one of the following, with snapshot transaction, or cache-enabled no-transaction session:

import (
	"context"
	
	gotenstore "github.com/cloudwan/goten/runtime/store"
)

func withSnapshotTransaction(
  ctx context.Context,
  sh gotenstore.Store,
) error {
  return sh.WithStoreHandleOpts(ctx, func (ctx context.Context) error {
    var err error
	//
	// Here we use all Get, List, Save, Delete etc.
	//
    return err
  }, gotenstore.WithTransactionLevel(gotenstore.TransactionSnapshot))
}

func withNoTransaction(
  ctx context.Context,
  sh gotenstore.Store,
) error {
  return sh.WithStoreHandleOpts(ctx, func (ctx context.Context) error {
    var err error
    //
    // Here we use all Get, List etc.
    //
    return err
  }, gotenstore.WithReadOnly(), gotenstore.WithTransactionLevel(gotenstore.NoTransaction), gotenstore.WithCacheEnabled(true))
}

If you look at any transaction middleware, like here: https://github.com/cloudwan/inventory-manager-example/blob/master/server/v1/site/site_service.pb.middleware.tx.go, you should note that typically transaction is already set per each call. It may be a different case if in the API-skeleton file, you did set the MANUAL type:

actions:
- name: SomeActionName
  withStoreHandle:
    transaction: MANUAL

In this case, transaction middleware would not set anything, and you need to call WithStoreHandleOpts yourself. MANUAL type is useful, if you plan to have multiple micro transactions.

Notes:

  • All Watch calls (singular and for collection) do NOT require WithStoreHandleOpts calls. They do not provide any transaction properties at all.
  • All read calls (Get, List, BatchGet, Search) must NOT be executed after ANY write (Save or Delete). You need to always collect all reads before making any writes.

Example usages can be found in https://github.com/cloudwan/inventory-manager-example/blob/master/server/v1/activation/activation_service.go

Note that the Activation service is using MANUAL type, middleware is not setting it.

Watching real-time updates

SPEKTRA Edge-based services utilize heavily real-time watch functionality offered by Goten. There are 3 types of watches:

  • Single resource watch

    The client picks a specific resource by name and subscribes for real-time updates of it. Initially, it gets the current data object, then it gets an update whenever there is a change to it.

  • Stateful watch

    Stateful watch is used to watch a specific PAGE of resources in a given collection (ORDER BY + PAGE SIZE + CURSOR), where CURSOR typically means offset from the beginning (but is more performant). This is more useful for web applications for users if there is a need to show real-time updates of a page where the user is. It is possible to specify filter objects.

  • Stateless watch

    It is used to watch ALL resources within a specified optional filter object. It is not possible to specify order or paging. Note this may overload the client with a large changeset if the filter is not carefully set.

For each resource, if you look at <resource_name>_service.proto files, API offers Watch<Single> or Watch<Collection>. The first one is for a single resource watch and is relatively simple to use. Collection watch type requires you to specify param: STATELESS or STATEFUL. We recommend STATEFUL for web-type applications because of its paging features. STATELESS is recommended for some edge applications that need to watch some sub-collection of resources. However, we do not recommend using direct API in this particular case. STATELESS watch, while powerful, may require clients to handle cases like resets or snapshot size checks. To hide this level of complexity, it is recommended to use Watcher modules in access packages, each resource has a typed-safe generated class.

This is reflected in tests from https://github.com/cloudwan/goten/blob/main/example/library/integration_tests/crud_test.go

There are 3 unit tests for various watches, and TestStatelessWatchAuthorsWithWatcher shows usage with the watcher.

Multi-Region development advice (for server AND clients)

Most of the multi-region features and instructions were discussed with api-skeleton functionality. If you stick to cases mentioned in the api-skeleton, then typically code-generated multi-region routing will handle all the quirks. Similarly, db-controller and MultiRegionPolicy objects will handle all cross-region synchronization.

Common advice for servers:

  • Easiest for multi-region routing are actions where isCollection and isPlural are both false.
  • Cases where isPlural is true and isCollection is false are not supported, we have built-in support for BatchGet, but custom methods will not fit. It is advised to avoid them, if possible.
  • Plural and collections requests are somewhat supported, we do support Watch, List, and Search requests. Customizations based on them are the easiest to support. You can look at the example like ListPublicDevices method in devices.edgelq.com service. However, there are certain conditions, Request object needs standard fields like parent and filter. Code-generation tool look for these to implement multi-region routing. Pagination fields are optional. In the response, it is necessary to include an array of returned resources. In the api-skeleton, it is necessary to provide responsePaths and point to the path where this list of resources is. If those conditions are met, you can implement various List variations yourself.
  • For streaming calls, you must allow multi-region routing using the first request from the client.

Links for ListPublicDevices:

Common advice for clients:

It is also advisable to avoid queries that will be routed or worse, split & merged across multiple regions. Those queries should be rather exceptional, not a rule. One easy way to avoid splitting & merge is to query for resources within a single policy-holder resource (Service, Organization, or Project). For example, if you query for a Distributions in specific project, they will likely be synced across all project regions - if not, they will at least reside in the primary region for a project. This way, one or more regions will be able to execute the request fully.

If you query (with filter) across projects/organizations/services, you can:

  • For resources attached to regions (like Device resource in devices.edgelq.com service), you can query just specific region across projects: ListDevices WHERE parent = "projects/-/regions/us-west2/devices/-". Note that the project is a wildcard, but the region is specific.
  • There is an object in each metadata object within each resource syncing. You can find this here: https://github.com/cloudwan/goten/blob/main/types/meta.proto. See the SyncingMeta object and its description. Now, if you filter by owningRegion, regardless of resource type, regardless of whether this is regional or not, a request with metadata.syncing.owningRegion will be routed to that specific region. Similarly, if you query with metadata.syncing.regions CONTAINS condition, you can also ensure requests will be routed to a specific region. Query with CONTAINS condition ensures that the client will see resources that the region can see anyway. Filter for owningRegion takes precedence over regions and CONTAINS.