runtime/observability

Understanding the observability module in the Goten runtime.

In the Goten repo, there is a observability module located at runtime/observability. This module is for:

  • Store tracing spans (Jaeger and Google Tracing supported)
  • Audit (Service audit.edgelq.com).
  • Monitoring usage (Metrics are stored in monitoring.edgelq.com).

In the Goten repo, this module is rather small, in observer.go we have Observer for spans. Goten also stores in context for the current gRPC call object called CallTracker (call_tracker.go). This generic tracker is used by the Audit and Monitoring usage reporter.

Goten also provides a global registry, where listeners can tap in to monitor all calls.

The mentioned module is however just a small base, more proper code is in SPEKTRA Edge repository, directory common/serverenv/observability:

  • InitCloudTracing

    It initializes span tracing. It registers a global instance, but is picked when we register the proper module, in file common/serverenv/grpc/server.go. See function NewGrpcServer, option grpc.StatsHandler. This is where tracing is added to the server. TODO: We will need to migrate Audit and Monitoring related, too. Also, I believe we should move the CallTracker initialization there altogether!

  • InitServerUsageReporter

    It initializes usage tracking. First, it stores a global usage reporter, that periodically sends usage time series data. It also stores standard observers, usage for store and API. They are registered in Goten observability module, to catch all calls.

  • InitAuditing

    It creates a logs exporter, as defined in the audit/logs_exporter module, then registers within the goten observability module.

Usage tracking

In the file common/serverenv/observability/observability.go, inside function InitServerUsageReporter, we initialize two modules:

  1. Usage reporter

    It is a periodic job, that checks all recorders from time to time, and exports usage as time series. It’s defined in common/serverenv/usage/reporter.go.

  2. usageCallObserver object

    With the RegisterStdUsageReporters call, it is registered within Goten observability (gotenobservability.RegisterCallObserver)

    (defined in common/serverenv/usage/std_recorders.

Reporter is supposed to be generic, there is a possibility to add more recorders. In the std_recorders directory, we just add a standard observer for API calls, we track usage on the API Server AND local store usage.

If you look at Reporter implementation, note that we are using always the same Project ID. This is Service Project ID, global for the whole Service, shared across all Deployments for this service. Each Service maintains usage metrics in its project. By convention, if we want to distinguish usage across user projects, we have a label for it, user_project_id. This is a common convention. See files std_recorders/api_recorder.go and std_recorders/storage_recorder.go, find RetrieveResults calls. We are providing a user_project_id label for all time series.

Let’s describe standard usage trackers. For this, the central point is usageCallObserver, defined in the call_observer.go file. If you look at it, it catches all unnecessary requests/responses plus streams (new/closed streams, new client or server messages). Its responsibilities are:

  • Insert store usage tracker in the context (via CallTracker).
  • Extract usage project IDs from requests or responses (where possible).
  • Notify API and storage recorders when necessary, storage recorder needs periodic flushing for streaming especially.

To track actual store usage, there is a dedicated store plugin, SPEKTRA Edge repository, file common/store_plugins/usage_observer.go. It gets a store usage tracker and increments values when necessary!

In summary, this implementation serves to provide metrics for fixtures defined in monitoring/fixtures/v4/per_service_metric_descriptor.yaml.

Audit

The audit is initialized in general in the common/serverenv/observability/observability.go file, inside the function InitAuditing. It calls NewLogsExporter from the audit/logs_exporter package.

Then, inside RegisterExporter, defined in file common/serverenv/auditing/exporter.go, we are hooking up two objects into Goten observability modules:

  • auditMsgVersioningObserver

    It’s responsible for catching all request/response versioning transformations.

  • auditCallObserver

    It’s responsible for catching all unary and streaming calls.

Of course, tracking API and versioning is not enough, we also need to export ResourceChangeLogs somehow. For this, we have also an additional store plugin in the file common/store_plugins/audit_observer.go file! It tracks changes happening in the store and pings Exporter when necessary. When the transaction is about to be committed, we call MarkTransactionAsReady. It may look a bit innocent, but it is not, see implementation. We are calling OnPreCommit, which is creating ResourceChangeLog resources! If we do not succeed, then we return an error, it will break the entire transaction in result. This is to ensure that ResourceChangeLogs are always present, even if we fail to commit ActivityLogs later on, so something is still there in audit.

The reason why we have a separate common/serverenv/auditing directory from the audit service, was some kind of idea that we should have an interface in the “common” part, but implementation should be elsewhere. This was an unnecessary abstraction, especially since we don’t expect other exporters here (and we want to maintain functionality and be able to break it). But for now, it is still there and probably will stay due to low harm.

Implementation of the audit log exporter should be fairly simple, see the audit/logs_exporter/exporter.go file in the SPEKTRA Edge repository.

Basically:

  • IsStreamReqAuditable and IsUnaryReqAuditable are used to determine whether we want to track this call. If not, no further calls will be made.

  • OnPreCommit and OnCommitResult are called to send ResourceChangeLog. Those are synchronous calls, they don’t exist until the Audit finishes processing. Note that it will extend a bit duration of store transactions!

  • OnUnaryReqStarted and OnUnaryReqFinished are called for unary requests and responses.

  • OnRequestVersioning and OnResponseVersioning are called for unary requests when their bodies are transformed between API versions. The function of it is to extract potential labels from updated requests or responses. Activity logs recorded will still be done for the old version.

  • OnStreamStarted and OnStreamFinished should be self-explanatory.

  • OnStreamExportable notifies when ActivityLog can be generated. It is used to send ActivityLogs before the call finishes.

  • OnStreamClientMessage and OnStreamServerMessage add client/server messages to ActivityLogs.

  • OnStreamClientMsgVersioning and OnStreamServerMsgVersioning notify the exporter when client or server messages are transformed to different API versions.

Notable elements:

  • Audit log exporter can sample unary requests when deciding whether to audit or not.
  • While ResourceChangeLog is sent synchronously and extends call duration, ActivityLogs does not. The audit exporter maintains a set of workers for streaming and unary calls, they have a bit of a different implementation. They work asynchronously.
  • Stream and unary log workers will try to accumulate a small batch of activity logs before sending, them to save on IO work. They have timeouts based on log size and time.
  • Stream and unary log workers will retry failed logs, but if they accumulate too much, they will start dropping.
  • Unary log workers send ActivityLogs for finished calls only.
  • Streaming log workers can send ActivityLogs for ongoing calls. If this happens, many Activity log fields like labels are no longer updateable. But request/responses and exit codes will be appended as Activity Log Events.