RollupConnector Service APIv2
Understanding the RollupConnector service APIv2, as known as ntt.monitoring.rollup_connector.v2.
The monitoring service provides a massive scale time-series multi-tenant data store with multi-key indexing and automatic rollups.
It has wire-compatible capabilities with google monitoring API for time-series - version v3.
Version v4
is however recommended one we are expanding.
Full API Specifications (with resources):
Highlights:
The Monitoring Service predefines a schema for time-series data and uses that schema to populate data and execute queries.
The schema is defined in two stages, with the upper level definition being the resource definition and the lower level definition being the metrics definition.
The higher-level resource definitions are units of entities that generate metrics and are defined as Monitored Resource Descriptors. For example, an entity that generates metrics, such as an SPEKTRA Edge device or SEI agent, exists as a resource. A resource definition contains the name of the resource as well as additional information such as labels.
Resource definitions are defined at the service level and are usually available only for reference by the user. Predefined resources can be viewed with the following commands, example for devices service:
cuttle monitoring list monitored-resource-descriptors \
--service devices.edgelq.com
Lower-level metric definitions define individual values and are defined as Metric Descriptor resources. For example, a metric definition could be the CPU utilization of an SPEKTRA Edge device or the latency measured by the SEI agent. In addition to the name of the metric, the metric definition defines a label as additional information and a data type.
On top of that, metric descriptor defines indices (per resource type). In Monitoring service, metric descriptors are designed to “know” resource descriptors, therefore time series storage indices are defined on metric descriptors level.
The following commands can be used to reference a Metric Descriptor in a project:
cuttle monitoring list metric-descriptors --project $PROJECT
Each metric has a defined data type and metric type.
Metric types include, for example, general GAUGE
(e.g., CPU utilization),
cumulative CUMULATIVE
(e.g., total network interface transfer volume), and
differential DELTA
.
Data types include INT64
for general integer values, DOUBLE
for real
numbers, and DISTRIBUTION
for histograms.
Note that a list of labels defined for each resource and metric can be
obtained by adding the --view FULL
option when executing the list command.
The labels can be used as filter conditions when executing the query described
below.
Please take a look at the following fixtures for the metric as well as the monitored resource descriptors for your reference:
Some metric descriptor resources are automatically created in the Project scope each time a Project resource is created. As the scope is Project, the resource can be managed both by the system and ordinary users - to some extent.
By default, each project gets Applications/Devices metric descriptors defined for pods and devices. Project may enable more optional services, which can create additional metric descriptors.
Service-provided Metric Descriptors provide:
Users of the projects can:
Time Series is identifiable by a unique set of:
Example of individual TimeSerie is number of packets in on a specific interface on a specific virtual machine.
Time series data points are expected to arrive in order and from a single client at once.
Example TimeSerie object:
{
"key": "BQHPAQoCGrEEHf8geAECGXc=",
"project": "your-project",
"region": "us-west2",
"metric": {
"type": "watchdog.edgelq.com/probe/session/delivery"
},
"resource": {
"type": "watchdog.edgelq.com/probe",
"labels": {
"probe_id": "p1"
}
},
"metricKind": "GAUGE",
"valueType": "DOUBLE",
"unit": "0-1%",
"points": [
{
"interval": {
"endTime": "2023-07-12T19:01:00Z"
},
"value": {
"doubleValue": 1
}
},
{
"interval": {
"endTime": "2023-07-12T19:00:00Z"
},
"value": {
"doubleValue": 1
}
}
]
}
Highlights:
key
. There is 1-1 relation between a binary key
(encoded as base64 in this example above) and a string label set.metricKind
, valueType
and unit
.
Note that those values may actually differ between queries: If user asks
for COUNT aggregation, valueType
will be always INT64
. If user is
interested in MEAN, it will be DOUBLE
. Unit is fetched from
MetricDescriptor
.The most common metric type - represents “in-moment” measurement, like Used Memory or statistics over a small period, like: CPU Usage.
Monotonically increasing value, like interface packet count. May be reset from time to time, but that should not affect the aggregated total sum. Useful for billing purposes.
Delta Kind may correspond to changes in resource usage like added new Device (+1) or removed Device (-1). Useful for quota control, request counting, or any case where maintaining Cumulative value isn’t an option.
Useful for Metrics that represent counters, like packet or request counts, or Gauges like used memory.
Floating point value - useful for Gauge values, like CPU Utilization, etc.
A histogram is useful for retaining more information about data-point value distribution when performing multiple aggregation steps (Align + Reduce), like request latency where we’re interested in the 99th percentile, not just mean.
Also, see
https://github.com/cloudwan/edgelq-sdk/blob/main/monitoring/proto/v4/common.proto
and find the Distribution
message.
A phantom time series allows to ensure that time series with specific value when responsible agent is down (or disconnected) and doesn’t emit anything. Usually orchestrated by controllers to provide device or agent down-detection.
You can also think of phantom time series as generators, as they emit specified points each minute.
Unlike other resources, TimeSerie obtains information through an operation called Query.
The following example queries the SEI agent (watchdog.edgelq.com/probe
)
as a resource and the metrics as packet loss
(watchdog.edgelq.com/probe/session/delivery
).
It asks for specific probe in specific region.
cuttle monitoring query time-serie --parent projects/your-project \
--filter 'resource.type="watchdog.edgelq.com/probe" AND \
metric.type="watchdog.edgelq.com/probe/session/delivery" AND \
resource.labels.probe_id="p1" AND region="us-west2"' \
--aggregation '{"alignmentPeriod": "60s", \
"perSeriesAligner": "ALIGN_SUMMARY", \
"crossSeriesReducer": "REDUCE_MEAN",
"groupByFields":["resource.labels.probe_id"]}' \
--interval '{"startTime": "2023-07-12T19:00:00Z", \
"endTime": "2023-07-12T19:01:00Z"}' -o json
Example output:
{
"timeSeries": [
{
"key": "BQHPAQoCGrEEHf8geAECGXc=",
"project": "your-project",
"region": "us-west2",
"metric": {
"type": "watchdog.edgelq.com/probe/session/delivery"
},
"resource": {
"type": "watchdog.edgelq.com/probe",
"labels": {
"probe_id": "p1"
}
},
"metricKind": "GAUGE",
"valueType": "DOUBLE",
"points": [
{
"interval": {
"endTime": "2023-07-12T19:01:00Z"
},
"value": {
"doubleValue": 1
}
},
{
"interval": {
"endTime": "2023-07-12T19:00:00Z"
},
"value": {
"doubleValue": 1
}
}
]
}
]
}
Standard list time series queries are those that contain:
metric.type
)--parent
All time series belong to some project, and it is necessary to specify project from which we query time series. Each project has own dedicated database indices, and is protected by authorization.
--filter
Describes filter conditions for the query combined with AND
operator.
Filter is a string that should satisfy:
metric.type <EqualityOp> <MetricTypes> [AND <FilterPath> <Operator> <Values>]
.
Expressions within []
may be repeated many times (or not at all).
<EqualityOp>
must be equal to =
or IN
<MetricTypes>
may be an array for IN operator, or just a quoted string for
single value.<FilterPath>
must be a valid path in TimeSerie object. It must be one of:
resource.type
: Points to MonitoredResourceDescriptor. If user does not
provide it, system deduces possible values based on metric.type
.metric.labels.<Key>
, where <Key>
must be a valid label key present in
MetricDescriptor resource (Field labels
). If user specified more than
one metric type, then labels must be present in all of them!resource.labels.<Key>
, where <Key>
must be valid label key present in
MonitoredResourceDescriptor resource (Field labels
). If user specified
multiple resource types, then label must be present for all of them! If
user did not specify resource type, label will must match whatever system
determines is actual resource type.region
: All time series belong not only to the project, but also to
specific regions. Project must be enabled in at least one region, but
may belong to the multiple regions. If region
filter condition is not
specified, then query is redirected to all regions where project is
enabled. It is advisable to provide region in the filter if we know it.
This way query will be not broadcast to multiple regions, which saves
on latency.<Operator>
must be equal to =
, !=
, IN
or NOT IN
.<Values>
may be single string value (quoted) or an array of quoted strings
between []
characters.--aggregation
You can specify an alignment interval (alignmentPeriod
), an aligner
for each time series (perSeriesAligner
), and a reducer across time series
(crossSeriesReducer
).
The alignment interval specifies the granularity of data required. The minimum granularity is 1 minute (60s or 1m), and the other available granularities are:
Note that unaligned data cannot be acquired.
The alignment interval must be set appropriately for the value of
the period (--interval
) value described below. Specifying a small
alignment interval with a large period value may cause the query to fail
due to the large amount of data to be processed.
The Aligner per Time Series defines the process used to merge data within an alignment period. For example, if the alignment interval is 5 minutes and the original data is stored every minute, there will be 5 data points in one alignment period. In this case, the data representative of that alignment period must be calculated, and the calculation method must be specified according to the application.
Below is a list of typical aligners that are commonly used:
ALIGN_MIN
, ALIGN_MAX
, ALIGN_MEAN
, ALIGN_STDDEV
use the minimum, maximum, average or standard deviation value within the period.
ALIGN_COUNT
use the number of data points in the period.
ALIGN_SUM
computes sum of values within specified alignment period.
ALIGN_SUMMARY
use the composite value of the histogram within the period. (Distribution type only).
ALIGN_DELTA
extract difference between current value specified at current end timestamp
and previous value at previous end timestamp. This aligner works only
for CUMULATIVE
and DELTA
metric types (Field metric_type
in MetricDescriptor must
be either of those values).
ALIGN_RATE
Works like ALIGN_DELTA
, but also divides result by number of seconds within
period (alignment period).
ALIGN_PERCENTILE_99
, ALIGN_PERCENTILE_95
, ALIGN_PERCENTILE_50
, ALIGN_PERCENTILE_05
use the value of each percentile in the period.
Reducer is used to group multiple time series and combine them to produce
a single time series. This option is used in conjunction with
the groupByFields
option, which specifies the criteria for grouping.
For example, to obtain a time series of the average CPU usage of all devices
in a specific project, specify resource.labels.project_id
in the
groupByField
option, then specify REDUCE_MEAN
as the reducer.
## If the above example is used as a parameter.
{... "crossSeriesReducer": "REDUCE_MEAN", "groupByFields": ["resource.labels.project_id"]}
Below is a list of typical reducers that are commonly used.
REDUCE_NONE
no grouping, no composite time series. Field groupByFields
is not relevant.
REDUCE_COUNT
Computes number of merged time series. Note it is not a sum of number of points
within merged time series!. To have sum of data points, use ALIGN_COUNT
combined
with REDUCE_SUM
.
REDUCE_SUM
use the sum of values within a group.
REDUCE_MIN
, REDUCE_MAX
, REDUCE_MEAN
, REDUCE_STDDEV
use the minimum, maximum, average and standard deviation values within a group.
REDUCE_PERCENTILE_99
, REDUCE_PERCENTILE_95
, REDUCE_PERCENTILE_50
, REDUCE_PERCENTILE_05
use the value of each percentile in the group.
It is important to note, that specific reducers (MEAN, STDDEV, any PERCENTILE), are typically best used with perSeriesAligner ALIGN_SUMMARY. This allows to eliminate imbalances between individual time series, and displays proper mean/percentile. Rate/Delta values (ALIGN_DELTA and ALIGN_RATE) are often used with REDUCE_SUM, similarly counters (ALIGN_COUNT) are best used with REDUCE_SUM as well.
See API Specifications for details on available aligners and reducers and the conditions under which they can be used.
--interval
Specify the time period for which time series data is to be retrieved.
Specify the start time and end time in RFC3339 format for startTime
and
endTime
. Please refer to “Data Storage Period” to set the available time
periods.
Standard list time series queries are those that contain:
metric.type
)One of the differences between regular queries and paginated, is that latter have aggregation function built-in. Paginated views must be defined within MetricDescriptors (more on it later), functions define aligner, reducer, and sorting order.
Pagination queries are helping to traverse/iterate over large number of individual time series.
Pagination view describes:
Function describes:
Time series across projects/regions will never be sorted together. Function (aligner+reducer) must extract either double or integer value.
Mechanism is following: For specified pagination view, each TimeSerie is sorted into specific ranking (project, region, metric and resource types are minimal shared properties of individual ranking, plus filterable/promoted labels). Within single ranking, function (aligner+reducer) extracts double/integer value from each TimeSerie for each timestamp. Finally, for each timestamp individually, sorted ranking of TimeSeries is created in specified order.
For example, imagine we have metric.type
equal to
devices.edgelq.com/device/disk/used
. This metric describes number of bytes
used on a disk. Matching resource.type
is devices.edgelq.com/device
.
Imagine we have a project and region where we have 10000 devices, and each has 3 partitions (name and mount). We can use paginated query to find out top 20 devices for each partition, but within single project and region. Query can look like the following example:
cuttle monitoring query time-serie --parent projects/your-project \
--filter 'metric.type="devices.edgelq.com/device/disk/used" AND \
resource.type="devices.edgelq.com/device" AND \
region="us-west2"' \
--pagination '{"alignmentPeriod": "3600s", \
"view": "ByDevice", \
"function": "Mean",
"limit": 20,
"offset": 0}' \
--interval '{"startTime": "2023-07-12T19:00:00Z", \
"endTime": "2023-07-12T22:00:00Z"}' -o json
This query, in order to work, must have in MetricDescriptor defined pagination
view ByDevice
and function Mean
. We will describe this later on in
Indices Management.
For now, assume that:
ByDevice
is the one where partition (name + mount) labels are
separating TimeSeries into different rankings. Resource label device_id
is the one label we keep in each individual ranking.Mean
is using ALIGN_SUMMARY+REDUCE_MEAN in DESCENDING order.Because we know there are 3 common partitions across all devices, we will receive, for each timestamp (19:00:00, 20:00:00, 21:00:00, 22:00:00), 60 time series data points (3 partitions by 20 devices) belonging to different TimeSerie objects.
Notes for caution:
Parameters other than --pagination
are same as for standard queries.
It is also possible to receive time series as they appear in the service.
$ cuttle monitoring watch time-serie \
--parent projects/your-project \
--filter 'metric.type="devices.edgelq.com/device/connected" AND \
resource.type="devices.edgelq.com/device"' \
--aggregation '{"alignmentPeriod": "300s", \
"crossSeriesReducer": "REDUCE_MEAN", \
"perSeriesAligner": "ALIGN_MEAN", \
"groupByFields":["resource.labels.device_id"]}' \
--starting-time "2025-01-01T12:00:00Z" \
-o json | jq .
Arguments are same as in standard queries, except --interval
is not
supported. It is replaced by --starting-time
. Starting time should not be
too far into te past (as of now, it can be one week old).
This is streaming long-running command. When some TimeSerie object is seen for the first time, monitoring will retrieve all data points from starting time (and including). It will also fetch full headers (metric & resource type and labels, project, region, unit).
Here is an example initial time series retrieved by the command above, assuming
that current time is past 2025-01-01T12:05:00Z
.
{
"timeSeries": [
{
"key": "BQHPAQoCGrEEHf8geAEA",
"project": "your-project",
"region": "us-west2",
"metric": {
"type": "devices.edgelq.com/device/connected"
},
"resource": {
"type": "devices.edgelq.com/device",
"labels": {
"device_id": "raspberry-pi-5"
}
},
"unit": "1",
"metricKind": "GAUGE",
"valueType": "DOUBLE",
"points": [
{
"interval": {
"endTime": "2025-01-01T12:05:00Z"
},
"value": {
"doubleValue": 1
}
},
{
"interval": {
"endTime": "2025-01-01T12:00:00Z"
},
"value": {
"doubleValue": 1
}
}
]
}
]
}
After 2025-01-01T12:10:00Z
, caller may receive next data point for this
timestamp:
{
"timeSeries": [
{
"key": "BQHPAQoCGrEEHf8geAEA",
"metricKind": "GAUGE",
"valueType": "DOUBLE",
"points": [
{
"interval": {
"endTime": "2025-01-01T12:10:00Z"
},
"value": {
"doubleValue": 1
}
}
]
}
]
}
This time, fields project
, region
, unit
, metric
and resource
are not
returned. Instead, caller should find matching TimeSerie object using key
value seen in the past. This mechanism ensures system does not need to fetch
again all TimeSerie metadata, saving on latency, bandwidth, resource
consumption.
Data point is unique for given key
and interval.endTime
fields.
Watch time-series call works using at-least-once principle. It is possible to:
key
field, in one moment you may receive data points for timestamps
(2025-01-01T12:00:15Z
, 2025-01-01T12:00:20Z
, 2025-01-01T12:00:25Z
),
then (2025-01-01T12:00:20Z
, 2025-01-01T12:00:25Z
, 2025-01-01T12:00:30Z
).
User should prefer last value received for given key and timestamp.Those received-again data points/headers are possible due to:
Finally, a TimeSerie object with unique key
may be not exactly synchronized
with other TimeSerie regarding timestamps: It is possible to receive data point
with timestamp 2025-01-01T12:00:15Z
for TimeSerie A, then
2025-01-01T12:00:05Z
for TimeSerie B (earlier). Typically, timestamps for
all TimeSerie should grow equally, but if some edge agent goes offline, it is
possible for them to submit LATE data when it comes online. Watch time-series
may return multiple late data points for this given edge object.
If edge agent comes online way too late, system may refuse data points over certain age though. As of now, maximum allowed age is one hour. Therefore, if device comes online after more than 1 hour break, older data points will not be accepted. TimeSeries received over watch (and actually also over List) will have gap in timestamps.
Data can only be acquired for a limited time depending on the alignment interval. For example, data with a resolution of 1 m (1 minute) can only be acquired up to 14 days in the past. If you want data up to 90 days old, you must use an alignment interval of 30 minutes.
This table represents retention periods:
Alignment Interval | Storage Period |
---|---|
1m | 14 days |
3m | 28 days |
5m | 42 days |
15m | 62 days |
30m | 92 days |
1h | 184 days |
3h | 366 days |
6h | 732 days |
12h | 1098 days |
1d | 1464 days |
By default, all time series have those alignment periods enabled. It is
however possible to opt-out from them using storage options in MetricDescriptor
instances. For example, if we only need up to 1h alignment period of
devices.edgelq.com/device/connected
metric type in project your-project
,
we can do that making an update:
$ cuttle monitoring update metric-descriptor \
'projects/your-project/metricDescriptors/devices.edgelq.com/device/connected' \
--storage-config '{"maxAp":"3600s"}' --update-mask 'storageConfig.maxAp' -o json
This allows to significantly reduce number of data points to be created and stored, lowering total costs (especially long term storage).
It is not possible to disable individual alignment periods “in the middle”. Higher alignments require lower alignments.
It is possible to restrict time series creation/queries to a specific subset within the project scope.
For example, suppose we have a device agent, and we want to ensure it can read/write only from/to specific owned time series. We can create the following bucket:
cuttle monitoring create bucket <bucketId> --project <projectId> \
--region <regionId> \
--resources '{
"types":["devices.edgelq.com/device", "applications.edgelq.com/pod"],
"labels": {"project_id":{"strings": ["<projectId>"]}, "region_id":{"strings": ["<regionId>"]}, "device_id":{"strings": ["<deviceId>"]}}
}'
We can now create a Role for Device (Yaml):
- name: services/devices.edgelq.com/roles/restricted-device-agent
scopeParams:
- name: region
type: STRING
- name: bucket
type: STRING
grants:
- subScope: regions/{region}/buckets/{bucket}
permissions:
- services/monitoring.edgelq.com/permissions/timeSeries.create
- services/monitoring.edgelq.com/permissions/timeSeries.query
The project can be specified in RoleBinding. When we assign the Role to the Device, the device agent will be only able to create/query time series for a specific bucket - and this bucket will guarantee that:
devices.edgelq.com/device
or
applications.edgelq.com/pod
resource types.Buckets ensure also correctness even if the client is submitting binary time series keys (A key in TimeSerie is provided, which allows to skip metric and resource types and labels).
Provided example above is for information - Service devices.edgelq.com
already provides Buckets for all Devices!
Monitoring service can observe specified time series to spot issues and trigger alerts.
Top monitoring resource is AlertingPolicy
. It is characterized by:
Creation of example policy in specific project/region, without notification:
$ cuttle monitoring create alerting-policy policyName --project your-project --region us-west2 \
--display-name "display name" --spec '{"enabled":true}' -o json
We can enable/disable it ($ENABLED
must be either true
or false
):
$ cuttle monitoring update alerting-policy projects/your-project/regions/us-west2/alertingPolicies/policyName \
--spec '{"enabled":$ENABLED}' --update-mask spec.enabled -o json
You can check policies in a project:
$ cuttle monitoring list alerting-policies --project your-project -o json
Once you have alerting policy, you can create a condition. Condition has 3 spec components:
Be aware, that alerting condition must belong to some policy, which is region
scoped. Therefore, a condition will have implicitly added region
filter
condition. If you want to have same condition across multiple regions, you will
need to copy it as many times as you have regions in a project.
Suppose that we want to trigger an alert if avg CPU utilization on ANY device exceeds 90% for at least 15 consecutive minutes, using granularity of 5 minutes. Assume that time series are reported within range from 0.0 to 1.0. We can do that with cuttle:
$ cuttle monitoring create alerting-condition cndName \
--parent 'projects/your-project/regions/us-west2/alertingPolicies/policyName' \
--display-name 'Display name' \
--spec '{"timeSeries":{\
"query":{\
"filter": "metric.type=\"devices.edgelq.com/device/cpu/utilization\" AND resource.type=\"devices.edgelq.com/device\"",\
"aggregation": {"alignmentPeriod":"300s", "perSeriesAligner":"ALIGN_SUMMARY","crossSeriesReducer":"REDUCE_MEAN","groupByFields":["resource.labels.device_id"]}\
},\
"threshold":{"compare":"GT", "value":0.9},\
"duration":"900s"\
}}'
Note that in query we specify filter and aggregation fields. In other words, pagination queries are not possible for alerting.
When alerting condition is created, monitoring checks non-aggregated and
pre-aggregated indices. It is worth mentioning here that alerting conditions
utilize watch time-series queries internally. Therefore,
labels in aggregation.groupByFields
are taken into account when looking
at partitionLabelSets
.
Number of TimeSeries objects monitored by a service for a single condition
depends on cardinality of labels in aggregation.groupByFields
. Also, at any
given time, there can be as many firing alerts as many are unique TimeSeries
objects within fields defined by aggregation.groupByFields
. Each Alert
instance is associated with:
aggregation.groupByFields
.Alerts for same TimeSeries will not overlap time-wise.
Duration field in condition specification has two meanings:
Duration should be multiplication of aggregation.alignmentPeriod
.
Non-firing alerts are deleted after 3 months as of now (garbage collected). As of now this is not configurable.
In order to enable notifications about alerts, it is necessary to create
NotificationChannel
resource. It can be done using cuttle. As of now, there
are 3 type of notifications:
$ cuttle monitoring create notification-channel --project your-project email-example \
--spec '{"enabled":true, "type":"EMAIL", "addresses":["admin@example.com"]}'
$ cuttle monitoring create notification-channel --project your-project slack-example \
--spec '{"enabled":true, "type":"SLACK", "incomingWebhook": "https://some.url"}'
$ cuttle monitoring create notification-channel --project your-project slack-example \
--spec '{"enabled":true, "type":"WEBHOOK", "webhook": {"url": "https://some.url", "maxMessageSizeMb": 0.25}}'
Created channels may be attached to policies from any region in a project:
$ cuttle monitoring update alerting-policy 'projects/your-project/regions/us-west2/alertingPolicies/policyName' \
--spec '{"notification":{"enabled":true, "channels": ["projects/your-project/notificationChannels/email-example"]}}'
Naturally policy can be created straight with attached notification channel.
Apart from using notifications, users can also access API to watch alert changes directly in their projects.
Webhooks are more customizable notification types, and more guaranteed structure. Webhook full notification has following format:
{
"project": {/** monitoring.edgelq.com/Project object here **/},
"organization": {/** iam.edgelq.com/Organization object here **/},
"alertingPolicy": {/** monitoring.edgelq.com/AlertingPolicy object here **/},
"notification": {/** monitoring.edgelq.com/Notification object here **/},
"events": [{
"alertingCondition": {/** monitoring.edgelq.com/AlertingCondition object here **/},
"metricDescriptor": {/** monitoring.edgelq.com/MetricDescriptor object here **/},
"monitoredResourceDescriptor": {/** monitoring.edgelq.com/MonitoredResourceDescriptor object here **/},
"alerts": [{
/** monitoring.edgelq.com/Alert object here **/
}/** More alerts **/]
}/** More events **/]
}
Refer to specifications of specified resources for more details what fields are available.
Note that this is described as full message. Many fields are hidden to
reduce notification size. It is possible to define what field paths should
be included in each message using notificationMask
field:
$ cuttle monitoring create notification-channel --project your-project slack-example \
--spec '{..., "webhook": {..., "notificationMask": ["path1", "path2"...]}}'
Refer to NotificationChannel
specification to see what is the default
notificationMask if user does not specify it (it is not empty).
Smaller message should easier fit into webhook.
It is also possible to limit maximum message using maxMessageSizeMb
param:
$ cuttle monitoring create notification-channel --project your-project slack-example \
--spec '{..., "webhook": {..., "maxMessageSizeMb": 0.25}}'
It should be used if there is a maximum size webhook can accept. By default there is no limit.
TimeSeries objects can be queried or received in real-time manner using watch. However, it is also possible to forward them to the external systems. We can do that using two resources:
TimeSeriesForwarderSink
: Provides target where time series can be stored
in protobuf format (with optional compression).TimeSeriesCollectionRule
: Provides persistent query that executes in the
service background. It can (should) be attached to sink resource.Collection rules do not have to be in the same project as sink. Admins may create one sink in major project and connect collection rules from minor ones.
Creating sink requires providing some external endpoint. As of now, only azure event hub is supported. Endpoint to azure event hub must contain auth key, therefore it is necessary to create secret resource first:
$ cuttle secrets create secret secretName --region us-west2 --project your-project \
--data '{"EndpointString": "Endpoint=sb://<name>.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=<SECRET>;EntityPath=<topicName>"}'
Replace all variables within <>
with proper values. Note that you should
provide also topic name (after creating azure even hub, create also a topic).
Then you can create sink, reference azure endpoint with secret:
$ cuttle monitoring create time-series-forwarder-sink sink-name --project your-project \
--display-name "some name" \
--spec '{"compression": "SNAPPY", "azureEventHub": {"endpoint": "projects/your-project/regions/us-west2/secrets/secretName"}}' -o json
Monitoring service detects automatically on start number of partitions.
With sink, you can create collection rules. They are very similar to watch queries. You need to provide filter and aggregation. Thenew argument is only sink.
Remember that fields filter
, aggregation
and sink
cannot be changed.
You will need to recreate rule.
Under the hood, watch queries are executed by the system to forward time series to the sink.
$ cuttle monitoring create time-series-collection-rule r1 --project your-project \
--display-name 'Some name' \
--filter 'metric.type="devices.edgelq.com/device/connected" AND \
resource.type="devices.edgelq.com/device"' \
--aggregation '{"alignmentPeriod": "300s", \
"crossSeriesReducer": "REDUCE_MEAN", \
"perSeriesAligner": "ALIGN_MEAN", \
"groupByFields":["resource.labels.device_id"]}' \
--sink 'projects/your-project/timeSeriesForwarderSinks/sink-name' -o json
With described example setup, azure event hub should be receiving protobuf
objects of type ntt.monitoring.v4.BulkTimeSeries
. If compression is used,
remember to decompress azure event bytes first, before attempting to
unmarshal BulkTimeSeries. Example simple Go code to parse events:
package main
import (
"context"
"fmt"
"os"
"github.com/golang/snappy"
eventhub "github.com/Azure/azure-event-hubs-go/v3"
"google.golang.org/protobuf/proto"
rts "github.com/cloudwan/edgelq-sdk/monitoring/resources/v4/time_serie"
)
func main() {
ctx := context.Background()
connStr := "Endpoint=sb://demo.servicebus.windows.net/;SharedAccessKeyName=shared;SharedAccessKey=SecretKey;EntityPath=test"
hub, err := eventhub.NewHubFromConnectionString(connStr)
if err != nil {
panic(fmt.Errorf("Failed to create hub client: %s", err))
}
// Subscribe to partition "0"
handle, err := hub.Receive(ctx, "0", func(ctx context.Context, event *eventhub.Event) error {
decompressedData, err := snappy.Decode(nil, event.Data)
if err != nil {
panic(err)
}
metrics := &rts.BulkTimeSeries{}
if err := proto.Unmarshal(decompressedData, metrics); err != nil {
panic(err)
}
os.Stderr.WriteString(fmt.Sprintf("GOT METRICS: %s\n", metrics))
return nil
})
if err != nil {
panic(fmt.Errorf("Failed receive: %s", err))
}
<-handle.Done()
}
If data is not being delivered, it is worth to check:
cuttle monitoring get time-series-forwarder-sink projects/your-project/timeSeriesForwarderSinks/name --field-mask status -o json
cuttle monitoring watch time-serie --parent 'projects/your-project' --filter '...' --aggregation '...' -o json
sink
in TimeSeriesCollectionRule
).If sink status does not indicate error, and watch query is delivering data, and collection rule is attached to the sink, it indicates system issue.
Indices are important for two reasons:
It is required to balance between them - more indices will require more writes and larger storage.
It is necessary to understand process of writing:
Essentially, there are 3 category of indices:
The common characteristic for all indices are partitions. Single partition
stores data points of TimeSerie objects with different keys in timestamp
increasing order. TimeSerie object identified by key
may be stored in one
or more partitions, depending on index specification.
Efficient queries require that partition we read from does not store too much data points (from different TimeSerie) in a single timestamp. This rule is common for all index types.
Each partition at least specifies following fields of TimeSerie: project
,
region
, metric.type
and resource.type
. It means TimeSerie across
regions, projects and metric types will not be mixed with others. Index
specification may contain more metric/resource label keys optionally.
For simplicity, when examining time series indices, we will assume that there is only one region ID used by an example project.
Non-aggregated indices use partitions to group TimeSerie objects, but nothing more. Data points are aligned according to the AP period (one minute, 3 minutes…, etc.). To ensure efficiency, high cardinality labels should be part of partition key.
For example, lets take the following MonitoredResourceDescriptor:
- name: services/devices.edgelq.com/monitoredResourceDescriptors/device
type: devices.edgelq.com/device
displayName: Device
labels:
- key: device_id
description: Device ID
valueType: STRING
- key: device_display_name
description: Device Display Name
valueType: STRING
defaultValue: <undefined>
- key: device_serial_number
description: Device Serial Number
valueType: STRING
defaultValue: <undefined>
Then, lets take the following MetricDescriptor:
- name: projects/your-project/metricDescriptors/devices.edgelq.com/device/disk/used
type: devices.edgelq.com/device/disk/used
displayName: Disk usage in bytes
metricKind: GAUGE
valueType: INT64
unit: By
labels:
- key: mount_point
description: Mount Point
valueType: STRING
- key: partition_name
description: Parition Name
valueType: STRING
We may have a fleet of devices, each characterized by a unique device_id
and
small set of partitions. If we have large fleet of similar devices we can
assume that:
device_id
, device_display_name
and device_serial_number
are high cardinality. If we have 20K devices, then we will have 20K label values
for each of these. However, it can be safely assumed that for specific device_id
we will have one value of device_display_name
and device_serial_number
.mount_point
and partition_name
will typically have low
cardinality, as devices within specific project/region should be similar.
But within single device_id
, we should be even more confident that number of
partitions will not go into large values.Based on the knowledge above, we may define the following non-aggregated indices:
- name: projects/your-project/metricDescriptors/devices.edgelq.com/device/disk/used
type: devices.edgelq.com/device/disk/used
displayName: # ... SKIP
labels: # ... SKIP
metricKind: GAUGE
valueType: INT64
unit: By
indices:
builtIn:
nonAggregatedIndices:
- name: "device-nonaggregated"
resourceTypes: [ devices.edgelq.com/device ]
partitionLabelSets:
- name: "DeviceScope"
resourceKeys: [ device_id ]
metricKeys: [ ]
- name: "SerialNumberScope"
resourceKeys: [ device_serial_number ]
metricKeys: [ ]
In the result we will have, for this metric descriptor (which has project, region, metric and resource type scope), as many partitions as number of devices multiplied by 2. We will have two indices:
device-nonaggregated:DeviceScope
: With partitions separated by resource
label key device_id
.device-nonaggregated:SerialNumberScope
: With partitions separated by resource
label key device_serial_number
.Note that each TimeSerie data point will be saved twice.
In terms of query efficiency, we satisfy the requirement that single partition
should not have too many data points for single timestamp. It is because single
partition will be guaranteed to contain data from only one device. If number of
disk partitions (labels mount_point
and partition_name
) on single device
is low (like 3 at most), single partition for single timestamp will contain 3
data points.
Let’s examine some queries:
cuttle monitoring query time-serie --parent 'projects/your-project' \
--filter 'metric.type="devices.edgelq.com/device/disk/used" AND \
resource.type="devices.edgelq.com/device" \
AND resource.labels.device_id="some_id"' \
--aggregation '{"perSeriesAligner":"ALIGN_SUMMARY",
"alignmentPeriod":"300s",
"crossSeriesReducers":"REDUCE_MEAN",
"groupByFields":["metric.labels.mount_point"]}' \
--interval '{"startTime":"$START_TIME","endTime":"$END_TIME"}' -o json | jq .
Above query specifies resource.labels.device_id
in the filter condition, but not
resource.labels.device_serial_number
. In that case, monitoring will use
device-nonaggregated:DeviceScope
index to retrieve data. If unique number of
metric.labels.mount_point
label is 3, we will receive 3 separate TimeSerie objects
for specified interval. We will read little number of points from the partition
belonging to device.
Note that param aggregation
may specify crossSeriesReducers
and groupByFields
fields for non-aggregated indices. Non-aggregated only means that data stored is in
non-aggregated (across time series) format. Aggregation can be executed on-fly,
during query execution.
Lets take a look into another example query:
cuttle monitoring query time-serie --parent 'projects/your-project' \
--filter 'metric.type="devices.edgelq.com/device/disk/used" AND \
resource.type="devices.edgelq.com/device" \
AND resource.labels.device_serial_number="some_number"' \
--aggregation '{"perSeriesAligner":"ALIGN_SUMMARY",
"alignmentPeriod":"300s",
"crossSeriesReducers":"REDUCE_MEAN",
"groupByFields":["metric.labels.mount_point"]}' \
--interval '{"startTime":"$START_TIME","endTime":"$END_TIME"}' -o json | jq .
Query above will use other index, device-nonaggregated:SerialNumberScope
,
but otherwise we will get similar response as before.
Now examine the following query:
cuttle monitoring query time-serie --parent 'projects/your-project' \
--filter 'metric.type="devices.edgelq.com/device/disk/used" AND \
resource.type="devices.edgelq.com/device" \
AND resource.labels.device_display_name="Some name"' \
--aggregation '{"perSeriesAligner":"ALIGN_SUMMARY",
"alignmentPeriod":"300s",
"crossSeriesReducers":"REDUCE_MEAN",
"groupByFields":["metric.labels.mount_point"]}' \
--interval '{"startTime":"$START_TIME","endTime":"$END_TIME"}' -o json | jq .
Monitoring service will return InvalidArgument
error, indicating that filter
does not match any of the defined indices.
If query contained BOTH resource.labels.device_serial_number
and
resource.labels.device_id
, monitoring will pick one of those indices,
but not both. Index is picked on case-by-case basis, after computing what
is more optimal.
Finally, let’s take a look at this query:
cuttle monitoring query time-serie --parent 'projects/your-project' \
--filter 'metric.type="devices.edgelq.com/device/disk/used" AND \
resource.type="devices.edgelq.com/device"' \
--aggregation '{"perSeriesAligner":"ALIGN_SUMMARY",
"alignmentPeriod":"300s",
"crossSeriesReducers":"REDUCE_MEAN",
"groupByFields":["resource.labels.device_id", "metric.labels.mount_point"]}' \
--interval '{"startTime":"$START_TIME","endTime":"$END_TIME"}' -o json | jq .
This query will also fail due to lack of index that could match it. When
monitoring receives a list query, it takes into account only filter
field
when matching against partitionLabelSets
. Presence of resource.labels.device_id
in aggregation.groupByFields
does not change this calculation. Monitoring
service is not scanning all partitions within project in search of unique device
IDs. Single queries are required to read from limited number of partitions.
Scanning potentially tens of thousands of partitions may also be non-practical. If we had 3 disk partitions and 10K devices, we would receive 30K TimeSerie objects - and if interval is one day (while alignment period is 5 minutes), query would return 288 data points per each TimeSerie, totalling 8640000. Response (uncompressed) may take hundreds of Megabytes. It is heavy for the system and receiving client. If number of devices grow even more, it will scale worse and worse.
Potentially, we can enable this query by adding non-aggregated index like this:
nonAggregatedIndices:
- name: "device-nonaggregated"
resourceTypes: [ devices.edgelq.com/device ]
partitionLabelSets:
- name: "ProjectScope" # Not recommended
resourceKeys: [ ]
metricKeys: [ ]
- name: "DeviceScope"
resourceKeys: [ device_id ]
metricKeys: [ ]
- name: "SerialNumberScope"
resourceKeys: [ device_serial_number ]
metricKeys: [ ]
Index device-nonaggregated:ProjectScope
would create one big partition though,
which goes against the rule that single partition must contain limited number of
data points for each timestamp. While monitoring does not prevent from creating
bad indices (it cannot know in advance cardinality of labels), subsequent queries
may start being rejected with timeout/out of resources errors.
Non-aggregated indices also are used for watch queries. However,
we have more relaxed requirements regarding indices, as labels provided via
aggregation.groupByFields
are also used against partitionLabelSets
. In other
words, following query would be supported:
cuttle monitoring watch time-serie --parent 'projects/your-project' \
--filter 'metric.type="devices.edgelq.com/device/disk/used" AND \
resource.type="devices.edgelq.com/device"' \
--aggregation '{"perSeriesAligner":"ALIGN_SUMMARY",
"alignmentPeriod":"300s",
"crossSeriesReducers":"REDUCE_MEAN",
"groupByFields":["resource.labels.device_id", "metric.labels.mount_point"]}' \
--starting-time '$START_TIME' -o json | jq .
While resource.labels.device_id
is not provided via filter, it is provided via
group by. It is not supported for regular queries, but works well for watch. The
reason lies within internal implementation details. Plus, watch time series
is capable of chunking large responses into more - it is designed to run longer,
in streaming fashion.
Finally, it is worth to reduce number of indices when they are not needed.
Non-aggregated indices are not re-using underlying storage, they are full
replicas. In case of devices.edgelq.com/device/disk/used
, we were able to
notice that caller always provides device_id
, rendering other index using
serial number redundant.
partitionLabelSets
contain all high cardinality labels.partitionLabelSets
.Pre aggregated indices are next evolution from non-aggregated ones. Like the latter, pre-aggregated indices are used by regular and watch queries: User must specify parent project, filter and aggregation. However, while non-aggregated indices store original TimeSerie objects as reported by time series writers, pre-aggregated merge those aligned-only time-series with each other to create new ones - so they are one step after non-aggregated.
Pre-aggregated means that aggregation happens at the storage level. Because storage already contains this data, it makes retrieval relatively cheap. Even if query requires merging tens of thousands of TimeSerie with each other, monitoring has a little work at the query time.
Let’s come back to known devices.edgelq.com/device
monitored resource
descriptor, as we will use it in these examples. Now, let’s define following
metric descriptor:
- name: projects/your-project/metricDescriptors/devices.edgelq.com/device/connected
type: devices.edgelq.com/device/connected
displayName: Device connected
metricKind: GAUGE
valueType: INT64
unit: "1"
labels: [] # empty labels
Each device sends “1” when it is online. When it is offline, data points are
populated with “0” value. Because connected
metric is direct single property
of device, it does not need any additional labels.
Hard cardinality labels are same as those discussed for non-aggregated indices.
To be able to check connectivity history of each individual device, we need some non-aggregated index:
- name: projects/<project>/metricDescriptors/devices.edgelq.com/device/connected
type: devices.edgelq.com/device/connected
displayName: Device connected
metricKind: GAUGE
valueType: INT64
unit: "1"
labels: []
indices:
builtIn:
nonAggregatedIndices:
- name: "device-nonaggregated"
resourceTypes: [ devices.edgelq.com/device ]
partitionLabelSets:
- name: "DeviceScope"
resourceKeys: [ device_id ]
metricKeys: [ ]
However, we may also want to know connectivity history of devices across project and region in general. In other words, we would like to execute query like this:
cuttle monitoring query time-serie --parent 'projects/your-project' \
--filter 'metric.type="devices.edgelq.com/device/connected" AND resource.type="devices.edgelq.com/device"' \
--aggregation '{ \
"perSeriesAligner":"ALIGN_MEAN", \
"alignmentPeriod":"300s", \
"crossSeriesReducers":"REDUCE_SUM"}' \
--interval '{"startTime":"$START_TIME","endTime":"$END_TIME"}' -o json | jq .
Note we do not group by device ID.
Execution of this query is done in two steps. First, for each individual
TimeSerie (which matches single device), we get fraction of 5 minutes interval
when device was online (see alignment period and per series aligner). For
example, if a device was online for 4 minutes during a 5 minute interval
(within larger specified interval in a query), ALIGN_MEAN
will produce
value 0.8: (1 + 1 + 1 + 1 + 0) / 5 = 0.8
. It assumes we have one data
point per minute. Each 5-minutes interval is computed individually, until
we fetch whole period specified by interval
argument.
After extracting aligned value for each individual time-series and timestamp,
we look at the cross series reducer. REDUCE_SUM
means we will add up all
values sharing same timestamp with each other. If a result for some timestamp
is 5635.7 it means that, within 5-minute interval ending at that timestamp,
on average 5635.7 of devices were online. If there were 10K devices in total,
maximum result we can ever have for single timestamp is 10000.
Knowing average number of online devices across time may be useful. In that case, we can define the following pre-aggregated index:
- name: projects/<project>/metricDescriptors/devices.edgelq.com/device/connected
type: devices.edgelq.com/device/connected
displayName: Device connected
metricKind: GAUGE
valueType: INT64
unit: "1"
labels: []
indices:
builtIn:
preAggregatedIndices:
- name: "device-aggregated"
resourceTypes: [ devices.edgelq.com/device ]
partitionLabelSets:
- name: "ProjectScope"
metricKeys: []
resourceKeys: []
filterAndGroupLabelSets:
- name: "AllReduced"
metricKeys: []
resourceKeys: []
supportedAggregations:
- name: "OnlineDevAvgCounts"
# These arrays may contain multiple entries if needed.
perSeriesAligners: [ ALIGN_MEAN ]
crossSeriesReducers: [ REDUCE_SUM ]
In comparison with non-aggregated indices, pre-aggregated have additional properties:
filter
or aggregation.groupByFields
parameters. But, label keys that are not
specified in these sets, must not be used in in filter
or
aggregation.groupByFields
parameters (except those mentioned in
partitionLabelSets
).Property partitionLabelSets
works in similar way as in non-aggregated
indices: In filter field, time-series queries must specify all labels
required by at least one set in partitionLabelSets
. And in case of watch
queries, it is also sufficient to provide partition labels via
aggregation.groupByFields
.
Number of actual pre-aggregated indices is a cartesian product of 3 arrays:
partitionLabelSets
, filterAndGroupLabelSets
and supportedAggregations
.
In the presented example, all the arrays have length of just one, therefore
we will have only one pre-aggregated index.
If you go back to the example query above, where groupByFields
set is empty
and filter provides only metric/resource type, you can see that it matches
pre-aggregated index. Partition label set ProjectScope
does not require ANY
extra labels in the filter. Then, AllReduced
set forbids all other labels
to be used, but we don’t. Finally, aligner and reducer are specified in
supported list, so we can use it.
Based on each preAggregatedIndices
group, monitoring generates number of
indices based on:
partitionLabelSets
filterAndGroupLabelSets
supportedAggregations
)In the example, we will have one index:
device-aggregated:ProjectScope/AllReduced/ALIGN_MEAN
Monitoring always multiples number of label sets when generating indices, but supported aggregations are attempted to be simplified. Monitoring may:
perSeriesAligners
.supportedAggregations
into same aligner, as long as they
belong to same partitionLabelSets
and filterAndGroupLabelSets
.It is important to note that each final aligner represents separate index data. To find out what final aligners (storage aligners) were determined by monitoring, users can make the following query:
$ cuttle monitoring list metric-descriptors --project $PROJECT --view NAME \
--field-mask indices.builtIn.preAggregatedIndices.supportedAggregations.storageAligners \
--field-mask indices.userDefined.preAggregatedIndices.supportedAggregations.storageAligners \
-o json
If there are two different supportedAggregations
functions sharing some
storageAligners
, monitoring will reuse same data to save on index data.
When we define pre-aggregated indices, it is important to make sure resulting
partitions will not be very large: We still should guarantee that number of
data points for given timestamp and partition will be limited. This is driven
by cardinality of labels in each set from filterAndGroupLabelSets
. It is
strongly advised against putting high cardinality labels there (like device ID).
Most device metric labels are fine: like disk partition name. Within single
project and region we can safely assume number of unique disk partitions is
limited. The maximum cardinality is decided ultimately by the widest set in
partitionLabelSets
(least amount of labels) and the largest set in
filterAndGroupLabelSets
(more labels decrease performance).
The particular case described above is very easy: Our AllReduced
has empty
list of keys, therefore total cardinality is exactly 1. Pre-aggregated index
using this group is guaranteed to produce at most 1 data point per each
timestamp, making it very efficient. Furthermore, looking at
partitionLabelSets
, we can say that for single project/region we will have
just one TimeSerie object describing this pre-aggregated connectivity metric
history.
In summary:
partitionLabelSets
may be empty, or contain any low/high cardinality labels.filterAndGroupLabelSets
should not contain high cardinality labelsPre-aggregated queries allow us to retrieve new TimeSerie objects that are based on thousands other ones (by aggregation). However, it does not allow to traverse those thousands of TimeSerie objects in a cheap way.
As a general rule, requests asking for thousands of TimeSerie objects are not good. If we have that many of them, there are two ways to manage:
TOP N / OFFSET K
results.Paginated indices address the second case.
Let’s define an example: We want to monitor CPU usage across very large fleet of devices. Specifically, we want to keep an eye on devices with the highest CPU. This is an example of MetricDescriptor:
- name: projects/your-project/metricDescriptors/devices.edgelq.com/device/cpu/utilization
type: devices.edgelq.com/device/cpu/utilization
displayName: CPU utilization in percentage
metricKind: GAUGE
valueType: DOUBLE
unit: "%"
labels:
- key: cpu_number
description: CPU Number
valueType: STRING
- key: state
description: CPU state one of user, system, idle, nice, iowait, irq, softirq and steal
valueType: STRING
indices:
builtIn:
paginationIndices:
- name: "usage-ranking"
resourceTypes: [ devices.edgelq.com/device ]
partitionLabelSets:
- name: "ProjectScope"
resourceKeys: [ ]
metricKeys: [ ]
views:
- name: "ByDevice"
filterableMetricKeys: [ state ]
filterableResourceKeys: [ ]
paginatedMetricKeys: [ ]
paginatedResourceKeys: [ device_id ]
functions:
- name: "Mean"
aligner: ALIGN_SUMMARY
reducer: REDUCE_MEAN
sorting: DESCENDING
This metric introduces 2 labels:
Of these 2 labels, we can safely assume that both of them are low cardinality. The only high cardinality label that remains is device ID.
As non and pre aggregated indices, paginated indices also have partitionLabelSets
.
They work in the same manner:
partitionLabelSets
If we can be certain that some labels will be always used in a filter, it is highly recommended to put them in partition label sets.
In this example case, we would like to retrieve top devices with the highest CPU,
so we cant tell device ID in the filter. Since we may want to make queries with
filter
specifying only metric type and region, it is best to have single empty
set in partitionLabelSets
. This way no labels are required to be specified.
The new important properties of paginated indices are:
views
: View is very important for deciding which labels may be used in a
filter field, and which labels are “paginated”. Filterable labels are defining
separate sorted rankings. Paginated labels are linked to double/integer
values that are sorted according to a defined functions. Note that each view
is combined with each set in partitionLabelSets
. Therefore, one index may
contain multiple rankings in linear memory layout. It is important to ensure
that filterable label keys are not high cardinality labels. Paginated
labels are the ones that can be of high cardinality.functions
: They combine aligner and reducer to extract double/integer
values for sorting purposes. Aligner specifies what to extract from individual
time series before they are (optionally) merged with each other. In the case
we presented, ALIGN_SUMMARY
tells we will merge distribution values.
Reducer then extracts final value from merged TimeSerie object. In our case,
it means we will extract AVERAGE value from final distributions.Monitoring merges time series according to the label keys that are not present
in either partitionLabelSets
and views
. Let’s examine current example in
this light.
Paginated indices are generated based on cartesian product of partitionLabelSets
,
views
and functions
. Since we have one item in each, we will have one index
that combines partition set ProjectScope
and view ByDevice
. Now, imagine we
have 10K devices, each with 4 CPU cores and 8 CPU states. Therefore, we have 320K
TimeSerie objects for each individual timestamp. We will be processing each
timestamp separately.
Since labels in ProjectScope
are empty, we will put all 320K TimeSerie objects
in one partition. Next, we take a look into view ByDevice
. We have one filterable
metric label: state
. Since we have established there are 8 states, 320K TimeSerie
objects are grouped into 8 different rankings: Each ranking now has 40K TimeSerie
objects. We will iterate each separately. Inside a ranking, we iterate all TimeSerie
and apply aligner
on each of them. Aligner tells us to extract Distribution
of
CPU measurements per each core. Therefore, we have now 40K Distributions, each has
a pair of labels: resource device ID and metric CPU number. Monitoring looks at
paginatedResourceKeys
and notices we should have only one resource label: device ID.
It takes 40K distributions, then merges those that share same device ID. Metric
label CPU number is effectively reduced (eliminated). Since we have 4 CPU numbers,
we will be left with final 10K Distributions - each assigned to specific device.
Since the value of reducer is REDUCE_MEAN
, we will extract average value from each
of distribution. Finally, we will have 10K pairs: device ID + average CPU value across
cores. This array is sorted in descending order.
Process above is executed by each sorting ranking. Therefore, for each timestamp
and each of 8 CPU states, we will extract 10K device + AVG CPU pairs. Rankings within
same partitionLabelSets
are stored in the same index, but in sorted manner.
Sorting neutralizes cardinality issues of all paginated labels.
While CPU number may be useful label, it was decided that in this ranking we will drop it to reduce number of rankings in general. If labels are indeed not needed, it is recommended to reduce them.
View/function names used in queries must be pre-defined in metric descriptor:
cuttle monitoring query time-serie --parent projects/your-project \
--filter 'metric.type="devices.edgelq.com/device/cpu/utilization" AND \
resource.type="devices.edgelq.com/device" AND \
region="us-west2" and metric.labels.state IN ["user","system"]' \
--pagination '{"alignmentPeriod": "3600s", \
"view": "ByDevice", \
"function": "Mean",
"limit": 20,
"offset": 0}' \
--interval '{"startTime": "2023-07-12T19:00:00Z", \
"endTime": "2023-07-12T22:00:00Z"}' -o json
You may have noticed that this CPU ranking may hide devices which have very high CPU usage on a single core - and low on remaining. For example, if we have 3 CPUs with 5% load, and 1 CPU with 95% load, average will be 27.5%.
To address this issue, we may add another function:
functions:
- name: "Mean"
# ... other fields
- name: "MaxCoreOfAvg"
aligner: ALIGN_MEAN
reducer: REDUCE_MAX
sorting: DESCENDING
Aligner ALIGN_MEAN
will extract average CPU usage per each core within each
alignment period. Reducer REDUCE_MAX
then picks CPU core that was the highest.
If CPU 0/1/2 had average CPU usage 5%, and CPU 3 had 95%, final value will be
not 27.5%, but 95%.
It is worth checking what functions are needed by users though - each function requires additional computation/storage resources when writing.
Paginated indices group produces as many indices, as large is cartesian product of
partitionLabelSets
, views
and functions
. In this example, we will have one
index: usage-ranking:ProjectScope/ByDevice/Mean
.
partitionLabelSets
may be empty, or contain any low/high cardinality labels.views.filterableMetricKeys
and views.filterableResourceKeys
should not
contain high cardinality labels.views.paginatedMetricKeys
and views.paginatedResourceKeys
are designed to
handle high cardinality labels.Each index has 3 lifecycle states:
By default, index is in ACTIVE state. From this state, it can be moved into SUSPENDED or CLOSED state. From SUSPENDED state, index can come back to ACTIVE state or CLOSED state. CLOSED state is terminal - index only exists to provide historical data up to the time when index was closed.
Potentially index can be completely removed from MetricDescriptor resource. In that case, it is forgotten and associated data will eventually expire.
Be aware, that time series are computed/written continuously with time. Adding new index does not cause old data to be recomputed. Nor deleting/closing index will delete old data. When monitoring gets a query, it analyzes requested interval to find the best index for each sub-period within.
It is highly recommended to move index into SUSPENDED state before CLOSED. This way we can test if anyone was actually using this index. While it is possible to use Audit service (sample reads) to analyze usage, or check monitoring index usage, it is advisable to err on safety side.
Non aggregated indices can be closed by changing partition label set status individually:
nonAggregatedIndices:
- name: "..."
resourceTypes: [ ... ]
partitionLabelSets:
- name: "..."
closingStatus: SUSPENDED # or CLOSED
For pre-aggregated/paginated indices, each cartesian component has its own closing status field:
paginationIndices:
- name: "..."
resourceTypes: [ ... ]
partitionLabelSets:
- name: "..."
closingStatus: ...
views:
- name: "..."
closingStatus: ...
functions:
- name: "..."
closingStatus: ...
preAggregatedIndices:
- name: "..."
resourceTypes: [ ... ]
partitionLabelSets:
- name: "..."
closingStatus: ...
filterAndGroupLabelSets:
- name: "..."
closingStatus: ...
supportedAggregations:
- name: "..."
closingStatus: ...
Index is considered CLOSED, if at least one of the inputs is in CLOSED state. If not, then index is SUSPENDED if at least one of the inputs is in SUSPENDED state.
Apart from lifecycle, it is important to categorize each index into 2 groups:
Both groups are reflected in MetricDescriptor schema:
- name: projects/.../metricDescriptors/...
type: ...
displayName: ...
metricKind: ...
valueType: ...
unit: ...
labels: ...
indices:
builtIn:
nonAggregatedIndices: ...
preAggregatedIndices: ...
paginationIndices: ...
userDefined:
nonAggregatedIndices: ...
preAggregatedIndices: ...
paginationIndices: ...
Understanding the RollupConnector service APIv2, as known as ntt.monitoring.rollup_connector.v2.