SPEKTRA Edge Alerting Service API

Understanding the Alerting service API.

Alerting service allows users to:

  • Setup automatic alerting based on time series and logs (Although logs are still not implemented yet).
  • Configure notifications receipt about alert changes
  • Use AI to diagnose & remediate alerts (when possible).

Full API Specifications (with resources):

  • Version v1

It uses data from:

Policy management and alerts

Alerting service is agnostic when it comes to observed resource types. SPEKTRA provides pre-defined templates for devices/pods, but it can be extended to custom services. Alerting usually uses monitoring/logging services, where users can specify relevant descriptors for time series and logs.

Most important resource in Alerting service is Policy: It is a container for conditions focused on specific resource type (like Device or Pod). Policy provides also a common specification regarding alert handling: It tells if AI agent should be enabled automatically, list of additional supporting queries to fetch correlated data, etc.

There are 2 types of generated alerts for time series:

  • THRESHOLD alert types are triggered when metric values exceed allowed range. For example, we can setup alert when CPU usage exceeds 85%. Threshold values can be global (in a project), or be customized per each reporting resource (adaptable, based on historic values).
  • ANOMALY alert types use AI algorithms to detect “unusual” patterns in time series metrics. It should be noted that each resource reporting metrics have its own patterns. It is assumed that each device can run different software, be on different hardware, have different workload. They may be powerful in detecting abnormal situations, but they have limitations.

It should be noted that THRESHOLD alerts are de facto mandatory. ANOMALY alerts can only be generated after sufficient amount of time passes. AI algorithms detecting anomalies require sufficient training data first. Depending on configuration, it may be week, two, four, or maybe more. During training data collection anomalies cant be detected. On top of that, characteristics of time series can change. For example, if software is upgraded, old patterns can be no longer valid. In that case, ANOMALY alerts cease to be meaningful. We need to wait sufficient of time again to collect new training data. Therefore, anomaly type alerts may be not always available. Threshold on the other hand can. Time Series conditions must always specify threshold configuration, while anomaly is optional. This ensures that we will always raise an alert beyond certain range.

Policies can be divided into 2 types, based on processing location:

  • BACKEND: Policy of this type indicates that alert detection is run by the backend itself. It is suitable especially for metrics like device connectivity. When device goes offline, it wont be able to generate “offline” alert on its own. The con of BACKEND based policy is poor support for ANOMALY alert types. AI Algorithms for detecting anomalies can be too heavy computationally if there are too many devices reporting: Remember that each device needs to learn its own patterns. If we have 25 observed metrics per device, and 10000 devices, it means 250000 unique time series that must be learnt separately. 10000 devices is also not very large, if we consider all projects.
  • EDGE: Policy of this type indicates that alert detection is run at the edge. In case of core SPEKTRA services, droplet is capable of monitoring its own metrics for devices.edgelq.com/device and applications.edgelq.com/pod resource types. Policy of this type can handle ANOMALY alerts. Since each device uses its own hardware, there is no risk of “AI workload backlog” growing from thousands of devices. Droplet can use spare free local resources to run anomaly detection. Drawback of EDGE policy is that it cant monitor certain metrics: most notably, connectivity metric. If device goes offline, it wont be able to send an alert that it is offline.

In SPEKTRA, we have two primary resource types that can raise alerts:

  • Device (from devices.edgelq.com service)
  • Pod (from applications.edgelq.com service)

It should be remembered that each Pod has assigned Device, therefore actual reporting happens on Device always.

Project administrator need to decide if they want to have ANOMALY and THRESHOLD alerts, or only THRESHOLD. If they opt for the former, it is advisable to have 4 policies for SPEKTRA core services:

  • BACKEND policy for Device resource (alerting resource is Device, the only resource label is device_id)
  • BACKEND policy for Pod resource (alerting resource is Device, resource labels are device_id and pod_id)
  • EDGE policy for Device resource (alerting resource is Device, the only resource label is device_id)
  • EDGE policy for Pod resource (alerting resource is Device, resource labels are device_id and pod_id)

If anomalies are not desired, administrator can create two BACKEND policies: One for Device, one for Pod.

Policies can be complex, therefore there are pre-defined templates. You can see them using command:

$ cuttle alerting list policy-templates --project public-alerting-templates --filter \
  'specTemplate.resourceIdentity.alertingResource="services/devices.edgelq.com/resources/Device"' \
  --field-mask supportingDocs --field-mask specTemplate -o json

The command above will display policy templates for core SPEKTRA services (applications and devices). Project public-alerting-templates is a public one containing all alerting templates that can be used in other projects. To see templates for core services, we filter specifically by Device resource type. It will return templates for both Device and Pod resource types. While it may look not intuitive first, it has a specific reason: All pod resources are attached to some device. Therefore, its always Device that is alerting.

In order to create a Policy in your project, you need to copy Documents (highly recommended), then create policy. Check field supportingDocs from PolicyTemplate. Get those documents, and copy:

$ cuttle alerting create document <document id> --project <my project ID> --title "<copied title>" \
  --mime-type "<copied type>" --content "<copied content>"

With documents prepared, create policy:

$ cuttle alerting create policy <policy id> --project <my project ID> --display-name "<display name of your choice>" \
  --spec '<copy paste JSON here from specTemplate>' \
  --template-source '{"template": "projects/public-alerting-templates/policyTemplates/<templateId>"}' \
  --supporting-docs '<first document...>' --supporting-docs '<second document...>' ... --supporting-docs '<last document>'

Each document name (supporting docs) must contain full format: projects/<project ID>/documents/<document ID>.

You can refer to PolicySpec for more details about spec.

Time Series Conditions

Once you have Policy resource, you can create TsCondition. It specifies:

  • List of observed time series queries (filters and aggregations, common alignment period and group by fields)
  • Threshold alerting configuration
  • Optional anomaly alerting configuration

Each TsCondition must belong to specific Policy. Group by fields in TsCondition spec must conform to alerting resource specified in Policy. For example, if Policy is for Device resource, then each TsCondition must specify device_id label in group by list. If Policy is for Pod resource (from applications.edgelq.com), then each TsCondition must specify device_id and pod_id labels in group by list. TsCondition can optionally specify extra labels if needed. Remember that each unique combination of group by values across the project is being monitored separately. Each can generate an alert.

Making good TsCondition requires knowledge of available monitoring metrics, labels, and typical values. If we want to use anomaly detection, we also need to specify training params. To simplify this process, we have pre-defined list of ts condition templates as well. Each policy template has its own recommended conditions.

In SPEKTRA core services, we have 4 policy templates in public project (BACKEND and EDGE, for Device and Pod). We know list of policy templates from this query:

$ cuttle alerting list policy-templates --project public-alerting-templates --filter \
  'specTemplate.resourceIdentity.alertingResource="services/devices.edgelq.com/resources/Device"' --view NAME

Now, to see list of recommended conditions for specific policy template, we can execute this query:

$ cuttle alerting list ts-condition-templates --project public-alerting-templates --policy-template backend-device-base-alerts \
  --field-mask supportingDocs --field-mask specTemplate -o json

You should see that:

  • We have standard conditions for temperature, CPU, memory, disk, connectivity…
  • BACKEND based policies contain more condition types - for example connectivity metric is exclusive. It does not make sense to monitor connectivity at the EDGE.
  • EDGE based policies contain anomaly alerting configurations. This is because AI algorithms are much more feasible at the EDGE than at the backend due to sheer amount of devices, each potentially presenting unique patterns.

Regarding alerting for SPEKTRA platform (core services): As a project administrator, you need to decide whether you want to go with THRESHOLD alerting only, or you want to combine with ANOMALY type. In the former case, it is recommended to setup just 2 BACKEND type policies, and deploy all/some TsCondition from underlying templates. If it is desired to have ANOMALY alerting along THRESHOLD, it is recommended to deploy all 4 policy templates (BACKEND and EDGE). Then, deploy TsConditions for EDGE policies - use all or some conditions of your choice (CPU, memory, disk, temperature, etc.). Then, deploy TsCondition for BACKEND policies that are unique for BACKEND (for example connectivity TsCondition for detecting offline devices). It may not make sense to have one TsCondition for CPU on backend, and one for EDGE. While it would work, you will have two conditions observing same metrics and raising duplicated alerts. Usage of templates allows to avoid specifying queries, thresholds, or training params for anomaly detectors. Those things can be complex, although you may learn based on examples.

Creating TsCondition based on template is similar to Policy. First, copy Documents (highly recommended). From TsConditionTemplate, check field supportingDocs. Get those documents, and copy:

$ cuttle alerting create document <document id> --project <my project ID> --title "<copied title>" \
  --mime-type "<copied type>" --content "<copied content>"

With documents prepared, create TsCondition:

$ cuttle alerting create ts-condition <cnd id> --policy <policy ID> --project <my project ID> --display-name "<display name of your choice>" \
  --spec '<copy paste JSON here from specTemplate>' \
  --template-source '{"template": "projects/public-alerting-templates/policyTemplates/<templateId>/tsConditionTemplates/<templateId>"}' \
  --supporting-docs '<first document...>' --supporting-docs '<second document...>' ... --supporting-docs '<last document>'

Each document name (supporting docs) must have of course full format: projects/<project ID>/documents/<document ID>.

You can refer to TsCndSpec for more details about spec.

Alert management

With Policies and TsConditions set, system monitors for abnormal situations and raises Alert resources. You can check on dashboard, or with cuttle:

# To check list of currently firing alerts in specific condition and region ID (you can replace regionId with "-" as wildcard):
$ cuttle alerting list alerts --parent 'projects/<projectId>/policies/<policyId>/tsConditions/<tsConditionId>/regions/<regionId>' \
  --filter 'state.isFiring=true' --order-by 'state.startTime ASC' --view DETAIL -o json
  
# To check list of firing alerts for specific resource (for example, specific Device):
$ cuttle alerting list alerts --project <projectId> --region <regionId> \
  --filter 'state.isFiring=true AND alerting_resource.name="projects/<projectId>/regions/<regionId>/devices/<deviceId>"' \
  --order-by 'state.startTime ASC' --view DETAIL -o json
  
# If you want to see alerts for Pod resource, query may be a bit more specific. Remember, each Pod has an assigned Device,
# and main alerting resource for Pod is still Device. We need to add extra filter condition using pod_id label:
$ cuttle alerting list alerts --project <projectId> --region <regionId> \
  --filter 'state.isFiring=true AND alerting_resource.name="projects/<projectId>/regions/<regionId>/devices/<deviceId>" AND tsInfo.commonResourceLabels.pod_id="<podId>"' \
  --order-by 'state.startTime ASC' --view DETAIL -o json

Alerting, on top of detecting alerts, also provides a way to automatically handle alerts using GenAI. This is enabled on Policy level (field spec.ai_agent.enabled in Policy resource). If the value is true, Alerting will employ LLM to investigate an alert. It will execute following steps:

  • It will fetch Alert data, with violating time series (or logs in the future).
  • It will fetch Document resources defined in Policy and TsCondition resource. This is why it is important to keep those documents in good state. They can contain alert runbook that can be utilized by LLM.
  • It will fetch additional data using field spec.supporting_queries from Policy object. This field contains list of potentially helpful correlated time series/logs/resources. When alert is raised for device, it is helpful to have list of extra data originating from that device. For example, if there is CPU spike on device, LLM will be able to see if there are pod CPU metrics spiking too. Or if logs indicate elevated usage.
  • If field spec.ai_agent.enabled_connectivity is enabled in Policy resource, LLM may also SSH into Device to get more information.
  • Considering all the mentioned context, LLM agent can decide what to do with alert. It will store its findings in field state.ai_agent_diagnosis_notes in the Alert resource.

LLM Agent may:

  • Assume that alert is false positive, and ignore it (for example, if high CPU usage is temporary due to upgrade operation). Alert is ignored and no further action taken.
  • Assume that alert is false positive due to incorrect threshold or incorrect anomaly detector. For example, if new workload is added to a device, CPU usage may increase permanently. This can trigger ANOMALY type alert, but its not actual anomaly. It is a false positive, and anomaly detector needs retraining based on new reality.
  • Escalate to human operator, if it is not clear how to handle the alert.
  • Try to propose remediation (in form SSH script for example) to fix the issue. If field spec.ai_agent.auto_accept_remediation in Policy resource is true, then LLM will be able to execute remediation without involving operator. Otherwise, operator will need to approve first.

This way, Alerting service can be semi-autonomous, with full alert lifecycle.

If AI Agent is not enabled automatically (goes to operator), or if alert has been escalated to operator, they can do similar actions:

  • Mark alert as false positive (to be ignored).
  • Mark alert as false positive and thresholds need to be adapted.
  • Send to AI agent for handling (it is possible even if AI agent is disabled in Policy).
  • Mark as resolved (it is assumed operator executed some remediation).

There are couple of important fields in Alert resource indicating current status:

  • Field state.is_firing is boolean informing if Alert is currently firing.
  • Field state.escalation_level indicates if Alert is being handled by AI Agent or Operator.
  • Field state.ai_agent_handling_state is relevant if state.escalation_level points to AI Agent, and tells what is the last available state (or decision) made by AI. Initial state for new alert is AI_AWAITING_HANDLING.
  • Field state.operator_handling_state is relevant if state.escalation_level points to Operator, and tells what is the last available state (or decision) made by Operator. Initial state for new alert is OP_AWAITING_HANDLING.

Operator can use cuttle or dashboard to execute operations on Alert resource:

# Ignore the alert. Be aware, that field state.operator_handling_state will switch to OP_AWAITING_HANDLING
# on its own, if alert does not stop firing after some time:
$ cuttle alerting update alert 'projects/<>/policies/<>/tsConditions/<>/regions/<>/alerts/<>' \
  --state '{"operatorHandlingState": "OP_IGNORE_AS_TEMPORARY"}' --update-mask 'state.operatorHandlingState'
  
# If operator thinks that alert is raised due to too sensitive threholds, its possible to execute this
# command. System will try to adjust entries on its own:
$ cuttle alerting update alert 'projects/<>/policies/<>/tsConditions/<>/regions/<>/alerts/<>' \
  --state '{"operatorHandlingState": "OP_ADJUST_CND_ENTRY"}' --update-mask 'state.operatorHandlingState'
  
# If operator executed some remediation, they can mark alert as resolved (but this is optional). If successful
# remediation was executed, alert should switch it's firing status to false anyway. This state can be used for filtering
# purposes.
$ cuttle alerting update alert 'projects/<>/policies/<>/tsConditions/<>/regions/<>/alerts/<>' \
  --state '{"operatorHandlingState": "OP_REMEDIATION_APPLIED"}' --update-mask 'state.operatorHandlingState'

# Operator can mark alert as acknowledged. From system point of view, this state has no consequence. This state can be
# used by operator to remember which alerts are new, which were looked at.
$ cuttle alerting update alert 'projects/<>/policies/<>/tsConditions/<>/regions/<>/alerts/<>' \
  --state '{"operatorHandlingState": "OP_ACKNOWLEDGED"}' --update-mask 'state.operatorHandlingState'

# Operator can ask AI Agent to handle this alert manually using this update command:
$ cuttle alerting update alert 'projects/<>/policies/<>/tsConditions/<>/regions/<>/alerts/<>' \
  --state '{"operatorHandlingState": "OP_NOT_INVOLVED", "aiAgentHandlingState": "AI_AWAITING_HANDLING", "escalationLevel": "AI_AGENT"}' \
  --update-mask 'state.operatorHandlingState' --update-mask 'state.aiAgentHandlingState' --update-mask 'state.escalationLevel'

There is also a field state.operator_notes that can be used by operator to store additional data (if they want).

If AI Agent is handling alerts, but automatic remediation approvals are not enabled, it may be necessary to check if there are alerts with pending approval (for example, in specific condition and region):

$ cuttle alerting list alerts --parent 'projects/<projectId>/policies/<policyId>/tsConditions/<tsConditionId>/regions/<regionId>' \
  --filter 'state.isFiring=true AND state.aiAgentHandlingState="AI_REMEDIATION_PROPOSED"' --order-by 'state.startTime ASC' \
  --view DETAIL --field-mask state -o json

State AI_REMEDIATION_PROPOSED indicates that fields state.ai_remediation and state.ai_remediation_arg are populated, and contain proposed remediation. For example, disk alert can contain log rotation command.

Operator must approve this alert using following update:

$ cuttle alerting update alert 'projects/<>/policies/<>/tsConditions/<>/regions/<>/alerts/<>' \
  --state '{"aiAgentHandlingState": "AI_REMEDIATION_APPROVED"}' --update-mask 'state.aiAgentHandlingState'

After this, system will execute remediation.

Notifications

In order to enable notifications about alerts, it is necessary to create NotificationChannel resource. It can be done using cuttle. As of now, there are 3 type of notifications:

$ cuttle alerting create notification-channel --project <projectId> email-example \
  --spec '{"enabled":true, "enabled_kinds": ["NEW_FIRING","STOPPED_FIRING"], "type":"EMAIL", "addresses":["admin@example.com"]}'

$ cuttle alerting create notification-channel --project <projectId> slack-example \
  --spec '{"enabled":true, "enabled_kinds": ["NEW_FIRING","STOPPED_FIRING"], "type":"SLACK", "incomingWebhook": "https://some.url"}'
  
$ cuttle alerting create notification-channel --project <projectId> webhook-example \
  --spec '{"enabled":true, "enabled_kinds": ["NEW_FIRING","STOPPED_FIRING"], "type":"WEBHOOK", "webhook": {"url": "https://some.url", "maxMessageSizeMb": 0.25}}'

Note that using field enabledKinds we can provide notifications for various state changes. We may opt to hide some alerts as well, for example if they can be fully handled by AI.

Created channels may be attached to policies:

$ cuttle alerting update policy 'projects/<projectId>/policies/<policyId>' \
  --notification-channels 'projects/<projectId>/notificationChannels/<channelId>' --update-mask notificationChannels

Naturally policy can be created straight with attached notification channels.

Apart from using notifications, users can also access API to watch alert changes directly in their projects.

Advanced topics - anomaly detection in TsCondition

Testing anomaly detection training

Writing TsCondition can be tricky, especially determining good anomaly detectors. We have helper cuttle commands to facilitate this.

Suppose we want to write TsCondition for temperature metric type. We want to monitor all chips on all devices in a project, but first, we want to test anomaly detector based on example device/chip that we already have. We can create JSON file like below:

{
  "projectId": "device-dev",
  "regionId": "us-west2",
  "tsEntryInfo": {
    "commonResourceLabels": {"device_id": "pp-rak-device-ek4445mkllkk3"},
    "commonMetricLabels": {"chip": "CPU"}
  },
  "queries": [
    {
      "name": "Temperature in celcius",
      "filter": "metric.type = \"devices.edgelq.com/device/hardware/temperature\" AND resource.type=\"devices.edgelq.com/device\"",
      "aligner": "ALIGN_MEAN",
      "reducer": "REDUCE_MAX",
      "maxValue": 120
    }
  ],
  "anomalyAlerting": {
    "analysisWindow": "3600s",
    "stepInterval": "300s",
    "trainStepInterval": "900s",
    "alignmentPeriod": "300s",
    "raiseAfter": "300s",
    "silenceAfter": "300s",
    "lstmAutoencoder": {
      "hiddenSize": 8,
      "learnRate": 0.005,
      "maxTrainingEpochs": 256,
      "minTrainingEpochs": 32,
      "acceptableTrainingError": 0.0005,
      "trainingPeriod": "604800s",
      "checkPeriodFraction": 0.05
    }
  },
  "trainEndTime": "2025-06-18T00:00:00Z",
  "checkInterval": {
    "startTime": "2025-06-18T00:00:00Z",
    "endTime": "2025-06-18T06:00:00Z"
  }
}

We will analyze this JSON file step by step.

Our project is device-dev, and we have some devices in us-west2 region. We have a device with ID pp-rak-device-ek4445mkllkk3, which has chip CPU. We can use monitoring first to verify data presence:

$ cuttle monitoring query time-serie --parent 'projects/device-dev' \
  --filter 'metric.type="devices.edgelq.com/device/hardware/temperature" AND resource.type="devices.edgelq.com/device" \
    AND region="us-west2" AND resource.labels.device_id="pp-rak-device-ek4445mkllkk3" AND metric.labels.chip="CPU"' \
  --aggregation '{"alignmentPeriod":"300s", "perSeriesAligner":"ALIGN_MEAN","crossSeriesReducer":"REDUCE_MAX"}' \
  --interval '{"startTime":"2025-06-11T00:05:00Z", "endTime":"2025-06-18T06:00:00Z"}' -o json

If we have data, we can use this example to write previously mentioned JSON. Project ID, region ID, ts Entry must specify our example resource. In field queries we provide all relevant queries (temperature in our case).

Configuration in anomalyAlerting can be tricky, but its possible to master it. First that we need to know, is that anomaly detection is running on sliding window. Sliding window is described by 4 parameters:

  • analysisWindow: Describes window size. In our case, it is one hour. We will be running anomaly detector on a window of fixed size always.
  • stepInterval: Describes sliding step, after which we prepare new window. This is 5 minutes in our case. This param is used during normal operation.
  • trainStepInterval: It is like stepInterval, but this one is used during training. It should be same as stepInterval or larger, to reduce number of generated training samples.
  • alignmentPeriod: Informs how large period is covered by single data point (floating number). This is data granularity

Consider this example:

    "analysisWindow": "3600s",
    "stepInterval": "300s",
    "trainStepInterval": "900s",
    "alignmentPeriod": "300s",

Once we finish training, system will be running anomaly detection each 5 minutes (stepInterval). Example windows:

  • Start time inc: 12.05.00, End time inc: 13.00.00. Number of data points: 12 (60 minutes by 5 minutes alignment period).
  • Start time inc: 12.10.00, End time inc: 13.05.00. Number of data points: 12 (60 minutes by 5 minutes alignment period).
  • Start time inc: 12.15.00, End time inc: 13.10.00. Number of data points: 12 (60 minutes by 5 minutes alignment period).
  • Start time inc: 12.20.00, End time inc: 13.15.00. Number of data points: 12 (60 minutes by 5 minutes alignment period).
  • … and so on

Param trainStepInterval decides step during training only. Here, we have 15 minutes. Therefore, if we have one week of training data, with one data point representing 5 minutes (alignment period), we have 2016 continuous data points. This is 24 * 7 * 12 (7 days, 24 hours, 12 data points in one hour). System will generate following training samples:

  • Start time inc: 00.05.00, End time inc: 01.00.00 (first day). Number of data points: 12 (60 minutes by 5 minutes alignment period).
  • Start time inc: 00.20.00, End time inc: 01.15.00 (first day). Number of data points: 12 (60 minutes by 5 minutes alignment period).
  • Start time inc: 00.35.00, End time inc: 01.30.00 (first day). Number of data points: 12 (60 minutes by 5 minutes alignment period).
  • Start time inc: 00.50.00, End time inc: 01.45.00 (first day). Number of data points: 12 (60 minutes by 5 minutes alignment period).
  • … and so on

Number of training samples can be calculated: (2016 - 12) / 3 + 1 = 669 samples with 12 data points each. Number 2016 is just number of data points. We subtract 12, because we want to remove “first” non-sliding window. We divide by 3, because trainStepInterval is 3 times larger than alignmentPeriod. Finally we add “1” as the first window.

Training data set will therefore contain 8028 floating point numbers (669 * 12). Training data will then consume around 64 KiB. During training itself of course this number will need to be proportionally multiplied.

As of now, we support only LSTM Autoencoder, as most flexible and capable of combining multiple ts queries if needed. Trained model size depends on hidden size (HS) and number of time series queries (TSC). Number of floating points is: TSC*(HS+2) + 2*(4HSHS+4HS+4TSC*HS). If HS = 8, and TSC = 1, we have 650 numbers, which is above 5 KiB.

Training requires considerable amount of RAM too -> each training step (number of data points in sequence) generates temporary matrices for back propagation. Approximate value: STEPS_COUNT * (BatchSize + 23*BatchSize*HiddenSize) numbers. If batch size is 669 (samples count), hidden size is 8, steps count 12, then it is around 11 MiB of data. However, it can become large with larger hidden size, steps count or batches count. Number of points in sequence can reach even 60, or 120. If hidden size is raised to 24, and batches count increases 4 times, we may even need 1 GB of data during training for temporary space, not including potential overheads. This is why we may have larger trainStepInterval. We may want to run check every 5 minutes, but it may be useful to have less granular steps during training itself (it may not matter). It is advisable to observe actual usage during test training.

Having decided on sliding window configuration, we need to consider params raiseAfter and silenceAfter. They should be multiplication of alignment period. If they are not, system will round them up to alignment period in practice. Those indicate number of consecutive data points where anomaly must be detected for alert to be raised. In our case, single anomaly point will cause alert to be raised or silenced.

Next, we have configuration for LSTM: lstmAutoencoder. We decide on hidden size, which influences trained model size. Hidden size should be lower than sequence length (number of data points in a window). Usually we use values 8, 16, 24, but we can try different ones. Learn rate is another important param, and it is tricky to catch correct value. Lower value usually extends training time, but may reach better accuracy, but with risk of overtraining. Training stops, when both conditions are met:

  • Number of passed epochs reaches minTrainingEpochs
  • Number of passed epochs reaches maxTrainingEpochs OR maximum error reaches acceptableTrainingError.

Param trainingPeriod indicates how long must be training period. If we dont have sufficient data, training wont take a place. Finally, param checkPeriodFraction indicates fraction (must be lower than 1.0) of training samples that wont be incorporated into training set. Those extra samples will be used to detect how large anomalies are in practice - not on training data. It is important to set this param to something larger than 0. Errors from training are not usually reliable.

It is important to mention at this point, that model training will be considered failed, if anomaly error values will be above certain maximum threshold. As of now, this value is 0.025. It means, that if training samples (from checkPeriodFraction) will give maximum error of 0.015, then model will be accepted. If maximum error will be for example 0.03, then model will be rejected. If model is rejected during actual training (not our testing cuttle training), then system will reschedule training in the future, on different data.

It should be also noted, that in case of failed training, system may actually try different training params than indicates in the specification. Specification is only default suggestion. This means that, if we have some weird devices, models may still be trained successfully. However, this is, as of now, backup option.

Lets come back to original JSON file, and clarify last params:

  • trainEndTime indicates end time from which we want to get training data. Begin time depends on param trainingPeriod.
  • checkInterval indicates extra time range on which trained model will be checked. This param must not be confused with checkPeriodFraction. The latter shows how many training samples are used for determining actual anomaly values. This will determine anomaly sensitivity later on. Time period in checkInterval will be used to generate sliding windows on which we will test anomaly detection. Our cuttle tool will run trained model on all samples within specified interval and display alerts that would have been raised.

Once we have JSON file and we understand it, we can test train & run anomaly detection:

# We can add optional train/check debug flags to have more verbose info during training or checking
$ cuttle alerting test-train-anomaly-detector-model -f <path-to-json-file> \
  --train-debug --check-train

If we are satisfied with the result, we can craft TsCondition resource, for example:

{
  "name": "projects/device-dev/policies/some-policy/tsConditions/temperature",
  "displayName": "Temperature alerting",
  "spec": {
    "queries": [
      {
        "name": "Temperature in celcius",
        "filter": "metric.type = \"devices.edgelq.com/device/hardware/temperature\" AND resource.type=\"devices.edgelq.com/device\"",
        "aligner": "ALIGN_MEAN",
        "reducer": "REDUCE_MAX",
        "maxValue": 120
      }
    ],
    "queryGroupBy": [
      "resource.labels.device_id",
      "metric.labels.chip"
    ],
    "thresholdAlerting": {
      "operator": "OR",
      "alignmentPeriod": "60s",
      "raiseAfter": "120s",
      "silenceAfter": "60s",
      "perQueryThresholds": [{
        "maxUpper": 70.0
      }]
    },
    "anomalyAlerting": [
      {
        "analysisWindow": "3600s",
        "stepInterval": "300s",
        "trainStepInterval": "900s",
        "alignmentPeriod": "300s",
        "raiseAfter": "300s",
        "silenceAfter": "300s",
        "lstmAutoencoder": {
          "hiddenSize": 8,
          "learnRate": 0.005,
          "maxTrainingEpochs": 256,
          "minTrainingEpochs": 32,
          "acceptableTrainingError": 0.0005,
          "trainingPeriod": "604800s",
          "checkPeriodFraction": 0.05
        }
      }
    ]
  }
}

We used tested anomaly detector and incorporated into TsCondition. Note that in field queryGroupBy we provided labels by which unique monitored entries are generated. They will include sample device/chip, but wont be limited by it. This TsCondition is naturally distributed to all regions where project is enabled too.

We need to remember configure threshold alerting, in case anomaly detection is not available, or contained bad training data. We want to raise alert if temperature crosses 70 celcius from above.

Finally, note that in field anomalyAlerting we can provide array of configurations. This can be recommended, if we want to have two models: One for short term windows (1 hour), but also one for long term tendencies (like 1 day). In case of longer windows, it is recommended to have larger alignment period. For example, for analysis window 1 day, we often use alignment period of 30 minutes. This gives 48 data points in one window. Both train/step interval can be set to 30 minutes too.

Checking existing trained anomaly detectors

We may encounter some bad anomaly type alerts from some device and TsCondition. There is a chance that model can be wrong. We can use cuttle to get model and test it on some sample data.

First we need to know, is that trained models are stored in resources called TsEntry. TsEntry is a resource under TsCondition, and represents unique combination of values specified in spec.queryGroupBy field in TsCondition.

Imagine we want to find TsEntry for specific TsCondition and alerting resource. We can use cuttle:

$ cuttle alerting list ts-entries --parent projects/device-dev/policies/device-alerts/tsConditions/temperature/regions/us-west2 \
  --filter 'info.alerting_resource.name="projects/device-dev/regions/us-west2/devices/pp-rak-device-ek4445mkllkk3"' \
  --view DETAIL --field-mask state -o json

If we have extra labels in spec.queryGroupBy in TsCondition, for example metric.labels.chip, then we may receive multiple TsEntries, one per each chip. We can narrow cuttle query for specific TsEntry:

$ cuttle alerting list ts-entries --parent projects/device-dev/policies/device-alerts/tsConditions/temperature/regions/us-west2 \
  --filter 'info.alerting_resource.name="projects/device-dev/regions/us-west2/devices/pp-rak-device-ek4445mkllkk3" AND info.commonMetricLabels.chip="CPU"' \
  --view DETAIL -o json

See TsEntry resource for more details.

TsEntry contains adaptive alerting thresholds and anomaly models. We can fetch this resource and test anomaly detection on some specified period.

# add check-debug flag for more verbose logging
$ cuttle alerting check-anomaly-detector-model --analysis-window '3600s' \
  --ts-entry 'projects/device-dev/policies/device-alerts/tsConditions/temperature/regions/us-west2/tsEntries/<base64-blob>' \
  --check-start-time '2025-07-21T00:01:00Z' --check-end-time '2025-07-22T00:00:00Z' \
  --check-debug

We need to provide TsEntry resource, from which Condition and exact time series queries are deduced. Then, we must provide anomaly window size (TsEntry may contain different models per anomaly window size). Finally, we need to provide time range which will be checked.

Cuttle will print alerts that would be generated for specified period.

We can update field state.models using cuttle, and set some value under train_after field in problematic model. This will enforce new re-training after specified time.

$ cuttle alerting update ts-entry '<full name>' --state '{"models":[<ALL JSONs>]}' --update-mask 'state.models'

Note however, that you must provide full state.models content, with all anomaly models. Any unchanged fields must be repeated. As of now, server does not support updating sub-fields in array elements, and state.models field is an array field.

You can also delete TsEntry all together. It will trigger recreation and retraining using current (latest) data.


Understanding the alerting.edgelq.com service APIv1, in proto package ntt.alerting.v1.