Wide-Column Data Models
User Data Model
Input/output of widecolumn store is KeyedValues
from store package,
file common/widecolumn/store/keyed_values.go
. It can be thought
of as a resource. It is combined of:
-
Single Key object (uid package).
It is used to uniquely identify resources.
-
Array of Column values (data entry package, object
ColumnValue
).Each column value has:
- Column Family name (string)
- Timestamp
- Optional binary key
- Data (binary data)
KeyedValues can hold multiple column values with timestamps ranging from hours, days, weeks, months, and potentially years. Therefore, the Write method only saves a range of values for a single key (normally it appends!). The query just grabs partial data only for a given range.
Column family and optional binary keys are identifiers of column values for the same timestamp (so they don’t overwrite each other).
Note that this makes “a resource called KeyedValues” a time-series-like resource. In this context, Key represents some header and column values (with timestamps) contain a series of values across an always-increasing time range.
Imagine there is a temperature sensor with some serial number. It sends
the temperature every minute. In that case, we can define the following
Key: serial_number=1234, product=temp_sensor
and Column Family:
temperatures
. Then each temp can be represented as a double value and
associated with a specific timestamp. Double can be encoded as binary
data. We don’t need an optional binary key in this case.
Keys
A key is a unique identifier of a column value series. Key definitions
and manipulation are provided by the uid
package. It is a set of
key-value pairs (KV type). KV without Value part is just a “KId”
object, key identifier. It may be better thought of as key field
identifier, because “the real” Key is an array of these objects.
Each KV pair has an equivalent of a UID object: UID allows to mapping of Keys and KeyValue pairs between string and binary representation UIDs are represented as paths in up-to depth three tree each node may be allocated using parent counter and atomic operations hierarchy:
- Root UID len 1; (always 1), str:
_
- Key UID len 2; (1, kid), example str:
_#serial_number
- Value UID len 3; (1, kid, vid), example str:
_#serial_number#1234
It is worth mentioning that UID with depth 1 is the same as the empty root value. UID with length 2 is just a KId, then full size is equivalent to full KV.
It is important to say that UID is more focused on how data is internally stored in the underlying backend (bigtable or scylla).
Going back to the temperature sensor example, we can notice our Key object has two KV pairs:
serial_number=1234
product=temp_sensor
But if you look at the structure of KV you may notice that it is a pair of integer values. This is because the storage of integers is more efficient than strings. Especially if the given key value repeats many times across time and different keys. Therefore, each type in the uid package has a “String” equivalent:
- KV has SKV
- KId has SKId
- UID has StrUID
Note that the main store interface, apart from Query/Write, provides also functions for:
- Allocating string key values.
- Resolving strings to integers back and forth.
Structure KeyedValues provides Key and SKey as fields. They are meant to be equivalent. However, it is worth to note:
- When executing the Query method, all KeyedValues will only have a Key value set. It is assumed that SKey may not always be needed. Besides, it is more efficient to resolve all skeys once the query is completed in bulk.
- When executing the Write method, the store will check if the Key is defined. If not, it will default to SKey and allocate/resolve at runtime. However, it is recommended to use Key whenever possible for performance reasons.
Storage Data Model
Data is stored in a bit different format than it is presented in
KeyedValues. Object KeyedValues is transformed into a set of DataEntry
objects from the dataentry
package. DataEntry is a combination of:
- Row object (see
dataentry/row.go
) - Array of ColumnValue objects (see
dataentry/column_value.go
)
It may look like Row is equivalent to Key from KeyedValues, but it i not. There is a transformation going on:
-
Key from KeyedValues is being split into two keys, promoted and tail key. This split is defined by the
TableIndex
object from theuid
package. As you may have figured out, this is to help query data in a fast and efficient way when the filter defines a set of keys, then the store will try to pick up the index with the promoted key set most closely to the filter.We are indexing!
-
Column value timestamps are transformed into
RowSequence
objects (seedataentry/row.go
file). Those column values that have the same sequence are grouped. Otherwise, for each unique sequence, a new Row is created, containing the promoted key, tail key, and sequence number. Then it gets assigned column values that have the same sequence number. -
Note that
KeyedValues
are created fromIndexedKeyedValues
when writing, each TableIndex will createDataEntry
object and those indices are full replicas!
Example: Imagine we have the following KeyedValue (single):
- Key:
serial_number=1234, product=temp_sensor
- Values:
temperatures: 123.4, 2020-01-01T00:00:00Z
temperatures: 124.4, 2020-01-01T00:01:00Z
Then it will be transformed into the following DataEntry objects,
provided that serial_number
is used as an index:
- DataEntry 1:
- Row:
- Promoted Key:
serial_number=1234
- Tail key:
product=temp_sensor
- Sequence:
fromTimestamp(2020-01-01T00:00:00Z)
- Promoted Key:
- Values:
temperatures: 123.4, 2020-01-01T00:00:00Z
- Row:
- DataEntry 2:
- Row:
- Promoted Key:
serial_number=1234
- Tail key:
product=temp_sensor
- Sequence:
fromTimestamp(2020-01-01T00:01:00Z)
- Promoted Key:
- Values:
temperatures: 124.4, 2020-01-01T00:01:00Z
- Row:
When data is saved to the underlying DB, repeated fields from “Values” like timestamp may be dropped, as we already have them in the Row object. Promoted/Tail key understanding is important to write good indices! In this example, we assumed a single promoted index, but if we had more, we would have more replicas.