Capsule Bundle Structure
These documents apply to the following versions:
Bundle: v3
Capsule: v2
Introduction
This document serves to describe the detailed structure of encapsulated data, describing both capsules and how capsules are bundled together to make up the final output stream. Given below are definitions depicting what symbols used within diagrams mean and should be referenced when need arises.
Definitions
B - byte
b - boolean
i# - signed integer of length # bits
u# - unsigned integer of length # bits
S - string
Arr(#) of T - Array of length # of type T
AAD - Additional Authenticated Data
DEK - Data Encryption Key
DR - Data Recovery
CBOR Encoding
The Antimatter capsule format make use of CBOR encoding for structured information storage within parts of the capsule, bundle, and file. It must be noted up front that CBOR encoded structures are done so using CBOR Array Encoding. This is the process of representing all field values in the source structure as an ordered array of their values (ordered by field order in the structure) before CBOR encoding only them. Not using this CBOR encoding scheme when attempting to parse CBOR structures detailed here can lead to errors during operation. Furthermore, diagrams found in this document, when referring to CBOR encoded structures, present their fields in the order of the fields in the source structure.
Capsule File
In its most abstract form, a Capsule File can be represented simply as a File Header followed by a Capsule Bundle containing all encapsulated data within. This structure is represented below.
File Header
A File Header is CBOR encoded, and contains the File Magic bytes and which capsule bundling version was used to store the capsules within this file.
Capsule Bundle
The Capsule Bundle begins with a Bundle Header and is then followed by all Capsules contained within this bundle.
Bundle Header
A Bundle Header is CBOR encoded, and contains information about which domain created the bundle, when it was created, and whether this Capsule Bundle contains 1 or many Capsules in it.
If Is Bundle is false
, then there is only a single capsule stored within this Capsule File. However, if Is Bundle
is true
, this indicates that the Capsule File contains multiple capsules. Furthermore, if Is Bundle is true
an
additional metadata Capsule is added as the first Capsule stored in the Capsule Bundle. This metadata Capsule is used
to store information about the format of the data stored in all Capsules with the Capsule Bundle.
The Domain ID stored here is the domain that created this Capsule Bundle. This Domain ID is stored by first having the leading "dm-" stripped from it, and the remaining characters packed. Please refer to ID Packing for a detailed description of this packing method.
Capsule
A Capsule is made up of a plaintext part and an encrypted part. The plaintext part of the Capsule is the Capsule Public
Header. This is followed by Encrypted Capsule Data, and finally the End of Capsule Delimiter bytes 0xFF00
.
Capsule Public Header
The Capsule Public Header is CBOR encoded and its purpose is to contains all information required to decrypt the Encrypted Capsule Data part of a Capsule.
The key within the Encrypted DEK here is used to decrypt the Encrypted Capsule Data. If contact can be made to Antimatter's services, this Encrypted DEK can be submitted with the Key ID in order to obtain the decrypted key for use in decrypting the Encrypted Capsule Data. If not, one will have to resort to using the DR Token, if it exists. If there is no DR Token for the given capsule, that is that DR was disabled for the domain when creating this capsule, the field is completely omitted from the outputted CBOR.
The Domain ID stored here is the domain that owns the Capsule. As such, the Capsule ID can be found in this domain's capsule list. The Capsule and Domain IDs are stored by first having the leading "ca-" or "dm-" stripped from them, respectively, and the remaining characters are packed. Please refer to ID Packing for a detailed description of this packing method.
Encrypted Capsule Data
Encrypted Capsule Data is encrypted in chunks to allow for streaming. Each chunk contains a sealed AES-256-GCM with AAD ciphertext that can be decrypted standalone. The chunks are laid out in the following form:
The End of Encrypted Chunks Delimiter is actually just a chunk of 0 length, and so has no associated chunk data. However, encountering this indicates that there are no more Encrypted Chunks to process.
Encrypted Chunk
An Encrypted Chunk has the structure:
The Chunk Length is calculated for the length of the Encrypted Bytes and the AES-256-GCM-TAG together and does not include the Nonce's or the Chunk Length's byte length. Please note that the Chunk Length is stored in Little Endian byte order.
In order to use AES-256-GCM to decrypt this data you will need the following:
- The Nonce (provided in Encrypted Chunk).
- The the AES-256-GCM-TAG (provided in Encrypted Chunk).
- The Encrypted Bytes (provided in Encrypted Chunk).
- The decrypted DEK.
- The Encrypted Chunk AAD for the current Encrypted Chunk.
Encrypted Chunk AAD
The AAD required for decrypting the Encrypted Chunk's Encrypted Bytes is the bytes of the following CBOR encoded structure based on which chunk is currently being processed.
Chunk Number is simply the zero indexed Encrypted Chunks index, i.e., if you are processing your third chunk then the Chunk Number would be 2. The Final Chunk is just a boolean value to state whether the current chunk being processed is the last one for the current Capsule.
Decrypted Capsule Data
Once the Encrypted Capsule Data has been decrypted, the resulting Decrypted Capsule Data follows this structure.
Capsule Secret Header
The Capsule Secret Header is CBOR encoded and serves to contain information about the structure of the Capsule's data.
Specifically, as a Capsule stores data in a grid, the Capsule Secret Header contains the details of all columns stored within the Capsule. Additionally, Extras is used to store additional information about the Capsule's data. Typically this additional information is used to store the data's source format to allow the capsule reader to present the data in its original form if read by a client that support the original format.
Column Definition
A Column Definition makes up part of the internal structure of the Capsule Secret Header's CBOR encoded bytes. As such, implementation of a CBOR decoder for the Capsule Secret Header should include the following structure.
The order in which Column Definitions are presented is the same order that a Capsule's row's values is stored. The Name here is simply the name given to the column in the original data. A column can have a list of classification information associated with it. In a Capsule, this classification information is referred to as a Capsule Tag. Ultimately, these Capsule Tags are used by Antimatter to apply policy to data stored in a Capsule.
As a read context can request that a Capsule reruns classification of the data within it, we need to know which columns should not be classified in the original data. This flag is stored in Skip Classification.
Capsule Tag
A Capsule Tag makes up part of the internal structure of the Column Definition's CBOR encoded bytes. As such, implementation of a CBOR decoder for the Column Definition should include the following structure.
The Name presented here is a human-readable name given to the data that represents what that data was classified as. To improve classification accuracy, one can additionally add a Value to a Capsule Tag that gives more meaning to it. The type of the Capsule Tag's Value is stored in Tag Type, and this should be used to properly represent the Value. The type stored in Tag Type is represented by a single integer value, and the meaning of these values is presented here:
0 - Unary
1 - String
2 - Number
3 - Boolean
4 - Date
Source in this structure is the name of the classifier (also called a hook) that generated this Capsule Tag, and Hook Version is the version of the Source classifier that was used to generate this Capsule Tag.
Capsule Footer
A Capsule Footer is CBOR encoded and contains a complete list of Classifiers (hooks) that were run in classification of the data stored in the given Capsule.
Hook Info
A Hook Info makes up part of the internal structure of the Capsule Footer's CBOR encoded bytes. As such, implementation of a CBOR decoder for the Capsule Footer should include the following structure.
Name in this structure is the name of the classifier that was used during classification of data in this Capsule. Version is the version of the named classifier that was used.
Chunked Capsule Data
Capsule data is chunked in order to allow for streaming. As capsules' store their data in a grid structure, the Chunked Capsule Data stores additional information to rebuild this grid. Additionally, Span Tags for a cell are stored in the Chunked Capsule Data as well. A capsule's chunks are laid out, in order, in the following form.
Data Chunk
All data stored in a Capsule is chunked into Data Chunks. The form of a single Data Chunk is given here.
These Data Chunks contain all information required in order to rebuild the grid structure of the original data the Capsule was created from. This is done through use of Flags. A Flag conveys when a grid's cell ends, when a row ends, and when the capsule's data ends; that is the end of the grid. A Flag is read by checking which bits are set within the Flag. The meaning for each set bit is given below.
XXXX XCRD
└─┬──┘││└── End of cell data
│ │└─── End of row
│ └──── End of capsule
└──────── Reserved
A Data Chunk's Size is only the number bytes in Chunk Bytes, and does not include the Size or Flag field's bytes in its calculation. This byte count is stored in Size.
Reading a Cell's Data
To read the data for a single cell in the grid structure stored in a Capsule, you must concatenate all Chunk Bytes in
each read Data Chunk until you reach a Data Chunk where the Data Chunk's Flag's D
bit is set. These concatenated
bytes will be named Cell Data.
Cell Data
A Cell Data takes the structure given below.
As shown, a Cell Data is further broken down into Cell Parts where each part contains both the classification information and the data bytes for that cell's part. That is, Cell Parts are assembled to form a single cell's data and classification information.
Please keep in mind that the Cell Parts are stored in order and so are order sensitive. All Cell Parts need to be read and assembled in order to produce the cell's raw data bytes and classification information.
Cell Part
A single Cell Part is shown here.
The Cell Part Info contains classification information to allow application of policy to this Cell Part's Part Raw Data. Part Raw Data is the raw data bytes for the current Cell Part. It should also be noted here that the number of bytes in Part Raw Data is stored in Cell Part Info.
Cell Part Info
A Cell Part Info is CBOR encoded and has the following structure.
As alluded to prior, the Length here stores the number of bytes in the Cell Part's Part Raw Data. Classifying information about specific parts of the Cell Part's Part Raw Data is stored in the List of Span Tags.
Span Tag
A Span Tag is an extended Capsule Tag with indexes to state what part of the Cell Part's Part Raw Data's bytes the Capsule Tag applies to. That is, what a part of the cell's data has been classified as. A Span Tag makes up part of the internal structure of the Cell Part Info's CBOR encoded bytes. As such, implementation of a CBOR decoder for the Cell Part Info should include the following structure. The structure of a Span Tag is given here.
Indexes for a Span Tag are found in the Start and End fields. As the names suggest, these fields indicate where the Capsule Tag applies to within Cell Part's Part Raw Data. One should keep in mind that in addition to a Span Tag's Start and End indexes being zero indexed, the zero for all indexes in all Part Raw Datas' Span Tags is based on the zero byte of the entire cell's assembled raw byte data. To help clarify this, a worked example is given below.
Worked example
For original cell data of 20 bytes in length and single Span Tag classified at position Start value 15 and End value 18. If the cell's data stored in the capsule is broken into 2 Cell Parts, let us say 10 bytes in length each, then the first Cell Part's Cell Part Info would record Length of 10 and list no Span Tags. The second Cell Part would also record a Length of 10 in its Cell Part Info, but it would additionally record the single Span Tag with Start value 15 and End value 18.
Assembling a Cell's Raw Data Bytes
In order to assemble just the raw data bytes of a cell you need to only perform an ordered concatenation of all Part Raw Datas from each Cell Parts. If you wish to also assemble classification information for the cell, then you can concatenate all Lists of Span Tags and use as is as these are indexed based on the zero byte of the assembled raw data bytes.
A Note On Data Chunk Flags
You can expect multiple flags to be set in a Data Chunk. For example if the D
and R
Flag bits are set
together, this indicates the end of a cell's data and the end of a row. That is, this is the last cell in the row.
To continue this example, if the D
, R
, and C
Flag bits are set, this indicates the last cell in the
capsule.
It is possible for a Data Chunk to be of 0 size and only have Flag bits set. In this case no Chunk Bytes need to be concatenated to the current cell being assembled and you only needs to follow the meaning of the set flags, if any.
ID Packing
IDs are limited to using only the characters in the Base58 character set. Please keep in mind that the order of these characters presented below is important in the packing process:
123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz
For every character that is packed in the ID, we first find the index (zero indexed) of that character in the Base58 character set presented above. The binary of this index is always small enough to fit in 6 bits, and so we store the 6 bit binary representation of this index. We repeat this process for every character in the ID.
Finally, we contiguously pack the 6 bit representations of each index together. If the number of bits packed does not result in enough bits to form the final byte, the final byte is padded using 0 bits. A worked example is presented below.
Worked example
For the ID iJiah
, we would first look up the index of each character and represent these values as 6 bits. These are
shown below:
i -> 41 -> 0b101001
J -> 17 -> 0b010001
i -> 41 -> 0b101001
a -> 33 -> 0b100001
h -> 40 -> 0b101000
Next we pack the bits together. This example will pack the bits with a seperator at every 8th bit to easily read the start and end of each byte:
10100101 00011010 01100001 101000
As expected, there are not enough bits to fill up the final byte, so we pad it with 0 bits to complete the final byte. This yields the result:
10100101 00011010 01100001 10100000
Finally, store these bytes as the packed ID. The hexadecimal version of these bits are given below as an alternate representation of the packed ID's binary representation:
A5 1A 61 A0
Escaping
The Capsule Public Header and Encrypted Capsule Data escape the byte 0xFF as 0xFFFF due to the End of Capsule Delimiter.