What is Antimatter?
Antimatter is a powerful tool for managing data consistently and securely, no matter where it’s stored or how it’s accessed. This can be useful if you need a particular type of data control that isn’t feasible given the storage or clients you are using, or if you have multiple different clients and storage systems and the complexity of building and maintaining multiple different mechanisms is too high. We think of this technique as having a data control plane (a system that decides where and in what form data can be accessed) that is separate from your data plane (where data is stored and the systems that move those bits around).
Concretely, Antimatter consists of three components that together comprise a decentralized data control plane:
- A set of services that provide a data control plane, managing encryption keys, policy and other settings for your data domains. By default, domains will use the Antimatter SaaS control plane, but you can also transparently migrate your domains to a data control plane in your own infrastructure See Architecture for more information.
- An encrypted object format that allows you to store both data and its metadata within a single, secure object. This object is called a capsule and it is encrypted and controlled using key material and policy configured in a domain. See Capsules for more information about this object format.
- A set of libraries for common languages (see Supported Languages) and plugins for common tools (see Supported Tools) that let you work with capsules and domains.
Put together, Antimatter gives you data control in a novel, secure, and system-agnostic way.
What is data control?
We use the term "data control" to mean any of a family of features that ensure that something always happens when data is written or read. That "something" could be access control, analysis, transformation of the data, observability hooks, you name it. Historically many of these have been thought of as distinct concerns, but Antimatter believes that a unified approach to managing these aspects is a powerful and inevitable shift in how data is handled.
The main types of data control that Antimatter provides are:
- Data classification: often, the first step in data control is understanding what data is. This involves capturing both the existing metadata (tags) associated with your data in its current context, and using Antimatter's sophisticated AI classifiers to detect and tag elements like personally identifiable information (PII) embedded within the data. See Write-Context Policy for more information about the classification and tagging features of Antimatter.
- Access control: the foundation of data control is ensuring that data is only accessible by the right parties. Antimatter lets you control access to data regardless of where that data is stored or how it is accessed. See the Policy for more information about Antimatter's access control primitives.
- Data transformation: even if data is permitted to be accessed, you often need to transform the data before it can be used. This transformation includes actions like restricting the set of records that are read (e.g. restricting the customer data being used for model training to just those customers that have opted in), but it also includes redacting fields or phrases within larger pieces of otherwise-permitted data. See Read-Context Policy for more information about redaction and transformation in Antimatter.
- Encryption: an important part of data control is knowing that your policies are not being circumvented. Antimatter does this by placing all the data in encrypted capsules. This ensures that any direct access to data is impossible, as an attacker would only see ciphertext. Antimatter handles key management for you, but also provides an easy way of bringing your own key (BYOK) or holding your own key (HYOK) in an external KMS. See Encryption for more information about the encryption capabilities of Antimatter.
- Audit logging: when working with sensitive data, it is often necessary to maintain a record of who has accessed the data and for what purpose. Antimatter provides both a data plane audit log (who has accessed what data) and a control plane audit log (who has changed settings or policy). See Audit logs for more information about these features.
- Data inventory: one of the most challenging tasks in the data ecosystem is simply knowing what you have and where it is. Antimatter provides a manifest of all of your encapsulated data, along with the tags that describe what that data is. See Capsule Manifest for more information.
More broadly, Antimatter provides a generalized mechanism to add hooks that process data when it's written, store additional metadata with that data in an encrypted capsule, and run operations on data before it is returned from a read. The above are some forms of data control that that mechanism affords us, but this is a powerful and extensible mechanism. If you have additional data control concerns not addressed above, please Contact Us. We'd love to learn more about your use case and add additional capabilities to Antimatter.
Policy
All data control relies upon some configuration that decides how data is handled when it is written (e.g. classification) and when it is read (e.g. redaction). In some workflows today (e.g. data redaction as an ETL stage) this policy is implicit, embedded in the code. Often tasks like audit logging or encryption are also buried inside application code. Access control is usually tightly coupled with the storage system: you often have policy scattered across cloud IAM, Google Drive, Active Directory, and numerous SaaS applications, all with their own way of expressing that policy.
Antimatter allows you to move some or all of this configuration out of the systems storing and accessing the data, and place it into a data control plane, letting you view and configure all of this policy in one place, regardless of what the application/client or the storage looks like.
Right now, Antimatter handles the storage of all of this configuration, and gives you powerful yet simple ways of expressing the policy that is at the core of data control – doing so in a universal way that works across multiple systems. See the Policy section for more information. This is immensely useful, but our vision is actually to do something even bolder: to offer seamlessly federated data control policy. Often, when working with sensitive data, this data has come from someplace where there was already policy protecting it. In enterprises this policy may be the result of years of investment into data governance. We believe that not only should you be able to control your data everywhere, but that you should, wherever possible, be able to do so by leveraging your existing policies. It's less complex, less prone to error, and faster to adopt.
To that end, we are developing integrations that let you link policy in Antimatter to external sources like Active Directory, Google Workspace, Microsoft Sharepoint, Confluence etc. Thus, rather than having to duplicate your policy in Antimatter, you can instead import that policy from an existing source and have it linked in real-time. Not only does this solve several perennial security challenges (such as ensuring all permissions are changed when employees change teams) but it lets you gain significantly more value from your existing data governance investments. Contact Us if you'd like early access to our federated data control plane features.
Example use cases
We believe that Antimatter is a novel and powerful solution to dozens of challenges across engineering, data science, artificial intelligence and security. We're still learning where best to focus our efforts, so please Contact Us if you have cool problems or interesting ideas. Right now some of the top applications are:
Retrieval Augmented Generation (RAG) AI permissions management
When building RAG AI applications, such as chatbots, you typically need to copy data from primary systems such as Confluence, Google Drive and Slack (to name a few) into a location that is more amenable to rapid retrieval. Often this is a vector database, but we've also seen hybrid designs that use object stores like S3. A big challenge when building such a system is ensuring that the permissions originally carried by the data when in the primary system are still enforced when accessed through the RAG chatbot. It would be a severe data breach if a user talking to the chatbot could ask questions about HR documents or see other information that they would not otherwise have had access to in Google Drive or whatever the primary source was.
Engineers facing this problem usually have three options:
- Capture and store the original permissions alongside the data, reimplementing the enforcement inside the RAG application
- Restrict the data you use to only "public" data that has no nuanced access control.
- Give up on having any secondary stores (like vector databases) and restrict yourself to just data available through an API that enforces permissions
None of these options are good. (1) takes a huge amount of effort and slows down the pace at which AI features can be developed and shipped to production. (2) drastically reduces the value of the application because users cannot get answers tailored to their roles. (3) usually has a massive impact on the latency and quality of answers the RAG application can generate
Antimatter gives you a much better option: you can capture the original permissions of the data when creating the Antimatter capsule (using our integrations), then wherever that capsule is stored (e.g. in a vector DB) those permissions will always apply when the data is read.
Dataset preparation
Today, we believe that data transformation for the sake of cleaning and augmentation has become conflated with data transformation for the purposes of security. A data engineer will often be tasked with materializing a "clean" dataset for use in analytics, model training or any number of use cases. As part of this they will often need to find and remove personally identifiable information (or other sensitive data). The challenge here is that while traditional cleaning (removing bad data) is universal, the transformation of data for security reasons is highly tied to who is going to be using the data and for what reason. We've found that it's common for companies to materialize multiple "data mart" copies of a dataset with different subsets of the data removed or redacted, because different use cases or teams are permitted to see different portions of the data.
Antimatter solves many of the problems that arise with today's approach to the problem:
- Identifying sensitive data to redact is challenging, especially if it's embedded in unstructured data (e.g it came from a comment field that customers can type anything into). Antimatter has built in classification.
- Storing metadata that identifies sensitive information within a mixed (semi-structured) dataset doesn't have a good solution today. Antimatter's capsule object format stores this with the data.
- The decisions of what data needs to be redacted, and when, are actually about security policy rather than data science, and often need to be made by completely different stakeholders. Antimatter lets you configure these policies separately.
- The complexity of datamarts increase when you need to respect customer opt-in. It's easy to accidentally use customer data for a purpose that the customer has recently opted-out of because your dataset was materialized before they changed their preferences. Antimatter gives you tools for customer opt-in tracking with real-time effect.
- The number of copies grows when you start respecting data residency. You often have
R x U
copies whereR
is the number of regions andU
is the number of use cases or teams interacting with that data. Data residency policies are simple to express in Antimatter without copying data. - If the data is going to be used in a customer-facing feature, you often need to respect the partitioning of data that the customer controls (e.g. ensuring a customer does not see a report that covers data they are not permitted to see). This requires expression and enforcement of policy that exists in the original operational systems, but in a data science workflow. Antimatter lets you enforce nuanced and dynamic policies without copying the data or writing any enforcement mechanisms.
In summary, Antimatter lets you classify data once, store it with that classification metadata in an encrypted object format, and then read different subsets and transformations of that data from multiple different contexts. It also lets you control the policy around which data can be read when in a manner completely detached from the actual data processing pipeline. It integrates with existing identity providers and customer identity context to allow your policy to easily express topics such as data residency restrictions or customer-controlled data compartmentalization without custom code.