Data Availability Sampling
Problem Statement
In a blockchain, light clients store block headers, which contain things like a hash of the block contents. When a new block is published, light clients can verify that the header is valid (e.g. the hash is correct) but they do not have the full block contents.
The problem is that a malicious block producer could create a valid header but then not actually publish the full block contents referenced by that header. So nodes would see a valid header but the actual block data would not be available on the network.
This could allow attacks where the header commits to one thing (e.g. a certain transaction) but the block contains something else. Light clients would believe the transaction is included based on the header, but full nodes would see it's not actually in the published block.
DAS prevents this attack by allowing light clients to still verify the full block contents are available, without downloading everything. The light clients sample random parts of the encoded block data to check it matches the header.
If enough of these probes accept, DAS guarantees the full data must be available somewhere on the network. Otherwise, an adversary could not have responded consistently to the probes.
So in summary, DAS lets light clients verify that valid headers really do have the appropriate block contents available and unpublished, not just making up a header with no data behind it. This verifies data availability without downloading everything.
How It Works
The key idea behind how data availability sampling (DAS) works is the use of erasure codes and associated cryptographic commitments:
-
The data is first encoded using an erasure code. This introduces some redundancy and spreads the data across a longer codeword.
-
The codeword symbols are then committed to using a commitment scheme tailored for that erasure code. This is what the paper calls an "erasure code commitment".
-
Clients can now randomly sample and probe positions of the encoded data. Each probe will return a codeword symbol and a commitment opening at that position.
-
Clients verify the opening is valid against the overall commitment. Enough accepting probes ensures the full data can be reconstructed from the codeword symbols.
-
The consistency property of the commitment scheme ensures all valid openings are for the same data.
Some key mathematical ingredients:
- Erasure codes allow reconstruction of data from a subset of symbols. This enables sampling.
- Commitment schemes ensure openings at different positions cannot be manipulated independently. This prevents selective responding to probes.
- Properties of the commitment scheme tied to the code ensure consistency of openings and codeword symbols.
So in summary, the erasure code provides the sampling capability, while the cryptographic commitments ensure soundness and consistency of probes. The math ensures it all works out properly.
What is an erasure code?
An erasure code is a technique used in computer science and information theory to encode data in a redundant way such that the original data can be reconstructed even if some of the encoded data is lost or corrupted.
The main components of an erasure code are:
-
The original data or message (e.g. a file). This is split into k equal-sized blocks.
-
An encoding function that takes the k message blocks and encodes them into n encoded blocks, where n > k.
-
The n encoded blocks have some built-in redundancy. This means that if some of the n blocks are lost or corrupted, the original k message blocks can still be recovered.
-
A decoding function that can reconstruct the original k message blocks from any subset of the n encoded blocks of size at least t, where t ≤ n. This property is called the reception efficiency.
Some examples of erasure codes:
-
Replication: Each message block is simply copied Identically to create redundancy. Not very space efficient.
-
Reed-Solomon codes: Message blocks are interpreted as a polynomial, which is evaluated at n points to get the encoded blocks. Can efficiently recover from any n - k lost blocks.
-
LDPC codes: Message bits are parity checked in different ways to build redundancy. Encode/decode via sparse binary matrices.
So in summary, erasure codes allow for efficient and redundant encoding such that data can be reliably reconstructed from a subset of the encoded data. This makes them useful for communications and storage when data may be lost or corrupted.
What is the Codeword?
In the context of erasure codes and data availability sampling, a "codeword symbol" refers to an element of the encoded data. Let me explain in a bit more detail:
-
The original data is encoded using an erasure code into a longer codeword.
-
This codeword consists of n symbols from some alphabet. Each symbol is a fixed size element of the encoded data.
-
For example, if using a Reed-Solomon code over 8-bit bytes, each symbol would be one byte. The codeword would consist of n bytes.
-
In data availability sampling, the codeword is committed to using a cryptographic commitment scheme.
-
Each position i in the codeword has a corresponding opening τi that is bound to the codeword symbol at that position.
-
When a client makes a probe at position i, it receives back the codeword symbol at i along with the opening τi.
-
The client can verify τi opens the overall commitment to the specific codeword symbol returned.
So in summary, a "codeword symbol" refers to the individual elements of the encoded data that make up the full codeword. The openings to the commitment scheme are tied to specific codeword symbols.