Implementations MUST have publicly discoverable Batch Publications.
Implementations MUST be able to validate the Parquet file contents. Validity MUST be immutable.
Implementations MUST retain proof of existence of a Batch Publication.
All Announcements in a Batch file MUST be able to be proven to be from or have a chain of delegation to the publisher of the Batch.
Batch files are stored and transferred in Apache Parquet format.
- Batch files MUST match the spec for a single Announcement Type.
- Batch files MUST have Bloom filters set in accordance with the Announcement Type Spec.
- Batch files MUST have NO MORE THAN 128*1024 rows.
- A Bloom filter MUST be a Split Block Bloom filter.
- The false-positive rate MUST be 0.001.
Calculation for filter bits is different and nearly a factor of 10 lower than for a standard Bloom filter: 128*1024 rows with a 0.001 false-positive rate results in around 29,000 bits for a Split Block Bloom filter.
Bloom filters are ONLY added to some fields. See also Announcement Types.
Applications need to know if a given Batch file has any information they are interested in without downloading the file first.
- Parquet is a column-oriented format. Since DSNP Batch Message data will have a very small column-to-row ratio compared to a typical web application database, it makes sense to prefer a column-oriented format.
- Parquet format has been field-tested under extreme network conditions. It has broad support in cloud storage solutions, with libraries in multiple languages.
- Bloom filters are already supported in the Parquet specification, which allows for fast and accurate searching (with caveats for proper configuration).
- Amazon S3 support: We anticipate that some Batch Announcers, and possibly Archivists, will store Batch files on Amazon S3. Amazon Athena also supports storage in Parquet, and its API supports SQL-like queries.
- Parquet also allows references to the same column across files, which could enable multi-file querying in the future.
- Parquet supports compression formats such as Brötli, which itself is already a browser standard and has a demonstrated improvement in compression speed and file size over older formats.
- Parquet files can be transferred directly to clients, which can parse the files in the app or browser. No conversion to some serialization format is necessary. This eliminates an entire class of bugs and makes both fetching and querying faster.
- Parquet uses schemas, which additionally reduces file size.
- Cassandra, RocksDB, CouchDB, MongoDB, and HBASE were rejected since DSNP data needs neither a database for storage nor the overhead of one. Each of these was designed for use cases ranging from somewhat to drastically different than the DSNP network.
- JSON, BSON, and SQLite, while used for storage sometimes, are intended for serialization. They are schemaless, which results in redundant information and therefore a larger size than formats with schemas. They also don't support Bloom filters; instead, indexing would be required, or new batches would need to be downloaded entirely. The exception is SQLite, which does support more advanced queries; however, it was designed for in-memory storage.
Batch validity is immutable and usually in part based on the validation of the delegation of authors listed inside the batch to the publisher. Due to the nature of distributed systems, it is possible that a race condition occurs such that a user's delegation revocation presents before a Batch that contains a message from that user via the revoked delegate. While those individual messages should be considered invalid, testing a historical window of some time is suggested before considering the entire batch invalid. This is analogous to the idea of a confirmation time, but only applies to the past instead of the future.