Neuroglancer Precomputed Sharded Format¶
The precomputed sharded format is logically a key-value store that maps 8-byte (uint64) keys to arbitrary byte sequence values.
It packs any number of key/value pairs (called chunks) into a fixed number of larger shard files. Compared to storing each key in a separate file, it can reduce space overhead and storage and improve write efficiency on storage systems with high per-file overhead, as is common in many distributed storage systems including cloud object stores. There are several downsides to the sharded format, however:
It requires greater complexity in the generation pipeline.
It is not possible to re-write the data for individual chunks; the entire shard must be re-written.
There is somewhat higher read latency due to the need to retrieve additional index information before retrieving the actual chunk data, although this latency is partially mitigated by client-side caching of the index data in Neuroglancer.
The sharded format uses a two-level index hierarchy:
There are a fixed number of shards, and a fixed number of minishards within each shard.
Each chunk, identified by a uint64 identifier, is mapped via a hash function to a particular shard and minishard. In the case of meshes and skeletons, the chunk identifier is simply the segment ID. In the case of volumetric and annotation data, the chunk identifier is the compressed Morton code.
A fixed size shard index stored at the start of each shard file specifies for each minishard the start and end offsets within the shard file of the corresponding minishard index.
The variable-size shard index specifies the list of chunk ids present in the minishard and the corresponding start and end offsets of the data within the shard file.
Note
The sharded format requires that the underlying key-value store supports byte range reads.
The sharded format consists of the sharding metadata
parameters
, which are embedded in the parent format
info
metadata file, and a directory containing the shard data
files.
Sharding metadata¶
- json PrecomputedSharding : object¶
Precomputed sharded format parameters
- Required members:¶
-
@type :
"neuroglancer_uint64_sharded_v1"
¶
-
preshift_bits : integer[
0
,64
]¶ Number of low-order bits of the chunk ID that do not contribute to the hashed chunk ID.
-
hash :
"identity"
|"murmurhash3_x86_128"
¶ Specifies the hash function used to map chunk IDs to shards.
-
minishard_bits : integer[
0
,64
]¶ Number of bits of the hashed chunk ID that determine the minishard number.
The number of minishards within each shard is equal to \(2^{\mathrm{minishard\_bits}}\). The minishard number is equal to bits
[0, minishard_bits)
of the hashed chunk id.
-
shard_bits : integer[
0
,64
]¶ Number of bits of the hashed chunk ID that determine the shard number.
The number of shards is equal to \(2^{\mathrm{shard\_bits}}\). The shard number is equal to bits
[minishard_bits, minishard_bits+shard_bits)
of the hashed chunk ID.
-
minishard_index_encoding :
"raw"
|"gzip"
¶ Specifies the encoding of the minishard index.
Normally
"gzip"
is a good choice.
-
data_encoding :
"raw"
|"gzip"
¶ Specifies the encoding of the data chunks.
Normally
"gzip"
is a good choice, unless the data is expected to already be fully compressed.
-
@type :
Shard data files¶
For each shard number in the range [0, 2**shard_bits)
, there is a
<shard>.shard
file, where <shard>
is the lowercase base-16
shard number zero padded to ceil(shard_bits/4)
digits.
Note
There was an earlier (obselete) version of the sharded format, which also
used the same "neuroglancer_uint64_sharded_v1"
identifier. The
earlier format differed only in that there was a separate
<shard>.index
file (containing the shard index) and a
<shard>.data
file (containing the remaining data) in place of the
single <shard>.shard
file of the current format; the
<shard>.shard
file is equivalent to the concatenation of the
<shard>.index
and <shard>.data
files of the earlier
version.
Shard index format¶
The first 2**minishard_bits * 16
bytes of each shard file is the shard
index consisting of 2**minishard_bits
16-byte entries of the form:
start_offset
: uint64le, specifies the inclusive start byte offset of the minishard index in the shard file.end_offset
: uint64le, specifies the exclusive end byte offset of the minishard index in the shard file.
Both the start_offset
and end_offset
are relative to the end of the
shard index, i.e. shard_index_end = 2**minishard_bits * 16
bytes.
That is, the encoded minishard
index for a given minishard is
stored in the byte range [shard_index_end + start_offset, shard_index_end +
end_offset)
of the shard file. A zero-length byte range indicates that there
are no chunk IDs in the minishard.
Minishard index format¶
The minishard index stored in the shard file is encoded according to the
minishard_index_encoding
metadata value.
The decoded minishard index is a binary string of 24*n
bytes, specifying a
contiguous C-order array
of [3, n]
uint64le values.
Values
array[0, 0], ..., array[0, n-1]
specify the chunk IDs in the minishard, and are delta encoded, such thatarray[0, 0]
is equal to the ID of the first chunk, and the ID of chunki
is equal to the sum ofarray[0, 0], ..., array[0, i]
.The size of the data for chunk
i
is stored asarray[2, i]
. Valuesarray[1, 0], ..., array[1, n-1]
specify the starting offsets in the shard file of the data corresponding to each chunk, and are also delta encoded relative to the end of the prior chunk, such that the starting offset of the first chunk is equal toshard_index_end + array[1, 0]
, and the starting offset of chunki
is the sum ofshard_index_end + array[1, 0], ..., array[1, i]
andarray[2, 0], ..., array[2, i-1]
.
The start and size values in the minishard index specify the location in the
shard file of the chunk data, which is encoded according to the
data_encoding
metadata value.