Data Flows
Data Flow Mechanisms
ZebClient processes data using various mechanisms to ensure consistency, performance, security, and high availability. There are three important mechanisms that provide these properties.
Firstly, files are split up at multiple levels to improve performance and find the right balance between keeping data flowing efficiently through the system, reducing the overhead associated with various signaling mechanisms, and making the most effective use of the resources available.
The second mechanism is with respect to the tiering mechanisms in place, combined with the means to elastically expand the storage medium that backs this tiering. The tiering mechanism itself ensures that data is offloaded to ZebClient as close as possible to the application, minimizing the time the application waits for the offload to take place. The tiers are also elastic, meaning that the tiers can be expanded as necessary in order to ensure that fast data offload is available for typical application uses cases. Being elastic, ZebClient also supports the reduction of these fast tiers in order to remain cost-effective.
The third mechanism is the erasure coding that ZebClient performs using a standard erasure coding in a K+M configuration. Whilst the main reason for coding in this way is to ensure availability for M simultaneous failures, a side-effect of this striping is the parallel write operations that can now be performed across the cluster.
Writing Data to ZebClient
The file is initially received as blocks via the particular intake protocol and then handled as logical units. These units are referred to as chunks and slices.
When the updated slice is ready, it will be further sharded according to the K+M configuration. These shards are then passed to the acceleration engine nodes, where they are persisted to the fast storage located in these nodes. The metadata associated with these shards is also updated at this time.
This logical unit of work has received ensured that the blocks received as blocks now persisted across a redundant set of nodes, according to a K+M redundancy configuration.
At a later point, these shards are uploaded to object storage. Metadata associated with these shards is also transferred to object storage.
During the writing phase, ZebClient acts as a high-performance, durable, and highly-available buffer for slower object storage.
Reading Data from ZebClient
ZebClient reads shards from the object storage shard by shard. The shards contain the blocks that will eventually be passed up through the tiers. The size of the blocks is configurable and can be tuned based on the particular situation and use case. Data read will be promoted through internal data tiers. Frequently read data will be placed close to the application, ensuring the fastest access to the data. During the read phase, ZebClient, therefore, acts as a distributed caching engine, able to retrieve data from S3-compatible storage and make it available for rapid retrieval.
Read performance tuning is achieved by ensuring the highest cache hit rate for the given situation. Factors that affect the hit rate that would include cache size, prefetch strategy, eviction strategy, application access patterns, file sizes, block sizes, and read/write ratios.
Obviously, the more the cached data will be accessed by subsequent reads, providing a high cache hit rate, ZebClient can maximize the object storage read performance. The dataflow along each component of this case shows as follows:
ZebClient is able to be configured to cater to scenarios with widely differing read patterns. Typically, we increase the size of the internal data tiers to essentially provide a larger cache surface, thereby continuing to ensure a high cache hit ratio. Scenarios that would lead to larger internal data tiers are where there is a large amount of data and the application requires random access to small blocks in larger files.
Sequentially reading small files is much easier. The whole file is often read in a single request and this can be ensured with an increase in block size. Since small files are cached first before writing to object storage if a file is read soon after it is written, the read operation will hit the local cache, and an impressive performance can be expected
Last updated