Deploy a ZebClient Advanced Cluster

Last updated 1 year ago

To deploy a cluster for advance analytics we should have in mind that the flow of information will follow a pattern like:

Getting raw data from one or several source systems.
Process raw data to transform it into meaningful and structured data.
Collect structured data, validate and update the information that will be used from output consumer systems as BI or LLM.

What will happen under the hood is that a pipeline implemented with will fetch the data from a source system and place it as it comes, in raw format, into what we call the "bronze layer" of information. This is a path inside a shared k8s volume (pvc) supported by ZebClient. Then, A next step in the pipeline will process the raw data and transform it into structured data in format and will validate the quality of the information, this is the "silver layer" which is another path inside the shared volume managed with ZebClient. The process of transforming raw data into structured data might need some specific hardware requirements to get the job done like GPU/CPU usage or certain amount of memory. This hardware requirements are specified in Elyra and each pipeline step is executed on isolated pods in Kubernetes by with the hardware specifications defined in Elyra. Later, the last step from the data pipeline will take the structured data and update the files that are actually being used by the consumer systems. This is the gold layer.

To support this pipeline, the infrastructure needed would look like this (This is a suggested deployment using open-source tools but it can vary depending on each business case):

In this infrastructure we will need a Kubernetes cluster with 4 node pools:

default: To run the Kubernetes services kube-system.
general: It will run the open-source services required to run the suggested infrastructure for advanced analytics with ZebClient:
- Kubeflow
- Jupyterhub
- Minio
- mysql
executor: Run the pods that will execute each pipeline step.
executor-gpu: Run the pods that will execute each pipeline step that requires GPU processing.

All node pools will have a pod which will run the csi-driver for ZebClient storage class and this pod will also run a ZebClient agent that will interact with the ZebClient's cold storage, and ZebClient's acceleration nodes if enabled.