Module Development
This page describes what must be provided when contributing a module.
Steps for creating a module
- Implement the logic of the module you are contributing. The implementation can either be directly in the Module Workload or in an external component. If the logic is in an external component, then the module workload should act as a client - i.e. receiving paramaters from the control plane and passing them to the external component.
- Create and publish the Module Helm Chart that will be used by the control plane to deploy the module workload, update it, and delete it as necessary.
- Create the FybrikModule YAML which describes the capabilities of the module workload, in which flows it should be considered for inclusion, its supported interfaces, and the link to the module helm chart.
- Test the new module
These steps are described in the following sections in more detail, so that you can create your own modules for use by Fybrik. Note that a new module is maintained in its own git repository, separate from the fybrik repository.
Module Workload
The module workload is associated with a specific user workload and is deployed by the control plane. It may implement the logic required itself, or it may be a client interface to an external component. The former will have module type "server" and the latter "config".
There is also a third type of module workload known as a plugin. It provides a standard interface by which another module may invoke its capabilities. For example, you may have a module that reads data but doesn't know how to do data transforms. Rather than implementing transforms in the module workload code, it can call the plugin to do the transforms. The control plane deploys the relevant transform plugin as well as the read module.
Credential management
Modules that access or write data need credentials in order to access the data store. The credentials are retrieved from HashiCorp Vault. The parameters to login to vault and to read secret are passed as part of the arguments to the module Helm chart.
An example for Vault Login API call which uses the Vault parameters is as follows:
$ curl -v -X POST <address>/<authPath> -H "Content-Type: application/json" --data '{"jwt": <module service account token>, "role": <role>}'
An example for Vault Read Secret API call which uses the Vault parameters is as follows:
$ curl --header "X-Vault-Token: ..." -X GET https://<address>/<secretPath>
Fybrik repository contains a Python Vault package that modules can use to retrieve the credentials.
Module Helm Chart
For any module chosen by the control plane to be part of the data path, the control plane needs to be able to install/remove/upgrade an instance of the module. Fybrik uses Helm to provide this functionality. Follow the Helm getting started guide if you are unfamiliar with Helm. Note that Helm 3.7 or above is required.
The names of the Kubernetes resources deployed by the module helm chart must contain the release name to avoid resource conflicts. A Kubernetes service
resource which is used to access the module must have a name equal to the release name (this service name is also used in the optional spec.capabilities.api.endpoint.hostname
field).
Because the chart is installed by the control plane, the input values
to the chart will contain the following information:
.Values.assets
- a list of asset arguments such as datastores, transformations, etc..Values.selector
- application selector.Values.context
- application context.Values.labels
- labels specified inFybrikApplication
.Values.uuid
- a unique id ofFybrikApplication
An example of values passed to a module(values.sample.yaml):
labels:
app.fybrik.io/app-name: my-notebook-read
namespace: fybrik-notebook-sample
uuid: 12345678
context:
intent: "Fraud Detection"
selector:
matchLabels:
app: my-notebook
assets:
- args:
- connection:
name: s3
s3:
bucket: fybrik-test-bucket
endpoint: s3.eu-gb.cloud-object-storage.appdomain.cloud
object_key: test1.parquet
format: parquet
vault:
read:
address: http://vault.fybrik-system:8200
authPath: /v1/auth/kubernetes/login
role: module
secretPath: /v1/kubernetes-secrets/data-creds?namespace=fybrik-notebook-sample
assetID: "test1"
capability: read
transformations:
- name: "RedactAction"
RedactAction:
columns:
- col1
- col2
If the module workload needs to return information to the user, that information should be written to the NOTES.txt
of the helm chart.
For a full example see the Arrow Flight Module chart.
Publishing the Helm Chart
Once your Helm chart is ready, you need to push it to a OCI-based registry such as ghcr.io. This allows the control plane of Fybrik to later pull the chart whenever it needs to be installed.
You can use the hack/make-rules/helm.mk Makefile, or manually push the chart as described in the link:
helm registry login -u <username> <registry>
helm package <chart folder> -d <local-chart-path>
helm push <local-chart-path> oci://<registry>/<path>
FybrikModule YAML
FybrikModule
is a kubernetes Custom Resource Definition (custom resource) which describes to the control plane the functionality provided by the module. The FybrikModule custom resource has no controller. The specification of the FybrikModule
Kubernetes custom resource is available in the API documentation.
The YAML file begins with standard Kubernetes metadata followed by the FybrikModule
specification:
apiVersion: app.fybrik.io/v1beta1
kind: FybrikModule # always this value
metadata:
name: "<module name>" # the name of your new module
labels:
name: "<module name>" # the name of your new module
version: "<semantic version>"
namespace: fybrik-system # control plane namespace. Always fybrik-system
spec:
...
The child fields of spec
are described next.
spec.chart
This is a link to a the Helm chart stored in the image registry. This is similar to how a Kubernetes Pod
references a container image. See Module Helm chart for more details.
spec:
chart:
name: "<helm chart link>" # e.g.: ghcr.io/username/chartname:chartversion
values:
image.tag: v0.0.1
spec.statusIndicators
Used for tracking the status of the module in terms of success or failure. In many cases this can be omitted and the status will be detected automatically.
if the Helm chart includes standard Kubernetes resources such as Deployment and Service, then the status is automatically detected. If however Custom Resource Definitions are used, then the status may not be automatically detected and statusIndicators should be specified.
statusIndicators:
- kind: "<module name>"
successCondition: "<condition>" # ex: status.status == SUCCEEDED
failureCondition: "<condition>" # ex: status.status == FAILED
errorMessage: "<field path>" # ex: status.error
spec.dependencies
A dependency has a type
and a name
. Currently dependencies
of type module
are supported, indicating that another module must also be installed for this module to work.
dependencies:
- type: module #currently the only option is a dependency on another module deployed by the control plane
name: <dependent module name>
spec.type
The type
field may be one of the following vaues:
1)service - Indicates that module workload implements the modules logic, and is deployed by the fybrik control plane.
2) config - In this case the logic is performed by a component deployed externally, i.e. not by the fybrik control plane. Such components can be assumed to support multiple workloads.
3) plugin (FUTURE) - This type of module enables a sub-set of often used capabilities to be implemented once and re-used by any module that supports plugins of the declared type.
spec.pluginType
(Future Functionality) The types of plugins supported by this module. Example: vault, fybrik-wasm ...
spec.capabilities
Each module may support one or more capabilities. Currently there are four capabilities: read
for enabling an application to read data or prepare data for being read, write
for enabling an application to write data, and copy
for performing an implicit data copy on behalf of the application, and transform
for altering data based on governance policies. A module provides one or more of these capabilities.
capabilities.capability
Indicates which of the types of capabilities this instance describes.
capability: # Indicate the capabilities for which the control plane should consider using this module
- read # optional
- write # optional
- copy # optional
- transform # optional
capability.scope
The capability provided by the module may work on one of several different scopes:
- workload - deployed once by fybrik and available for use by the data planes of all the datasets
- asset - deployed by fybrik for each dataset
- cluster - deployed outside of fybrik and can be used by multiple fybbrik workloads in a given cluster
scope: <scope of the capability> # cluster, workload, asset
capabilites.supportedInterfaces
Lists the supported data services from which the module can read data (sources) and to which it can write (sinks). There can be multiple sources and sinks. For each, a protocol and format are provided.
protocol
field can take a value such askafka
,s3
,db2
,fybrik-arrow-flight
, etc.format
field can take a value such asavro
,parquet
,json
, orcsv
.
Note that a module that targets copy flows will omit the api
field and contain just source
and sink
, a module that only supports reading data assets will omit the sink
field and only contain api
and source
capabilites.api
describes the api exposed by the module to the user's workload for the particular capability.
protocol
field can take a value such askafka
,s3
,db2
,fybrik-arrow-flight
, etcdataformat
field can take a value such asparquet
,csv
,avro
, etcendpoint
field describes the endpoint exposed the module
capabilites.api.endpoint
describes the endpoint from a networking perspective:
hostname
field is the hostname to be used when accessing the module. Equals the release name. Can be omitted.port
field is the port of the service exposed by the module.scheme
field can take a value such ashttp
,https
,grpc
,grpc+tls
,jdbc:oracle:thin:@
, etc
An example for a module that copies data from a db2 database table to an s3 bucket in parquet format.
capabilities:
- capability: copy
supportedInterfaces:
- source:
protocol: db2
sink:
protocol: s3
dataformat: parquet
An example for a module that has an API for reading data, and supports reading both parquet and csv formats from s3.
capabilities:
- capability: read
api:
protocol: fybrik-arrow-flight
endpoint:
port: 80
scheme: grpc
supportedInterfaces:
- source:
protocol: s3
dataformat: parquet
- flow: read
source:
protocol: s3
dataformat: csv
capabilites.actions
are taken from a defined Enforcement Actions Taxonomy
a module that does not perform any transformation on the data may omit the capabilities.actions
field.
The following is an example of how a module would declare that it knows how to redact, remove or encrypt data. Additional properties may be associated with each action.
capabilities:
- read:
actions:
- name: "RedactAction"
- name: "RemoveAction"
- name: "EncryptAction"
Full Examples
The following are examples of YAMLs from fully implemented modules:
- An example YAML for a module that copies from db2 to s3 and includes transformation actions
- And an example arrow flight read module YAML, also with transformation support
Getting Started
In order to help module developers get started there are two example "hello world" modules: * Hello world module * Hello world read module
An example of a fully functional module is the [arrow flight module][https://github.com/fybrik/arrow-flight-module]
Test
- Register the module to make the control plane aware of it.
- Create an
FybrikApplication
YAML for a user workload, ensuring that the data set and other parameters included in it, together with the governance policies defined in the policy manager, will result in your module being chosen based on the control plane logic. - Apply the
FybrikApplication
YAML. - View the
FybrikApplication status
. - Run the user workload and review the results to check if they are what is expected.