Skip to content

Module Development

This page describes what must be provided when contributing a module.

Steps for creating a module

  1. Implement the logic of the module you are contributing. The implementation can either be directly in the Module Workload or in an external component. If the logic is in an external component, then the module workload should act as a client - i.e. receiving paramaters from the control plane and passing them to the external component.
  2. Create and publish the Module Helm Chart that will be used by the control plane to deploy the module workload, update it, and delete it as necessary.
  3. Create the FybrikModule YAML which describes the capabilities of the module workload, in which flows it should be considered for inclusion, its supported interfaces, and the link to the module helm chart.
  4. Test the new module

These steps are described in the following sections in more detail, so that you can create your own modules for use by Fybrik. Note that a new module is maintained in its own git repository, separate from the fybrik repository.

Module Workload

The module workload is associated with a specific user workload and is deployed by the control plane. It may implement the logic required itself, or it may be a client interface to an external component.

Credential management

Modules that access or write data need credentials in order to access the data store. The credentials are retrieved from HashiCorp Vault. The parameters to login to vault and to read secret are passed as part of the arguments to the module Helm chart.

An example for Vault Login API call which uses the Vault parameters is as follows:

$ curl -v -X POST <address>/<authPath> -H "Content-Type: application/json" --data '{"jwt": <module service account token>, "role": <role>}'

An example for Vault Read Secret API call which uses the Vault parameters is as follows:

$ curl --header "X-Vault-Token: ..." -X GET https://<address>/<secretPath>

Module Helm Chart

For any module chosen by the control plane to be part of the data path, the control plane needs to be able to install/remove/upgrade an instance of the module. Fybrik uses Helm to provide this functionality. Follow the Helm getting started guide if you are unfamiliar with Helm. Note that Helm 3.3 or above is required.

The names of the Kubernetes resources deployed by the module helm chart must contain the release name to avoid resource conflicts. A Kubernetes service resource which is used to access the module must have a name equal to the release name (this service name is also used in the optional spec.capabilites.api.endpoint.hostname field).

Because the chart is installed by the control plane, the input values to the chart must match the relevant type of arguments.

If the module workload needs to return information to the user, that information should be written to the NOTES.txt of the helm chart.

For a full example see the Arrow Flight Module chart.

Publishing the Helm Chart

Once your Helm chart is ready, you need to push it to a OCI-based registry such as ghcr.io. This allows the control plane of Fybrik to later pull the chart whenever it needs to be installed.

You can use the hack/make-rules/helm.mk Makefile, or manually push the chart:

HELM_EXPERIMENTAL_OCI=1 
helm registry login -u <username> <registry>
helm chart save <chart folder> <registry>/<path>:<version>
helm chart push <registry>/<path>:<version>

FybrikModule YAML

FybrikModule is a kubernetes Custom Resource Definition (CRD) which describes to the control plane the functionality provided by the module. The FybrikModule CRD has no controller. The specification of the FybrikModule Kubernetes CRD is available in the API documentation.

The YAML file begins with standard Kubernetes metadata followed by the FybrikModule specification:

apiVersion: app.fybrik.io/v1alpha1 # always this value
kind: FybrikModule # always this value
metadata:
  name: "<module name>" # the name of your new module
  namespace: fybrik-system  # control plane namespace. Always fybrik-system
spec:
   ...

The child fields of spec are described next.

spec.chart

This is a link to a the Helm chart stored in the image registry. This is similar to how a Kubernetes Pod references a container image. See Module Helm chart for more details.

spec:
  chart: "<helm chart link>" # e.g.: ghcr.io/username/chartname:chartversion

spec.statusIndicators

Used for tracking the status of the module in terms of success or failure. In many cases this can be omitted and the status will be detected automatically.

if the Helm chart includes standard Kubernetes resources such as Deployment and Service, then the status is automatically detected. If however Custom Resource Definitions are used, then the status may not be automatically detected and statusIndicators should be specified.

statusIndicators:
    - kind: "<module name>"
      successCondition: "<condition>" # ex: status.status == SUCCEEDED
      failureCondition: "<condition>" # ex: status.status == FAILED
      errorMessage: "<field path>" # ex: status.error

spec.dependencies

A dependency has a type and a name. Currently dependencies of type module are supported, indicating that another module must also be installed for this module to work.

dependencies:
    - type: module #currently the only option is a dependency on another module deployed by the control plane
      name: <dependent module name>

spec.flows

The flows field indicates the types of capabilities supported by the module. Currently supported are three data flows: read for enabling an application to read data or prepare data for being read, write for enabling an application to write data, and copy for performing an implicit data copy on behalf of the application. A module is associated with one or more data flow based on its functionality.

flows: # Indicate the data flow(s) in which the control plane should consider using this module 
- read  # optional
- write # optional
- copy  # optional

spec.capabilities

capabilites.supportedInterfaces lists the supported data services from which the module can read data and to which it can write * flow field can be read, write or copy * protocol field can take a value such as kafka, s3, jdbc-db2, fybrik-arrow-flight, etc. * format field can take a value such as avro, parquet, json, or csv. Note that a module that targets copy flows will omit the api field and contain just source and sink, a module that only supports reading data assets will omit the sink field and only contain api and source

capabilites.api describes the api exposed by the module for reading or writing data from the user's workload: * protocol field can take a value such as kafka, s3, jdbc-db2, fybrik-arrow-flight, etc * dataformat field can take a value such as parquet, csv, arrow, etc * endpoint field describes the endpoint exposed the module

capabilites.api.endpoint describes the endpoint from a networking perspective: * hostname field is the hostname to be used when accessing the module. Equals the release name. Can be omitted. * port field is the port of the service exposed by the module. * scheme field can take a value such as http, https, grpc, grpc+tls, jdbc:oracle:thin:@, etc

An example for a module that copies data from a db2 database table to an s3 bucket in parquet format.

capabilities:
    supportedInterfaces:
    - flow: copy  
      source:
        protocol: jdbc-db2
        dataformat: table
      sink:
        protocol: s3
        dataformat: parquet

An example for a module that has an API for reading data, and supports reading both parquet and csv formats from s3.

capabilities:
    api:
      protocol: fybrik-arrow-flight
      dataformat: arrow
      endpoint:
        port: 80
        scheme: grpc
    supportedInterfaces:
    - flow: read
      source:
        protocol: s3
        dataformat: parquet
    - flow: read
      source:
        protocol: s3
        dataformat: csv

capabilites.actions are taken from a defined Enforcement Actions Taxonomy a module that does not perform any transformation on the data may omit the capabilities.actions field.

The following is an example of how a module would declare that it knows how to redact, remove or encrypt data. For each action there is a level indication, which can be data set level, column level, or row level. In the example shown column level is indicated, and the actions arguments indicate the columns on which the transformation should be performed.

capabilities:
    actions:
    - id: "redact-ID"
      level: 2 # column
      args:
        column_name: column_value
    - id: "removed-ID"
      level: 2 # column
      args:
        column_name: column_value
    - id: "encrypt-ID"
      level: 2 # column

Full Examples

The following are examples of YAMLs from fully implemented modules:

Test

  1. Register the module to make the control plane aware of it.
  2. Create an FybrikApplication YAML for a user workload, ensuring that the data set and other parameters included in it, together with the governance policies defined in the policy manager, will result in your module being chosen based on the control plane logic.
  3. Apply the FybrikApplication YAML.
  4. View the FybrikApplication status.
  5. Run the user workload and review the results to check if they are what is expected.