Notebook sample

This sample shows how Fybrik enables a Jupyter notebook workload to access a dataset. It demonstrates how policies are seamlessly applied when accessing the dataset classified as financial data.

In this sample you play multiple roles:

As a data owner you upload a dataset and register it in a data catalog
As a data steward you setup data governance policies
As a data user you specify your data usage requirements and use a notebook to consume the data

Before you begin

Install Fybrik using the Quick Start guide. This sample assumes the use of the built-in catalog, Open Policy Agent (OPA) and flight module.
A web browser.

Create a namespace for the sample

Create a new Kubernetes namespace and set it as the active namespace:

kubectl create namespace fybrik-notebook-sample
kubectl config set-context --current --namespace=fybrik-notebook-sample

This enables easy cleanup once you're done experimenting with the sample.

Prepare a dataset to be accessed by the notebook

This sample uses the Synthetic Financial Datasets For Fraud Detection dataset¹ as the data that the notebook needs to read. Download and extract the file to your machine. You should now see a file named PS_20174392719_1491204439457_log.csv. Alternatively, use a sample of 100 lines of the same dataset by downloading PS_20174392719_1491204439457_log.csv from GitHub.

Upload the CSV file to an object storage of your choice such as AWS S3, IBM Cloud Object Storage or Ceph. Make a note of the service endpoint, bucket name, and access credentials. You will need them later.

Setup and upload to localstack

For experimentation you can install localstack to your cluster instead of using a cloud service.

Define variables for access key and secret key

export ACCESS_KEY="myaccesskey"
export SECRET_KEY="mysecretkey"

Install localstack to the currently active namespace and wait for it to be ready:

helm repo add localstack-charts https://localstack.github.io/helm-charts
helm install localstack localstack-charts/localstack --set startServices="s3" --set service.type=ClusterIP
kubectl wait --for=condition=ready --all pod -n fybrik-notebook-sample --timeout=120s

Create a port-forward to communicate with localstack server:
```
kubectl port-forward svc/localstack 4566:4566 &
```

Use AWS CLI to upload the dataset to a new created bucket in the localstack server:

export ENDPOINT="http://127.0.0.1:4566"
export BUCKET="demo"
export OBJECT_KEY="PS_20174392719_1491204439457_log.csv"
export FILEPATH="/path/to/PS_20174392719_1491204439457_log.csv"
aws configure set aws_access_key_id ${ACCESS_KEY} && aws configure set aws_secret_access_key ${SECRET_KEY} && aws --endpoint-url=${ENDPOINT} s3api create-bucket --bucket ${BUCKET} && aws --endpoint-url=${ENDPOINT} s3api put-object --bucket ${BUCKET} --key ${OBJECT_KEY} --body ${FILEPATH}

Register the dataset in a data catalog

Register the credentials required for accessing the dataset. Replace the values for access_key and secret_key with the values from the object storage service that you used and run:

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: paysim-csv
type: Opaque
stringData:
  access_key: "${ACCESS_KEY}"
  secret_key: "${SECRET_KEY}"
EOF

Then, register the data asset itself in the catalog. Replace the values for endpoint, bucket and objectKey with values from the object storage service that you used and run:

cat << EOF | kubectl apply -f -
apiVersion: katalog.fybrik.io/v1alpha1
kind: Asset
metadata:
  name: paysim-csv
spec:
  secretRef: 
    name: paysim-csv
  assetDetails:
    dataFormat: csv
    connection:
      type: s3
      s3:
        endpoint: "http://localstack.fybrik-notebook-sample.svc.cluster.local:4566"
        bucket: "demo"
        objectKey: "PS_20174392719_1491204439457_log.csv"
  assetMetadata:
    geography: theshire
    tags:
    - finance
    componentsMetadata:
      nameOrig: 
        tags:
        - PII
      oldbalanceOrg:
        tags:
        - sensitive
      newbalanceOrig:
        tags:
        - sensitive
EOF

The asset is now registered in the catalog. The identifier of the asset is fybrik-notebook-sample/paysim-csv (i.e. <namespace>/<name>). You will use that name in the FybrikApplication later.

Notice the assetMetadata field above. It specifies the dataset geography and tags. These attributes can later be used in policies.

Define data access policies

Define an OpenPolicyAgent policy to redact the nameOrig column for datasets tagged as finance. Below is the policy (written in Rego language):

package dataapi.authz

import data.data_policies as dp

transform[action] {
  description := "Redact sensitive columns in finance datasets"
  dp.AccessType() == "READ"
  dp.dataset_has_tag("finance")
  column_names := dp.column_with_any_name({"nameOrig"})
  action = dp.build_redact_column_action(column_names[_], dp.build_policy_from_description(description))
}

In this sample only the policy above is applied. Copy the policy to a file named sample-policy.rego and then run:

kubectl -n fybrik-system create configmap sample-policy --from-file=sample-policy.rego
kubectl -n fybrik-system label configmap sample-policy openpolicyagent.org/policy=rego
while [[ $(kubectl get cm sample-policy -n fybrik-system -o 'jsonpath={.metadata.annotations.openpolicyagent\.org/policy-status}') != '{"status":"ok"}' ]]; do echo "waiting for policy to be applied" && sleep 5; done

You can similarly apply a directory holding multiple rego files.

Deploy a Jupyter notebook

In this sample a Jupyter notebook is used as the user workload and its business logic requires reading the asset that we registered (e.g., for creating a fraud detection model). Deploy a notebook to your cluster:

JupyterLab

Deploy JupyterLab:

kubectl create deployment my-notebook --image=jupyter/base-notebook --port=8888 -- start.sh jupyter lab --LabApp.token=''
kubectl set env deployment my-notebook JUPYTER_ENABLE_LAB=yes
kubectl label deployment my-notebook app.kubernetes.io/name=my-notebook
kubectl wait --for=condition=available --timeout=120s deployment/my-notebook
kubectl expose deployment my-notebook --port=80 --target-port=8888

Create a port-forward to communicate with JupyterLab:
```
kubectl port-forward svc/my-notebook 8080:80 &
```
Open your browser and go to http://localhost:8080/.
Create a new notebook in the server

Kubeflow

Ensure that Kubeflow is installed in your cluster

Create a port-forward to communicate with Kubeflow:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 &

Open your browser and go to http://localhost:8080/.
Click Start Setup and then Finish (use the anonymous namespace).
Click Notebook Servers (in the left).
In the notebooks page select in the top left the anonymous namespace and then click New Server.
In the notebook server creation page, set my-notebook in the Name box and then click Launch. Wait for the server to become ready.
Click Connect and create a new notebook in the server.

Create a `FybrikApplication` resource for the notebook

Create a FybrikApplication resource to register the notebook workload to the control plane of Fybrik:

cat <<EOF | kubectl apply -f -
apiVersion: app.fybrik.io/v1alpha1
kind: FybrikApplication
metadata:
  name: my-notebook
  labels:
    app: my-notebook
spec:
  selector:
    workloadSelector:
      matchLabels:
        app: my-notebook
  appInfo:
    intent: fraud-detection
  data:
    - dataSetID: "fybrik-notebook-sample/paysim-csv"
      requirements:
        interface: 
          protocol: fybrik-arrow-flight
          dataformat: arrow
EOF

Notice that:

The selector field matches the labels of our Jupyter notebook workload.
The data field includes a dataSetID that matches the asset identifier in the catalog.
The protocol and dataformat indicate that the developer wants to consume the data using Apache Arrow Flight.

Run the following command to wait until the FybrikApplication is ready:

while [[ $(kubectl get fybrikapplication my-notebook -o 'jsonpath={.status.ready}') != "true" ]]; do echo "waiting for FybrikApplication" && sleep 5; done

Read the dataset from the notebook

In your terminal, run the following command to print the endpoint to use for reading the data. It fetches the code from the FybrikApplication resource:

ENDPOINT_SCHEME=$(kubectl get fybrikapplication my-notebook -o jsonpath={.status.readEndpointsMap.fybrik-notebook-sample/paysim-csv.scheme})
ENDPOINT_HOSTNAME=$(kubectl get fybrikapplication my-notebook -o jsonpath={.status.readEndpointsMap.fybrik-notebook-sample/paysim-csv.hostname})
ENDPOINT_PORT=$(kubectl get fybrikapplication my-notebook -o jsonpath={.status.readEndpointsMap.fybrik-notebook-sample/paysim-csv.port})
printf "${ENDPOINT_SCHEME}://${ENDPOINT_HOSTNAME}:${ENDPOINT_PORT}"

The next steps use the endpoint to read the data in a python notebook

Insert a new notebook cell to install pandas and pyarrow packages:
```
%pip install pandas pyarrow
```

Insert a new notebook cell to read the data using the endpoint value extracted from the FybrikApplication in the previous step:

%pip install pandas pyarrow
import json
import pyarrow.flight as fl
import pandas as pd

# Create a Flight client
client = fl.connect('<ENDPOINT>')

# Prepare the request
request = {
    "asset": "fybrik-notebook-sample/paysim-csv",
    # To request specific columns add to the request a "columns" key with a list of column names
    # "columns": [...]
}

# Send request and fetch result as a pandas DataFrame
info = client.get_flight_info(fl.FlightDescriptor.for_command(json.dumps(request)))
reader: fl.FlightStreamReader = client.do_get(info.endpoints[0].ticket)
df: pd.DataFrame = reader.read_pandas()

Insert a new notebook cell with the following command to visualize the result:
```
df
```
Execute all notebook cells and notice that the nameOrig column appears redacted.

Cleanup

When you’re finished experimenting with the notebook sample, clean it up:

Stop kubectl port-forward processes (e.g., using pkill kubectl)

Delete the namespace created for this sample:

kubectl delete namespace fybrik-notebook-sample

Created by NTNU and shared under the CC BY-SA 4.0 license. ↩