Notebook sample

This sample shows how Mesh for Data enables a Jupyter notebook workload to access a dataset. It demonstrates how policies are seamlessly applied when accessing the dataset classified as financial data.

In this sample you play multiple roles:

As a data ower you upload a dataset and register it in a data catalog
As a data steward you setup data governance policies
As a data user you specify your data usage requirements and use a notebook to consume the data

Before you begin

Install Mesh for Data using the Quick Start guide. This sample assumes the use of the built-in catalog, Open Policy Agent (OPA) and flight module.
A web browser.

Create a namespace for the sample

Create a new Kubernetes namespace and set it as the active namespace:

kubectl create namespace m4d-notebook-sample
kubectl config set-context --current --namespace=m4d-notebook-sample

This enables easy cleanup once you're done experimenting with the sample.

Prepare a dataset to be accessed by the notebook

This sample uses the Synthetic Financial Datasets For Fraud Detection dataset¹ as the data that the notebook needs to read. Download and extract the file to your machine. You should now see a file named PS_20174392719_1491204439457_log.csv. Alternatively, use a sample of 100 lines of the same dataset by downloading PS_20174392719_1491204439457_log.csv from GitHub.

Upload the CSV file to an object storage of your choice such as AWS S3, IBM Cloud Object Storage or Ceph. Make a note of the service endpoint, bucket name, and access credentials. You will need them later.

Setup and upload to MinIO

For experimentation you can install MinIO to your cluster instead of using a cloud service.

Define variables for access key and secret key

export ACCESS_KEY="myaccesskey"
export SECRET_KEY="mysecretkey"

Install Minio to the currently active namespace:

kubectl create deployment minio --image=minio/minio:RELEASE.2021-02-14T04-01-33Z -- /bin/sh -ce "/usr/bin/docker-entrypoint.sh minio -S /etc/minio/certs/ server /export"
kubectl set env deployment/minio MINIO_ACCESS_KEY=${ACCESS_KEY} MINIO_SECRET_KEY=${SECRET_KEY}
kubectl wait --for=condition=available --timeout=120s deployment/minio

Create a service to expose MinIO:

kubectl expose deployment minio --port 9000

Create a port-forward to connect to MinIO UI:
```
kubectl port-forward svc/minio 9000 &
```
Open http://localhost:9000 and login with the access key and secret key defined in step 1
Click the button in the bottom right corner and then Create bucket to create a bucket (e.g. "demo").
Click the button again and then Upload files to upload a file to the newly created bucket.

Register the dataset in a data catalog

Register the credentials required for accessing the dataset. Replace the values for access_key and secret_key with the values from the object storage service that you used and run:

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: paysim-csv
type: Opaque
stringData:
  access_key: "${ACCESS_KEY}"
  secret_key: "${SECRET_KEY}"
EOF

Then, register the data asset itself in the catalog. Replace the values for endpoint, bucket and objectKey with values from the object storage service that you used and run:

cat << EOF | kubectl apply -f -
apiVersion: katalog.m4d.ibm.com/v1alpha1
kind: Asset
metadata:
  name: paysim-csv
spec:
  secretRef: 
    name: paysim-csv
  assetDetails:
    dataFormat: csv
    connection:
      type: s3
      s3:
        endpoint: "http://minio.m4d-notebook-sample.svc.cluster.local:9000"
        bucket: "demo"
        objectKey: "PS_20174392719_1491204439457_log.csv"
  assetMetadata:
    geography: theshire
    tags:
    - finance
    componentsMetadata:
      nameOrig: 
        tags:
        - PII
      oldbalanceOrg:
        tags:
        - sensitive
      newbalanceOrig:
        tags:
        - sensitive
EOF

The asset is now registered in the catalog. The identifier of the asset is m4d-notebook-sample/paysim-csv (i.e. <namespace>/<name>). You will use that name in the M4DApplication later.

Notice the assetMetadata field above. It specifies the dataset geography and tags. These attributes can later be used in policies.

Define data access policies

Define an OpenPolicyAgent policy to redact the nameOrig column for datasets tagged as finance. Below is the policy (written in Rego language):

package dataapi.authz

import data.data_policies as dp

transform[action] {
  description := "Redact sensitive columns in finance datasets"
  dp.AccessType() == "READ"
  dp.dataset_has_tag("finance")
  column_names := dp.column_with_any_name({"nameOrig"})
  action = dp.build_redact_column_action(column_names[_], dp.build_policy_from_description(description))
}

In this sample only the policy above is applied. Copy the policy to a file named sample-policy.rego and then run:

kubectl -n m4d-system create configmap sample-policy --from-file=sample-policy.rego
kubectl -n m4d-system label configmap sample-policy openpolicyagent.org/policy=rego
while [[ $(kubectl get cm sample-policy -n m4d-system -o 'jsonpath={.metadata.annotations.openpolicyagent\.org/policy-status}') != '{"status":"ok"}' ]]; do echo "waiting for policy to be applied" && sleep 5; done

You can similarly apply a directory holding multiple rego files.

Deploy a Jupyter notebook

In this sample a Jupyter notebook is used as the user workload and its business logic requires reading the asset that we registered (e.g., for creating a fraud detection model). Deploy a notebook to your cluster:

JupyterLab

Deploy JupyterLab:

kubectl create deployment my-notebook --image=jupyter/base-notebook --port=8888 -- start.sh jupyter lab --LabApp.token=''
kubectl set env deployment my-notebook JUPYTER_ENABLE_LAB=yes
kubectl label deployment my-notebook app.kubernetes.io/name=my-notebook
kubectl wait --for=condition=available --timeout=120s deployment/my-notebook
kubectl expose deployment my-notebook --port=80 --target-port=8888

Create a port-forward to communicate with JupyterLab:
```
kubectl port-forward svc/my-notebook 8080:80 &
```
Open your browser and go to http://localhost:8080/.
Create a new notebook in the server

Kubeflow

Ensure that Kubeflow is installed in your cluster

Create a port-forward to communicate with Kubeflow:

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 &

Open your browser and go to http://localhost:8080/.
Click Start Setup and then Finish (use the anonymous namespace).
Click Notebook Servers (in the left).
In the notebooks page select in the top left the anonymous namespace and then click New Server.
In the notebook server creation page, set my-notebook in the Name box and then click Launch. Wait for the server to become ready.
Click Connect and create a new notebook in the server.

Create a `M4DApplication` resource for the notebook

Create a M4DApplication resource to register the notebook workload to the control plane of Mesh for Data:

cat <<EOF | kubectl apply -f -
apiVersion: app.m4d.ibm.com/v1alpha1
kind: M4DApplication
metadata:
  name: my-notebook
  labels:
    app: my-notebook
spec:
  selector:
    workloadSelector:
      matchLabels:
        app: my-notebook
  appInfo:
    intent: fraud-detection
  data:
    - dataSetID: "m4d-notebook-sample/paysim-csv"
      requirements:
        interface: 
          protocol: m4d-arrow-flight
          dataformat: arrow
EOF

Notice that:

The selector field matches the labels of our Jupyter notebook workload.
The data field includes a dataSetID that matches the asset identifier in the catalog.
The protocol and dataformat indicate that the developer wants to consume the data using Apache Arrow Flight.

Run the following command to wait until the M4DApplication is ready:

while [[ $(kubectl get m4dapplication my-notebook -o 'jsonpath={.status.ready}') != "true" ]]; do echo "waiting for M4DApplication" && sleep 5; done

Read the dataset from the notebook

Insert a new notebook cell to install pandas and pyarrow packages:
```
%pip install pandas pyarrow
```
In your terminal, run the following command to print the code to use for reading the data. It fetches the code from the M4DApplication resource:
```
printf "$(kubectl get m4dapplication my-notebook -o jsonpath={.status.dataAccessInstructions})"
```
Insert a new notebook cell and paste in it the code for reading data as printed in the previous step.
Insert a new notebook cell with the following command to visualize the result:
```
df
```
Execute all notebook cells and notice that the nameOrig column appears redacted.

Cleanup

When you’re finished experimenting with the notebook sample, clean it up:

Stop kubectl port-forward processes (e.g., using pkill kubectl)

Delete the namespace created for this sample:

kubectl delete namespace m4d-notebook-sample

Created by NTNU and shared under the CC BY-SA 4.0 license. ↩