Notebook sample
This sample shows how Mesh for Data enables a Jupyter notebook workload to access a dataset. It demonstrates how policies are seamlessly applied when accessing the dataset classified as financial data.
In this sample you play multiple roles:
- As a data ower you upload a dataset and register it in a data catalog
- As a data steward you setup data governance policies
- As a data user you specify your data usage requirements and use a notebook to consume the data
Before you begin
- Install Mesh for Data using the Quick Start guide. This sample assumes the use of the built-in catalog, Open Policy Agent (OPA) and flight module.
- A web browser.
Create a namespace for the sample
Create a new Kubernetes namespace and set it as the active namespace:
kubectl create namespace m4d-notebook-sample
kubectl config set-context --current --namespace=m4d-notebook-sample
This enables easy cleanup once you're done experimenting with the sample.
Prepare a dataset to be accessed by the notebook
This sample uses the Synthetic Financial Datasets For Fraud Detection dataset1 as the data that the notebook needs to read. Download and extract the file to your machine. You should now see a file named PS_20174392719_1491204439457_log.csv
. Alternatively, use a sample of 100 lines of the same dataset by downloading PS_20174392719_1491204439457_log.csv
from GitHub.
Upload the CSV file to an object storage of your choice such as AWS S3, IBM Cloud Object Storage or Ceph. Make a note of the service endpoint, bucket name, and access credentials. You will need them later.
Setup and upload to MinIO
For experimentation you can install MinIO to your cluster instead of using a cloud service.
- Define variables for access key and secret key
export ACCESS_KEY="myaccesskey" export SECRET_KEY="mysecretkey"
- Install Minio to the currently active namespace:
kubectl create deployment minio --image=minio/minio:RELEASE.2021-02-14T04-01-33Z -- /bin/sh -ce "/usr/bin/docker-entrypoint.sh minio -S /etc/minio/certs/ server /export" kubectl set env deployment/minio MINIO_ACCESS_KEY=${ACCESS_KEY} MINIO_SECRET_KEY=${SECRET_KEY} kubectl wait --for=condition=available --timeout=120s deployment/minio
- Create a service to expose MinIO:
kubectl expose deployment minio --port 9000
- Create a port-forward to connect to MinIO UI:
kubectl port-forward svc/minio 9000 &
- Open http://localhost:9000 and login with the access key and secret key defined in step 1
- Click the button in the bottom right corner and then Create bucket to create a bucket (e.g. "demo").
- Click the button again and then Upload files to upload a file to the newly created bucket.
Register the dataset in a data catalog
Register the credentials required for accessing the dataset. Replace the values for access_key
and secret_key
with the values from the object storage service that you used and run:
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: paysim-csv
type: Opaque
stringData:
access_key: "${ACCESS_KEY}"
secret_key: "${SECRET_KEY}"
EOF
Then, register the data asset itself in the catalog. Replace the values for endpoint
, bucket
and objectKey
with values from the object storage service that you used and run:
cat << EOF | kubectl apply -f -
apiVersion: katalog.m4d.ibm.com/v1alpha1
kind: Asset
metadata:
name: paysim-csv
spec:
secretRef:
name: paysim-csv
assetDetails:
dataFormat: csv
connection:
type: s3
s3:
endpoint: "http://minio.m4d-notebook-sample.svc.cluster.local:9000"
bucket: "demo"
objectKey: "PS_20174392719_1491204439457_log.csv"
assetMetadata:
geography: theshire
tags:
- finance
componentsMetadata:
nameOrig:
tags:
- PII
oldbalanceOrg:
tags:
- sensitive
newbalanceOrig:
tags:
- sensitive
EOF
The asset is now registered in the catalog. The identifier of the asset is m4d-notebook-sample/paysim-csv
(i.e. <namespace>/<name>
). You will use that name in the M4DApplication
later.
Notice the assetMetadata
field above. It specifies the dataset geography and tags. These attributes can later be used in policies.
Define data access policies
Define an OpenPolicyAgent policy to redact the nameOrig
column for datasets tagged as finance
. Below is the policy (written in Rego language):
package dataapi.authz
import data.data_policies as dp
transform[action] {
description := "Redact sensitive columns in finance datasets"
dp.AccessType() == "READ"
dp.dataset_has_tag("finance")
column_names := dp.column_with_any_name({"nameOrig"})
action = dp.build_redact_column_action(column_names[_], dp.build_policy_from_description(description))
}
In this sample only the policy above is applied. Copy the policy to a file named sample-policy.rego
and then run:
kubectl -n m4d-system create configmap sample-policy --from-file=sample-policy.rego
kubectl -n m4d-system label configmap sample-policy openpolicyagent.org/policy=rego
while [[ $(kubectl get cm sample-policy -n m4d-system -o 'jsonpath={.metadata.annotations.openpolicyagent\.org/policy-status}') != '{"status":"ok"}' ]]; do echo "waiting for policy to be applied" && sleep 5; done
You can similarly apply a directory holding multiple rego files.
Deploy a Jupyter notebook
In this sample a Jupyter notebook is used as the user workload and its business logic requires reading the asset that we registered (e.g., for creating a fraud detection model). Deploy a notebook to your cluster:
- Deploy JupyterLab:
kubectl create deployment my-notebook --image=jupyter/base-notebook --port=8888 -- start.sh jupyter lab --LabApp.token='' kubectl set env deployment my-notebook JUPYTER_ENABLE_LAB=yes kubectl label deployment my-notebook app.kubernetes.io/name=my-notebook kubectl wait --for=condition=available --timeout=120s deployment/my-notebook kubectl expose deployment my-notebook --port=80 --target-port=8888
- Create a port-forward to communicate with JupyterLab:
kubectl port-forward svc/my-notebook 8080:80 &
- Open your browser and go to http://localhost:8080/.
- Create a new notebook in the server
- Ensure that Kubeflow is installed in your cluster
- Create a port-forward to communicate with Kubeflow:
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80 &
- Open your browser and go to http://localhost:8080/.
- Click Start Setup and then Finish (use the
anonymous
namespace). - Click Notebook Servers (in the left).
- In the notebooks page select in the top left the
anonymous
namespace and then click New Server. - In the notebook server creation page, set
my-notebook
in the Name box and then click Launch. Wait for the server to become ready. - Click Connect and create a new notebook in the server.
Create a M4DApplication
resource for the notebook
Create a M4DApplication
resource to register the notebook workload to the control plane of Mesh for Data:
cat <<EOF | kubectl apply -f -
apiVersion: app.m4d.ibm.com/v1alpha1
kind: M4DApplication
metadata:
name: my-notebook
labels:
app: my-notebook
spec:
selector:
workloadSelector:
matchLabels:
app: my-notebook
appInfo:
intent: fraud-detection
data:
- dataSetID: "m4d-notebook-sample/paysim-csv"
requirements:
interface:
protocol: m4d-arrow-flight
dataformat: arrow
EOF
Notice that:
- The
selector
field matches the labels of our Jupyter notebook workload. - The
data
field includes adataSetID
that matches the asset identifier in the catalog. - The
protocol
anddataformat
indicate that the developer wants to consume the data using Apache Arrow Flight.
Run the following command to wait until the M4DApplication
is ready:
while [[ $(kubectl get m4dapplication my-notebook -o 'jsonpath={.status.ready}') != "true" ]]; do echo "waiting for M4DApplication" && sleep 5; done
Read the dataset from the notebook
- Insert a new notebook cell to install pandas and pyarrow packages:
%pip install pandas pyarrow
- In your terminal, run the following command to print the code to use for reading the data. It fetches the code from the
M4DApplication
resource:printf "$(kubectl get m4dapplication my-notebook -o jsonpath={.status.dataAccessInstructions})"
- Insert a new notebook cell and paste in it the code for reading data as printed in the previous step.
- Insert a new notebook cell with the following command to visualize the result:
df
- Execute all notebook cells and notice that the
nameOrig
column appears redacted.
Cleanup
When you’re finished experimenting with the notebook sample, clean it up:
- Stop
kubectl port-forward
processes (e.g., usingpkill kubectl
) - Delete the namespace created for this sample:
kubectl delete namespace m4d-notebook-sample
-
Created by NTNU and shared under the CC BY-SA 4.0 license. ↩