Search
K
Comment on page

Dagster

Why Dagster + Noteable?

Dagster provides native support for notebook execution and asset materialization. Normally this runs your notebook within the Dagster environment and provides links to the resulting files. By introducing Noteable into the equation, you gain several powerful tools using the existing dagstermill interface, including:
  • commentable and collaborative notebook links
  • improved visualizations
  • native data connectors
  • notebook commenting and notifications
  • the ability to fix issues on live jobs.
If your notebook execution fails, you get a live link to the notebook for a period of time to try and correct the underlying issue.
This saves hours or even days when a long-running execution hits an issue such as data drift or column renames that cause query analysis code to start failing.
Below is an example link to a failed Notebook run by Dagster. You can edit this session to correct the problem.
This allows you to fix an issue, then resume execution of the notebook from the last successful asset materialization. This is a major productivity boost (not to mention a less stressful development experience).
The following guide walks you through setting up a scheduled Noteable notebook using Dagster:
Build your Noteable workflow:
  • Create a notebook within Noteable, or select an existing notebook to use.
  • If you would like to override existing parameters used within the notebook. Mark the cell that contains the parameters with the parameters tag.
Marking a cell as the parameter cell will allow the scheduler to override at execution time the default parameters of the notebook. For more details about how to parametrize a notebook, see the Papermill documentation.
  • Copy the URL to the notebook and extract the file_id
    • [Optionally] If you’d like to schedule a specific version of a notebook, copy the Version ID from the sidepanel and set the input_path to noteable://{version_id}
  • Get your API token from Noteable
    • Within user settings, go to the API Token page, and generate a new token. Copy the value and place it in your Ariflow variables (In Admin -> Variables) as NOTEABLE_TOKEN.
  • If just using ids with the noteable://{id} pattern, you also need to supply the NOTEABLE_DOMAINkey in Dagster secrets

Dagster workflow:

  • Follow the guidance from dagster to generate a new project, or reuse an existing one
  • Add papermill_origami requirement to your pyproject.toml
    • requires = ["setuptools", “papermill_origami”]
  • In your repository.py file create a notebook op following the dagstermill docs
    • notebook_path="noteable://{id}"
      • Fetch the notebook id from URL
        • https://app.noteable.io/f/9b92ef52-29af-497a-bbd1-d14c18b27e5d/What-can-you-do-in-a-Noteable-notebook.ipynb
from papermill_origami.noteable_dagstermill import define_noteable_dagstermill_op
from dagster import job, In, Field, fs_io_manager
import dagstermill as dm
noteable_id = MY_FILE_OR_VERSION_ID_HERE # OR …
noteable_url = MY_FILE_OR_VERSION_URL_HERE
noteable_op = define_noteable_dagstermill_op(
"my_noteable_op",
notebook_path=f"noteable://{notebook_id}" # OR f"{notebook_url}”,
output_notebook_name="local_output", # Is also populatable via None to be set automatically to the Noteable link
config_schema=Field(...),
ins={...},
)
@job(
resource_defs={
"output_notebook_io_manager": dm.local_output_notebook_io_manager,
"fs_io_manager": fs_io_manager,
}
)
def run_noteable_notebook():
noteable_op()

Add domain and token secrets to dagster

  • NOTEABLE_DOMAIN
  • NOTEABLE_TOKEN
  • Serverless:
    • https://docs.dagster.io/dagster-cloud/deployment/serverless#adding-secrets
  • Hybrid:
    • ECS
      • https://docs.dagster.io/deployment/guides/aws#secrets-management-in-ecs
    • Kubernetes
      • https://kubernetes.io/docs/concepts/configuration/secret/
  • Local (via dagit):
    • Command Line
      • https://www3.ntu.edu.sg/home/ehchua/programming/howto/Environment_Variables.html
        • NOTEABLE_TOKEN = abc123…
  • Running the notebook
    • We recommend using Dagster’s UI (Dagit) to test connecting and running Noteable
    • Push to prod via branch following https://docs.dagster.io/guides/dagster/transitioning-data-pipelines-from-development-to-production
    • You should see an Asset Graph (Dagster’s visual representation of the DAG) using your defined operations and assets, including the new noteable node
  • Next launch a materialization of your notebook asset using the LaunchPad
  • View the execution results with a link to the modified run of the original notebook
The original notebook was unmodified and still left as a template for future executions. Noteable uses linear versioning on your Notebook file to always reference a point-in-time version of your live links.

Passing Data to Noteable from Dagster

Using Papermill with Dagster follows the same principles of data transport for both Dagstermill and Noteable with one exception: parameters will be serialized to the notebook cell as a rehydratable context object holding all of the data you want to use. This is similar to how it would be in a Dagstermill notebook execution, but the integration handles serializing your data instead of a data loader object in the injected notebook cell. See below for a working example. Say you have a small dataframe loaded as a solid in a particular op and want to reuse it as an input to a Noteable notebook. First, you define your dataframe: in this case, we loaded the iris dataset into pandas.
@op(out={"iris": Out(dagster_type=DataFrame, io_manager_key="fs_io_manager")})
def iris():
sk_iris = datasets.load_iris()
return pd.DataFrame(
data=np.c_[sk_iris['data'], sk_iris['target']],
columns=sk_iris['feature_names'] + ['target']
)
iris_asset_job = define_asset_job(name="iris_job", selection="iris")
Then you define your ins argument to the noteable_dagstermill_op as you would elsewhere
demo = define_noteable_dagstermill_op(
"demo",
notebook_path=f"noteable://{file_id}",
output_notebook_name="demo_output",
ins={"iris": In(dagster_type=DataFrame, input_manager_key="fs_io_manager")},
)
@job(
resource_defs={
"output_notebook_io_manager": dm.local_output_notebook_io_manager,
"fs_io_manager": fs_io_manager,
}
)
def run_demo():
demo(iris())
This produces an asset graph in Dagster. Our dataframe is now an asset generated by the Iris job and acts as an input to the notebook.
When materialized, the notebook will have an iris variable loaded after the parameter cell is executed. You can now reference this anywhere in your notebook as a local variable.

Restrictions on Parameters

Parameterization today serializes the content from the Dagster node to the Noteable notebook. This means that:
A) that only cloud pickleable parameters can be passed and
B) very large parameters above a couple MB will be rejected.
In the case where you have more complicated parameters you wish to pass, consider putting them into a database or blob storage and referencing the data path to be loaded at runtime. Noteable supports Data Connections with native SQL cells as well so loading from a shared database is easy.