Versioning and memoization
Experimental
#

You can find the code for this example on Github

This example describes how to use Dagster's versioning and memoization features.

Dagster can use versions to determine whether or not it is necessary to re-execute a particular step. Given versions of the code from each op in a job, the system can infer whether an upcoming execution of a step will differ from previous executions by tagging op outputs with a version. This allows for the outputs of previous runs to be re-used. We call this process memoized execution.


Quick start#

You can enable memoization functionality by providing a VersionStrategy to your job. Dagster provides the SourceHashVersionStrategy as a top-level export.

from dagster import SourceHashVersionStrategy, job


@job(version_strategy=SourceHashVersionStrategy())
def the_job():
    ...

When memoization is enabled, the outputs of ops will be cached. Ops will only be re-run if:

  • An upstream output's version changes
  • The config to the op changes
  • The version of a required resource changes
  • The value returned by your VersionStrategy for that particular op changes. In the case of SourceHashVersionStrategy, this only occurs when the code within your ops and resources changes.

How versioning works#

The following diagram shows how an op output version is computed.

Diagram describing how an op output version is computed

Notice how the version of an output depends on all upstream output versions. Because of this, output versions are computed in topological order.

This diagram describes the computation of the version of a resource.

Diagram describing how a resource version is computed

Resource versions are also computed in topological order, as resources can depend on other resources.


How memoization works#

Memoization is enabled by using a version strategy on your job, in tandem with MemoizableIOManager. In addition to the handle_output and load_input methods from the traditional IOManager, MemoizableIOManagers also implement a has_output method. This is intended to check whether an output already exists that matches specifications.

Before execution occurs, the Dagster system will determine a set of which steps actually need to run. If using memoization, Dagster will check whether all the outputs of a given step have already been memoized by calling the has_output method on the io manager for each output. If has_output returns True for all outputs, then the step will not run again.

Several of the persistent IO managers provided by Dagster are memoizable. This includes Dagster's default io manager, the s3_pickle_io_manager and the fs_io_manager.


Writing a custom VersionStrategy#

There will likely be cases where the default SourceHashVersionStrategy will not suffice. In these cases, it is advantageous to implement your own VersionStrategy to match your requirements.

Check out the implementation of SourceHashVersionStrategy as an example.


Writing a custom MemoizableIOManager#

If you are using a custom IO manager and want to make use of memoization functionality, then your custom IO manager must be memoizable. This means they must implement the has_output function. The OutputContext.get_output_identifier will provide a path which includes version information that you can both store and check outputs to. Check out our implementations of io managers for inspiration.


Disabling memoization#

Sometimes, you may want to run the whole job from scratch. Memoization can be disabled by setting the MEMOIZED_RUN_TAG to false on your job.

Adding custom tag to disable memoization in Dagit