AutoFlow¶
AutoFlow is a tool that automates the event-driven execution of workflows consisting of Jupyter notebooks that interact with FlowKit via FlowAPI. Workflows can consist of multiple inter-dependent notebooks, and can run automatically for each new date of CDR data available in FlowDB. After execution, notebooks can optionally be converted to PDF reports.
AutoFlow consists of a sensor that, on a user-defined schedule, checks for new available dates of CDR data in a FlowKit instance, and runs a set of notebook-based workflows for each date for which they haven't previously run successfully. Each notebook-based workflow is a user-defined workflow that executes one or more Jupyter notebooks, and optionally converts the executed notebooks to PDF.
Running AutoFlow¶
AutoFlow is designed to run in a Docker container (defined by the flowminder/autoflow
image) with two bind-mounted directories:
/mounts/inputs
: Inputs directory (read-only), which contains the notebooks to be executed along with a YAML fileworkflows.yml
that defines the workflows to execute the notebooks and configures the available dates sensor to run these workflows. See the inputs section for details of what this directory should contain, and see the example section for an example./mounts/outputs
: Outputs directory, which will contain two sub-directoriesnotebooks/
andreports/
containing the executed notebooks and PDF reports, respectively. The sub-directories will be created if they do not already exist.
When running the container, the following environment variables should be set:
AUTOFLOW_LOG_LEVEL
: Logging level (e.g. 'DEBUG' or 'ERROR'). Must be upper-case.AUTOFLOW_DB_URI
: URI of a SQL database in which AutoFlow will store metadata for workflow runs. The password can be replaced by'{}'
, e.g.postgresql://user:{}@localhost:5432/autoflow
. If database URI is not provided, a temporary SQLite database will be created inside the container.AUTOFLOW_DB_PASSWORD
: Password for the database.FLOWAPI_URL
: FlowAPI URL.FLOWAPI_TOKEN
: FlowAPI access token. This should allow access toavailable_dates
, along with any query kinds used in the notebooks to be executed.
Alternatively, AUTOFLOW_DB_PASSWORD
and FLOWAPI_TOKEN
can be set as docker secrets instead of environment variables. If connecting to FlowAPI over a secure connection (i.e. if FLOWAPI_URL
starts with 'https://'), then the SSL_CERTIFICATE_FILE
environment variable should be set as the path (within the container) to the SSL certificate for FlowAPI. For example, if the SSL certificate file is set as the Docker secret cert.pem
, then SSL_CERTIFICATE_FILE
should be set as "/run/secrets/cert.pem"
.
A sample Docker stack file for deploying AutoFlow along with a Postgres databaase can be found here. See the deployment documentation for details on using this file.
When the container runs, it will parse workflows.yml
and construct all of the workflows defined, and run a sensor that runs the workflows with the provided parameters each time new data is found in FlowDB. All executed notebooks will be available in <bind-mounted_outputs_directory>/notebooks/
, and PDF reports will be in <bind-mounted_outputs_directory>/reports/
.
Inputs¶
Defining workflows¶
Workflows should be defined in a yaml file workflows.yml
, in the inputs directory that will be bind-mounted into the AutoFlow container. This file should have the following structure:
workflows
: A sequence of workflow specifications. Each workflow specification has the following keys:name
: A unique name for this workflow.notebooks
: Specifications for one or more notebook execution tasks, defined as a mapping from labels to notebook task specifications. See the section on defining notebook tasks for more details.
available_dates_sensor
: Configuration parameters for the available dates sensor. This should have the following parameters:schedule
: A cron string describing the schedule on which the sensor will check for new available dates. Setschedule: null
to run the sensor just once, without a schedule.cdr_types
(optional): A list of CDR types for which available dates should be checked (can be any subset of ["calls", "sms", "mds", "topups"]). Omit this parameter to check available dates for all available CDR types.workflows
: A list of sets of configuration parameters for workflows that the available dates sensor should trigger. Each element of this list should have the following parameters:workflow_name
: The name of a workflow defined in theworkflows
section at the top ofworkflows.yml
.parameters
: Values of any parameters used by notebooks in the workflow (exceptreference_date
,date_ranges
andflowapi_url
, which will be provided automatically).earliest_date
(optional): Optionally specify the earliest date of available data for which this workflow should run.date_stencil
(optional): Optionally provide a date stencil that defines a set of dates that must be available for this workflow to run. See the section on date stencils for more details.
See the example section for an example.
Defining notebook tasks¶
Jupyter notebooks in the inputs directory can be executed as tasks within a workflow. The value of the notebooks
parameter within a workflow specification in workflows.yml
should be a mapping from notebook task keys (which notebook tasks can use to refer to each other) to notebook task specifications. Each notebook task specification has the following keys:
filename
: Filename of the Jupyter notebook to be executed. This should refer to a file in the same directory asworkflows.yml
.parameters
: A mapping that defines parameters for this notebook. Keys are the names of these parameters inside the notebook, and each value refers to one of:- The name of one of the parameters of this workflow.
- The key of another notebook task in this workflow (i.e. the key corresponding to a notebook task specification). Dependencies between notebook tasks can be defined in this way, and these dependencies will be respected by the order in which notebooks are executed. The value of this parameter within the notebook will be the filename of the other notebook after execution.
- The name of an 'automatic' workflow parameter (i.e. a parameter that will automatically be passed by the available dates sensor). There are two 'automatic' parameters:
reference_date
: The date of available CDR data for which the workflow is running (as an iso-format date string).date_ranges
: A list of pairs of iso-format date strings, describing the date ranges defined by thedate_stencil
(see the section on date stencils below) forreference_date
.
flowapi_url
: The URL at which FlowAPI can be accessed. The value of this parameter will be the URL set as the environment variableFLOWAPI_URL
inside the container.
output
(optional): Setoutput: {format: pdf}
to convert this notebook to PDF after execution. Omit theoutput
key to skip converting this notebook to PDF. Optionally, a custom template can be used when converting the notebook to asciidoc by settingoutput: {format: pdf, template: custom_asciidoc_template.tpl}
(wherecustom_asciidoc_template.tpl
should be a file in the same directory asworkflows.yml
).
Date stencils¶
The default behaviour is to run a workflow for every date of available CDR data. If the notebooks refer to date periods other than the reference date for which the workflow is running, further filtering is required to ensure that all required dates are available. This can be done by providing a date stencil.
Each element of the list of workflows defined in the available_dates_sensor
block in workflows.yml
accepts a date_stencil
parameter, which defines a pattern of dates relative to each available CDR date. The workflow will only run for a particular date if all dates contained in the date stencil for that date are available. The date stencil is a list of elements, each of which is one of:
- an absolute date (e.g.
2016-01-01
); - an integer representing an offset (in days) from a reference date. For example,
-1
would refer to the day before a reference date, so a workflow containing-1
would only run for date2016-01-03
if date2016-01-02
was available; - a pair of dates/offsets representing a half-open date interval (inclusive of lower limit) for which all dates must be available. Absolute dates and offsets can be mixed (e.g.
[2016-01-01, -3]
).
If the date_stencil
parameter is not provided, the default value is [0]
(i.e. only the reference date itself needs to be available).
For each date for which the workflow runs, the corresponding date ranges described by the date stencil are available to notebooks as the parameter date_ranges
, which will contain a list of pairs of absolute dates. Note that date_ranges
will always contain pairs of dates, even when the date stencil contains individual dates/offsets.
As an example, if the date stencil is
date_stencil:
- [2016-01-01, 2016-01-03]
- -3
- [-2, 0]
date_ranges
parameter for reference date 2016-01-07
would have the value
[
["2016-01-01", "2016-01-03"],
["2016-01-04", "2016-01-05"],
["2016-01-05", "2016-01-07"],
]
2016-01-07
if all of the dates 2016-01-01
, 2016-01-02
, 2016-01-04
, 2016-01-05
and 2016-01-06
are available.
Preparing notebooks for automated execution¶
Parameters¶
Notebooks are executed within workflows using papermill, which enables passing parameters into a notebook. Within a notebook, indicate the cell containing parameters that should be replaced at execution time by adding the tag parameters
to that cell (see papermill's documentation for more details).
Passing data between notebooks¶
Scrapbook can be used to attach data to a notebook during execution, so that it can later be retrieved in another notebook.
To 'glue' the value of a variable x
in a notebook:
import scrapbook as sb
sb.glue("scrap_name", x)
import scrapbook as sb
nb = sb.read_notebook(path_to_first_notebook)
x = nb.scraps["scrap_name"].data
See scrapbook's documentation for more details.
As an example, if the notebooks
parameter of a workflow specification is
notebooks:
notebook1:
filename: ...
parameters: ...
notebook2:
filename: ...
parameters:
path_to_first_notebook: notebook1
...
path_to_first_notebook
parameter in notebook2
will be the path to the executed notebook1
, so values 'glued' in notebook1
can be accessed in notebook2
using sb.read_notebook(path_to_first_notebook)
.
Connecting to FlowAPI¶
Currently, the best way to create a FlowClient connection in a notebook is to pass the parameter flowapi_url
to the notebook, and then run
import flowclient
from get_secret_or_env_var import environ, getenv
conn = flowclient.connect(
url=flowapi_url,
token=environ["FLOWAPI_TOKEN"],
ssl_certificate=getenv("SSL_CERTIFICATE_FILE"),
)
Note
This method for accessing the FlowAPI access token is likely to change in the future.
The example notebooks in the example section contain examples of parameterisation with papermill, data-sharing with scrapbook, and creating a FlowClient connection.
Example¶
This directory contains example inputs for an AutoFlow workflow that runs two Jupyter notebooks and produces a PDF report showing in/outflows above normal for each day of CDR data. It contains two Jupyter notebooks:
run_flows.ipynb
: Notebook that runs twoflows
queries, so that they are stored in cache. The query IDs for theflows
queries are glued in this notebook.flows_report.ipynb
: Notebook that reads the query IDs fromrun_flows.ipynb
, gets the query results using FlowClient and combines them with geography data, and displays the flows above normal in plots and tables. This notebook is designed to be converted to a PDF report after execution.
and an input file workflows.yml
that defines a workflow that runs run_flows.ipynb
followed by flows_report.ipynb
and creates a PDF report from flows_report.ipynb
, and configures the available dates sensor to check for new dates every day at midnight, and run the workflow with two different sets of parameters.
To run the AutoFlow example:
- Start a FlowKit instance, by following the quick-start instructions.
- Go to FlowAuth, log in with username
TEST_USER
and passwordDUMMY_PASSWORD
, and create an API token following instructions here. - Create two empty directories, one for inputs and the other for outputs.
- Copy the example input files into your inputs directory:
cd <your_inputs_directory> wget https://raw.githubusercontent.com/Flowminder/FlowKit/master/autoflow/examples/inputs/run_flows.ipynb wget https://raw.githubusercontent.com/Flowminder/FlowKit/master/autoflow/examples/inputs/flows_report.ipynb wget https://raw.githubusercontent.com/Flowminder/FlowKit/master/autoflow/examples/inputs/workflows.yml
- Start AutoFlow:
(wheredocker run -d \ -v <your_inputs_directory>:/mounts/inputs:ro \ -v <your_outputs_directory>:/mounts/outputs:rw \ -e "AUTOFLOW_LOG_LEVEL=ERROR" \ -e "FLOWAPI_URL=http://localhost:9090" \ -e "FLOWAPI_TOKEN=<api_token>" \ flowminder/autoflow:latest
<your_inputs_directory>
and<your_outputs_directory>
are the paths to the input and output directories you created, and<api_token>
is the token you created in FlowAuth).
To run the sensor immediately without a schedule, rather than waiting until midnight, change the line schedule: "0 0 * * *"
to schedule: null
in workflows.yml
.
Outputs will appear in the outputs directory you created.