Deploying FlowKit

A complete FlowKit deployment consists of FlowDB, FlowMachine, FlowETL, FlowAPI, FlowAuth, and redis. FlowDB, FlowMachine, FlowETL, FlowAPI and redis are deployed inside your firewall and work together to provide a complete running system. FlowAuth can be installed outside your firewall, and does not require a direct connection to the rest of the system.

We strongly recommend using docker swarm to deploy all the components, to support you in safely managing secrets to ensure a secure system. Ensure you understand how to create a swarm, manage secrets, and deploy stacks using compose files before continuing.

Deployment scenarios¶

FlowDB and FlowETL only¶

FlowDB can be used with FlowETL independently of the other components, to provide a system which allows access to individual level data via a harmonised schema and SQL access. Because FlowDB is built on PostgreSQL, standard SQL based tools and workflows will work just fine.

FlowDB¶

FlowDB is distributed as a docker container. To run it, you will need to provide several secrets:

Secret name	Secret purpose	Notes
FLOWAPI_FLOWDB_USER	Database user used by FlowAPI	Role with read access to tables under the cache and geography schemas
FLOWAPI_FLOWDB_PASSWORD	Password for the FlowAPI database user
FLOWMACHINE_FLOWDB_USER	Database user for FlowMachine	Role with write access to tables under the cache schema, and read access to events, infrastructure, cache and geography schemas
FLOWMACHINE_FLOWDB_PASSWORD	Password for flowmachine user
FLOWDB_POSTGRES_PASSWORD	Postgres superuser password for flowdb	Username `flowdb`, user with super user access to flowdb database

You may also provide the following environment variables:

Variable name	Purpose	Default value
CACHE_SIZE	Maximum size of the cache schema	1 tenth of available space in pgdata directory
CACHE_PROTECTED_PERIOD	Amount of time to protect cache tables from being cleaned up	86400 (24 hours)
CACHE_HALF_LIFE	Speed at which cache tables expire when not used	1000
MAX_CPUS	Maximum number of CPUs that may be used for parallelising queries	The greater of 1 or 1 less than all CPUs
SHARED_BUFFERS_SIZE	Size of shared buffers	16GB
MAX_WORKERS	Maximum number of CPUs that may be used for parallelising one query	MAX_CPUS/2
MAX_WORKERS_PER_GATHER	Maximum number of CPUs that may be used for parallelising part of one query	MAX_CPUS/2
EFFECTIVE_CACHE_SIZE	Postgres cache size	25% of total RAM
FLOWDB_ENABLE_POSTGRES_DEBUG_MODE	When set to TRUE, enables use of the pgadmin debugger	FALSE

However in most cases, the defaults will be adequate.

Shared memory¶

You will typically need to increase the default shared memory available to docker containers when running FlowDB. You can do this either by setting shm_size for the FlowDB container in your compose or stack file, or by passing the --shm-size argument to the docker run command.

Bind Mounts and user permissions¶

By default, FlowDB will create and attach a docker volume that contains all data. In some cases, this will be sufficient for use.

However, you will often wish to set up bind mounts to hold the data, allow access to the postgres logs, and allow FlowDB to consume new data. To avoid sticky situations with permissions, you will want to specify the uid and gid that FlowDB runs with to match an existing user on the host system.

Adding a bind mount using docker-compose is simple:

services:
    flowdb:
    ...
        user: <HOST_USER_ID>:<HOST_GROUP_ID>
        volumes:
          - /path/to/store/data/on/host:/var/lib/postgresql/data
          - /path/to/consume/data/from/host:/etl:ro

This creates two bind mounts, the first is FlowDB's internal storage, and the second is a read only mount for loading new data. The user FlowDB runs as inside the container will also be changed to the uid specified.

Warning

If the bind mounted directories do not exist, docker will create them and you will need to chown them to the correct user.

And similarly when using docker run:

docker run --name flowdb_testdata -e FLOWMACHINE_FLOWDB_PASSWORD=foo -e FLOWAPI_FLOWDB_PASSWORD=foo \
 --publish 9000:5432 \
 --user HOST_USER_ID:HOST_GROUP_ID \
 -v /path/to/store/data/on/host:/var/lib/postgresql/data \
 -v /path/to/consume/data/from/host:/etl:ro \
 --detach flowminder/flowdb-testdata:latest

Tip

To run as the current user, you can simply replace HOST_USER_ID:HOST_GROUP_ID with $(id -u):$(id -g).

Warning

Using the --user flag without a bind mount specified will not work, and you will see an error like this: initdb: could not change permissions of directory "/var/lib/postgresql/data": Operation not permitted.

When using docker volumes, docker will manage the permissions for you.

FlowETL¶

To run FlowETL, you will need to provide the following secrets:

Secret name	Secret purpose	Notes
FLOWETL_AIRFLOW_ADMIN_USERNAME	Default administrative user logon name for the FlowETL web interface
FLOWETL_AIRFLOW_ADMIN_PASSWORD	Password for the administrative user
AIRFLOW__CORE__SQL_ALCHEMY_CONN	Connection string for the backing database	Should take the form `postgres://flowetl:<FLOWETL_POSTGRES_PASSWORD>@flowetl_db:5432/flowetl`
AIRFLOW__CORE__FERNET_KEY	Ferney key used to encrypt (at rest) database credentials
AIRFLOW_CONN_FLOWDB	Connection string for the FlowDB database	Should take the form `postgres://flowdb:<FLOWDB_POSTGRES_PASSWORD>@flowdb:5432/flowdb`
FLOWETL_POSTGRES_PASSWORD	Superuser password for FlowETL's backing database

Note

Generating Fernet keys

A convenient way to generate Fernet keys is to use the python cryptography package. After installing, you can generate a new key by running python -c "from cryptography.fernet import Fernet;print(Fernet.generate_key().decode())".

See also the airflow documentation for other configuration options which you can provide as environment variables.

The ETL documentation gives detail on how to use FlowETL to load data into FlowDB.

Sample stack files¶

FlowDB¶

You can find a sample FlowDB stack file here. To use it, you should first create the secrets, and additionally set the following environment variables:

Variable name	Purpose
FLOWDB_HOST_PORT	Localhost port where FlowDB will be accessible
FLOWDB_HOST_USER_ID	uid of the host user for FlowDB
FLOWDB_HOST_GROUP_ID	gid of the host user for FlowDB
FLOWDB_DATA_DIR	Path on the host to a directory owned by FLOWDB_HOST_USER_ID where FlowDB will store data and write logs
FLOWDB_ETL_DIR	Path on the host to a directory readable by FLOWDB_HOST_USER_ID from which data may be loaded, mounted inside the container at /etl

Once the FlowDB service has started, you will be able to access it using psql as with any standard PostgreSQL database.

FlowETL¶

You can find a sample FlowETL stack file here which should be used with the FlowDB stack file. To use it, you should first create the required secrets, and additionally set the following environment variables:

Variable name	Purpose
FLOWETL_HOST_PORT	Localhost port on which the FlowETL airflow web interface will be available
FLOWETL_HOST_USER_ID	uid of the host user for FlowETL
FLOWETL_HOST_GROUP_ID	gid of the host user for FlowETL
FLOWETL_HOST_DAG_DIR	Path on the host to a directory where dag files will be stored

Once your stack has come up, you will be able to access FlowETL's web user interface which allows you to monitor the progress of ETL tasks.

FlowDB, FlowETL and FlowMachine¶

For cases where your users require individual level data access, you can support the use of FlowMachine as a library. In this mode, users connect directly to FlowDB via the FlowMachine Python module. Many of the benefits of a complete FlowKit deployment are available in this scenario, including query caching.

You will need to host a redis service, to allow the FlowMachine users to coordinate processing. See the FlowMachine stack file for an example of deploying redis using docker.

You will need to create database users for each user who needs access, and provide them with the password to the redis instance. Users should install FlowMachine individually, using pip (pip install flowmachine).

Note

A FlowMachine Docker service is not required for using FlowMachine as a library - users can install the FlowMachine module individually. A redis service is required, and all users should connect to the same redis instance.

FlowKit¶

A complete FlowKit deployment makes aggregated insights easily available to end users via a web API and FlowClient, while allowing administrators granular control over who can access what data, and for how long.

To deploy a complete FlowKit system, you will first need to generate a key pair which will be used to connect FlowAuth and FlowAPI. (If you have an existing FlowAuth deployment, you do not need to generate a new key pair - you can use the same public key with all the FlowAPI servers managed by that FlowAuth server).

Generating a key pair¶

FlowAuth uses a private key to sign the tokens it generates, which ensures that they can be trusted by FlowAPI. When deploying instances of FlowAPI, you will need to supply them with the corresponding public key, to allow them to verify the tokens were produced by the right instance of FlowAuth.

You can use openssl to generate a private key:

openssl genrsa -out flowauth-private-key.key 4096

And then create a public key from the key file (openssl rsa -pubout -in flowauth-private-key.key -out flowapi-public-key.pub). Should you need to supply the key using environment variables, rather than secrets (not recommended), you should base64 encode the key (e.g. base64 -i flowauth-private-key.key). FlowAuth and FlowAPI will automatically decode base64 encoded keys for use.

Warning

Always keep your private key secure

If your key is destroyed, you will need to generate a new one and redeploy any instances of FlowAuth and FlowAPI. If you key is leaked, unauthorised parties will be able to sign tokens for your instances of FlowAPI.

FlowAuth¶

FlowAuth is designed to be deployed as a single Docker container working in cooperation with a database and, typically, an ssl reverse proxy (e.g. nginx-proxy combined with letsencrypt-nginx-proxy-companion).

To run FlowAuth, you should set the following secrets:

Secret name	Secret purpose
FLOWAUTH_ADMIN_USER	Admin user name
FLOWAUTH_ADMIN_PASSWORD	Admin user password
FLOWAUTH_DB_PASSWORD	Password for FlowAuth's backing database
FLOWAUTH_FERNET_KEY	Reversible encryption key for storing tokens at rest
SECRET_KEY	Secures session and CSRF protection cookies
PRIVATE_JWT_SIGNING_KEY	Used to sign the tokens generated by FlowAuth, which ensures that they can be trusted by FlowAPI

You may also set the following environment variables:

Variable name	Purpose	Notes
DB_URI	URI for the backing database	Should be of the form `postgresql://flowauth:{}@flowauth_postgres:5432/flowauth`. If not set, a temporary sqlite database will be created. The `{}` will be populated using the value of the `FLOWAUTH_DB_PASSWORD` secret.
RESET_FLOWAUTH_DB	Set to true to reset the database
FLOWAUTH_CACHE_BACKEND	Backend to use for two factor auth last used key cache	Defaults to 'file'. May be set to 'memory' if deploying a single instance on only one CPU, or to 'redis' for larger deployments

Two-factor authentication¶

FlowAuth supports optional two-factor authentication for user accounts, using the Google Authenticator app or similar. This can be enabled either by an administrator, or by individual users.

To safeguard two-factor codes, FlowAuth prevents users from authenticating more than once with the same code within a short window. When deploying to production, you may wish to deploy a redis backend to support this feature - for example if you are deploying multiple instances of the FlowAuth container which need to be able to record the last used codes for users in a common place.

To configure FlowAuth for use with redis, set the FLOWAUTH_CACHE_BACKEND environment variable to redis. You will also need to set the following secrets:

Secret name	Purpose	Default
FLOWAUTH_REDIS_HOST	The hostname to connect to redis on.
FLOWAUTH_REDIS_PORT	The port to use to connect to redis	6379
FLOWAUTH_REDIS_PASSWORD	The password for the redis database
FLOWAUTH_REDIS_DB	The database number to connect to	0

By default, FlowAuth will use a dbm file backend to track last used two-factor codes. This file will be created at /dev/shm/flowauth_last_used_cache inside the container (i.e. in Docker's shared memory area), and can be mounted to a volume or pointed to an alternative location by setting the FLOWAUTH_CACHE_FILE environment variable.

Sample stack files¶

You can find an example docker stack file for FlowAuth here. This will bring up instances of FlowAuth, redis, and postgres. You can combine this with the letsencrypt stack file to automatically acquire an SSL certificate.

FlowMachine and FlowAPI¶

Once you have FlowAuth, FlowDB, and FlowETL running, you are ready to add FlowMachine and FlowAPI.

FlowMachine¶

The FlowMachine server requires one additional secret: REDIS_PASSWORD, the password for an accompanying redis database. This secret should also be provided to redis. FlowMachine also uses the FLOWMACHINE_FLOWDB_USER and FLOWMACHINE_FLOWDB_PASSWORD secrets defined for FlowDB.

You may also set the following environment variables:

Variable name	Purpose	Default
FLOWMACHINE_PORT	Port FlowAPI should communicate on	5555
FLOWMACHINE_SERVER_DEBUG_MODE	Set to True to enable debug mode for asyncio	False
FLOWMACHINE_SERVER_DISABLE_DEPENDENCY_CACHING	Set to True to disable automatically pre-caching dependencies of running queries	False
FLOWMACHINE_CACHE_PRUNING_FREQUENCY	How often to automatically clean up the cache	86400 (24 hours)
FLOWMACHINE_CACHE_PRUNING_TIMEOUT	Number of seconds to wait before halting a cache prune	600
FLOWMACHINE_LOG_LEVEL	Verbosity of logging (critical, error, info, or debug)	error
FLOWMACHINE_SERVER_THREADPOOL_SIZE	Number of threads the server will use to manage running queries	5*n_cpus
DB_CONNECTION_POOL_SIZE	Number of connections keep open to FlowDB - the server can actively run this many queries at once. You may wish to increase this if the FlowDB instance is running on a powerful server with multiple CPUs	5
DB_CONNECTION_POOL_OVERFLOW	Number of connections in addition to `DB_CONNECTION_POOL_SIZE` to open if needed	1

FlowAPI¶

FlowAPI requires additional secrets:

Secret name	Purpose	Notes
cert-flowkit.pem	SSL Certificate used to serve FlowAPI over https	Optional, but strongly recommended. If you are using a self-signed certificate, you will need to make the file available to FlowClient users.
key-flowkit.pem	Private key for the SSL Certificate	Optional, but strongly recommended. This part of the certificate does not need to be made available to FlowClient users.
PUBLIC_JWT_SIGNING_KEY	Public key to verify api tokens	The public key corresponding to the `PRIVATE_JWT_SIGNING_KEY` used by FlowAuth
FLOWAPI_IDENTIFIER	Secret used in combination with secret key for decoding JWTs	Should be unique per FlowAPI server; this will also be the name of the server in the FlowAuth user interface

FlowAPI also makes use of the FLOWAPI_FLOWDB_USER and FLOWAPI_FLOWDB_PASSWORD secrets provided to FlowDB.

Adding the new server to FlowAuth¶

Once FlowAPI has started, it can be added to FlowAuth so that users can generate tokens for it. You should be able to download the API specification from https://<flowapi_host>:<flowapi_port>/api/0/spec/openapi.json. You can then use the spec file to add the server to FlowAuth by navigating to Servers, and clicking the new server button.

After uploading the specification, you can configure the maximum token lifetime settings, and use the dropdown box to enable or disable access to the available FlowAPI scopes. If you have updated either the FlowAPI or FlowMachine servers, you should upload the newly generated specification to ensure that the correct API actions are available when assigning users and generating tokens.

Sample stack files¶

FlowMachine¶

A sample stack file suitable for use with the FlowDB and FlowETL stacks can be found here. This adds an additional two services: FlowMachine, and a redis instance used to coordinate the running state of queries. If you are supporting additional users with FlowMachine as a library, they should also use this redis instance. This stack file requires one additional environment variable: REDIS_HOST_PORT, the localhost port where Redis will be accessible.

FlowAPI¶

The sample stack file for FlowAPI can be found here, and requires one additional environment variable: FLOWAPI_HOST_PORT, the local port to make the API accessible on.

Secrets Quickstart¶

A full example deployment script which brings up all components is available here.

This will bring up a single node swarm, create random 16 character passwords for the database users, generate a fresh RSA key pair which links FlowAuth and FlowAPI, generate a certificate valid for the flowkit.api domain (and point that to localhost using /etc/hosts), pull all necessary containers, and bring up FlowAuth and FlowAPI.

For convenience, you can also do pipenv run secrets_quickstart from the secrets_quickstart directory.

Note that if you wish to deploy a branch other than master, you should set the CONTAINER_TAG environment variable before running, to ensure that Docker pulls the correct tags.

You can then provide the certificate to flowclient, and finally connect via https:

import flowclient
conn = flowclient.Connection(url="https://localhost:9090", token="JWT_STRING", ssl_certificate="<path_to_cert.pem>")

(This generates a certificate valid for the flow.api domain as well, which you can use by adding a corresponding entry to your /etc/hosts file.)

AutoFlow production deployment¶

Analysts with permission to run docker containers may choose to run their own AutoFlow instances. Instructions for doing so can be found in the AutoFlow documentation. A sample stack file for deploying AutoFlow along with the rest of the FlowKit stack can be found here, which adds an AutoFlow service, and an additionl Postgres database used by AutoFlow to record workflow runs. This makes use of the cert-flowkit.pem secret provided to FlowAPI, and also requires two other secrets:

Secret name	Secret purpose
AUTOFLOW_DB_PASSWORD	Password for AutoFlow's database
FLOWAPI_TOKEN	API token AutoFlow will use to connect to FlowAPI

You should also set the following environment variables:

Variable name	Purpose
AUTOFLOW_INPUTS_DIR	Path on the host to the directory where input files to AutoFlow are stored
AUTOFLOW_OUTPUTS_DIR	Path on the host to a directory where AutoFLow should store output files

and optionally set the AUTOFLOW_LOG_LEVEL environment variable (default 'ERROR').

Note

AutoFlow input files (Jupyter notebooks and workflows.yml) should be in the inputs directory before starting the AutoFlow container. Files added later will not be picked up by AutoFlow.

Demonstrating successful deployment¶

Once FlowKit installation is complete, you can verify that the system has been successfully set up by visiting http<s>://<flowapi_url>:<flowapi_port>/api/0/spec/redoc. Once all the services have come up, you will be able to view the interactive API specification. We also recommend running the provided worked examples against the deployed FlowKit to check that everything is working correctly.

Additional support¶

If you require more assistance to get up and running, please reach out to us by email and we will try to assist.