A complete FlowKit deployment consists of FlowDB, FlowMachine, FlowETL, FlowAPI, FlowAuth, and redis. FlowDB, FlowMachine, FlowETL, FlowAPI and redis are deployed inside your firewall and work together to provide a complete running system. FlowAuth can be installed outside your firewall, and does not require a direct connection to the rest of the system.
We strongly recommend using docker swarm to deploy all the components, to support you in safely managing secrets to ensure a secure system. Ensure you understand how to create a swarm, manage secrets, and deploy stacks using compose files before continuing.
FlowDB and FlowETL only¶
FlowDB can be used with FlowETL independently of the other components, to provide a system which allows access to individual level data via a harmonised schema and SQL access. Because FlowDB is built on PostgreSQL, standard SQL based tools and workflows will work just fine.
FlowDB is distributed as a docker container. To run it, you will need to provide several secrets:
|Secret name||Secret purpose||Notes|
|FLOWAPI_FLOWDB_USER||Database user used by FlowAPI||Role with read access to tables under the cache and geography schemas|
|FLOWAPI_FLOWDB_PASSWORD||Password for the FlowAPI database user|
|FLOWMACHINE_FLOWDB_USER||Database user for FlowMachine||Role with write access to tables under the cache schema, and read access to events, infrastructure, cache and geography schemas|
|FLOWMACHINE_FLOWDB_PASSWORD||Password for flowmachine user|
|FLOWDB_POSTGRES_PASSWORD||Postgres superuser password for flowdb||Username
You may also provide the following environment variables:
|Variable name||Purpose||Default value|
|CACHE_SIZE||Maximum size of the cache schema||1 tenth of available space in pgdata directory|
|CACHE_PROTECTED_PERIOD||Amount of time to protect cache tables from being cleaned up||86400 (24 hours)|
|CACHE_HALF_LIFE||Speed at which cache tables expire when not used||1000|
|MAX_CPUS||Maximum number of CPUs that may be used for parallelising queries||The greater of 1 or 1 less than all CPUs|
|SHARED_BUFFERS_SIZE||Size of shared buffers||16GB|
|MAX_WORKERS||Maximum number of CPUs that may be used for parallelising one query||MAX_CPUS/2|
|MAX_WORKERS_PER_GATHER||Maximum number of CPUs that may be used for parallelising part of one query||MAX_CPUS/2|
|EFFECTIVE_CACHE_SIZE||Postgres cache size||25% of total RAM|
|FLOWDB_ENABLE_POSTGRES_DEBUG_MODE||When set to TRUE, enables use of the pgadmin debugger||FALSE|
However in most cases, the defaults will be adequate.
You will typically need to increase the default shared memory available to docker containers when running FlowDB. You can do this either by setting
shm_size for the FlowDB container in your compose or stack file, or by passing the
--shm-size argument to the
docker run command.
Bind Mounts and user permissions¶
By default, FlowDB will create and attach a docker volume that contains all data. In some cases, this will be sufficient for use.
However, you will often wish to set up bind mounts to hold the data, allow access to the postgres logs, and allow FlowDB to consume new data. To avoid sticky situations with permissions, you will want to specify the uid and gid that FlowDB runs with to match an existing user on the host system.
Adding a bind mount using
docker-compose is simple:
services: flowdb: ... user: <HOST_USER_ID>:<HOST_GROUP_ID> volumes: - /path/to/store/data/on/host:/var/lib/postgresql/data - /path/to/consume/data/from/host:/etl:ro
This creates two bind mounts, the first is FlowDB's internal storage, and the second is a read only mount for loading new data. The user FlowDB runs as inside the container will also be changed to the uid specified.
If the bind mounted directories do not exist, docker will create them and you will need to
chown them to the correct user.
And similarly when using
docker run --name flowdb_testdata -e FLOWMACHINE_FLOWDB_PASSWORD=foo -e FLOWAPI_FLOWDB_PASSWORD=foo \ --publish 9000:5432 \ --user HOST_USER_ID:HOST_GROUP_ID \ -v /path/to/store/data/on/host:/var/lib/postgresql/data \ -v /path/to/consume/data/from/host:/etl:ro \ --detach flowminder/flowdb-testdata:latest
To run as the current user, you can simply replace
$(id -u):$(id -g).
--user flag without a bind mount specified will not work, and you will see an error
initdb: could not change permissions of directory "/var/lib/postgresql/data": Operation not permitted.
When using docker volumes, docker will manage the permissions for you.
To run FlowETL, you will need to provide the following secrets:
|Secret name||Secret purpose||Notes|
|FLOWETL_AIRFLOW_ADMIN_USERNAME||Default administrative user logon name for the FlowETL web interface|
|FLOWETL_AIRFLOW_ADMIN_PASSWORD||Password for the administrative user|
|AIRFLOW__CORE__SQL_ALCHEMY_CONN||Connection string for the backing database||Should take the form
|AIRFLOW__CORE__FERNET_KEY||Ferney key used to encrypt (at rest) database credentials|
|AIRFLOW_CONN_FLOWDB||Connection string for the FlowDB database||Should take the form
|FLOWETL_POSTGRES_PASSWORD||Superuser password for FlowETL's backing database|
Generating Fernet keys
A convenient way to generate Fernet keys is to use the python cryptography package. After installing, you can generate a new key by running
python -c "from cryptography.fernet import Fernet;print(Fernet.generate_key().decode())".
See also the airflow documentation for other configuration options which you can provide as environment variables.
The ETL documentation gives detail on how to use FlowETL to load data into FlowDB.
Sample stack files¶
You can find a sample FlowDB stack file here. To use it, you should first create the secrets, and additionally set the following environment variables:
|FLOWDB_HOST_PORT||Localhost port where FlowDB will be accessible|
|FLOWDB_HOST_USER_ID||uid of the host user for FlowDB|
|FLOWDB_HOST_GROUP_ID||gid of the host user for FlowDB|
|FLOWDB_DATA_DIR||Path on the host to a directory owned by FLOWDB_HOST_USER_ID where FlowDB will store data and write logs|
|FLOWDB_ETL_DIR||Path on the host to a directory readable by FLOWDB_HOST_USER_ID from which data may be loaded, mounted inside the container at /etl|
Once the FlowDB service has started, you will be able to access it using
psql as with any standard PostgreSQL database.
You can find a sample FlowETL stack file here which should be used with the FlowDB stack file. To use it, you should first create the required secrets, and additionally set the following environment variables:
|FLOWETL_HOST_PORT||Localhost port on which the FlowETL airflow web interface will be available|
|FLOWETL_HOST_USER_ID||uid of the host user for FlowETL|
|FLOWETL_HOST_GROUP_ID||gid of the host user for FlowETL|
|FLOWETL_HOST_DAG_DIR||Path on the host to a directory where dag files will be stored|
Once your stack has come up, you will be able to access FlowETL's web user interface which allows you to monitor the progress of ETL tasks.
FlowDB, FlowETL and FlowMachine¶
For cases where your users require individual level data access, you can support the use of FlowMachine as a library. In this mode, users connect directly to FlowDB via the FlowMachine Python module. Many of the benefits of a complete FlowKit deployment are available in this scenario, including query caching.
You will need to host a redis service, to allow the FlowMachine users to coordinate processing. See the FlowMachine stack file for an example of deploying redis using docker.
You will need to create database users for each user who needs access, and provide them with the password to the redis instance. Users should install FlowMachine individually, using pip (
pip install flowmachine).
A FlowMachine Docker service is not required for using FlowMachine as a library - users can install the FlowMachine module individually. A redis service is required, and all users should connect to the same redis instance.
A complete FlowKit deployment makes aggregated insights easily available to end users via a web API and FlowClient, while allowing administrators granular control over who can access what data, and for how long.
To deploy a complete FlowKit system, you will first need to generate a key pair which will be used to connect FlowAuth and FlowAPI. (If you have an existing FlowAuth deployment, you do not need to generate a new key pair - you can use the same public key with all the FlowAPI servers managed by that FlowAuth server).
Generating a key pair¶
FlowAuth uses a private key to sign the tokens it generates, which ensures that they can be trusted by FlowAPI. When deploying instances of FlowAPI, you will need to supply them with the corresponding public key, to allow them to verify the tokens were produced by the right instance of FlowAuth.
You can use
openssl to generate a private key:
openssl genrsa -out flowauth-private-key.key 4096
And then create a public key from the key file (
openssl rsa -pubout -in flowauth-private-key.key -out flowapi-public-key.pub). Should you need to supply the key using environment variables, rather than secrets (not recommended), you should base64 encode the key (e.g.
base64 -i flowauth-private-key.key). FlowAuth and FlowAPI will automatically decode base64 encoded keys for use.
Always keep your private key secure
If your key is destroyed, you will need to generate a new one and redeploy any instances of FlowAuth and FlowAPI. If you key is leaked, unauthorised parties will be able to sign tokens for your instances of FlowAPI.
FlowAuth is designed to be deployed as a single Docker container working in cooperation with a database and, typically, an ssl reverse proxy (e.g. nginx-proxy combined with letsencrypt-nginx-proxy-companion).
To run FlowAuth, you should set the following secrets:
|Secret name||Secret purpose|
|FLOWAUTH_ADMIN_USER||Admin user name|
|FLOWAUTH_ADMIN_PASSWORD||Admin user password|
|FLOWAUTH_DB_PASSWORD||Password for FlowAuth's backing database|
|FLOWAUTH_FERNET_KEY||Reversible encryption key for storing tokens at rest|
|SECRET_KEY||Secures session and CSRF protection cookies|
|PRIVATE_JWT_SIGNING_KEY||Used to sign the tokens generated by FlowAuth, which ensures that they can be trusted by FlowAPI|
You may also set the following environment variables:
|DB_URI||URI for the backing database||Should be of the form
|RESET_FLOWAUTH_DB||Set to true to reset the database|
|FLOWAUTH_CACHE_BACKEND||Backend to use for two factor auth last used key cache||Defaults to 'file'. May be set to 'memory' if deploying a single instance on only one CPU, or to 'redis' for larger deployments|
FlowAuth supports optional two-factor authentication for user accounts, using the Google Authenticator app or similar. This can be enabled either by an administrator, or by individual users.
To safeguard two-factor codes, FlowAuth prevents users from authenticating more than once with the same code within a short window. When deploying to production, you may wish to deploy a redis backend to support this feature - for example if you are deploying multiple instances of the FlowAuth container which need to be able to record the last used codes for users in a common place.
To configure FlowAuth for use with redis, set the
FLOWAUTH_CACHE_BACKEND environment variable to
redis. You will also need to set the following secrets:
|FLOWAUTH_REDIS_HOST||The hostname to connect to redis on.|
|FLOWAUTH_REDIS_PORT||The port to use to connect to redis||6379|
|FLOWAUTH_REDIS_PASSWORD||The password for the redis database|
|FLOWAUTH_REDIS_DB||The database number to connect to||0|
By default, FlowAuth will use a dbm file backend to track last used two-factor codes. This file will be created at
/dev/shm/flowauth_last_used_cache inside the container (i.e. in Docker's shared memory area), and can be mounted to a volume or pointed to an alternative location by setting the
FLOWAUTH_CACHE_FILE environment variable.
Sample stack files¶
You can find an example docker stack file for FlowAuth here. This will bring up instances of FlowAuth, redis, and postgres. You can combine this with the letsencrypt stack file to automatically acquire an SSL certificate.
FlowMachine and FlowAPI¶
Once you have FlowAuth, FlowDB, and FlowETL running, you are ready to add FlowMachine and FlowAPI.
The FlowMachine server requires one additional secret:
REDIS_PASSWORD, the password for an accompanying redis database. This secret should also be provided to redis. FlowMachine also uses the
FLOWMACHINE_FLOWDB_PASSWORD secrets defined for FlowDB.
You may also set the following environment variables:
|FLOWMACHINE_PORT||Port FlowAPI should communicate on||5555|
|FLOWMACHINE_SERVER_DEBUG_MODE||Set to True to enable debug mode for asyncio||False|
|FLOWMACHINE_SERVER_DISABLE_DEPENDENCY_CACHING||Set to True to disable automatically pre-caching dependencies of running queries||False|
|FLOWMACHINE_CACHE_PRUNING_FREQUENCY||How often to automatically clean up the cache||86400 (24 hours)|
|FLOWMACHINE_CACHE_PRUNING_TIMEOUT||Number of seconds to wait before halting a cache prune||600|
|FLOWMACHINE_LOG_LEVEL||Verbosity of logging (critical, error, info, or debug)||error|
|FLOWMACHINE_SERVER_THREADPOOL_SIZE||Number of threads the server will use to manage running queries||5*n_cpus|
|DB_CONNECTION_POOL_SIZE||Number of connections keep open to FlowDB - the server can actively run this many queries at once. You may wish to increase this if the FlowDB instance is running on a powerful server with multiple CPUs||5|
|DB_CONNECTION_POOL_OVERFLOW||Number of connections in addition to
FlowAPI requires additional secrets:
|cert-flowkit.pem||SSL Certificate used to serve FlowAPI over https||Optional, but strongly recommended. If you are using a self-signed certificate, you will need to make the file available to FlowClient users.|
|key-flowkit.pem||Private key for the SSL Certificate||Optional, but strongly recommended. This part of the certificate does not need to be made available to FlowClient users.|
|PUBLIC_JWT_SIGNING_KEY||Public key to verify api tokens||The public key corresponding to the
|FLOWAPI_IDENTIFIER||Secret used in combination with secret key for decoding JWTs||Should be unique per FlowAPI server; this will also be the name of the server in the FlowAuth user interface|
FlowAPI also makes use of the
FLOWAPI_FLOWDB_PASSWORD secrets provided to FlowDB.
Adding the new server to FlowAuth¶
Once FlowAPI has started, it can be added to FlowAuth so that users can generate tokens for it. You should be able to download the API specification from
https://<flowapi_host>:<flowapi_port>/api/0/spec/openapi.json. You can then use the spec file to add the server to FlowAuth by navigating to Servers, and clicking the new server button.
After uploading the specification, you can configure the maximum token lifetime settings, and use the dropdown box to enable or disable access to the available FlowAPI scopes. If you have updated either the FlowAPI or FlowMachine servers, you should upload the newly generated specification to ensure that the correct API actions are available when assigning users and generating tokens.
Sample stack files¶
A sample stack file suitable for use with the FlowDB and FlowETL stacks can be found here. This adds an additional two services: FlowMachine, and a redis instance used to coordinate the running state of queries. If you are supporting additional users with FlowMachine as a library, they should also use this redis instance. This stack file requires one additional environment variable:
REDIS_HOST_PORT, the localhost port where Redis will be accessible.
The sample stack file for FlowAPI can be found here, and requires one additional environment variable:
FLOWAPI_HOST_PORT, the local port to make the API accessible on.
A full example deployment script which brings up all components is available here.
This will bring up a single node swarm, create random 16 character passwords for the database users, generate a fresh RSA key pair which links FlowAuth and FlowAPI, generate a certificate valid for the
flowkit.api domain (and point that to
/etc/hosts), pull all necessary containers, and bring up FlowAuth and FlowAPI.
For convenience, you can also do
pipenv run secrets_quickstart from the
Note that if you wish to deploy a branch other than
master, you should set the
CONTAINER_TAG environment variable before running, to ensure that Docker pulls the correct tags.
You can then provide the certificate to
flowclient, and finally connect via https:
import flowclient conn = flowclient.Connection(url="https://localhost:9090", token="JWT_STRING", ssl_certificate="<path_to_cert.pem>")
(This generates a certificate valid for the
flow.api domain as well, which you can use by adding a corresponding entry to your
AutoFlow production deployment¶
Analysts with permission to run docker containers may choose to run their own AutoFlow instances. Instructions for doing so can be found in the AutoFlow documentation. A sample stack file for deploying AutoFlow along with the rest of the FlowKit stack can be found here, which adds an AutoFlow service, and an additionl Postgres database used by AutoFlow to record workflow runs. This makes use of the
cert-flowkit.pem secret provided to FlowAPI, and also requires two other secrets:
|Secret name||Secret purpose|
|AUTOFLOW_DB_PASSWORD||Password for AutoFlow's database|
|FLOWAPI_TOKEN||API token AutoFlow will use to connect to FlowAPI|
You should also set the following environment variables:
|AUTOFLOW_INPUTS_DIR||Path on the host to the directory where input files to AutoFlow are stored|
|AUTOFLOW_OUTPUTS_DIR||Path on the host to a directory where AutoFLow should store output files|
and optionally set the
AUTOFLOW_LOG_LEVEL environment variable (default 'ERROR').
AutoFlow input files (Jupyter notebooks and
workflows.yml) should be in the inputs directory before starting the AutoFlow container. Files added later will not be picked up by AutoFlow.
Demonstrating successful deployment¶
Once FlowKit installation is complete, you can verify that the system has been successfully set up by visiting
http<s>://<flowapi_url>:<flowapi_port>/api/0/spec/redoc. Once all the services have come up, you will be able to view the interactive API specification.
We also recommend running the provided worked examples against the deployed FlowKit to check that everything is working correctly.
If you require more assistance to get up and running, please reach out to us by email and we will try to assist.