Deploying FlowKit
A complete FlowKit deployment consists of FlowDB, FlowMachine, FlowETL, FlowAPI, FlowAuth, and redis. FlowDB, FlowMachine, FlowETL, FlowAPI and redis are deployed inside your firewall and work together to provide a complete running system. FlowAuth can be installed outside your firewall, and does not require a direct connection to the rest of the system.
We strongly recommend using docker swarm to deploy all the components, to support you in safely managing secrets to ensure a secure system. Ensure you understand how to create a swarm, manage secrets, and deploy stacks using compose files before continuing.
Deployment scenarios¶
FlowDB and FlowETL only¶
FlowDB can be used with FlowETL independently of the other components, to provide a system which allows access to individual level data via a harmonised schema and SQL access. Because FlowDB is built on PostgreSQL, standard SQL based tools and workflows will work just fine.
FlowDB¶
FlowDB is distributed as a docker container. To run it, you will need to provide several secrets:
Secret name | Secret purpose | Notes |
---|---|---|
FLOWAPI_FLOWDB_USER | Database user used by FlowAPI | Role with read access to tables under the cache and geography schemas |
FLOWAPI_FLOWDB_PASSWORD | Password for the FlowAPI database user | |
FLOWMACHINE_FLOWDB_USER | Database user for FlowMachine | Role with write access to tables under the cache schema, and read access to events, infrastructure, cache and geography schemas |
FLOWMACHINE_FLOWDB_PASSWORD | Password for flowmachine user | |
FLOWDB_POSTGRES_PASSWORD | Postgres superuser password for flowdb | Username flowdb , user with super user access to flowdb database |
You may also provide the following environment variables:
Variable name | Purpose | Default value |
---|---|---|
CACHE_SIZE | Maximum size of the cache schema | 1 tenth of available space in pgdata directory |
CACHE_PROTECTED_PERIOD | Amount of time to protect cache tables from being cleaned up | 86400 (24 hours) |
CACHE_HALF_LIFE | Speed at which cache tables expire when not used | 1000 |
MAX_CPUS | Maximum number of CPUs that may be used for parallelising queries | The greater of 1 or 1 less than all CPUs |
SHARED_BUFFERS_SIZE | Size of shared buffers | 16GB |
MAX_WORKERS | Maximum number of CPUs that may be used for parallelising one query | MAX_CPUS/2 |
MAX_WORKERS_PER_GATHER | Maximum number of CPUs that may be used for parallelising part of one query | MAX_CPUS/2 |
EFFECTIVE_CACHE_SIZE | Postgres cache size | 25% of total RAM |
FLOWDB_ENABLE_POSTGRES_DEBUG_MODE | When set to TRUE, enables use of the pgadmin debugger | FALSE |
However in most cases, the defaults will be adequate.
Shared memory¶
You will typically need to increase the default shared memory available to docker containers when running FlowDB. You can do this either by setting shm_size
for the FlowDB container in your compose or stack file, or by passing the --shm-size
argument to the docker run
command.
Bind Mounts and user permissions¶
By default, FlowDB will create and attach a docker volume that contains all data. In some cases, this will be sufficient for use.
However, you will often wish to set up bind mounts to hold the data, allow access to the postgres logs, and allow FlowDB to consume new data. To avoid sticky situations with permissions, you will want to specify the uid and gid that FlowDB runs with to match an existing user on the host system.
Adding a bind mount using docker-compose
is simple:
services:
flowdb:
...
user: <HOST_USER_ID>:<HOST_GROUP_ID>
volumes:
- /path/to/store/data/on/host:/var/lib/postgresql/data
- /path/to/consume/data/from/host:/etl:ro
This creates two bind mounts, the first is FlowDB's internal storage, and the second is a read only mount for loading new data. The user FlowDB runs as inside the container will also be changed to the uid specified.
Warning
If the bind mounted directories do not exist, docker will create them and you will need to chown
them to the correct user.
And similarly when using docker run
:
docker run --name flowdb_testdata -e FLOWMACHINE_FLOWDB_PASSWORD=foo -e FLOWAPI_FLOWDB_PASSWORD=foo \
--publish 9000:5432 \
--user HOST_USER_ID:HOST_GROUP_ID \
-v /path/to/store/data/on/host:/var/lib/postgresql/data \
-v /path/to/consume/data/from/host:/etl:ro \
--detach flowminder/flowdb-testdata:latest
Tip
To run as the current user, you can simply replace HOST_USER_ID:HOST_GROUP_ID
with $(id -u):$(id -g)
.
Warning
Using the --user
flag without a bind mount specified will not work, and you will see an error
like this: initdb: could not change permissions of directory "/var/lib/postgresql/data": Operation not permitted
.
When using docker volumes, docker will manage the permissions for you.
FlowETL¶
To run FlowETL, you will need to provide the following secrets:
Secret name | Secret purpose | Notes |
---|---|---|
FLOWETL_AIRFLOW_ADMIN_USERNAME | Default administrative user logon name for the FlowETL web interface | |
FLOWETL_AIRFLOW_ADMIN_PASSWORD | Password for the administrative user | |
AIRFLOW__CORE__SQL_ALCHEMY_CONN | Connection string for the backing database | Should take the form postgres://flowetl:<FLOWETL_POSTGRES_PASSWORD>@flowetl_db:5432/flowetl |
AIRFLOW__CORE__FERNET_KEY | Ferney key used to encrypt (at rest) database credentials | |
AIRFLOW_CONN_FLOWDB | Connection string for the FlowDB database | Should take the form postgres://flowdb:<FLOWDB_POSTGRES_PASSWORD>@flowdb:5432/flowdb |
FLOWETL_POSTGRES_PASSWORD | Superuser password for FlowETL's backing database |
Note
Generating Fernet keys
A convenient way to generate Fernet keys is to use the python cryptography package. After installing, you can generate a new key by running python -c "from cryptography.fernet import Fernet;print(Fernet.generate_key().decode())"
.
See also the airflow documentation for other configuration options which you can provide as environment variables.
The ETL documentation gives detail on how to use FlowETL to load data into FlowDB.
Sample stack files¶
FlowDB¶
You can find a sample FlowDB stack file here. To use it, you should first create the secrets, and additionally set the following environment variables:
Variable name | Purpose |
---|---|
FLOWDB_HOST_PORT | Localhost port where FlowDB will be accessible |
FLOWDB_HOST_USER_ID | uid of the host user for FlowDB |
FLOWDB_HOST_GROUP_ID | gid of the host user for FlowDB |
FLOWDB_DATA_DIR | Path on the host to a directory owned by FLOWDB_HOST_USER_ID where FlowDB will store data and write logs |
FLOWDB_ETL_DIR | Path on the host to a directory readable by FLOWDB_HOST_USER_ID from which data may be loaded, mounted inside the container at /etl |
Once the FlowDB service has started, you will be able to access it using psql
as with any standard PostgreSQL database.
FlowETL¶
You can find a sample FlowETL stack file here which should be used with the FlowDB stack file. To use it, you should first create the required secrets, and additionally set the following environment variables:
Variable name | Purpose |
---|---|
FLOWETL_HOST_PORT | Localhost port on which the FlowETL airflow web interface will be available |
FLOWETL_HOST_USER_ID | uid of the host user for FlowETL |
FLOWETL_HOST_GROUP_ID | gid of the host user for FlowETL |
FLOWETL_HOST_DAG_DIR | Path on the host to a directory where dag files will be stored |
Once your stack has come up, you will be able to access FlowETL's web user interface which allows you to monitor the progress of ETL tasks.
FlowDB, FlowETL and FlowMachine¶
For cases where your users require individual level data access, you can support the use of FlowMachine as a library. In this mode, users connect directly to FlowDB via the FlowMachine Python module. Many of the benefits of a complete FlowKit deployment are available in this scenario, including query caching.
You will need to host a redis service, to allow the FlowMachine users to coordinate processing. See the FlowMachine stack file for an example of deploying redis using docker.
You will need to create database users for each user who needs access, and provide them with the password to the redis instance. Users should install FlowMachine individually, using pip (pip install flowmachine
).
Note
A FlowMachine Docker service is not required for using FlowMachine as a library - users can install the FlowMachine module individually. A redis service is required, and all users should connect to the same redis instance.
FlowKit¶
A complete FlowKit deployment makes aggregated insights easily available to end users via a web API and FlowClient, while allowing administrators granular control over who can access what data, and for how long.
To deploy a complete FlowKit system, you will first need to generate a key pair which will be used to connect FlowAuth and FlowAPI. (If you have an existing FlowAuth deployment, you do not need to generate a new key pair - you can use the same public key with all the FlowAPI servers managed by that FlowAuth server).
Generating a key pair¶
FlowAuth uses a private key to sign the tokens it generates, which ensures that they can be trusted by FlowAPI. When deploying instances of FlowAPI, you will need to supply them with the corresponding public key, to allow them to verify the tokens were produced by the right instance of FlowAuth.
You can use openssl
to generate a private key:
openssl genrsa -out flowauth-private-key.key 4096
And then create a public key from the key file (openssl rsa -pubout -in flowauth-private-key.key -out flowapi-public-key.pub
). Should you need to supply the key using environment variables, rather than secrets (not recommended), you should base64 encode the key (e.g. base64 -i flowauth-private-key.key
). FlowAuth and FlowAPI will automatically decode base64 encoded keys for use.
Warning
Always keep your private key secure
If your key is destroyed, you will need to generate a new one and redeploy any instances of FlowAuth and FlowAPI. If you key is leaked, unauthorised parties will be able to sign tokens for your instances of FlowAPI.
FlowAuth¶
FlowAuth is designed to be deployed as a single Docker container working in cooperation with a database and, typically, an ssl reverse proxy (e.g. nginx-proxy combined with letsencrypt-nginx-proxy-companion).
To run FlowAuth, you should set the following secrets:
Secret name | Secret purpose |
---|---|
FLOWAUTH_ADMIN_USER | Admin user name |
FLOWAUTH_ADMIN_PASSWORD | Admin user password |
FLOWAUTH_DB_PASSWORD | Password for FlowAuth's backing database |
FLOWAUTH_FERNET_KEY | Reversible encryption key for storing tokens at rest |
SECRET_KEY | Secures session and CSRF protection cookies |
PRIVATE_JWT_SIGNING_KEY | Used to sign the tokens generated by FlowAuth, which ensures that they can be trusted by FlowAPI |
You may also set the following environment variables:
Variable name | Purpose | Notes |
---|---|---|
DB_URI | URI for the backing database | Should be of the form postgresql://flowauth:{}@flowauth_postgres:5432/flowauth . If not set, a temporary sqlite database will be created. The {} will be populated using the value of the FLOWAUTH_DB_PASSWORD secret. |
RESET_FLOWAUTH_DB | Set to true to reset the database | |
FLOWAUTH_CACHE_BACKEND | Backend to use for two factor auth last used key cache | Defaults to 'file'. May be set to 'memory' if deploying a single instance on only one CPU, or to 'redis' for larger deployments |
Two-factor authentication¶
FlowAuth supports optional two-factor authentication for user accounts, using the Google Authenticator app or similar. This can be enabled either by an administrator, or by individual users.
To safeguard two-factor codes, FlowAuth prevents users from authenticating more than once with the same code within a short window. When deploying to production, you may wish to deploy a redis backend to support this feature - for example if you are deploying multiple instances of the FlowAuth container which need to be able to record the last used codes for users in a common place.
To configure FlowAuth for use with redis, set the FLOWAUTH_CACHE_BACKEND
environment variable to redis
. You will also need to set the following secrets:
Secret name | Purpose | Default |
---|---|---|
FLOWAUTH_REDIS_HOST | The hostname to connect to redis on. | |
FLOWAUTH_REDIS_PORT | The port to use to connect to redis | 6379 |
FLOWAUTH_REDIS_PASSWORD | The password for the redis database | |
FLOWAUTH_REDIS_DB | The database number to connect to | 0 |
By default, FlowAuth will use a dbm file backend to track last used two-factor codes. This file will be created at /dev/shm/flowauth_last_used_cache
inside the container (i.e. in Docker's shared memory area), and can be mounted to a volume or pointed to an alternative location by setting the FLOWAUTH_CACHE_FILE
environment variable.
Sample stack files¶
You can find an example docker stack file for FlowAuth here. This will bring up instances of FlowAuth, redis, and postgres. You can combine this with the letsencrypt stack file to automatically acquire an SSL certificate.
FlowMachine and FlowAPI¶
Once you have FlowAuth, FlowDB, and FlowETL running, you are ready to add FlowMachine and FlowAPI.
FlowMachine¶
The FlowMachine server requires one additional secret: REDIS_PASSWORD
, the password for an accompanying redis database. This secret should also be provided to redis. FlowMachine also uses the FLOWMACHINE_FLOWDB_USER
and FLOWMACHINE_FLOWDB_PASSWORD
secrets defined for FlowDB.
You may also set the following environment variables:
Variable name | Purpose | Default |
---|---|---|
FLOWMACHINE_PORT | Port FlowAPI should communicate on | 5555 |
FLOWMACHINE_SERVER_DEBUG_MODE | Set to True to enable debug mode for asyncio | False |
FLOWMACHINE_SERVER_DISABLE_DEPENDENCY_CACHING | Set to True to disable automatically pre-caching dependencies of running queries | False |
FLOWMACHINE_CACHE_PRUNING_FREQUENCY | How often to automatically clean up the cache | 86400 (24 hours) |
FLOWMACHINE_CACHE_PRUNING_TIMEOUT | Number of seconds to wait before halting a cache prune | 600 |
FLOWMACHINE_LOG_LEVEL | Verbosity of logging (critical, error, info, or debug) | error |
FLOWMACHINE_SERVER_THREADPOOL_SIZE | Number of threads the server will use to manage running queries | 5*n_cpus |
DB_CONNECTION_POOL_SIZE | Number of connections keep open to FlowDB - the server can actively run this many queries at once. You may wish to increase this if the FlowDB instance is running on a powerful server with multiple CPUs | 5 |
DB_CONNECTION_POOL_OVERFLOW | Number of connections in addition to DB_CONNECTION_POOL_SIZE to open if needed |
1 |
FlowAPI¶
FlowAPI requires additional secrets:
Secret name | Purpose | Notes |
---|---|---|
cert-flowkit.pem | SSL Certificate used to serve FlowAPI over https | Optional, but strongly recommended. If you are using a self-signed certificate, you will need to make the file available to FlowClient users. |
key-flowkit.pem | Private key for the SSL Certificate | Optional, but strongly recommended. This part of the certificate does not need to be made available to FlowClient users. |
PUBLIC_JWT_SIGNING_KEY | Public key to verify api tokens | The public key corresponding to the PRIVATE_JWT_SIGNING_KEY used by FlowAuth |
FLOWAPI_IDENTIFIER | Secret used in combination with secret key for decoding JWTs | Should be unique per FlowAPI server; this will also be the name of the server in the FlowAuth user interface |
FlowAPI also makes use of the FLOWAPI_FLOWDB_USER
and FLOWAPI_FLOWDB_PASSWORD
secrets provided to FlowDB.
Adding the new server to FlowAuth¶
Once FlowAPI has started, it can be added to FlowAuth so that users can generate tokens for it. You should be able to download the API specification from https://<flowapi_host>:<flowapi_port>/api/0/spec/openapi.json
. You can then use the spec file to add the server to FlowAuth by navigating to Servers, and clicking the new server button.
After uploading the specification, you can configure the maximum token lifetime settings, and use the dropdown box to enable or disable access to the available FlowAPI scopes. If you have updated either the FlowAPI or FlowMachine servers, you should upload the newly generated specification to ensure that the correct API actions are available when assigning users and generating tokens.
Sample stack files¶
FlowMachine¶
A sample stack file suitable for use with the FlowDB and FlowETL stacks can be found here. This adds an additional two services: FlowMachine, and a redis instance used to coordinate the running state of queries. If you are supporting additional users with FlowMachine as a library, they should also use this redis instance. This stack file requires one additional environment variable: REDIS_HOST_PORT
, the localhost port where Redis will be accessible.
FlowAPI¶
The sample stack file for FlowAPI can be found here, and requires one additional environment variable: FLOWAPI_HOST_PORT
, the local port to make the API accessible on.
Secrets Quickstart¶
A full example deployment script which brings up all components is available here.
This will bring up a single node swarm, create random 16 character passwords for the database users, generate a fresh RSA key pair which links FlowAuth and FlowAPI, generate a certificate valid for the flowkit.api
domain (and point that to localhost
using /etc/hosts
), pull all necessary containers, and bring up FlowAuth and FlowAPI.
For convenience, you can also do pipenv run secrets_quickstart
from the secrets_quickstart
directory.
Note that if you wish to deploy a branch other than master
, you should set the CONTAINER_TAG
environment variable before running, to ensure that Docker pulls the correct tags.
You can then provide the certificate to flowclient
, and finally connect via https:
import flowclient
conn = flowclient.Connection(url="https://localhost:9090", token="JWT_STRING", ssl_certificate="<path_to_cert.pem>")
(This generates a certificate valid for the flow.api
domain as well, which you can use by adding a corresponding entry to your /etc/hosts
file.)
AutoFlow production deployment¶
Analysts with permission to run docker containers may choose to run their own AutoFlow instances. Instructions for doing so can be found in the AutoFlow documentation. A sample stack file for deploying AutoFlow along with the rest of the FlowKit stack can be found here, which adds an AutoFlow service, and an additionl Postgres database used by AutoFlow to record workflow runs. This makes use of the cert-flowkit.pem
secret provided to FlowAPI, and also requires two other secrets:
Secret name | Secret purpose |
---|---|
AUTOFLOW_DB_PASSWORD | Password for AutoFlow's database |
FLOWAPI_TOKEN | API token AutoFlow will use to connect to FlowAPI |
You should also set the following environment variables:
Variable name | Purpose |
---|---|
AUTOFLOW_INPUTS_DIR | Path on the host to the directory where input files to AutoFlow are stored |
AUTOFLOW_OUTPUTS_DIR | Path on the host to a directory where AutoFLow should store output files |
and optionally set the AUTOFLOW_LOG_LEVEL
environment variable (default 'ERROR').
Note
AutoFlow input files (Jupyter notebooks and workflows.yml
) should be in the inputs directory before starting the AutoFlow container. Files added later will not be picked up by AutoFlow.
Demonstrating successful deployment¶
Once FlowKit installation is complete, you can verify that the system has been successfully set up by visiting http<s>://<flowapi_url>:<flowapi_port>/api/0/spec/redoc
. Once all the services have come up, you will be able to view the interactive API specification.
We also recommend running the provided worked examples against the deployed FlowKit to check that everything is working correctly.
Additional support¶
If you require more assistance to get up and running, please reach out to us by email and we will try to assist.