Caching in FlowKit
FlowKit implements a caching system to enhance performance. Queries requested via FlowAPI are cached in FlowDB, under the
Once cached, a query will not be recalculated - the cached version will simply be returned instead, which can save significant computation time. In addition to queries which are directly returned, FlowKit may cache queries which are used in calculating other queries. For example, calculating a modal location aggregate, and a daily location aggregate will both use the same underlying query when the dates (and other parameters) overlap. Hence, caching the underlying query allows both the aggregate and the modal location aggregate to be produced faster.
This performance boost is achieved at the cost of disk space usage. FlowMachine automatically manages the size of the on-disk cache, and will remove seldom used cache entries periodically. The frequency of this check can be configured using the
FLOWMACHINE_CACHE_PRUNING_FREQUENCY environment variable. By default, this is set to
86400, or 24 hours in seconds. For heavily used servers, it may be desirable to set this to a lower threshold. Automatic cache clearance follows the procedure described in the following section.
When a query is requested via the API, the query itself will be cached along with all of the other queries on which its calculation depends. For complex queries this can result in a large number of tables being added to the cache. This default behaviour can be changed by setting the environment variable
FLOWMACHINE_SERVER_DISABLE_DEPENDENCY_CACHING=true when starting the FlowMachine server, which will result in only the specific queries requested being cached. Computation times may be significantly longer when dependency caching is turned off.
FlowMachine and FlowDB provide tools to inspect and manage the content of FlowKit's cache. FlowDB also contains metadata about the content of cache, in the
Administrators can inspect this table directly by connecting to FlowDB, but in many scenarios the better option is to make use of FlowMachine's cache management module.
The cache submodule provides functions to assess the disk usage of the cache tables, and to reduce the disk usage below a desired threshold.
To identify which tables should be discarded from cache, FlowKit keeps track of how expensive they were to calculate initially, how much disk space they occupy, and how often and recently they have been used. These factors are combined into a cache score, based on the cachey algorithm.
Each cache table has a cache score, with a higher score indicating that the table has more cache value.
FlowMachine provides two functions which make use of this cache score to reduce the size of the cache -
shrink_one flushes the table with the lowest cache score.
shrink_below_size flushes tables until the disk space used by the cache falls below a threshold1 by calling
shrink_one repeatedly. By default, queries which have been recently calculated are excluded from removal. To configure the global default for the exclusion period, set the
CACHE_PROTECTED_PERIOD environment variable for FlowDB, or update the
cache_protected_period key in the
cache.cache_config table. The default exclusion period is
86400s (24 hours). This can also be overridden when calling the cache management functions directly.
If necessary, the cache can also be completely reset using the
You can always access FlowMachine library functions from inside a FlowMachine container.
To bring up a Python repl use:
docker exec -it <container_name> pipenv run python
You can then import the flowmachine library and the
connect function will read the values the currently running server is using to connect to FlowDB and redis.
Removing a Specific Query from Cache¶
If a specific query must be removed from the cache, then an administrator can use the
invalidate_cache_by_id function of the
By default, this function only removes that specific query from cache. However, setting the
cascade argument to
True will also flush from the cache any cached queries which used that query in their calculation. This will also cascade to any queries which used those queries, and so on.
Configuring the Cache¶
There are three parameters which control FlowKit's cache, both of which are in the
half_life controls how much weight is given to recency of access when updating the cache score.
half_life is in units of number of cache retrievals, so a larger value for
half_life will give less weight to recency and frequency of access.
small_query both took 100 seconds to calculate.
big_query takes 100 bytes to store, and
small_query takes 10 bytes.
Their costs are
compute_time/storage_size, or 1 for
big_query and 10 for
small_query is stored first and has an initial cache score of 10.
big_query is stored next, with a
half_life of 2.0, it will get an initial cache score of 1.35.
Just in terms of the balance between compute time and storage cost,
small_query is more valuable in cache because it is relatively cheaper to store. However, after only four retrievals of
big_query from cache,
big_query will have a cache score of 13.3, meaning it is more valuable in cache because it is so frequently used.
half_life was instead set to 10.0,
big_query would need to be retrieved seven times to exceed the cache score of
cache_size is the maximum size in bytes that the cache tables should occupy on disk. These settings default to 1000.0, and 10% of available space on the drive where
/var/lib/postgresql/data is located.
cache_protected_period is the length of time in seconds that a cache table is, by default, immune from being removed by a cache shrinkage operation. This defaults to
86400s, or 24 hours. During this time, cache tables will not be removed by automatic cache shrinking, and will be default be excluded from the cache management functions.
These values can be overridden when creating a new FlowDB container by setting the
CACHE_PROTECTED_PERIOD environment variables for the container, set by updating the
cache.cache_config table after connecting directly to FlowDB, or modified using the cache submodule.
Redis and the Query Cache¶
FlowMachine also tracks the execution state of queries using redis. In some cases, it is possible for redis and the cache metadata table to get out of sync with one another (for example, if either redis or FlowDB has been manually edited). To deal with this, you can forcibly resync redis with FlowDB's cache table, using the
resync_redis_with_cache function. This will reset redis, and repopulate it based only on the contents of
You must ensure that no queries are currently running before using this function. Any queries that are currently running will become out of sync.
By default, this uses the value set for