flowmachine.core.query¶

Source: flowmachine/core/query.py

This is the base class that defines any query on our database. It simply defines methods that returns the query as a string and as a pandas dataframe.

Class Query¶

Query(cache=True)

Source: flowmachine/core/query.py

The core base class of the flowmachine module. This should handle all input and output methods for our sql queries, so that inherited classes should only need to deal with the logic of actually making the sql statement itself.

Parameters¶

cache: bool, default True

Will store the resultant dataframes in memory. One can turn this off with turn_off_caching, and back on with turn_on_caching.

Methods¶

explain¶

explain(self, format="text", analyse=False)

Source: flowmachine/core/query.py

get_dataframe¶

get_dataframe(self)

Source: flowmachine/core/query.py

Executes the query and return the result as a pandas dataframe. This should be executed with care, as the results may consume large amounts of memory.

Returns¶

pandas.DataFrame

DataFrame containing results of the query.

get_dataframe_async¶

get_dataframe_async(self)

Source: flowmachine/core/query.py

Execute the query in a worker thread and return a future object which will contain the result as a pandas dataframe when complete.

Returns¶

Future

Future object which can be used to get the resulting dataframe

Note

This should be executed with care, as the results may consume large amounts of memory

get_query¶

get_query(self)

Source: flowmachine/core/query.py

Returns a string representing an SQL query. The string will point to the database cache of this query if it exists.

Returns¶

str

SQL query string.

get_table¶

get_table(self)

Source: flowmachine/core/query.py

If this Query is stored, return a Table object referencing the stored version. If it is not stored, raise an exception.

Returns¶

flowmachine.core.Table

The stored version of this Query as a Table object

head¶

head(self, n=5)

Source: flowmachine/core/query.py

Return the first n results of the query

Parameters¶

n: int

Number of results to return

Returns¶

pandas.DataFrame

A DataFrame containing n results

invalidate_db_cache¶

invalidate_db_cache(self, name=None, schema=None, cascade=True, drop=True)

Source: flowmachine/core/query.py

Drops this table, and (by default) any that depend on it, as well as removing them from the cache metadata table. If the table is currently being dropped from elsewhere, this method will block and return when the table has been removed.

Parameters¶

name: str

Name of the table
schema: str

Schema of the table
cascade: bool

Set to false to remove only this table from cache
drop: bool

Set to false to remove the cache record without dropping the table

join¶

join(
    self,
    other,
    on_left,
    on_right=None,
    how="inner",
    left_append="",
    right_append="",
)

Source: flowmachine/core/query.py

Parameters¶

other: Query

Query to join to
on_left: str

Field of this query to join on
on_right: str

Field of this query to join on
how: {'left', 'outer', 'inner', 'full'}

Method of joining to the other
left_append: str
right_append: str

Returns¶

Join

Query object representing the two queries joined together

numeric_subset¶

numeric_subset(self, col, low, high)

Source: flowmachine/core/query.py

Subsets one of the columns to a specified range of numerical values.

Parameters¶

col: str

Name of the column to subset, e.g. subscriber, cell etc.
low: float

Lower bound of interval to subset on
high: float

Upper bound of interval to subset on

Returns¶

Numeric subset object

random_sample¶

random_sample(self, sampling_method="random_ids", **params)

Source: flowmachine/core/query.py

Draws a random sample from this query.

Parameters¶

sampling_method: {'system', 'system_rows', 'bernoulli', 'random_ids'}, default 'random_ids'

Specifies the method used to select the random sample. 'system_rows': performs block-level sampling by randomly sampling each physical storage page of the underlying relation. This sampling method is guaranteed to provide a sample of the specified size 'system': performs block-level sampling by randomly sampling each physical storage page for the underlying relation. This sampling method is not guaranteed to generate a sample of the specified size, but an approximation. This method may not produce a sample at all, so it might be worth running it again if it returns an empty dataframe. 'bernoulli': samples directly on each row of the underlying relation. This sampling method is slower and is not guaranteed to generate a sample of the specified size, but an approximation 'random_ids': samples rows by randomly sampling the row number.
size: optional, int

The number of rows to draw. Exactly one of the 'size' or 'fraction' arguments must be provided.
fraction: optional, float

Fraction of rows to draw. Exactly one of the 'size' or 'fraction' arguments must be provided.
estimate_count: bool, default False

Whether to estimate the number of rows in the table using information contained in the pg_class or whether to perform an actual count in the number of rows.
seed: optional, float

Optionally provide a seed for repeatable random samples. If using random_ids method, seed must be between -/+1. Not available in combination with the system_rows method.

Returns¶

Random

A special query object which contains a random sample from this one

Note

Random samples may only be stored if a seed is supplied.

store¶

store(self, store_dependencies: bool = False) -> concurrent.futures._base.Future

Source: flowmachine/core/query.py

Store the results of this computation with the correct table name using a background thread.

Parameters¶

store_dependencies: bool, default False

If True, store the dependencies of this query.

Returns¶

concurrent.futures._base.Future

Future object which can be queried to check the query is stored.

subset¶

subset(self, col, subset)

Source: flowmachine/core/query.py

Subsets one of the columns to a specified subset of values

Parameters¶

col: str

Name of the column to subset, e.g. subscriber, cell etc.
subset: list

List of values to subset to

Returns¶

Subset object

to_sql¶

to_sql(self, name: str, schema: Union[str, NoneType] = None, store_dependencies: bool = False) -> concurrent.futures._base.Future

Source: flowmachine/core/query.py

Store the result of the calculation back into the database.

Parameters¶

name: str

name of the table
schema: typing.Union[str, NoneType], default None

Name of an existing schema. If none will use the postgres default, see postgres docs for more info.
store_dependencies: bool, default False

If True, store the dependencies of this query.

Returns¶

concurrent.futures._base.Future

Future object, containing this query and any result information.

Note

This method will return a Future immediately.

turn_off_caching¶

turn_off_caching(self)

Source: flowmachine/core/query.py

Turn the caching off, the object forgets previously calculated dataframes, and won't store further calculations

turn_on_caching¶

turn_on_caching(self)

Source: flowmachine/core/query.py

Turn on the caching, so that a computed dataframe is retained.

union¶

union(self, *other: 'Query', all: bool = True)

Source: flowmachine/core/query.py

Returns a Query representing a the union of queries. This is simply the tables concatenated. By passing the argument all as False the duplicates are also removed.

Parameters¶

other: Query

An instance of a query object.
all: bool, default True

If true returns sql UNION ALL else returns UNION

Returns¶

Union

Query representing the concatenation of the two queries

Examples¶

dl1 = daily_location('2016-01-01', spatial_unit=CellSpatialUnit())
dl2 = daily_location('2016-01-02', spatial_unit=CellSpatialUnit())
dl1.union(dl2).get_query()
'cell_msisdn_20160101 UNION ALL cell_msisdn_20160102'

dl1.union(dl2,all=False).get_query()
'cell_msisdn_20160101 UNION cell_msisdn_20160102'

cache¶

cache

Source: flowmachine/core/query.py

Returns¶

bool

True is caching is switched on.

column_names¶

column_names

Source: flowmachine/core/query.py

Returns the column names.

Returns¶

typing.List[str]

List of the column names of this query.

column_names_as_string_list¶

column_names_as_string_list

Source: flowmachine/core/query.py

Get the column names as a comma separated list

Returns¶

str

Comma separated list of column names

dependencies¶

dependencies

Source: flowmachine/core/query.py

Returns¶

set

The set of queries which this one is directly dependent on.

fully_qualified_table_name¶

fully_qualified_table_name

Source: flowmachine/core/query.py

Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.

Returns¶

str

String form of the table's fqn

index_cols¶

index_cols

Source: flowmachine/core/query.py

A list of columns to use as indexes when storing this query.

Returns¶

ixen: list

By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.

Examples¶

daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']

is_stored¶

is_stored

Source: flowmachine/core/query.py

Returns¶

bool

True if the table is stored, and False otherwise.

query_id¶

query_id

Source: flowmachine/core/query.py

Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.

Returns¶

str

query_id hash string

query_state¶

query_state

Source: flowmachine/core/query.py

Return the current query state.

Returns¶

QueryState

The current query state

query_state_str¶

query_state_str

Source: flowmachine/core/query.py

Return the current query state as a string

Returns¶

str

The current query state. The possible values are the ones defined in flowmachine.core.query_state.QueryState.

table_name¶

table_name

Source: flowmachine/core/query.py

Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.

Returns¶

str

String form of the table's fqn