Skip to content

flowmachine.core.query

Source: flowmachine/core/query.py

This is the base class that defines any query on our database. It simply defines methods that returns the query as a string and as a pandas dataframe.

Class Query

Query(cache=True)
Source: flowmachine/core/query.py

The core base class of the flowmachine module. This should handle all input and output methods for our sql queries, so that inherited classes should only need to deal with the logic of actually making the sql statement itself.

Attributes

Parameters

  • cache: bool, default True

    Will store the resultant dataframes in memory. One can turn this off with turn_off_caching, and back on with turn_on_caching.

Methods

explain

explain(self, format="text", analyse=False)
Source: flowmachine/core/query.py

get_dataframe

get_dataframe(self)
Source: flowmachine/core/query.py

Executes the query and return the result as a pandas dataframe. This should be executed with care, as the results may consume large amounts of memory.

Returns
  • pandas.DataFrame

    DataFrame containing results of the query.

get_dataframe_async

get_dataframe_async(self)
Source: flowmachine/core/query.py

Execute the query in a worker thread and return a future object which will contain the result as a pandas dataframe when complete.

Returns
  • Future

    Future object which can be used to get the resulting dataframe

Note

This should be executed with care, as the results may consume large amounts of memory

get_query

get_query(self)
Source: flowmachine/core/query.py

Returns a string representing an SQL query. The string will point to the database cache of this query if it exists.

Returns
  • str

    SQL query string.

get_table

get_table(self)
Source: flowmachine/core/query.py

If this Query is stored, return a Table object referencing the stored version. If it is not stored, raise an exception.

Returns
  • flowmachine.core.Table

    The stored version of this Query as a Table object

head(self, n=5)
Source: flowmachine/core/query.py

Return the first n results of the query

Parameters
  • n: int

    Number of results to return

Returns
  • pandas.DataFrame

    A DataFrame containing n results

invalidate_db_cache

invalidate_db_cache(self, name=None, schema=None, cascade=True, drop=True)
Source: flowmachine/core/query.py

Drops this table, and (by default) any that depend on it, as well as removing them from the cache metadata table. If the table is currently being dropped from elsewhere, this method will block and return when the table has been removed.

Parameters
  • name: str

    Name of the table

  • schema: str

    Schema of the table

  • cascade: bool

    Set to false to remove only this table from cache

  • drop: bool

    Set to false to remove the cache record without dropping the table

join

join(
    self,
    other,
    on_left,
    on_right=None,
    how="inner",
    left_append="",
    right_append="",
)
Source: flowmachine/core/query.py

Parameters
  • other: Query

    Query to join to

  • on_left: str

    Field of this query to join on

  • on_right: str

    Field of this query to join on

  • how: {'left', 'outer', 'inner', 'full'}

    Method of joining to the other

  • left_append: str

  • right_append: str

Returns
  • Join

    Query object representing the two queries joined together

numeric_subset

numeric_subset(self, col, low, high)
Source: flowmachine/core/query.py

Subsets one of the columns to a specified range of numerical values.

Parameters
  • col: str

    Name of the column to subset, e.g. subscriber, cell etc.

  • low: float

    Lower bound of interval to subset on

  • high: float

    Upper bound of interval to subset on

Returns
  • Numeric subset object

random_sample

random_sample(self, sampling_method="random_ids", **params)
Source: flowmachine/core/query.py

Draws a random sample from this query.

Parameters
  • sampling_method: {'system', 'system_rows', 'bernoulli', 'random_ids'}, default 'random_ids'

    Specifies the method used to select the random sample. 'system_rows': performs block-level sampling by randomly sampling each physical storage page of the underlying relation. This sampling method is guaranteed to provide a sample of the specified size 'system': performs block-level sampling by randomly sampling each physical storage page for the underlying relation. This sampling method is not guaranteed to generate a sample of the specified size, but an approximation. This method may not produce a sample at all, so it might be worth running it again if it returns an empty dataframe. 'bernoulli': samples directly on each row of the underlying relation. This sampling method is slower and is not guaranteed to generate a sample of the specified size, but an approximation 'random_ids': samples rows by randomly sampling the row number.

  • size: optional, int

    The number of rows to draw. Exactly one of the 'size' or 'fraction' arguments must be provided.

  • fraction: optional, float

    Fraction of rows to draw. Exactly one of the 'size' or 'fraction' arguments must be provided.

  • estimate_count: bool, default False

    Whether to estimate the number of rows in the table using information contained in the pg_class or whether to perform an actual count in the number of rows.

  • seed: optional, float

    Optionally provide a seed for repeatable random samples. If using random_ids method, seed must be between -/+1. Not available in combination with the system_rows method.

Returns
  • Random

    A special query object which contains a random sample from this one

Note

Random samples may only be stored if a seed is supplied.

store

store(self, store_dependencies: bool = False) -> concurrent.futures._base.Future
Source: flowmachine/core/query.py

Store the results of this computation with the correct table name using a background thread.

Parameters
  • store_dependencies: bool, default False

    If True, store the dependencies of this query.

Returns
  • concurrent.futures._base.Future

    Future object which can be queried to check the query is stored.

subset

subset(self, col, subset)
Source: flowmachine/core/query.py

Subsets one of the columns to a specified subset of values

Parameters
  • col: str

    Name of the column to subset, e.g. subscriber, cell etc.

  • subset: list

    List of values to subset to

Returns
  • Subset object

to_sql

to_sql(self, name: str, schema: Union[str, NoneType] = None, store_dependencies: bool = False) -> concurrent.futures._base.Future
Source: flowmachine/core/query.py

Store the result of the calculation back into the database.

Parameters
  • name: str

    name of the table

  • schema: typing.Union[str, NoneType], default None

    Name of an existing schema. If none will use the postgres default, see postgres docs for more info.

  • store_dependencies: bool, default False

    If True, store the dependencies of this query.

Returns
  • concurrent.futures._base.Future

    Future object, containing this query and any result information.

Note

This method will return a Future immediately.

turn_off_caching

turn_off_caching(self)
Source: flowmachine/core/query.py

Turn the caching off, the object forgets previously calculated dataframes, and won't store further calculations

turn_on_caching

turn_on_caching(self)
Source: flowmachine/core/query.py

Turn on the caching, so that a computed dataframe is retained.

union

union(self, *other: 'Query', all: bool = True)
Source: flowmachine/core/query.py

Returns a Query representing a the union of queries. This is simply the tables concatenated. By passing the argument all as False the duplicates are also removed.

Parameters
  • other: Query

    An instance of a query object.

  • all: bool, default True

    If true returns sql UNION ALL else returns UNION

Returns
  • Union

    Query representing the concatenation of the two queries

Examples
dl1 = daily_location('2016-01-01', spatial_unit=CellSpatialUnit())
dl2 = daily_location('2016-01-02', spatial_unit=CellSpatialUnit())
dl1.union(dl2).get_query()
'cell_msisdn_20160101 UNION ALL cell_msisdn_20160102'
dl1.union(dl2,all=False).get_query()
'cell_msisdn_20160101 UNION cell_msisdn_20160102'

cache

cache
Source: flowmachine/core/query.py

Returns
  • bool

    True is caching is switched on.

column_names

column_names
Source: flowmachine/core/query.py

Returns the column names.

Returns
  • typing.List[str]

    List of the column names of this query.

column_names_as_string_list

column_names_as_string_list
Source: flowmachine/core/query.py

Get the column names as a comma separated list

Returns
  • str

    Comma separated list of column names

dependencies

dependencies
Source: flowmachine/core/query.py

Returns
  • set

    The set of queries which this one is directly dependent on.

fully_qualified_table_name

fully_qualified_table_name
Source: flowmachine/core/query.py

Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.

Returns
  • str

    String form of the table's fqn

index_cols

index_cols
Source: flowmachine/core/query.py

A list of columns to use as indexes when storing this query.

Returns
  • ixen: list

    By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.

Examples
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']

is_stored

is_stored
Source: flowmachine/core/query.py

Returns
  • bool

    True if the table is stored, and False otherwise.

query_id

query_id
Source: flowmachine/core/query.py

Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.

Returns
  • str

    query_id hash string

query_state

query_state
Source: flowmachine/core/query.py

Return the current query state.

Returns
  • QueryState

    The current query state

query_state_str

query_state_str
Source: flowmachine/core/query.py

Return the current query state as a string

Returns
  • str

    The current query state. The possible values are the ones defined in flowmachine.core.query_state.QueryState.

table_name

table_name
Source: flowmachine/core/query.py

Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.

Returns
  • str

    String form of the table's fqn