flowmachine.core.query¶
Source: flowmachine/core/query.py
This is the base class that defines any query on our database. It simply defines methods that returns the query as a string and as a pandas dataframe.
Class Query¶
Query(cache=True)
The core base class of the flowmachine module. This should handle all input and output methods for our sql queries, so that inherited classes should only need to deal with the logic of actually making the sql statement itself.
Attributes¶
Parameters¶
-
cache:bool, defaultTrueWill store the resultant dataframes in memory. One can turn this off with turn_off_caching, and back on with turn_on_caching.
Methods¶
explain¶
explain(self, format="text", analyse=False)
get_dataframe¶
get_dataframe(self)
Executes the query and return the result as a pandas dataframe. This should be executed with care, as the results may consume large amounts of memory.
Returns¶
-
pandas.DataFrameDataFrame containing results of the query.
get_dataframe_async¶
get_dataframe_async(self)
Execute the query in a worker thread and return a future object which will contain the result as a pandas dataframe when complete.
Returns¶
-
FutureFuture object which can be used to get the resulting dataframe
Note
This should be executed with care, as the results may consume large amounts of memory
get_query¶
get_query(self)
Returns a string representing an SQL query. The string will point to the database cache of this query if it exists.
Returns¶
-
strSQL query string.
get_table¶
get_table(self)
If this Query is stored, return a Table object referencing the stored version. If it is not stored, raise an exception.
Returns¶
-
flowmachine.core.TableThe stored version of this Query as a Table object
head¶
head(self, n=5)
Return the first n results of the query
Parameters¶
-
n:intNumber of results to return
Returns¶
-
pandas.DataFrameA DataFrame containing n results
invalidate_db_cache¶
invalidate_db_cache(self, name=None, schema=None, cascade=True, drop=True)
Drops this table, and (by default) any that depend on it, as well as removing them from the cache metadata table. If the table is currently being dropped from elsewhere, this method will block and return when the table has been removed.
Parameters¶
-
name:strName of the table
-
schema:strSchema of the table
-
cascade:boolSet to false to remove only this table from cache
-
drop:boolSet to false to remove the cache record without dropping the table
join¶
join(
self,
other,
on_left,
on_right=None,
how="inner",
left_append="",
right_append="",
)
Parameters¶
-
other:QueryQuery to join to
-
on_left:strField of this query to join on
-
on_right:strField of this query to join on
-
how:{'left', 'outer', 'inner', 'full'}Method of joining to the other
-
left_append:str -
right_append:str
Returns¶
-
JoinQuery object representing the two queries joined together
numeric_subset¶
numeric_subset(self, col, low, high)
Subsets one of the columns to a specified range of numerical values.
Parameters¶
-
col:strName of the column to subset, e.g. subscriber, cell etc.
-
low:floatLower bound of interval to subset on
-
high:floatUpper bound of interval to subset on
Returns¶
Numeric subset object
random_sample¶
random_sample(self, sampling_method="random_ids", **params)
Draws a random sample from this query.
Parameters¶
-
sampling_method:{'system', 'system_rows', 'bernoulli', 'random_ids'}, default'random_ids'Specifies the method used to select the random sample. 'system_rows': performs block-level sampling by randomly sampling each physical storage page of the underlying relation. This sampling method is guaranteed to provide a sample of the specified size 'system': performs block-level sampling by randomly sampling each physical storage page for the underlying relation. This sampling method is not guaranteed to generate a sample of the specified size, but an approximation. This method may not produce a sample at all, so it might be worth running it again if it returns an empty dataframe. 'bernoulli': samples directly on each row of the underlying relation. This sampling method is slower and is not guaranteed to generate a sample of the specified size, but an approximation 'random_ids': samples rows by randomly sampling the row number.
-
size:optional,intThe number of rows to draw. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
fraction:optional,floatFraction of rows to draw. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
estimate_count:bool, defaultFalseWhether to estimate the number of rows in the table using information contained in the
pg_classor whether to perform an actual count in the number of rows. -
seed:optional,floatOptionally provide a seed for repeatable random samples. If using random_ids method, seed must be between -/+1. Not available in combination with the system_rows method.
Returns¶
-
RandomA special query object which contains a random sample from this one
Note
Random samples may only be stored if a seed is supplied.
store¶
store(self, store_dependencies: bool = False) -> concurrent.futures._base.Future
Store the results of this computation with the correct table name using a background thread.
Parameters¶
-
store_dependencies:bool, defaultFalseIf True, store the dependencies of this query.
Returns¶
-
concurrent.futures._base.FutureFuture object which can be queried to check the query is stored.
subset¶
subset(self, col, subset)
Subsets one of the columns to a specified subset of values
Parameters¶
-
col:strName of the column to subset, e.g. subscriber, cell etc.
-
subset:listList of values to subset to
Returns¶
Subset object
to_sql¶
to_sql(self, name: str, schema: Union[str, NoneType] = None, store_dependencies: bool = False) -> concurrent.futures._base.Future
Store the result of the calculation back into the database.
Parameters¶
-
name:strname of the table
-
schema:typing.Union[str, NoneType], defaultNoneName of an existing schema. If none will use the postgres default, see postgres docs for more info.
-
store_dependencies:bool, defaultFalseIf True, store the dependencies of this query.
Returns¶
-
concurrent.futures._base.FutureFuture object, containing this query and any result information.
Note
This method will return a Future immediately.
turn_off_caching¶
turn_off_caching(self)
Turn the caching off, the object forgets previously calculated dataframes, and won't store further calculations
turn_on_caching¶
turn_on_caching(self)
Turn on the caching, so that a computed dataframe is retained.
union¶
union(self, *other: 'Query', all: bool = True)
Returns a Query representing a the union of queries. This is simply the tables concatenated. By passing the argument all as False the duplicates are also removed.
Parameters¶
-
other:QueryAn instance of a query object.
-
all:bool, defaultTrueIf true returns sql UNION ALL else returns UNION
Returns¶
-
UnionQuery representing the concatenation of the two queries
Examples¶
dl1 = daily_location('2016-01-01', spatial_unit=CellSpatialUnit())
dl2 = daily_location('2016-01-02', spatial_unit=CellSpatialUnit())
dl1.union(dl2).get_query()
'cell_msisdn_20160101 UNION ALL cell_msisdn_20160102'
dl1.union(dl2,all=False).get_query()
'cell_msisdn_20160101 UNION cell_msisdn_20160102'
cache¶
cache
Returns¶
-
boolTrue is caching is switched on.
column_names¶
column_names
Returns the column names.
Returns¶
-
typing.List[str]List of the column names of this query.
column_names_as_string_list¶
column_names_as_string_list
Get the column names as a comma separated list
Returns¶
-
strComma separated list of column names
dependencies¶
dependencies
Returns¶
-
setThe set of queries which this one is directly dependent on.
fully_qualified_table_name¶
fully_qualified_table_name
Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.
Returns¶
-
strString form of the table's fqn
index_cols¶
index_cols
A list of columns to use as indexes when storing this query.
Returns¶
-
ixen:listBy default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.
Examples¶
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']
is_stored¶
is_stored
Returns¶
-
boolTrue if the table is stored, and False otherwise.
query_id¶
query_id
Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.
Returns¶
-
strquery_id hash string
query_state¶
query_state
Return the current query state.
Returns¶
-
QueryStateThe current query state
query_state_str¶
query_state_str
Return the current query state as a string
Returns¶
-
strThe current query state. The possible values are the ones defined in
flowmachine.core.query_state.QueryState.
table_name¶
table_name
Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.
Returns¶
-
strString form of the table's fqn