flowmachine.core.query¶
Source: flowmachine/core/query.py
This is the base class that defines any query on our database. It simply defines methods that returns the query as a string and as a pandas dataframe.
Class Query¶
Query(cache=True)
The core base class of the flowmachine module. This should handle all input and output methods for our sql queries, so that inherited classes should only need to deal with the logic of actually making the sql statement itself.
Attributes¶
Parameters¶
-
cache
:bool
, defaultTrue
Will store the resultant dataframes in memory. One can turn this off with turn_off_caching, and back on with turn_on_caching.
Methods¶
explain¶
explain(self, format="text", analyse=False)
get_dataframe¶
get_dataframe(self)
Executes the query and return the result as a pandas dataframe. This should be executed with care, as the results may consume large amounts of memory.
Returns¶
-
pandas.DataFrame
DataFrame containing results of the query.
get_dataframe_async¶
get_dataframe_async(self)
Execute the query in a worker thread and return a future object which will contain the result as a pandas dataframe when complete.
Returns¶
-
Future
Future object which can be used to get the resulting dataframe
Note
This should be executed with care, as the results may consume large amounts of memory
get_query¶
get_query(self)
Returns a string representing an SQL query. The string will point to the database cache of this query if it exists.
Returns¶
-
str
SQL query string.
get_table¶
get_table(self)
If this Query is stored, return a Table object referencing the stored version. If it is not stored, raise an exception.
Returns¶
-
flowmachine.core.Table
The stored version of this Query as a Table object
head¶
head(self, n=5)
Return the first n results of the query
Parameters¶
-
n
:int
Number of results to return
Returns¶
-
pandas.DataFrame
A DataFrame containing n results
invalidate_db_cache¶
invalidate_db_cache(self, name=None, schema=None, cascade=True, drop=True)
Drops this table, and (by default) any that depend on it, as well as removing them from the cache metadata table. If the table is currently being dropped from elsewhere, this method will block and return when the table has been removed.
Parameters¶
-
name
:str
Name of the table
-
schema
:str
Schema of the table
-
cascade
:bool
Set to false to remove only this table from cache
-
drop
:bool
Set to false to remove the cache record without dropping the table
join¶
join(
self,
other,
on_left,
on_right=None,
how="inner",
left_append="",
right_append="",
)
Parameters¶
-
other
:Query
Query to join to
-
on_left
:str
Field of this query to join on
-
on_right
:str
Field of this query to join on
-
how
:{'left', 'outer', 'inner', 'full'}
Method of joining to the other
-
left_append
:str
-
right_append
:str
Returns¶
-
Join
Query object representing the two queries joined together
numeric_subset¶
numeric_subset(self, col, low, high)
Subsets one of the columns to a specified range of numerical values.
Parameters¶
-
col
:str
Name of the column to subset, e.g. subscriber, cell etc.
-
low
:float
Lower bound of interval to subset on
-
high
:float
Upper bound of interval to subset on
Returns¶
Numeric subset object
random_sample¶
random_sample(self, sampling_method="random_ids", **params)
Draws a random sample from this query.
Parameters¶
-
sampling_method
:{'system', 'system_rows', 'bernoulli', 'random_ids'}
, default'random_ids'
Specifies the method used to select the random sample. 'system_rows': performs block-level sampling by randomly sampling each physical storage page of the underlying relation. This sampling method is guaranteed to provide a sample of the specified size 'system': performs block-level sampling by randomly sampling each physical storage page for the underlying relation. This sampling method is not guaranteed to generate a sample of the specified size, but an approximation. This method may not produce a sample at all, so it might be worth running it again if it returns an empty dataframe. 'bernoulli': samples directly on each row of the underlying relation. This sampling method is slower and is not guaranteed to generate a sample of the specified size, but an approximation 'random_ids': samples rows by randomly sampling the row number.
-
size
:optional
,int
The number of rows to draw. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
fraction
:optional
,float
Fraction of rows to draw. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
estimate_count
:bool
, defaultFalse
Whether to estimate the number of rows in the table using information contained in the
pg_class
or whether to perform an actual count in the number of rows. -
seed
:optional
,float
Optionally provide a seed for repeatable random samples. If using random_ids method, seed must be between -/+1. Not available in combination with the system_rows method.
Returns¶
-
Random
A special query object which contains a random sample from this one
Note
Random samples may only be stored if a seed is supplied.
store¶
store(self, store_dependencies: bool = False) -> concurrent.futures._base.Future
Store the results of this computation with the correct table name using a background thread.
Parameters¶
-
store_dependencies
:bool
, defaultFalse
If True, store the dependencies of this query.
Returns¶
-
concurrent.futures._base.Future
Future object which can be queried to check the query is stored.
subset¶
subset(self, col, subset)
Subsets one of the columns to a specified subset of values
Parameters¶
-
col
:str
Name of the column to subset, e.g. subscriber, cell etc.
-
subset
:list
List of values to subset to
Returns¶
Subset object
to_sql¶
to_sql(self, name: str, schema: Union[str, NoneType] = None, store_dependencies: bool = False) -> concurrent.futures._base.Future
Store the result of the calculation back into the database.
Parameters¶
-
name
:str
name of the table
-
schema
:typing.Union[str, NoneType]
, defaultNone
Name of an existing schema. If none will use the postgres default, see postgres docs for more info.
-
store_dependencies
:bool
, defaultFalse
If True, store the dependencies of this query.
Returns¶
-
concurrent.futures._base.Future
Future object, containing this query and any result information.
Note
This method will return a Future immediately.
turn_off_caching¶
turn_off_caching(self)
Turn the caching off, the object forgets previously calculated dataframes, and won't store further calculations
turn_on_caching¶
turn_on_caching(self)
Turn on the caching, so that a computed dataframe is retained.
union¶
union(self, *other: 'Query', all: bool = True)
Returns a Query representing a the union of queries. This is simply the tables concatenated. By passing the argument all as False the duplicates are also removed.
Parameters¶
-
other
:Query
An instance of a query object.
-
all
:bool
, defaultTrue
If true returns sql UNION ALL else returns UNION
Returns¶
-
Union
Query representing the concatenation of the two queries
Examples¶
dl1 = daily_location('2016-01-01', spatial_unit=CellSpatialUnit())
dl2 = daily_location('2016-01-02', spatial_unit=CellSpatialUnit())
dl1.union(dl2).get_query()
'cell_msisdn_20160101 UNION ALL cell_msisdn_20160102'
dl1.union(dl2,all=False).get_query()
'cell_msisdn_20160101 UNION cell_msisdn_20160102'
cache¶
cache
Returns¶
-
bool
True is caching is switched on.
column_names¶
column_names
Returns the column names.
Returns¶
-
typing.List[str]
List of the column names of this query.
column_names_as_string_list¶
column_names_as_string_list
Get the column names as a comma separated list
Returns¶
-
str
Comma separated list of column names
dependencies¶
dependencies
Returns¶
-
set
The set of queries which this one is directly dependent on.
fully_qualified_table_name¶
fully_qualified_table_name
Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.
Returns¶
-
str
String form of the table's fqn
index_cols¶
index_cols
A list of columns to use as indexes when storing this query.
Returns¶
-
ixen
:list
By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.
Examples¶
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']
is_stored¶
is_stored
Returns¶
-
bool
True if the table is stored, and False otherwise.
query_id¶
query_id
Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.
Returns¶
-
str
query_id hash string
query_state¶
query_state
Return the current query state.
Returns¶
-
QueryState
The current query state
query_state_str¶
query_state_str
Return the current query state as a string
Returns¶
-
str
The current query state. The possible values are the ones defined in
flowmachine.core.query_state.QueryState
.
table_name¶
table_name
Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.
Returns¶
-
str
String form of the table's fqn