flowmachine.core.random¶
Source: flowmachine/core/random.py
Classes to select random samples from queries or tables.
Class RandomBase¶
RandomBase(query: flowmachine.core.query.Query, *, size: Optional[int] = None, fraction: Optional[float] = None, estimate_count: bool = False)
Base class for queries used to obtain a random sample from a table.
Attributes¶
Methods¶
_sample_params¶
_sample_params
Parameters passed when initialising this query.
Returns¶
typing.Dict
column_names¶
column_names
table_name¶
table_name
Class RandomIDs¶
RandomIDs(query: flowmachine.core.query.Query, *, size: Optional[int] = None, fraction: Optional[float] = None, estimate_count: bool = False, seed: Optional[float] = None)
Gets a random sample from the result of a query, using the 'random_ids' sampling method. This method samples rows by randomly sampling the row number.
Attributes¶
Parameters¶
-
query
:flowmachine.core.query.Query
A query specifying a table from which a random sample will be drawn.
-
size
:typing.Optional
, defaultNone
The number of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
fraction
:typing.Optional
, defaultNone
The fraction of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
estimate_count
:bool
, defaultFalse
Whether to estimate the number of rows in the table using information contained in the
pg_class
or whether to perform an actual count in the number of rows. -
seed
:typing.Optional
, defaultNone
Optionally provide a seed for repeatable random samples. For the 'random_ids' method, seed must be between -/+1.
Note
Random samples may only be stored if a seed is supplied.
Methods¶
_sample_params¶
_sample_params
Parameters passed when initialising this query.
Returns¶
typing.Dict
column_names¶
column_names
seed¶
seed
table_name¶
table_name
Class RandomSystemRows¶
RandomSystemRows(query: flowmachine.core.query.Query, *, size: Optional[int] = None, fraction: Optional[float] = None, estimate_count: bool = False)
Gets a random sample from the result of a query, using a PostgreSQL TABLESAMPLE clause with the 'system_rows' method. This method performs block-level sampling by randomly sampling each physical storage page of the underlying relation. This sampling method is guaranteed to provide a sample of the specified size.
Attributes¶
Parameters¶
-
query
:flowmachine.core.query.Query
A query specifying a table from which a random sample will be drawn.
-
size
:typing.Optional
, defaultNone
The number of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
fraction
:typing.Optional
, defaultNone
The fraction of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
estimate_count
:bool
, defaultFalse
Whether to estimate the number of rows in the table using information contained in the
pg_class
or whether to perform an actual count in the number of rows.
Note
The 'system_rows' sampling method does not support parent tables which have child inheritance. The 'system_rows' sampling method does not support supplying a seed for reproducible samples, so random samples cannot be stored.
Methods¶
_sample_params¶
_sample_params
Parameters passed when initialising this query.
Returns¶
typing.Dict
column_names¶
column_names
table_name¶
table_name
Class RandomTablesample¶
RandomTablesample(query: flowmachine.core.query.Query, *, size: Optional[int] = None, fraction: Optional[float] = None, estimate_count: bool = False, seed: Optional[float] = None)
Gets a random sample from the result of a query, using a PostgreSQL TABLESAMPLE clause with one of the following sampling methods: 'system': performs block-level sampling by randomly sampling each physical storage page for the underlying relation. This sampling method is not guaranteed to generate a sample of the specified size, but an approximation. This method may not produce a sample at all, so it might be worth running it again if it returns an empty dataframe. 'bernoulli': samples directly on each row of the underlying relation. This sampling method is slower and is not guaranteed to generate a sample of the specified size, but an approximation. The choice of method is determined from the _sampling_method attribute.
Attributes¶
Parameters¶
-
query
:flowmachine.core.query.Query
A query specifying a table from which a random sample will be drawn.
-
size
:typing.Optional
, defaultNone
The number of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
fraction
:typing.Optional
, defaultNone
The fraction of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.
-
estimate_count
:bool
, defaultFalse
Whether to estimate the number of rows in the table using information contained in the
pg_class
or whether to perform an actual count in the number of rows. -
seed
:typing.Optional
, defaultNone
Optionally provide a seed for repeatable random samples.
Note
Random samples may only be stored if a seed is supplied.
Methods¶
_sample_params¶
_sample_params
Parameters passed when initialising this query.
Returns¶
typing.Dict
column_names¶
column_names
seed¶
seed
table_name¶
table_name
Class SeedableRandom¶
SeedableRandom(query: flowmachine.core.query.Query, *, size: Optional[int] = None, fraction: Optional[float] = None, estimate_count: bool = False, seed: Optional[float] = None)
Base class for random samples that accept a seed parameter for reproducibility.
Attributes¶
Methods¶
_sample_params¶
_sample_params
Parameters passed when initialising this query.
Returns¶
typing.Dict
column_names¶
column_names
seed¶
seed
table_name¶
table_name
random_factory¶
random_factory(parent_class: Type[flowmachine.core.query.Query], sampling_method: str = 'random_ids')
Dynamically creates a random class as a descendant of parent_class. The resulting object will query the underlying object for attributes, and methods.
Parameters¶
-
parent_class
:typing.Type
Class from which to derive random class
-
sampling_method
:str
, defaultrandom_ids
One of 'system_rows', 'system', 'bernoulli', 'random_ids'. Specifies the method used to select the random sample. 'system_rows': performs block-level sampling by randomly sampling each physical storage page of the underlying relation. This sampling method is guaranteed to provide a sample of the specified size. This method does not support parent tables which have child inheritance, and is not reproducible. 'system': performs block-level sampling by randomly sampling each physical storage page for the underlying relation. This sampling method is not guaranteed to generate a sample of the specified size, but an approximation. This method may not produce a sample at all, so it might be worth running it again if it returns an empty dataframe. 'bernoulli': samples directly on each row of the underlying relation. This sampling method is slower and is not guaranteed to generate a sample of the specified size, but an approximation. 'random_ids': samples rows by randomly sampling the row number.
Returns¶
-
class
A class which gets a random sample from the result of a query.
Examples¶
>>> query = UniqueSubscribers("2016-01-01", "2016-01-31") >>> Random = random_factory(query.__class__) >>> Random(query=query, size=10).get_dataframe() msisdn 0 AgvE8pa3Bvqezmo6 1 3XKdxqvyNxO2vLD1 2 5Kgwy8Gp6DlN3Eq9 3 L4V537alj321eWz6 4 GJP3DWdGyb4QBnyo 5 DAlqeZENbeOn2vBw 6 By4j6PKdB4NGMpxr 7 mkqQ4NPBPQLapbeg 8 YNv2EgDJxxAoy0Gr 9 2vmOlAENnxpPM1xX >>> query = VersionedInfrastructure("2016-01-01") >>> Random = random_factory(query.__class__) >>> Random(query=query, size=10).get_dataframe() id version 0 o9yyxY 0 1 B8OaG5 0 2 DbWg4K 0 3 0xqNDj 0 4 pqg7ZE 0 5 nWM8R3 0 6 LVnDQL 0 7 pdVVV4 0 8 wzrXjw 0 9 RZgwVz 0 # The default method 'system_rows' does not support parent tables which have child inheritance # as is the case with 'events.calls', so we choose another method here. >>> Random = random_factory(flowmachine.core.Query, sampling_method='bernoulli') >>> Random(query=Table('events.calls', columns=['id', 'duration']), size=10).get_dataframe() id duration 0 mQjOy-5eVrm-Ll5eE-P4V27 422.0 1 mQjOy-5eVrm-Ll5eE-P4V27 422.0 2 0r4KG-Rb4Lm-VK1bB-LZQxg 762.0 3 BDXMV-yb8Kl-zkmav-AZEJ2 318.0 4 vm9gW-4QbYm-OrKbz-qM5Yx 1407.0 5 WYxk8-mepk9-W3pdM-yJNjQ 1062.0 6 mQjOy-5eVn3-wK5eE-P4V27 1033.0 7 M7Vl4-zbqom-oPDep-rOZqE 879.0 8 58DKg-l9av9-NE8eG-1vzAp 3129.0 9 m9gW4-QbY62-WLYdz-qM5Yx 1117.0