Skip to content

flowmachine.core.random

Source: flowmachine/core/random.py

Classes to select random samples from queries or tables.

Class RandomBase

RandomBase(query: flowmachine.core.query.Query, *, size: Union[int, NoneType] = None, fraction: Union[float, NoneType] = None, estimate_count: bool = False)
Source: flowmachine/core/random.py

Base class for queries used to obtain a random sample from a table.

Attributes

Methods

_sample_params

_sample_params
Source: flowmachine/core/random.py

Parameters passed when initialising this query.

Returns
  • typing.Dict[str, typing.Any]

column_names

column_names
Source: flowmachine/core/random.py

table_name

table_name
Source: flowmachine/core/random.py

Class RandomIDs

RandomIDs(query: flowmachine.core.query.Query, *, size: Union[int, NoneType] = None, fraction: Union[float, NoneType] = None, estimate_count: bool = False, seed: Union[float, NoneType] = None)
Source: flowmachine/core/random.py

Gets a random sample from the result of a query, using the 'random_ids' sampling method. This method samples rows by randomly sampling the row number.

Attributes

Parameters

  • query: flowmachine.core.query.Query

    A query specifying a table from which a random sample will be drawn.

  • size: typing.Union[int, NoneType], default None

    The number of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.

  • fraction: typing.Union[float, NoneType], default None

    The fraction of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.

  • estimate_count: bool, default False

    Whether to estimate the number of rows in the table using information contained in the pg_class or whether to perform an actual count in the number of rows.

  • seed: typing.Union[float, NoneType], default None

    Optionally provide a seed for repeatable random samples. For the 'random_ids' method, seed must be between -/+1.

Note

Random samples may only be stored if a seed is supplied.

Methods

_sample_params

_sample_params
Source: flowmachine/core/random.py

Parameters passed when initialising this query.

Returns
  • typing.Dict[str, typing.Any]

column_names

column_names
Source: flowmachine/core/random.py

seed

seed
Source: flowmachine/core/random.py

table_name

table_name
Source: flowmachine/core/random.py

Class RandomSystemRows

RandomSystemRows(query: flowmachine.core.query.Query, *, size: Union[int, NoneType] = None, fraction: Union[float, NoneType] = None, estimate_count: bool = False)
Source: flowmachine/core/random.py

Gets a random sample from the result of a query, using a PostgreSQL TABLESAMPLE clause with the 'system_rows' method. This method performs block-level sampling by randomly sampling each physical storage page of the underlying relation. This sampling method is guaranteed to provide a sample of the specified size.

Attributes

Parameters

  • query: flowmachine.core.query.Query

    A query specifying a table from which a random sample will be drawn.

  • size: typing.Union[int, NoneType], default None

    The number of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.

  • fraction: typing.Union[float, NoneType], default None

    The fraction of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.

  • estimate_count: bool, default False

    Whether to estimate the number of rows in the table using information contained in the pg_class or whether to perform an actual count in the number of rows.

Note

The 'system_rows' sampling method does not support parent tables which have child inheritance. The 'system_rows' sampling method does not support supplying a seed for reproducible samples, so random samples cannot be stored.

Methods

_sample_params

_sample_params
Source: flowmachine/core/random.py

Parameters passed when initialising this query.

Returns
  • typing.Dict[str, typing.Any]

column_names

column_names
Source: flowmachine/core/random.py

table_name

table_name
Source: flowmachine/core/random.py

Class RandomTablesample

RandomTablesample(query: flowmachine.core.query.Query, *, size: Union[int, NoneType] = None, fraction: Union[float, NoneType] = None, estimate_count: bool = False, seed: Union[float, NoneType] = None)
Source: flowmachine/core/random.py

Gets a random sample from the result of a query, using a PostgreSQL TABLESAMPLE clause with one of the following sampling methods: 'system': performs block-level sampling by randomly sampling each physical storage page for the underlying relation. This sampling method is not guaranteed to generate a sample of the specified size, but an approximation. This method may not produce a sample at all, so it might be worth running it again if it returns an empty dataframe. 'bernoulli': samples directly on each row of the underlying relation. This sampling method is slower and is not guaranteed to generate a sample of the specified size, but an approximation. The choice of method is determined from the _sampling_method attribute.

Attributes

Parameters

  • query: flowmachine.core.query.Query

    A query specifying a table from which a random sample will be drawn.

  • size: typing.Union[int, NoneType], default None

    The number of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.

  • fraction: typing.Union[float, NoneType], default None

    The fraction of rows to be selected from the table. Exactly one of the 'size' or 'fraction' arguments must be provided.

  • estimate_count: bool, default False

    Whether to estimate the number of rows in the table using information contained in the pg_class or whether to perform an actual count in the number of rows.

  • seed: typing.Union[float, NoneType], default None

    Optionally provide a seed for repeatable random samples.

Note

Random samples may only be stored if a seed is supplied.

Methods

_sample_params

_sample_params
Source: flowmachine/core/random.py

Parameters passed when initialising this query.

Returns
  • typing.Dict[str, typing.Any]

column_names

column_names
Source: flowmachine/core/random.py

seed

seed
Source: flowmachine/core/random.py

table_name

table_name
Source: flowmachine/core/random.py

Class SeedableRandom

SeedableRandom(query: flowmachine.core.query.Query, *, size: Union[int, NoneType] = None, fraction: Union[float, NoneType] = None, estimate_count: bool = False, seed: Union[float, NoneType] = None)
Source: flowmachine/core/random.py

Base class for random samples that accept a seed parameter for reproducibility.

Attributes

Methods

_sample_params

_sample_params
Source: flowmachine/core/random.py

Parameters passed when initialising this query.

Returns
  • typing.Dict[str, typing.Any]

column_names

column_names
Source: flowmachine/core/random.py

seed

seed
Source: flowmachine/core/random.py

table_name

table_name
Source: flowmachine/core/random.py

random_factory

random_factory(parent_class: Type[flowmachine.core.query.Query], sampling_method: str = 'random_ids')
Source: flowmachine/core/random.py

Dynamically creates a random class as a descendant of parent_class. The resulting object will query the underlying object for attributes, and methods.

Parameters

  • parent_class: typing.Type[flowmachine.core.query.Query]

    Class from which to derive random class

  • sampling_method: str, default random_ids

    One of 'system_rows', 'system', 'bernoulli', 'random_ids'. Specifies the method used to select the random sample. 'system_rows': performs block-level sampling by randomly sampling each physical storage page of the underlying relation. This sampling method is guaranteed to provide a sample of the specified size. This method does not support parent tables which have child inheritance, and is not reproducible. 'system': performs block-level sampling by randomly sampling each physical storage page for the underlying relation. This sampling method is not guaranteed to generate a sample of the specified size, but an approximation. This method may not produce a sample at all, so it might be worth running it again if it returns an empty dataframe. 'bernoulli': samples directly on each row of the underlying relation. This sampling method is slower and is not guaranteed to generate a sample of the specified size, but an approximation. 'random_ids': samples rows by randomly sampling the row number.

Returns

  • class

    A class which gets a random sample from the result of a query.

Examples

>>> query = UniqueSubscribers("2016-01-01", "2016-01-31")     >>> Random = random_factory(query.__class__)     >>> Random(query=query, size=10).get_dataframe()                      msisdn     0  AgvE8pa3Bvqezmo6     1  3XKdxqvyNxO2vLD1     2  5Kgwy8Gp6DlN3Eq9     3  L4V537alj321eWz6     4  GJP3DWdGyb4QBnyo     5  DAlqeZENbeOn2vBw     6  By4j6PKdB4NGMpxr     7  mkqQ4NPBPQLapbeg     8  YNv2EgDJxxAoy0Gr     9  2vmOlAENnxpPM1xX      >>> query = VersionedInfrastructure("2016-01-01")     >>> Random = random_factory(query.__class__)     >>> Random(query=query, size=10).get_dataframe()              id  version     0  o9yyxY        0     1  B8OaG5        0     2  DbWg4K        0     3  0xqNDj        0     4  pqg7ZE        0     5  nWM8R3        0     6  LVnDQL        0     7  pdVVV4        0     8  wzrXjw        0     9  RZgwVz        0      # The default method 'system_rows' does not support parent tables which have child inheritance     # as is the case with 'events.calls', so we choose another method here.     >>> Random = random_factory(flowmachine.core.Query, sampling_method='bernoulli')     >>> Random(query=Table('events.calls', columns=['id', 'duration']), size=10).get_dataframe()                             id  duration     0  mQjOy-5eVrm-Ll5eE-P4V27     422.0     1  mQjOy-5eVrm-Ll5eE-P4V27     422.0     2  0r4KG-Rb4Lm-VK1bB-LZQxg     762.0     3  BDXMV-yb8Kl-zkmav-AZEJ2     318.0     4  vm9gW-4QbYm-OrKbz-qM5Yx    1407.0     5  WYxk8-mepk9-W3pdM-yJNjQ    1062.0     6  mQjOy-5eVn3-wK5eE-P4V27    1033.0     7  M7Vl4-zbqom-oPDep-rOZqE     879.0     8  58DKg-l9av9-NE8eG-1vzAp    3129.0     9  m9gW4-QbY62-WLYdz-qM5Yx    1117.0