Skip to content

flowmachine.features.subscriber.hartigan_cluster

Source: flowmachine/features/subscriber/hartigan_cluster.py

Classes that deal with location clustering using different methods. These classes attempt to reduce the number of locations for each subscriber, by clustering a set of locations according to a specified methodology. These methods are great options for reducing the dimensionality of the problem in hand.

Class BaseCluster

BaseCluster(cache=True)
Source: flowmachine/features/subscriber/hartigan_cluster.py

Base query for cluster methods, providing a geo augmented query method.

Attributes

Methods

cache

cache
Source: flowmachine/core/query.py

Returns
  • bool

    True is caching is switched on.

column_names

column_names
Source: flowmachine/core/query.py

Returns the column names.

Returns
  • typing.List[str]

    List of the column names of this query.

column_names_as_string_list

column_names_as_string_list
Source: flowmachine/core/query.py

Get the column names as a comma separated list

Returns
  • str

    Comma separated list of column names

dependencies

dependencies
Source: flowmachine/core/query.py

Returns
  • set

    The set of queries which this one is directly dependent on.

fully_qualified_table_name

fully_qualified_table_name
Source: flowmachine/core/query.py

Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.

Returns
  • str

    String form of the table's fqn

index_cols

index_cols
Source: flowmachine/core/query.py

A list of columns to use as indexes when storing this query.

Returns
  • ixen: list

    By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.

Examples
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']

is_stored

is_stored
Source: flowmachine/core/query.py

Returns
  • bool

    True if the table is stored, and False otherwise.

query_id

query_id
Source: flowmachine/core/query.py

Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.

Returns
  • str

    query_id hash string

query_state

query_state
Source: flowmachine/core/query.py

Return the current query state.

Returns
  • QueryState

    The current query state

query_state_str

query_state_str
Source: flowmachine/core/query.py

Return the current query state as a string

Returns
  • str

    The current query state. The possible values are the ones defined in flowmachine.core.query_state.QueryState.

table_name

table_name
Source: flowmachine/core/query.py

Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.

Returns
  • str

    String form of the table's fqn

Class HartiganCluster

HartiganCluster(*, calldays: flowmachine.features.subscriber.call_days.CallDays, radius: Union[float, str], buffer: float = 0, call_threshold: int = 0)
Source: flowmachine/features/subscriber/hartigan_cluster.py

Implements the Hartigan Clustering algorithm. The algorithm clusters locations based on a ranked listed of call days. 1 The Hartigan clustering algorithm will pick up the site associated with the highest call days and form a cluster on that site. It will descend the ranked list of call days from the top. The second site associated with the highest call days will be incorporated into the the first cluster if it is within a certain radius of it. In such case, a new cluster centroid is calculated by taking a weighted average by the call days of all the sites in the cluster. Eventually, when the algorithm lands on a site which is not within the radius of all the available clusters, it will create a new cluster on that particular site. The class will produce a table where each row lists a cluster for a given subscriber, the rank of that cluster, the total number of call days in that cluster (which is the sum of all call days constituting that cluster and not the actual call days of the cluster), and all the sites which constitute that cluster.

Attributes

Parameters

  • calldays: flowmachine.features.subscriber.call_days.CallDays

    The calls day table which contains the call day data per subscriber. This table should follow the same format as the table produced with the CallDays class.

  • radius: typing.Union[float, str]

    The threshold value in km to be used for clustering towers. If a string is passed, it is assumed that it is the name of the column in the call day table.

  • buffer: float, default 0

    The buffer radius size in km to be used for buffering the cluster centroid. If the cluster is formed by only one site, then the buffer radius size has no effect and the cluster centroid is buffered to the polygon representing the given site. If buffer is 0, only the cluster centroids are returned.

  • call_threshold: int, default 0

    The minimum number of calls that a cluster must have. Any cluster with less than that amount of calls will be eliminated.

Examples

cd = CallDays( '2016-01-01', '2016-01-04', spatial_unit=make_spatial_unit('versioned-site'))
har = HartiganCluster(cd, 2.5)
har.head()
            subscriber                                cluster  rank  calldays
038OVABN11Ak4W5P        POINT (82.60170958 29.81591927)     1         2
038OVABN11Ak4W5P    POINT (82.91428457000001 29.358975)     2         2
038OVABN11Ak4W5P  POINT (81.63916106000001 28.21192983)     3         1
038OVABN11Ak4W5P        POINT (87.26522455 27.58509554)     4         1
038OVABN11Ak4W5P  POINT (80.86633861999999 28.70767038)     5         1
...
site_id version
   [m9jL23]     [0]
   [QeBRM8]     [0]
   [nWM8R3]     [0]
   [zdNQx2]     [0]
   [pqg7ZE]     [0]
...

Methods

join_to_cluster_components

join_to_cluster_components(self, query)
Source: flowmachine/features/subscriber/hartigan_cluster.py

Join the versioned-sites composing the Hartigan cluster table with another table containing versioned-sites.

Parameters
  • query: flowmachine.Query

    A flowmachine.Query object. This represents a table that can be joined to the versioned-sites composing the Hartigan cluster table. This must have a column called 'site_id', another column called 'version' and another called 'subscriber'. The remaining columns will be averaged over.

Examples
es = EventScore(start='2016-01-01', stop='2016-01-05',
spatial_unit=make_spatial_unit('versioned-site'))
cd = CallDays(start='2016-01-01', stop='2016-01-04',
spatial_unit=make_spatial_unit('versioned-site'))
har = HartiganCluster(cd, 50, call_threshold=1)
har.join_to_cluster_components(es).head(geom=['cluster'])
            subscriber                                      cluster  rank
038OVABN11Ak4W5P              POINT (87.26522455 27.58509554)     4
038OVABN11Ak4W5P               POINT (86.00007467 27.2713931)     7
038OVABN11Ak4W5P              POINT (83.51373348 28.14524211)     8
038OVABN11Ak4W5P  POINT (82.97508908333334 29.28452965333333)     2
038OVABN11Ak4W5P        POINT (83.02805528499999 28.42765618)     6
...
calldays  score_hour  score_dow
       1   -1.000000   0.000000
       1    1.000000   0.000000
       1   -1.000000  -1.000000
       3   -0.666667  -0.666667
       2    0.000000  -0.500000
...

cache

cache
Source: flowmachine/core/query.py

Returns
  • bool

    True is caching is switched on.

column_names

column_names
Source: flowmachine/features/subscriber/hartigan_cluster.py

Returns the column names.

Returns
  • typing.List[str]

    List of the column names of this query.

column_names_as_string_list

column_names_as_string_list
Source: flowmachine/core/query.py

Get the column names as a comma separated list

Returns
  • str

    Comma separated list of column names

dependencies

dependencies
Source: flowmachine/core/query.py

Returns
  • set

    The set of queries which this one is directly dependent on.

fully_qualified_table_name

fully_qualified_table_name
Source: flowmachine/core/query.py

Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.

Returns
  • str

    String form of the table's fqn

index_cols

index_cols
Source: flowmachine/core/query.py

A list of columns to use as indexes when storing this query.

Returns
  • ixen: list

    By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.

Examples
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']

is_stored

is_stored
Source: flowmachine/core/query.py

Returns
  • bool

    True if the table is stored, and False otherwise.

query_id

query_id
Source: flowmachine/core/query.py

Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.

Returns
  • str

    query_id hash string

query_state

query_state
Source: flowmachine/core/query.py

Return the current query state.

Returns
  • QueryState

    The current query state

query_state_str

query_state_str
Source: flowmachine/core/query.py

Return the current query state as a string

Returns
  • str

    The current query state. The possible values are the ones defined in flowmachine.core.query_state.QueryState.

table_name

table_name
Source: flowmachine/core/query.py

Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.

Returns
  • str

    String form of the table's fqn


  1. S. Isaacman et al., "Identifying Important Places in People's Lives from Cellular Network Data", International Conference on Pervasive Computing (2011), pp 133-151.