Skip to content

flowmachine.features.subscriber.hartigan_cluster¶

Source: flowmachine/features/subscriber/hartigan_cluster.py

Classes that deal with location clustering using different methods. These classes attempt to reduce the number of locations for each subscriber, by clustering a set of locations according to a specified methodology. These methods are great options for reducing the dimensionality of the problem in hand.

Class BaseCluster¶

BaseCluster(cache=True)

Source: flowmachine/features/subscriber/hartigan_cluster.py

Base query for cluster methods, providing a geo augmented query method.

Attributes¶

cache
column_names
column_names_as_string_list
dependencies
fully_qualified_table_name
index_cols
is_stored
query_id
query_state
query_state_str
table_name

Methods¶

cache¶

cache

Source: flowmachine/core/query.py

Returns¶

bool

True is caching is switched on.

column_names¶

column_names

Source: flowmachine/core/query.py

Returns the column names.

Returns¶

typing.List[str]

List of the column names of this query.

column_names_as_string_list¶

column_names_as_string_list

Source: flowmachine/core/query.py

Get the column names as a comma separated list

Returns¶

str

Comma separated list of column names

dependencies¶

dependencies

Source: flowmachine/core/query.py

Returns¶

set

The set of queries which this one is directly dependent on.

fully_qualified_table_name¶

fully_qualified_table_name

Source: flowmachine/core/query.py

Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.

Returns¶

str

String form of the table's fqn

index_cols¶

index_cols

Source: flowmachine/core/query.py

A list of columns to use as indexes when storing this query.

Returns¶

ixen: list

By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.

Examples¶

daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']

is_stored¶

is_stored

Source: flowmachine/core/query.py

Returns¶

bool

True if the table is stored, and False otherwise.

query_id¶

query_id

Source: flowmachine/core/query.py

Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.

Returns¶

str

query_id hash string

query_state¶

query_state

Source: flowmachine/core/query.py

Return the current query state.

Returns¶

QueryState

The current query state

query_state_str¶

query_state_str

Source: flowmachine/core/query.py

Return the current query state as a string

Returns¶

str

The current query state. The possible values are the ones defined in flowmachine.core.query_state.QueryState.

table_name¶

table_name

Source: flowmachine/core/query.py

Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.

Returns¶

str

String form of the table's fqn

Class HartiganCluster¶

HartiganCluster(*, calldays: flowmachine.features.subscriber.call_days.CallDays, radius: Union[float, str], buffer: float = 0, call_threshold: int = 0)

Source: flowmachine/features/subscriber/hartigan_cluster.py

Implements the Hartigan Clustering algorithm. The algorithm clusters locations based on a ranked listed of call days. ¹ The Hartigan clustering algorithm will pick up the site associated with the highest call days and form a cluster on that site. It will descend the ranked list of call days from the top. The second site associated with the highest call days will be incorporated into the the first cluster if it is within a certain radius of it. In such case, a new cluster centroid is calculated by taking a weighted average by the call days of all the sites in the cluster. Eventually, when the algorithm lands on a site which is not within the radius of all the available clusters, it will create a new cluster on that particular site. The class will produce a table where each row lists a cluster for a given subscriber, the rank of that cluster, the total number of call days in that cluster (which is the sum of all call days constituting that cluster and not the actual call days of the cluster), and all the sites which constitute that cluster.

Attributes¶

cache
column_names
column_names_as_string_list
dependencies
fully_qualified_table_name
index_cols
is_stored
query_id
query_state
query_state_str
table_name

Parameters¶

calldays: flowmachine.features.subscriber.call_days.CallDays

The calls day table which contains the call day data per subscriber. This table should follow the same format as the table produced with the CallDays class.
radius: typing.Union[float, str]

The threshold value in km to be used for clustering towers. If a string is passed, it is assumed that it is the name of the column in the call day table.
buffer: float, default 0

The buffer radius size in km to be used for buffering the cluster centroid. If the cluster is formed by only one site, then the buffer radius size has no effect and the cluster centroid is buffered to the polygon representing the given site. If buffer is 0, only the cluster centroids are returned.
call_threshold: int, default 0

The minimum number of calls that a cluster must have. Any cluster with less than that amount of calls will be eliminated.

Examples¶

cd = CallDays( '2016-01-01', '2016-01-04', spatial_unit=make_spatial_unit('versioned-site'))

har = HartiganCluster(cd, 2.5)

har.head()
            subscriber                                cluster  rank  calldays
038OVABN11Ak4W5P        POINT (82.60170958 29.81591927)     1         2
038OVABN11Ak4W5P    POINT (82.91428457000001 29.358975)     2         2
038OVABN11Ak4W5P  POINT (81.63916106000001 28.21192983)     3         1
038OVABN11Ak4W5P        POINT (87.26522455 27.58509554)     4         1
038OVABN11Ak4W5P  POINT (80.86633861999999 28.70767038)     5         1
...
site_id version
   [m9jL23]     [0]
   [QeBRM8]     [0]
   [nWM8R3]     [0]
   [zdNQx2]     [0]
   [pqg7ZE]     [0]
...

Methods¶

join_to_cluster_components¶

join_to_cluster_components(self, query)

Source: flowmachine/features/subscriber/hartigan_cluster.py

Join the versioned-sites composing the Hartigan cluster table with another table containing versioned-sites.

Parameters¶

query: flowmachine.Query

A flowmachine.Query object. This represents a table that can be joined to the versioned-sites composing the Hartigan cluster table. This must have a column called 'site_id', another column called 'version' and another called 'subscriber'. The remaining columns will be averaged over.

Examples¶

es = EventScore(start='2016-01-01', stop='2016-01-05',
spatial_unit=make_spatial_unit('versioned-site'))

cd = CallDays(start='2016-01-01', stop='2016-01-04',
spatial_unit=make_spatial_unit('versioned-site'))

har = HartiganCluster(cd, 50, call_threshold=1)

har.join_to_cluster_components(es).head(geom=['cluster'])
            subscriber                                      cluster  rank
038OVABN11Ak4W5P              POINT (87.26522455 27.58509554)     4
038OVABN11Ak4W5P               POINT (86.00007467 27.2713931)     7
038OVABN11Ak4W5P              POINT (83.51373348 28.14524211)     8
038OVABN11Ak4W5P  POINT (82.97508908333334 29.28452965333333)     2
038OVABN11Ak4W5P        POINT (83.02805528499999 28.42765618)     6
...
calldays  score_hour  score_dow
       1   -1.000000   0.000000
       1    1.000000   0.000000
       1   -1.000000  -1.000000
       3   -0.666667  -0.666667
       2    0.000000  -0.500000
...

cache¶

cache

Source: flowmachine/core/query.py

Returns¶

bool

True is caching is switched on.

column_names¶

column_names

Source: flowmachine/features/subscriber/hartigan_cluster.py

Returns the column names.

Returns¶

typing.List[str]

List of the column names of this query.

column_names_as_string_list¶

column_names_as_string_list

Source: flowmachine/core/query.py

Get the column names as a comma separated list

Returns¶

str

Comma separated list of column names

dependencies¶

dependencies

Source: flowmachine/core/query.py

Returns¶

set

The set of queries which this one is directly dependent on.

fully_qualified_table_name¶

fully_qualified_table_name

Source: flowmachine/core/query.py

Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.

Returns¶

str

String form of the table's fqn

index_cols¶

index_cols

Source: flowmachine/core/query.py

A list of columns to use as indexes when storing this query.

Returns¶

ixen: list

By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.

Examples¶

daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']

is_stored¶

is_stored

Source: flowmachine/core/query.py

Returns¶

bool

True if the table is stored, and False otherwise.

query_id¶

query_id

Source: flowmachine/core/query.py

Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.

Returns¶

str

query_id hash string

query_state¶

query_state

Source: flowmachine/core/query.py

Return the current query state.

Returns¶

QueryState

The current query state

query_state_str¶

query_state_str

Source: flowmachine/core/query.py

Return the current query state as a string

Returns¶

str

The current query state. The possible values are the ones defined in flowmachine.core.query_state.QueryState.

table_name¶

table_name

Source: flowmachine/core/query.py

Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.

Returns¶

str

String form of the table's fqn

S. Isaacman et al., "Identifying Important Places in People's Lives from Cellular Network Data", International Conference on Pervasive Computing (2011), pp 133-151. ↩