flowmachine.features.subscriber.hartigan_cluster¶
Source: flowmachine/features/subscriber/hartigan_cluster.py
Classes that deal with location clustering using different methods. These classes attempt to reduce the number of locations for each subscriber, by clustering a set of locations according to a specified methodology. These methods are great options for reducing the dimensionality of the problem in hand.
Class BaseCluster¶
BaseCluster(cache=True)
Base query for cluster methods, providing a geo augmented query method.
Attributes¶
Methods¶
cache¶
cache
Returns¶
-
bool
True is caching is switched on.
column_names¶
column_names
Returns the column names.
Returns¶
-
typing.List[str]
List of the column names of this query.
column_names_as_string_list¶
column_names_as_string_list
Get the column names as a comma separated list
Returns¶
-
str
Comma separated list of column names
dependencies¶
dependencies
Returns¶
-
set
The set of queries which this one is directly dependent on.
fully_qualified_table_name¶
fully_qualified_table_name
Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.
Returns¶
-
str
String form of the table's fqn
index_cols¶
index_cols
A list of columns to use as indexes when storing this query.
Returns¶
-
ixen
:list
By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.
Examples¶
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']
is_stored¶
is_stored
Returns¶
-
bool
True if the table is stored, and False otherwise.
query_id¶
query_id
Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.
Returns¶
-
str
query_id hash string
query_state¶
query_state
Return the current query state.
Returns¶
-
QueryState
The current query state
query_state_str¶
query_state_str
Return the current query state as a string
Returns¶
-
str
The current query state. The possible values are the ones defined in
flowmachine.core.query_state.QueryState
.
table_name¶
table_name
Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.
Returns¶
-
str
String form of the table's fqn
Class HartiganCluster¶
HartiganCluster(*, calldays: flowmachine.features.subscriber.call_days.CallDays, radius: Union[float, str], buffer: float = 0, call_threshold: int = 0)
Implements the Hartigan Clustering algorithm. The algorithm clusters locations based on a ranked listed of call days. 1 The Hartigan clustering algorithm will pick up the site associated with the highest call days and form a cluster on that site. It will descend the ranked list of call days from the top. The second site associated with the highest call days will be incorporated into the the first cluster if it is within a certain radius of it. In such case, a new cluster centroid is calculated by taking a weighted average by the call days of all the sites in the cluster. Eventually, when the algorithm lands on a site which is not within the radius of all the available clusters, it will create a new cluster on that particular site. The class will produce a table where each row lists a cluster for a given subscriber, the rank of that cluster, the total number of call days in that cluster (which is the sum of all call days constituting that cluster and not the actual call days of the cluster), and all the sites which constitute that cluster.
Attributes¶
Parameters¶
-
calldays
:flowmachine.features.subscriber.call_days.CallDays
The calls day table which contains the call day data per subscriber. This table should follow the same format as the table produced with the
CallDays
class. -
radius
:typing.Union[float, str]
The threshold value in km to be used for clustering towers. If a string is passed, it is assumed that it is the name of the column in the call day table.
-
buffer
:float
, default0
The buffer radius size in km to be used for buffering the cluster centroid. If the cluster is formed by only one site, then the buffer radius size has no effect and the cluster centroid is buffered to the polygon representing the given site. If buffer is 0, only the cluster centroids are returned.
-
call_threshold
:int
, default0
The minimum number of calls that a cluster must have. Any cluster with less than that amount of calls will be eliminated.
Examples¶
cd = CallDays( '2016-01-01', '2016-01-04', spatial_unit=make_spatial_unit('versioned-site'))
har = HartiganCluster(cd, 2.5)
har.head()
subscriber cluster rank calldays
038OVABN11Ak4W5P POINT (82.60170958 29.81591927) 1 2
038OVABN11Ak4W5P POINT (82.91428457000001 29.358975) 2 2
038OVABN11Ak4W5P POINT (81.63916106000001 28.21192983) 3 1
038OVABN11Ak4W5P POINT (87.26522455 27.58509554) 4 1
038OVABN11Ak4W5P POINT (80.86633861999999 28.70767038) 5 1
...
site_id version
[m9jL23] [0]
[QeBRM8] [0]
[nWM8R3] [0]
[zdNQx2] [0]
[pqg7ZE] [0]
...
Methods¶
join_to_cluster_components¶
join_to_cluster_components(self, query)
Join the versioned-sites composing the Hartigan cluster table with another table containing versioned-sites.
Parameters¶
-
query
:flowmachine.Query
A flowmachine.Query object. This represents a table that can be joined to the versioned-sites composing the Hartigan cluster table. This must have a column called 'site_id', another column called 'version' and another called 'subscriber'. The remaining columns will be averaged over.
Examples¶
es = EventScore(start='2016-01-01', stop='2016-01-05',
spatial_unit=make_spatial_unit('versioned-site'))
cd = CallDays(start='2016-01-01', stop='2016-01-04',
spatial_unit=make_spatial_unit('versioned-site'))
har = HartiganCluster(cd, 50, call_threshold=1)
har.join_to_cluster_components(es).head(geom=['cluster'])
subscriber cluster rank
038OVABN11Ak4W5P POINT (87.26522455 27.58509554) 4
038OVABN11Ak4W5P POINT (86.00007467 27.2713931) 7
038OVABN11Ak4W5P POINT (83.51373348 28.14524211) 8
038OVABN11Ak4W5P POINT (82.97508908333334 29.28452965333333) 2
038OVABN11Ak4W5P POINT (83.02805528499999 28.42765618) 6
...
calldays score_hour score_dow
1 -1.000000 0.000000
1 1.000000 0.000000
1 -1.000000 -1.000000
3 -0.666667 -0.666667
2 0.000000 -0.500000
...
cache¶
cache
Returns¶
-
bool
True is caching is switched on.
column_names¶
column_names
Returns the column names.
Returns¶
-
typing.List[str]
List of the column names of this query.
column_names_as_string_list¶
column_names_as_string_list
Get the column names as a comma separated list
Returns¶
-
str
Comma separated list of column names
dependencies¶
dependencies
Returns¶
-
set
The set of queries which this one is directly dependent on.
fully_qualified_table_name¶
fully_qualified_table_name
Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.
Returns¶
-
str
String form of the table's fqn
index_cols¶
index_cols
A list of columns to use as indexes when storing this query.
Returns¶
-
ixen
:list
By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.
Examples¶
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']
is_stored¶
is_stored
Returns¶
-
bool
True if the table is stored, and False otherwise.
query_id¶
query_id
Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.
Returns¶
-
str
query_id hash string
query_state¶
query_state
Return the current query state.
Returns¶
-
QueryState
The current query state
query_state_str¶
query_state_str
Return the current query state as a string
Returns¶
-
str
The current query state. The possible values are the ones defined in
flowmachine.core.query_state.QueryState
.
table_name¶
table_name
Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.
Returns¶
-
str
String form of the table's fqn
-
S. Isaacman et al., "Identifying Important Places in People's Lives from Cellular Network Data", International Conference on Pervasive Computing (2011), pp 133-151. ↩