flowmachine.features.spatial.location_cluster¶
Source: flowmachine/features/spatial/location_cluster.py
Methods for clustering point collections. Methods available are designed to work with infrastructure elements, but can be used to any other point collection.
Class LocationCluster¶
LocationCluster(
point_collection="sites",
location_identifier="id",
geometry_identifier="geom_point",
method="kmeans",
distance_tolerance=1,
density_tolerance=5,
number_of_clusters=5,
date=None,
aggregate=False,
return_no_cluster=True,
)
Class for computing clusters of points using different algorithms. This class was designed to work with infrastructure elements (i.e. towers/sites), but can also be used with other point collection as long as that is a table in the database. This class currently implements three methods: K-means, DBSCAN, and Area. K-means is a clustering algorithm that clusters together points based on the point's distance to a point representing the centroid of the cluster. The algorithm has two steps: (a) point allocation and (b) centroid re-calculation. In (a) it allocates points to the centroid in which they are closest to. In (b) it moves the centroid to the mean location of the distances to all its members. The process will continue until (b) causes the centroid to stop moving, resulting in a Voronoi tesselation. For more information, refer to the Wikipedia entry on K-means clustering: * https://en.wikipedia.org/wiki/K-means_clustering The following resource is also very informative: * https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that uses the maximum distance between points (denoted by ε) as an inclusion criteria to a cluster. Cluster have to contain a minimum number of members (denoted by density) to be considered valid, otherwise no cluster is assigned to a given member. If any members from a given cluster has a distance ε to an outside point, that point will be subsequently included to the cluster. This process runs continuously until all points are evaluated. Scientific reference for this algorithm is found at: Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M., eds. "A density-based algorithm for discovering clusters in large spatial databases with noise". Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231 Available at: http://www.lsi.upc.edu/~bejar/amlt/material_art/ DM%20clustring%20DBSCAN%20kdd-96.pdf This reference is also useful for understanding how the algorithm works: * https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
Attributes¶
Parameters¶
-
point_collection
:str
, default'sites'
Table to use take point collection from. This parameter may accept dataframes with geographic in the future.
-
location_identifier
:str
, default'id'
Location identifier from the point table to use. This identifier must be unique to each location.
-
geometry_identifier
:str
, default'geom_point'
Geometry column to use in computations.
-
method
:str
, default'kmeans'
Method to use in clustring. Please seek each method's reference for algorithmic information on how they work. Current implementations are: *
kmeans
: Uses a K-means algorithm to select clusters. This method requires the parameternumber_of_clusters
. *dbscan
: Uses the DBSCAN algorithm to select clusters. This method requires the parametersdistance_tolerance
anddensity_tolerance
. *area
: Clusters points that are wihin a certain area. Similar to DBSCAN, but without a density value, that is clusters can be of any size. This method requires the parameterdistance_tolerance
. -
distance_tolerance
:int
,float
, default1
Radius area in km. Area is approximated using a WGS84 degree to meter conversion of (distance_tolerance * 1000) / 111195. This can include a maximum error of ~0.1%.
-
density_tolerance
:int
, default5
Minimum number of members that a cluster must have in order to exist. If members of a possible cluster do not meet this criteria, they will not be assigned a cluster. See the
return_no_cluster
parameter. -
number_of_clusters
:int
, default5
Number of clusters to create with the K-means algorithm.
-
date
:str
, defaultNone
If the
point_collection
is either 'sites' or 'cells' use this parameter to determing which version of those infrastructure elements to use. If the default None is used the current date will be used. -
aggregate
:bool
, defaultFalse
If used, the a dataframe will be returned with the a convex hull geometry per cluster id alongside its centroid. This can be used in conjuction with the LocationArea() to create new area representations.
-
return_no_cluster
:bool
, defaultTrue
If used results will include members that have not been assigned to a cluster. If this parameter is used in conjunction with the
aggregate
parameter, elements with no cluster will be ignored. We do not recommend using this in conjunction with that parameter.
Note
The DBSCAN implementation method code has originally been sourced from Dan Baston's website (implementer of the method in PostGIS) -- the K-mean implementation is a derivation of the DBSCAN implementation: * http://www.danbaston.com/posts/2016/06/02/ dbscan-clustering-in-postgis.html The Area method code has originally been drawn from the GISStackExchange page: * https://gis.stackexchange.com/questions/ 11567/spatial-clustering-with-postgis
Methods¶
cache¶
cache
Returns¶
-
bool
True is caching is switched on.
column_names¶
column_names
Returns the column names.
Returns¶
-
typing.List
List of the column names of this query.
column_names_as_string_list¶
column_names_as_string_list
Get the column names as a comma separated list
Returns¶
-
str
Comma separated list of column names
dependencies¶
dependencies
Returns¶
-
set
The set of queries which this one is directly dependent on.
fully_qualified_table_name¶
fully_qualified_table_name
Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.
Returns¶
-
str
String form of the table's fqn
index_cols¶
index_cols
A list of columns to use as indexes when storing this query.
Returns¶
-
ixen
:list
By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.
Examples¶
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']
is_stored¶
is_stored
Returns¶
-
bool
True if the table is stored, and False otherwise.
query_id¶
query_id
Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.
Returns¶
-
str
query_id hash string
query_state¶
query_state
Return the current query state.
Returns¶
-
QueryState
The current query state
query_state_str¶
query_state_str
Return the current query state as a string
Returns¶
-
str
The current query state. The possible values are the ones defined in
flowmachine.core.query_state.QueryState
.
table_name¶
table_name
Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.
Returns¶
-
str
String form of the table's fqn