Skip to content

flowmachine.features.spatial.location_cluster

Source: flowmachine/features/spatial/location_cluster.py

Methods for clustering point collections. Methods available are designed to work with infrastructure elements, but can be used to any other point collection.

Class LocationCluster

LocationCluster(
    point_collection="sites",
    location_identifier="id",
    geometry_identifier="geom_point",
    method="kmeans",
    distance_tolerance=1,
    density_tolerance=5,
    number_of_clusters=5,
    date=None,
    aggregate=False,
    return_no_cluster=True,
)
Source: flowmachine/features/spatial/location_cluster.py

Class for computing clusters of points using different algorithms. This class was designed to work with infrastructure elements (i.e. towers/sites), but can also be used with other point collection as long as that is a table in the database. This class currently implements three methods: K-means, DBSCAN, and Area. K-means is a clustering algorithm that clusters together points based on the point's distance to a point representing the centroid of the cluster. The algorithm has two steps: (a) point allocation and (b) centroid re-calculation. In (a) it allocates points to the centroid in which they are closest to. In (b) it moves the centroid to the mean location of the distances to all its members. The process will continue until (b) causes the centroid to stop moving, resulting in a Voronoi tesselation. For more information, refer to the Wikipedia entry on K-means clustering: * https://en.wikipedia.org/wiki/K-means_clustering The following resource is also very informative: * https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that uses the maximum distance between points (denoted by ε) as an inclusion criteria to a cluster. Cluster have to contain a minimum number of members (denoted by density) to be considered valid, otherwise no cluster is assigned to a given member. If any members from a given cluster has a distance ε to an outside point, that point will be subsequently included to the cluster. This process runs continuously until all points are evaluated. Scientific reference for this algorithm is found at: Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M., eds. "A density-based algorithm for discovering clusters in large spatial databases with noise". Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231 Available at: http://www.lsi.upc.edu/~bejar/amlt/material_art/ DM%20clustring%20DBSCAN%20kdd-96.pdf This reference is also useful for understanding how the algorithm works: * https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

Attributes

Parameters

  • point_collection: str, default 'sites'

    Table to use take point collection from. This parameter may accept dataframes with geographic in the future.

  • location_identifier: str, default 'id'

    Location identifier from the point table to use. This identifier must be unique to each location.

  • geometry_identifier: str, default 'geom_point'

    Geometry column to use in computations.

  • method: str, default 'kmeans'

    Method to use in clustring. Please seek each method's reference for algorithmic information on how they work. Current implementations are: * kmeans: Uses a K-means algorithm to select clusters. This method requires the parameter number_of_clusters. * dbscan: Uses the DBSCAN algorithm to select clusters. This method requires the parameters distance_tolerance and density_tolerance. * area: Clusters points that are wihin a certain area. Similar to DBSCAN, but without a density value, that is clusters can be of any size. This method requires the parameter distance_tolerance.

  • distance_tolerance: int, float, default 1

    Radius area in km. Area is approximated using a WGS84 degree to meter conversion of (distance_tolerance * 1000) / 111195. This can include a maximum error of ~0.1%.

  • density_tolerance: int, default 5

    Minimum number of members that a cluster must have in order to exist. If members of a possible cluster do not meet this criteria, they will not be assigned a cluster. See the return_no_cluster parameter.

  • number_of_clusters: int, default 5

    Number of clusters to create with the K-means algorithm.

  • date: str, default None

    If the point_collection is either 'sites' or 'cells' use this parameter to determing which version of those infrastructure elements to use. If the default None is used the current date will be used.

  • aggregate: bool, default False

    If used, the a dataframe will be returned with the a convex hull geometry per cluster id alongside its centroid. This can be used in conjuction with the LocationArea() to create new area representations.

  • return_no_cluster: bool, default True

    If used results will include members that have not been assigned to a cluster. If this parameter is used in conjunction with the aggregate parameter, elements with no cluster will be ignored. We do not recommend using this in conjunction with that parameter.

Note

The DBSCAN implementation method code has originally been sourced from Dan Baston's website (implementer of the method in PostGIS) -- the K-mean implementation is a derivation of the DBSCAN implementation: * http://www.danbaston.com/posts/2016/06/02/ dbscan-clustering-in-postgis.html The Area method code has originally been drawn from the GISStackExchange page: * https://gis.stackexchange.com/questions/ 11567/spatial-clustering-with-postgis

Methods

cache

cache
Source: flowmachine/core/query.py

Returns
  • bool

    True is caching is switched on.

column_names

column_names
Source: flowmachine/features/spatial/location_cluster.py

Returns the column names.

Returns
  • typing.List[str]

    List of the column names of this query.

column_names_as_string_list

column_names_as_string_list
Source: flowmachine/core/query.py

Get the column names as a comma separated list

Returns
  • str

    Comma separated list of column names

dependencies

dependencies
Source: flowmachine/core/query.py

Returns
  • set

    The set of queries which this one is directly dependent on.

fully_qualified_table_name

fully_qualified_table_name
Source: flowmachine/core/query.py

Returns a unique fully qualified name for the query to be stored as under the cache schema, based on a hash of the parameters, class, and subqueries.

Returns
  • str

    String form of the table's fqn

index_cols

index_cols
Source: flowmachine/core/query.py

A list of columns to use as indexes when storing this query.

Returns
  • ixen: list

    By default, returns the location columns if they are present and self.spatial_unit is defined, and the subscriber column.

Examples
daily_location("2016-01-01").index_cols
[['name'], '"subscriber"']

is_stored

is_stored
Source: flowmachine/core/query.py

Returns
  • bool

    True if the table is stored, and False otherwise.

query_id

query_id
Source: flowmachine/core/query.py

Generate a uniquely identifying hash of this query, based on the parameters of it and the subqueries it is composed of.

Returns
  • str

    query_id hash string

query_state

query_state
Source: flowmachine/core/query.py

Return the current query state.

Returns
  • QueryState

    The current query state

query_state_str

query_state_str
Source: flowmachine/core/query.py

Return the current query state as a string

Returns
  • str

    The current query state. The possible values are the ones defined in flowmachine.core.query_state.QueryState.

table_name

table_name
Source: flowmachine/core/query.py

Returns a uniquename for the query to be stored as, based on a hash of the parameters, class, and subqueries.

Returns
  • str

    String form of the table's fqn