Skip to content

Joined spatial aggregate

Comparing mobility with handset types

This worked example demonstrates using joined spatial aggregate queries in FlowKit. A joined spatial aggregate query calculates a metric for each subscriber, and joins the metric values to subscribers' locations before aggregating the metric by region.

Suppose we want to investigate whether there is a link between subscribers' mobility and their handset types. We can calculate two joined spatial aggregates:

  • Average radius of gyration per region. Radius of gyration is a measure of the spread of a subscriber's event locations - a subscriber with a large radius of gyration will have events spread over a larger area (for example, somebody who commutes over a long distance or regularly travels to regions away from their home).
  • Distribution of handset types (basic/feature/smartphone) per region. Once we have the results of these queries, we can investigate whether there is any correlation between the metrics.

The Jupyter notebook for this worked example can be downloaded here, or can be run using the quick start setup.

Load FlowClient and connect to FlowAPI

We start by importing FlowClient. We also import pandas, geopandas and matplotlib, which we will use for performing further analysis and visualisation of the outputs from FlowKit.

import os

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

import flowclient as fc

%matplotlib inline

We must next generate a FlowAPI access token using FlowAuth. If you are running this notebook using the quick start setup, generating a token requires the following steps:

  1. Visit the FlowAuth login page at http://localhost:9091.
  2. Log in with username TEST_USER and password DUMMY_PASSWORD.
  3. Under "My Servers", select TEST_SERVER.
  4. Click the + button to create a new token.
  5. Give the new token a name, and click SAVE.
  6. Copy the token string using the COPY button.
  7. Paste the token in this notebook as TOKEN.

The steps are the same in a production setup, but the FlowAuth URL, login details and server name will differ.

Once we have a token, we can start a connection to the FlowAPI system. If you are connecting to FlowAPI over https (recommended) and the system administrator has provided you with an SSL certificate file, you should provide the path to this file as the ssl_certificate argument to flowclient.connect() (in this example, you can set the path in the environment variable SSL_CERTIFICATE_FILE). If you are connecting over http, this argument is not required.

conn = fc.connect(
    url=os.getenv("FLOWAPI_URL", "http://localhost:9090"),
    token=TOKEN,
    ssl_certificate=os.getenv("SSL_CERTIFICATE_FILE"),
)

Specify FlowKit queries

Next, we create query specifications for the FlowKit queries we will run.

Locations

Joined spatial aggregate queries join a per-subscriber metric to the subscribers' locations, so we need a query that assigns a single location to each subscriber. Here we use a modal location query to estimate subscribers' home locations over the period of interest (first seven days of 2016), at administrative level 3.

date_range = ("2016-01-01", "2016-01-08")
modal_loc = fc.modal_location_from_dates_spec(
    start_date=date_range[0],
    end_date=date_range[1],
    aggregation_unit="admin3",
    method="last",
)

Radius of gyration

We create a joined spatial aggregate query specification to calculate the average radius of gyration per region, using locations from the modal location query above.

rog_query = fc.radius_of_gyration_spec(
    start_date=date_range[0],
    end_date=date_range[1],
)

jsa_rog_query = fc.aggregates.joined_spatial_aggregate_spec(
    locations=modal_loc,
    metric=rog_query,
    method="avg",
)

Handset type

Finally, we specify a second joined spatial aggregate query, this time using a handset metric with the "hnd_type" (handset type) characteristic. The handset query will return a categorical variable (handset type is one of "Basic", "Feature" or "Smart"), so we must choose the "distr" method for the joined spatial aggregate, to get the distribution of handset types per region.

handset_query = fc.handset_spec(
    start_date=date_range[0],
    end_date=date_range[1],
    characteristic="hnd_type",
    method="most-common",
)

jsa_handset_query = fc.aggregates.joined_spatial_aggregate_spec(
    locations=modal_loc,
    metric=handset_query,
    method="distr",
)

Radius of gyration results

Now that we have a FlowClient connection and we have specified our queries, we can run the queries and get the results using FlowClient's get_result function. First, we get the results of the radius of gyration query.

rog_results = fc.get_result(connection=conn, query_spec=jsa_rog_query)
rog_results.head()
pcod value
0 NPL.2.3.3_1 63.6611
1 NPL.3.2.5_1 90.6393
2 NPL.1.3.3_1 43.8355
3 NPL.4.1.1_1 72.7846
4 NPL.1.1.3_1 47.1262

The resulting dataframe has two columns: pcod (the P-code of each region) and value (the average radius of gyration, in km). Later, we will want to join with other results for the same regions, so let's set the pcod column as the index. We also rename the value column, to give it the more descriptive name radius_of_gyration.

rog_results = rog_results.rename(columns={"value": "radius_of_gyration"}).set_index(
    "pcod"
)
rog_results.head()


pcod
radius_of_gyration

 
NPL.2.3.3_1 63.6611
NPL.3.2.5_1 90.6393
NPL.1.3.3_1 43.8355
NPL.4.1.1_1 72.7846
NPL.1.1.3_1 47.1262

We can get a summary of the data using the dataframe's describe method, and quickly see the distribution of values using plot.hist.

rog_results.describe()
radius_of_gyration
count 61
mean 64.9108
std 17.2098
min 43.8355
25% 50.3544
50% 60.5139
75% 76.3581
max 103.238
rog_results.plot.hist();
No description has been provided for this image

To see the spatial distribution of radius of gyration, we need the geographic boundaries of the administrative regions. We can get these as GeoJSON using the FlowClient get_geography function, and load into a GeoPandas GeoDataFrame using the GeoDataFrame.from_features method.

geog = gpd.GeoDataFrame.from_features(
    fc.get_geography(connection=conn, aggregation_unit="admin3")
).set_index("pcod")

The plot method is a quick way to see the regions described by this GeoDataFrame.

geog.plot();
No description has been provided for this image

Now we can join the geography data to the radius of gyration results, and create a choropleth map by plotting again using the radius_of_gyration column to colour the regions.
(Note: the order in which we join here is important - rog_results.join(geog) would produce a pandas DataFrame, not a geopandas GeoDataFrame, and the plot method would produce a different plot).

geog.join(rog_results).plot(column="radius_of_gyration", legend=True);
No description has been provided for this image

There are gaps in this choropleth map. Looking at the content of the 'radius_of_gyration' column in the joined dataframe, we can see why:

geog.join(rog_results)[["radius_of_gyration"]].head()


pcod
radius_of_gyration

 
NPL.1.1.1_1 47.585
NPL.1.1.2_1 nan
NPL.1.1.5_1 46.9523
NPL.1.1.3_1 47.1262
NPL.1.1.4_1 46.7406

There are some regions for which we have no radius_of_gyration data, because these regions contained fewer than 15 subscribers so FlowKit redacted the results to preserve privacy. We can drop these rows using the dropna method. So that we can still see the shapes of the regions with no data, let's plot the geog data as a light grey background behind our choropleth map.

background = geog.plot(color="lightgrey")
geog.join(rog_results).dropna().plot(
    ax=background, column="radius_of_gyration", legend=True
);
No description has been provided for this image

Handset type results

Next we get the results of our handset-type query.

handset_results = fc.get_result(connection=conn, query_spec=jsa_handset_query)
handset_results.head()
pcod metric key value
0 NPL.1.1.1_1 value Feature 0.347262
1 NPL.1.1.1_1 value Smart 0.325648
2 NPL.1.1.1_1 value Basic 0.327089
3 NPL.1.1.3_1 value Smart 0.345817
4 NPL.1.1.3_1 value Basic 0.326038

This dataframe has four columns: pcod (the P-code, as above), metric, key (the handset type) and value (the proportion of subscribers in that region who have that handset type). The metric column just contains the value "value" in every row, which is not useful to us, so let's drop it. We'll also rename the key and value columns to handset_type and proportion, respectively.

This time we have three rows per P-code, one for each handset type, so let's set a MultiIndex using the pcod and handset_type columns.

handset_results = (
    handset_results.rename(columns={"key": "handset_type", "value": "proportion"})
    .drop(columns="metric")
    .set_index(["pcod", "handset_type"])
)
handset_results.head()


pcod, handset_type
proportion

 
('NPL.1.1.1_1', 'Feature') 0.347262
('NPL.1.1.1_1', 'Smart') 0.325648
('NPL.1.1.1_1', 'Basic') 0.327089
('NPL.1.1.3_1', 'Smart') 0.345817
('NPL.1.1.3_1', 'Basic') 0.326038

We can get a summary of the data using the describe method, but let's first group by handset_type to get a summary per handset type.

handset_results.groupby("handset_type").describe()



handset_type
proportion
count

 
proportion
mean

 
proportion
std

 
proportion
min

 
proportion
25%

 
proportion
50%

 
proportion
75%

 
proportion
max

 
Basic 61 0.326162 0.0216468 0.253247 0.317614 0.326633 0.337289 0.3875
Feature 61 0.336245 0.0213356 0.287054 0.32448 0.336691 0.348718 0.38961
Smart 61 0.337593 0.0226547 0.284615 0.324706 0.335958 0.349515 0.41358

To plot histograms for the three handset types, it would be convenient if the proportion data for each handset type were in separate columns. Since we set a hierarchical index on the handset_results dataframe, we can use the unstack method to unpack the innermost level of the index (handset_type, in this case) into separate columns.

handset_results.unstack().head()

handset_type

pcod
proportion
Basic

 
proportion
Feature

 
proportion
Smart

 
NPL.1.1.1_1 0.327089 0.347262 0.325648
NPL.1.1.3_1 0.326038 0.328145 0.345817
NPL.1.1.4_1 0.313433 0.308458 0.378109
NPL.1.1.5_1 0.326633 0.32448 0.348887
NPL.1.1.6_1 0.351899 0.340506 0.307595

This produces a dataframe with a single-level index, and a MultiIndex as column headers. Now the plot.hist method plots all three histograms on the same axes. We set alpha=0.5 to make them slightly transparent. In this dataset we can see that there is little difference between the distributions of the different handset types - the proportions of all handset types are approximately equal in all regions.

handset_results.unstack().plot.hist(alpha=0.5);
No description has been provided for this image

As with the radius of gyration results, we can also join to the geography data and plot choropleth maps of the handset type distributions. We use the query method to select the subset of rows for a given handset type.

handset_results_with_geo = geog.join(handset_results).dropna()

for htype in ["Basic", "Feature", "Smart"]:
    ax = geog.plot(color="lightgrey")
    ax.set_title(f"Proportion ({htype})")
    handset_results_with_geo.query(f"handset_type == '{htype}'").plot(
        ax=ax, column="proportion", legend=True
    );
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Comparing radius of gyration and handset types

Join the dataframes

Now that we have the results from both of our joined spatial aggregate queries, we can join them and compare the metrics.

handset_results_unstacked = handset_results.unstack()
handset_results_unstacked.columns = handset_results_unstacked.columns.to_flat_index()
combined = rog_results.join(handset_results_unstacked)
combined.head()


pcod
radius_of_gyration

 
('proportion', 'Basic')

 
('proportion', 'Feature')

 
('proportion', 'Smart')

 
NPL.2.3.3_1 63.6611 0.300292 0.370262 0.329446
NPL.3.2.5_1 90.6393 0.328244 0.357506 0.314249
NPL.1.3.3_1 43.8355 0.310631 0.327243 0.362126
NPL.4.1.1_1 72.7846 0.317757 0.333333 0.34891
NPL.1.1.3_1 47.1262 0.326038 0.328145 0.345817

Correlation

The corr method provides a quick way to calculate the correlation between all pairs of columns. We're interested in the correlation between radius of gyration and each of the proportion columns. From these results, there appears to be no significant correlation between handset type and radius of gyration.

combined.corr()
radius_of_gyration ('proportion', 'Basic') ('proportion', 'Feature') ('proportion', 'Smart')
radius_of_gyration 1 0.00775751 -0.125188 0.110486
('proportion', 'Basic') 0.00775751 1 -0.444473 -0.536918
('proportion', 'Feature') -0.125188 -0.444473 1 -0.517076
('proportion', 'Smart') 0.110486 -0.536918 -0.517076 1

Scatter plot

To see whether there is any relationship that's not apparent from the correlation, let's make a scatter plot of radius of gyration against proportion of subscribers who use a smartphone.

combined.plot.scatter(x="radius_of_gyration", y=("proportion", "Smart"));
No description has been provided for this image

We can use matplotlib to plot radius of gyration against proportion for each of the handset types on the same plot.

fig, ax = plt.subplots()

for htype in ["Basic", "Feature", "Smart"]:
    ax.scatter(
        combined["radius_of_gyration"],
        combined[("proportion", htype)],
        label=htype,
    )
ax.set_xlabel("Radius of gyration (km)")
ax.set_ylabel("Handset type proportion")
ax.legend();
No description has been provided for this image

There is no obvious pattern visible in these plots, so for this test dataset we have found no relationship between subscribers' handset types and their radius of gyration.