Joined spatial aggregate¶
Comparing mobility with handset types¶
This worked example demonstrates using joined spatial aggregate queries in FlowKit. A joined spatial aggregate query calculates a metric for each subscriber, and joins the metric values to subscribers' locations before aggregating the metric by region.
Suppose we want to investigate whether there is a link between subscribers' mobility and their handset types. We can calculate two joined spatial aggregates:
- Average radius of gyration per region. Radius of gyration is a measure of the spread of a subscriber's event locations - a subscriber with a large radius of gyration will have events spread over a larger area (for example, somebody who commutes over a long distance or regularly travels to regions away from their home).
- Distribution of handset types (basic/feature/smartphone) per region. Once we have the results of these queries, we can investigate whether there is any correlation between the metrics.
The Jupyter notebook for this worked example can be downloaded here, or can be run using the quick start setup.
Load FlowClient and connect to FlowAPI¶
We start by importing FlowClient. We also import pandas, geopandas and matplotlib, which we will use for performing further analysis and visualisation of the outputs from FlowKit.
import os
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import flowclient as fc
%matplotlib inline
We must next generate a FlowAPI access token using FlowAuth. If you are running this notebook using the quick start setup, generating a token requires the following steps:
- Visit the FlowAuth login page at http://localhost:9091.
- Log in with username
TEST_USER
and passwordDUMMY_PASSWORD
. - Under "My Servers", select
TEST_SERVER
. - Click the
+
button to create a new token. - Give the new token a name, and click
SAVE
. - Copy the token string using the
COPY
button. - Paste the token in this notebook as
TOKEN
.
The steps are the same in a production setup, but the FlowAuth URL, login details and server name will differ.
Once we have a token, we can start a connection to the FlowAPI system. If you are connecting to FlowAPI over https (recommended) and the system administrator has provided you with an SSL certificate file, you should provide the path to this file as the ssl_certificate
argument to flowclient.connect()
(in this example, you can set the path in the environment variable SSL_CERTIFICATE_FILE
). If you are connecting over http, this argument is not required.
conn = fc.connect(
url=os.getenv("FLOWAPI_URL", "http://localhost:9090"),
token=TOKEN,
ssl_certificate=os.getenv("SSL_CERTIFICATE_FILE"),
)
Specify FlowKit queries¶
Next, we create query specifications for the FlowKit queries we will run.
Locations¶
Joined spatial aggregate queries join a per-subscriber metric to the subscribers' locations, so we need a query that assigns a single location to each subscriber. Here we use a modal location query to estimate subscribers' home locations over the period of interest (first seven days of 2016), at administrative level 3.
date_range = ("2016-01-01", "2016-01-08")
modal_loc = fc.modal_location_from_dates_spec(
start_date=date_range[0],
end_date=date_range[1],
aggregation_unit="admin3",
method="last",
)
Radius of gyration¶
We create a joined spatial aggregate query specification to calculate the average radius of gyration per region, using locations from the modal location query above.
rog_query = fc.radius_of_gyration_spec(
start_date=date_range[0],
end_date=date_range[1],
)
jsa_rog_query = fc.aggregates.joined_spatial_aggregate_spec(
locations=modal_loc,
metric=rog_query,
method="avg",
)
Handset type¶
Finally, we specify a second joined spatial aggregate query, this time using a handset
metric with the "hnd_type"
(handset type) characteristic. The handset
query will return a categorical variable (handset type is one of "Basic", "Feature" or "Smart"), so we must choose the "distr"
method for the joined spatial aggregate, to get the distribution of handset types per region.
handset_query = fc.handset_spec(
start_date=date_range[0],
end_date=date_range[1],
characteristic="hnd_type",
method="most-common",
)
jsa_handset_query = fc.aggregates.joined_spatial_aggregate_spec(
locations=modal_loc,
metric=handset_query,
method="distr",
)
Radius of gyration results¶
Now that we have a FlowClient connection and we have specified our queries, we can run the queries and get the results using FlowClient's get_result
function. First, we get the results of the radius of gyration query.
rog_results = fc.get_result(connection=conn, query_spec=jsa_rog_query)
rog_results.head()
pcod | value | |
---|---|---|
0 | NPL.2.3.3_1 | 63.6611 |
1 | NPL.3.2.5_1 | 90.6393 |
2 | NPL.1.3.3_1 | 43.8355 |
3 | NPL.4.1.1_1 | 72.7846 |
4 | NPL.1.1.3_1 | 47.1262 |
The resulting dataframe has two columns: pcod
(the P-code of each region) and value
(the average radius of gyration, in km). Later, we will want to join with other results for the same regions, so let's set the pcod
column as the index. We also rename the value
column, to give it the more descriptive name radius_of_gyration
.
rog_results = rog_results.rename(columns={"value": "radius_of_gyration"}).set_index(
"pcod"
)
rog_results.head()
pcod |
radius_of_gyration |
---|---|
NPL.2.3.3_1 | 63.6611 |
NPL.3.2.5_1 | 90.6393 |
NPL.1.3.3_1 | 43.8355 |
NPL.4.1.1_1 | 72.7846 |
NPL.1.1.3_1 | 47.1262 |
We can get a summary of the data using the dataframe's describe
method, and quickly see the distribution of values using plot.hist
.
rog_results.describe()
radius_of_gyration | |
---|---|
count | 61 |
mean | 64.9108 |
std | 17.2098 |
min | 43.8355 |
25% | 50.3544 |
50% | 60.5139 |
75% | 76.3581 |
max | 103.238 |
rog_results.plot.hist();
To see the spatial distribution of radius of gyration, we need the geographic boundaries of the administrative regions. We can get these as GeoJSON using the FlowClient get_geography
function, and load into a GeoPandas GeoDataFrame
using the GeoDataFrame.from_features
method.
geog = gpd.GeoDataFrame.from_features(
fc.get_geography(connection=conn, aggregation_unit="admin3")
).set_index("pcod")
The plot
method is a quick way to see the regions described by this GeoDataFrame.
geog.plot();
Now we can join the geography data to the radius of gyration results, and create a choropleth map by plotting again using the radius_of_gyration
column to colour the regions.
(Note: the order in which we join here is important - rog_results.join(geog)
would produce a pandas DataFrame
, not a geopandas GeoDataFrame
, and the plot
method would produce a different plot).
geog.join(rog_results).plot(column="radius_of_gyration", legend=True);
There are gaps in this choropleth map. Looking at the content of the 'radius_of_gyration' column in the joined dataframe, we can see why:
geog.join(rog_results)[["radius_of_gyration"]].head()
pcod |
radius_of_gyration |
---|---|
NPL.1.1.1_1 | 47.585 |
NPL.1.1.2_1 | nan |
NPL.1.1.5_1 | 46.9523 |
NPL.1.1.3_1 | 47.1262 |
NPL.1.1.4_1 | 46.7406 |
There are some regions for which we have no radius_of_gyration
data, because these regions contained fewer than 15 subscribers so FlowKit redacted the results to preserve privacy. We can drop these rows using the dropna
method. So that we can still see the shapes of the regions with no data, let's plot the geog
data as a light grey background behind our choropleth map.
background = geog.plot(color="lightgrey")
geog.join(rog_results).dropna().plot(
ax=background, column="radius_of_gyration", legend=True
);
Handset type results¶
Next we get the results of our handset-type query.
handset_results = fc.get_result(connection=conn, query_spec=jsa_handset_query)
handset_results.head()
pcod | metric | key | value | |
---|---|---|---|---|
0 | NPL.1.1.1_1 | value | Feature | 0.347262 |
1 | NPL.1.1.1_1 | value | Smart | 0.325648 |
2 | NPL.1.1.1_1 | value | Basic | 0.327089 |
3 | NPL.1.1.3_1 | value | Smart | 0.345817 |
4 | NPL.1.1.3_1 | value | Basic | 0.326038 |
This dataframe has four columns: pcod
(the P-code, as above), metric
, key
(the handset type) and value
(the proportion of subscribers in that region who have that handset type). The metric
column just contains the value "value"
in every row, which is not useful to us, so let's drop it. We'll also rename the key
and value
columns to handset_type
and proportion
, respectively.
This time we have three rows per P-code, one for each handset type, so let's set a MultiIndex
using the pcod
and handset_type
columns.
handset_results = (
handset_results.rename(columns={"key": "handset_type", "value": "proportion"})
.drop(columns="metric")
.set_index(["pcod", "handset_type"])
)
handset_results.head()
pcod, handset_type |
proportion |
---|---|
('NPL.1.1.1_1', 'Feature') | 0.347262 |
('NPL.1.1.1_1', 'Smart') | 0.325648 |
('NPL.1.1.1_1', 'Basic') | 0.327089 |
('NPL.1.1.3_1', 'Smart') | 0.345817 |
('NPL.1.1.3_1', 'Basic') | 0.326038 |
We can get a summary of the data using the describe
method, but let's first group by handset_type
to get a summary per handset type.
handset_results.groupby("handset_type").describe()
handset_type |
proportion count |
proportion mean |
proportion std |
proportion min |
proportion 25% |
proportion 50% |
proportion 75% |
proportion max |
---|---|---|---|---|---|---|---|---|
Basic | 61 | 0.326162 | 0.0216468 | 0.253247 | 0.317614 | 0.326633 | 0.337289 | 0.3875 |
Feature | 61 | 0.336245 | 0.0213356 | 0.287054 | 0.32448 | 0.336691 | 0.348718 | 0.38961 |
Smart | 61 | 0.337593 | 0.0226547 | 0.284615 | 0.324706 | 0.335958 | 0.349515 | 0.41358 |
To plot histograms for the three handset types, it would be convenient if the proportion data for each handset type were in separate columns. Since we set a hierarchical index on the handset_results
dataframe, we can use the unstack
method to unpack the innermost level of the index (handset_type
, in this case) into separate columns.
handset_results.unstack().head()
handset_type pcod |
proportion Basic |
proportion Feature |
proportion Smart |
---|---|---|---|
NPL.1.1.1_1 | 0.327089 | 0.347262 | 0.325648 |
NPL.1.1.3_1 | 0.326038 | 0.328145 | 0.345817 |
NPL.1.1.4_1 | 0.313433 | 0.308458 | 0.378109 |
NPL.1.1.5_1 | 0.326633 | 0.32448 | 0.348887 |
NPL.1.1.6_1 | 0.351899 | 0.340506 | 0.307595 |
This produces a dataframe with a single-level index, and a MultiIndex
as column headers. Now the plot.hist
method plots all three histograms on the same axes. We set alpha=0.5
to make them slightly transparent. In this dataset we can see that there is little difference between the distributions of the different handset types - the proportions of all handset types are approximately equal in all regions.
handset_results.unstack().plot.hist(alpha=0.5);
As with the radius of gyration results, we can also join to the geography data and plot choropleth maps of the handset type distributions. We use the query
method to select the subset of rows for a given handset type.
handset_results_with_geo = geog.join(handset_results).dropna()
for htype in ["Basic", "Feature", "Smart"]:
ax = geog.plot(color="lightgrey")
ax.set_title(f"Proportion ({htype})")
handset_results_with_geo.query(f"handset_type == '{htype}'").plot(
ax=ax, column="proportion", legend=True
);
handset_results_unstacked = handset_results.unstack()
handset_results_unstacked.columns = handset_results_unstacked.columns.to_flat_index()
combined = rog_results.join(handset_results_unstacked)
combined.head()
pcod |
radius_of_gyration |
('proportion', 'Basic') |
('proportion', 'Feature') |
('proportion', 'Smart') |
---|---|---|---|---|
NPL.2.3.3_1 | 63.6611 | 0.300292 | 0.370262 | 0.329446 |
NPL.3.2.5_1 | 90.6393 | 0.328244 | 0.357506 | 0.314249 |
NPL.1.3.3_1 | 43.8355 | 0.310631 | 0.327243 | 0.362126 |
NPL.4.1.1_1 | 72.7846 | 0.317757 | 0.333333 | 0.34891 |
NPL.1.1.3_1 | 47.1262 | 0.326038 | 0.328145 | 0.345817 |
Correlation¶
The corr
method provides a quick way to calculate the correlation between all pairs of columns. We're interested in the correlation between radius of gyration and each of the proportion columns. From these results, there appears to be no significant correlation between handset type and radius of gyration.
combined.corr()
radius_of_gyration | ('proportion', 'Basic') | ('proportion', 'Feature') | ('proportion', 'Smart') | |
---|---|---|---|---|
radius_of_gyration | 1 | 0.00775751 | -0.125188 | 0.110486 |
('proportion', 'Basic') | 0.00775751 | 1 | -0.444473 | -0.536918 |
('proportion', 'Feature') | -0.125188 | -0.444473 | 1 | -0.517076 |
('proportion', 'Smart') | 0.110486 | -0.536918 | -0.517076 | 1 |
Scatter plot¶
To see whether there is any relationship that's not apparent from the correlation, let's make a scatter plot of radius of gyration against proportion of subscribers who use a smartphone.
combined.plot.scatter(x="radius_of_gyration", y=("proportion", "Smart"));
We can use matplotlib
to plot radius of gyration against proportion for each of the handset types on the same plot.
fig, ax = plt.subplots()
for htype in ["Basic", "Feature", "Smart"]:
ax.scatter(
combined["radius_of_gyration"],
combined[("proportion", htype)],
label=htype,
)
ax.set_xlabel("Radius of gyration (km)")
ax.set_ylabel("Handset type proportion")
ax.legend();
There is no obvious pattern visible in these plots, so for this test dataset we have found no relationship between subscribers' handset types and their radius of gyration.