Flows Above Normal¶
Post-crisis analysis¶
In this worked example we assume the role of an analyst working in the aftermath of a significant crisis in Nepal. Our aim is to use FlowKit to investigate which administrative regions people have been displaced from/to during the crisis, following the methodology used in this paper.
The Jupyter notebook for this worked example can be downloaded here, or can be run using the quick start setup.
Introduction¶
We want to investigate changes in people's home locations caused by the crisis. To do this we will calculate an origin-destination matrix, containing the number of subscribers who lived in region A during a 'benchmark' period before the crisis, but had moved to region B during a 'focal' period after the start of the crisis, for each pair of regions A and B. We call this the 'flows' from the benchmark period to the focal period.
Some of these movements would have been made under normal circumstances, and were not caused by the crisis. To separate the crisis-driven movements from normal mobility patterns, we will also calculate a second origin-destination matrix, containing the flows from the same benchmark period to a 'comparison' period shortly before the crisis began. We can then subtract these 'normal' flows from the 'crisis' flows to get the flows above normal, which we assume to have been caused by the crisis. The diagram below shows the two flows that we will calculate.
We will finish by using the mapboxgl
library to display the total flows above normal in or out of each region on an interactive map, which will look like this:
Load FlowClient and connect to FlowAPI¶
We start by importing FlowClient. We also import geopandas and mapboxgl, which we will use later to to visualise the data.
import flowclient as fc
import os
import numpy as np
import geopandas as gpd
import mapboxgl
from mapboxgl.utils import create_color_stops
%matplotlib inline
We must next generate a FlowAPI access token using FlowAuth. If you are running this notebook using the quick start setup, generating a token requires the following steps:
- Visit the FlowAuth login page at http://localhost:9091.
- Log in with username
TEST_USER
and passwordDUMMY_PASSWORD
. - Under "My Servers", select
TEST_SERVER
. - Click the
+
button to create a new token. - Give the new token a name, and click
SAVE
. - Copy the token string using the
COPY
button. - Paste the token in this notebook as
TOKEN
.
The steps are the same in a production setup, but the FlowAuth URL, login details and server name will differ.
Tokens are only valid for a fixed time, so once a token has expired you will need to create a new one.
Once we have a token, we can start a connection to the FlowAPI system. If you are connecting to FlowAPI over https and the system administrator has provided you with an SSL certificate file, you should provide the path to this file as the ssl_certificate
argument toflowclient.connect()
(in this example, you can set the path in the environment variable SSL_CERTIFICATE_FILE
). If you are connecting over http, this argument is not required.
conn = fc.connect(
url=os.getenv(
"FLOWAPI_URL", "http://localhost:9090"
), # This is the URL for the interface flowclient uses to communicate with the FlowKit server
token=TOKEN, # Replace TOKEN with the token copied from FlowAuth
ssl_certificate=os.getenv(
"SSL_CERTIFICATE_FILE"
), # Path to the SSL certificate file (required if FlowAPI uses an https connection)
)
Calculate origin-destination matrices¶
1. Create query specifications¶
We first need to choose the three time periods (benchmark, comparison and focal) between which we will calculate flows. The simulated crisis in our fake dataset begins on 10th February 2016, so both the benchmark and comparison periods should fall entirely before this date, and the focal period should be after this date.
We can use the flowclient get_available_dates
function to check which dates have data available in FlowKit. We need to provide the connection we have just created as the 'connection' argument - flowclient uses this to communicate with the FlowKit server.
fc.get_available_dates(connection=conn)
{'calls': ['2016-01-01',
'2016-01-02',
'2016-01-03',
'2016-01-04',
'2016-01-05',
'2016-01-06',
'2016-01-07',
'2016-01-08',
'2016-01-09',
'2016-01-10',
'2016-01-11',
'2016-01-12',
'2016-01-13',
'2016-01-14',
'2016-01-15',
'2016-01-16',
'2016-01-17',
'2016-01-18',
'2016-01-19',
'2016-01-20',
'2016-01-21',
'2016-01-22',
'2016-01-23',
'2016-01-24',
'2016-01-25',
'2016-01-26',
'2016-01-27',
'2016-01-28',
'2016-01-29',
'2016-01-30',
'2016-01-31',
'2016-02-01',
'2016-02-02',
'2016-02-03',
'2016-02-04',
'2016-02-05',
'2016-02-06',
'2016-02-07',
'2016-02-08',
'2016-02-09',
'2016-02-10',
'2016-02-11',
'2016-02-12',
'2016-02-13',
'2016-02-14',
'2016-02-15',
'2016-02-16',
'2016-02-17',
'2016-02-18',
'2016-02-19',
'2016-02-20',
'2016-02-21',
'2016-02-22',
'2016-02-23',
'2016-02-24',
'2016-02-25',
'2016-02-26',
'2016-02-27',
'2016-02-28',
'2016-02-29'],
'mds': ['2016-01-01',
'2016-01-02',
'2016-01-03',
'2016-01-04',
'2016-01-05',
'2016-01-06',
'2016-01-07',
'2016-01-08',
'2016-01-09',
'2016-01-10',
'2016-01-11',
'2016-01-12',
'2016-01-13',
'2016-01-14',
'2016-01-15',
'2016-01-16',
'2016-01-17',
'2016-01-18',
'2016-01-19',
'2016-01-20',
'2016-01-21',
'2016-01-22',
'2016-01-23',
'2016-01-24',
'2016-01-25',
'2016-01-26',
'2016-01-27',
'2016-01-28',
'2016-01-29',
'2016-01-30',
'2016-01-31',
'2016-02-01',
'2016-02-02',
'2016-02-03',
'2016-02-04',
'2016-02-05',
'2016-02-06',
'2016-02-07',
'2016-02-08',
'2016-02-09',
'2016-02-10',
'2016-02-11',
'2016-02-12',
'2016-02-13',
'2016-02-14',
'2016-02-15',
'2016-02-16',
'2016-02-17',
'2016-02-18',
'2016-02-19',
'2016-02-20',
'2016-02-21',
'2016-02-22',
'2016-02-23',
'2016-02-24',
'2016-02-25',
'2016-02-26',
'2016-02-27',
'2016-02-28',
'2016-02-29'],
'sms': ['2016-01-01',
'2016-01-02',
'2016-01-03',
'2016-01-04',
'2016-01-05',
'2016-01-06',
'2016-01-07',
'2016-01-08',
'2016-01-09',
'2016-01-10',
'2016-01-11',
'2016-01-12',
'2016-01-13',
'2016-01-14',
'2016-01-15',
'2016-01-16',
'2016-01-17',
'2016-01-18',
'2016-01-19',
'2016-01-20',
'2016-01-21',
'2016-01-22',
'2016-01-23',
'2016-01-24',
'2016-01-25',
'2016-01-26',
'2016-01-27',
'2016-01-28',
'2016-01-29',
'2016-01-30',
'2016-01-31',
'2016-02-01',
'2016-02-02',
'2016-02-03',
'2016-02-04',
'2016-02-05',
'2016-02-06',
'2016-02-07',
'2016-02-08',
'2016-02-09',
'2016-02-10',
'2016-02-11',
'2016-02-12',
'2016-02-13',
'2016-02-14',
'2016-02-15',
'2016-02-16',
'2016-02-17',
'2016-02-18',
'2016-02-19',
'2016-02-20',
'2016-02-21',
'2016-02-22',
'2016-02-23',
'2016-02-24',
'2016-02-25',
'2016-02-26',
'2016-02-27',
'2016-02-28',
'2016-02-29']}
We can see that we have data for all days from 1st January to 29th February 2016, for each of three event types: 'calls', 'sms' and 'mds' (mobile data sessions). Let's define our three reference periods as:
- 'benchmark' period: the first 21 days of available data,
- 'comparison' period: the last 14 days before the crisis begins,
- 'focal' period: the first 14 days from the start of the crisis.
To count changes in subscribers' home locations between these periods, we need to define what we mean by 'home' locations. Here we will calculate 'modal locations'; a subscriber's modal location is calculated by assigning them a single location each day, and then finding their most common daily location over the period of interest. FlowKit provides two methods for determining the daily locations: 'last' (i.e. the location from which they made the last call of the day) or 'most-common' (i.e. the location from which they most commonly made/received calls during the day). Here we will use the 'last' method.
We can use the flowclient modal_location_from_dates
function to create a modal location query specification for the benchmark period. (A query specification is a dictionary containing the parameters of a FlowKit query). This takes four arguments:
start_date
: the first date in the benchmark periodend_date
: the day after the last date in the benchmark periodmethod
: the method to use for calculating daily locations ("last" or "most-common")aggregation_unit
: the spatial aggregation level (here we will use "admin3", to use administrative level 3 regions as the subscriber locations)
Note: when specifying FlowKit queries, the end date is not included in the interval.
# Create modal location query specification for benchmark period
# (Note: this is not the query result, just the parameters we will use to run the query)
benchmark_locations_spec = fc.modal_location_from_dates_spec(
start_date="2016-01-01", # First date in the period
end_date="2016-01-22", # Day after the last date in the period
method="last", # Use last event of the day to define daily locations
aggregation_unit="admin3", # Use administrative level 3 regions as the locations
)
We repeat this process to create two more modal location query specifications, for the comparison and focal periods:
# Create modal location query specifications for comparison and focal periods
# (Note: these are not the query results, just the parameters we will use to run the queries)
comparison_locations_spec = fc.modal_location_from_dates_spec(
start_date="2016-01-27", # First date in the period
end_date="2016-02-10", # Day after the last date in the period
method="last", # Use last event of the day to define daily locations
aggregation_unit="admin3", # Use administrative level 3 regions as the locations
)
focal_locations_spec = fc.modal_location_from_dates_spec(
start_date="2016-02-10", # First date in the period
end_date="2016-02-24", # Day after the last date in the period
method="last", # Use last event of the day to define daily locations
aggregation_unit="admin3", # Use administrative level 3 regions as the locations
)
Let's put these three query specifications into a dictionary, so we can more easily loop over them later (when running the queries and getting results):
# Put the three modal location query specifications in a dictionary
home_locations_specs = {
"benchmark": benchmark_locations_spec,
"comparison": comparison_locations_spec,
"focal": focal_locations_spec,
}
Next, we use the flows
function to create specifications for the two flows queries. Each flows query takes two modal location query specifications as parameters from_location
and to_location
, and an aggregation_unit
parameter that must match the aggregation units of the two location queries.
# Create query specifications for the two flows
normal_flows_spec = fc.aggregates.flows_spec(
from_location=home_locations_specs[
"benchmark"
], # 'from' location is the benchmark modal locations
to_location=home_locations_specs[
"comparison"
], # 'to' location is the comparison modal locations
)
crisis_flows_spec = fc.aggregates.flows_spec(
from_location=home_locations_specs[
"benchmark"
], # 'from' location is the benchmark modal locations
to_location=home_locations_specs[
"focal"
], # 'to' location is the focal modal locations
)
2. Run the queries¶
Modal location queries return subscriber-level results, which cannot be accessed directly through FlowAPI. We can create spatially-aggregated query specifications by passing each modal location query specification to the spatial_aggregate
function, and pass these query specifications to the run_query
function to start running them.
Because some FlowKit queries may take a long time to run, the run_query
function doesn't return the result of the query. Instead, it returns a query ID that can be used to check the status of the query run, and get the result when it is finished.
# Run spatial aggregates of the home locations queries
home_locations_ids = {
period: fc.run_query(
connection=conn,
query_spec=fc.aggregates.spatial_aggregate_spec(
locations=query_spec
), # Wrap the query spec in a spatial aggregate query, so that we are allowed to get the result
)
for period, query_spec in home_locations_specs.items() # Loop over the three query specifications and run them all
}
home_locations_ids
{'benchmark': '6eba3dedc7f7c18b70018adc3a7b10dc',
'comparison': 'ca2dc96bc1b1a90148fab927be3e0154',
'focal': '12359ea2867bce812525846b76517d8f'}
Unlike the modal_location
queries, flows
queries return aggregated results, so we can directly run the two flows
queries.
# Run the flows queries
normal_flows_id = fc.run_query(connection=conn, query_spec=normal_flows_spec)
crisis_flows_id = fc.run_query(connection=conn, query_spec=crisis_flows_spec)
print(normal_flows_id, crisis_flows_id)
We can periodically check the status of the queries using the get_status
function.
# Get status of flows queries
for flows_id in [normal_flows_id, crisis_flows_id]:
print(fc.get_status(connection=conn, query_id=flows_id))
Visualise the distributions of locations¶
The flows queries may take some time to calculate. While we wait, let's visualise the results of the spatially-aggregated home locations queries, to see the distributions of home locations during the three periods.
1. Download geography data¶
If we want to display our query results on a map, we will need to combine them with geography data for the level 3 administrative regions. We can download the shapes of the spatial regions as a GeoJSON dictionary using the get_geography
function.
# Download geography data as GeoJSON
# (aggregation_unit here needs to match the one we used for the modal locations and flows query specifications)
regions = fc.get_geography(connection=conn, aggregation_unit="admin3")
To help with joining the geography data to query results, it is helpful to load the GeoJSON into a geopandas GeoDataFrame. The dataframe has three columns:
- 'pcod': the P-code that identifies each admin3 region. We set this as the index.
- 'geometry': the polygons that describe the shapes of the regions.
- 'centroid': a point location in the centre of each region. We won't use the centroid data in this example.
# Create a geopandas GeoDataFrame from the GeoJSON
regions_geodataframe = gpd.GeoDataFrame.from_features(regions).set_index("pcod")
regions_geodataframe.head()
pcod |
geometry  |
centroid  |
---|---|---|
NPL.1.1.1_1 | MULTIPOLYGON (((85.430763 27.624517, 85.425789 ... | {'type': 'Point', 'coordinates': [85.442905911,... |
NPL.1.1.2_1 | MULTIPOLYGON (((84.802307 27.671244, 84.79982 2... | {'type': 'Point', 'coordinates': [84.960119929,... |
NPL.1.1.5_1 | MULTIPOLYGON (((85.446045 27.481852, 85.448784 ... | {'type': 'Point', 'coordinates': [85.344826394,... |
NPL.1.1.3_1 | MULTIPOLYGON (((85.241989 27.580898, 85.240997 ... | {'type': 'Point', 'coordinates': [85.349621081,... |
NPL.1.1.4_1 | MULTIPOLYGON (((85.710251 27.394238, 85.702797 ... | {'type': 'Point', 'coordinates': [85.621936812,... |
The GeoDataFrame has a 'plot' method that allows us to quickly see the shapes of the regions:
regions_geodataframe.plot();
2. Get query results¶
Hopefully our home locations queries have finished running by now. We can use the get_result
function to get the results of the spatial aggregate queries as pandas DataFrames.
# Get results for the spatially-aggregated home locations queries
home_locations_results = {
period: fc.get_result(
connection=conn,
query_spec=fc.aggregates.spatial_aggregate_spec(locations=query_spec),
)
for period, query_spec in home_locations_specs.items()
}
Let's look at the "benchmark" dataframe. It has two columns:
- 'pcod' is the P-code that identifies the region. We-ll use this to join to the geography data.
- 'value' is the number of subscribers whose modal location is in each region.
home_locations_results["benchmark"].head()
pcod | value | |
---|---|---|
0 | NPL.2.3.3_1 | 2454 |
1 | NPL.3.2.5_1 | 1403 |
2 | NPL.1.3.3_1 | 272 |
3 | NPL.4.1.1_1 | 1265 |
4 | NPL.1.1.3_1 | 4728 |
3. Visualise the results¶
Now that we have the home locations query results and geography data for the spatial regions, we can use these to create a choropleth map. In this example we will use the Mapbox GL
library for the visualisation, which requires us to first join the datasets.
Since we have all datasets in pandas/geopandas dataframes, we can use the 'join' method to join them, using the 'pcod' column.
# First, rename the 'value' column in each of the result dataframes, so they all have different names
for period in home_locations_results.keys():
home_locations_results[period] = home_locations_results[period].rename(
columns={"value": f"Total ({period} period)"}
)
# Join the result dataframes to the geography data, using the 'pcod' column
# (we provide a list of dataframes here, to join them all at once)
home_locations_geodataframe = regions_geodataframe.join(
[result.set_index("pcod") for result in home_locations_results.values()],
how="left",
)
home_locations_geodataframe.head()
pcod |
geometry  |
centroid  |
Total (benchmark period) Â |
Total (comparison period) Â |
Total (focal period) Â |
---|---|---|---|---|---|
NPL.1.1.1_1 | MULTIPOLYGON (((85.430763 27.624517, 85.425789 ... | {'type': 'Point', 'coordinates': [85.442905911,... | 503 | 532 | 525 |
NPL.1.1.2_1 | MULTIPOLYGON (((84.802307 27.671244, 84.79982 2... | {'type': 'Point', 'coordinates': [84.960119929,... | 417 | 407 | 422 |
NPL.1.1.5_1 | MULTIPOLYGON (((85.446045 27.481852, 85.448784 ... | {'type': 'Point', 'coordinates': [85.344826394,... | 1354 | 1359 | 1367 |
NPL.1.1.3_1 | MULTIPOLYGON (((85.241989 27.580898, 85.240997 ... | {'type': 'Point', 'coordinates': [85.349621081,... | 4728 | 4796 | 4703 |
NPL.1.1.4_1 | MULTIPOLYGON (((85.710251 27.394238, 85.702797 ... | {'type': 'Point', 'coordinates': [85.621936812,... | nan | nan | nan |
There are some entries with missing values in the joined dataframe. This is because FlowKit redacts any values representing fewer than 15 subscribers, to protect those subscribers' privacy, so we are missing the results for regions where fewer than 15 people have their modal locations. We can assume that there are 0 subscribers in these regions, so let's fill the missing values with 0.
# Use pandas 'fillna' method to fill missing values with 0
home_locations_geodataframe = home_locations_geodataframe.fillna(0)
We will be able to hover over the map and see the values in each column, so let's drop columns we don't need and ensure the remaining columns have sensible names.
# Drop the 'centroid' column, since we're not using it
home_locations_geodataframe = home_locations_geodataframe.drop(columns="centroid")
# Rename the 'pcod' column to 'P-code'
home_locations_geodataframe = home_locations_geodataframe.reset_index().rename(
columns={"pcod": "P-code"}
)
home_locations_geodataframe.head()
P-code | geometry | Total (benchmark period) | Total (comparison period) | Total (focal period) | |
---|---|---|---|---|---|
0 | NPL.1.1.1_1 | MULTIPOLYGON (((85.430763 27.624517, 85.425789 ... | 503 | 532 | 525 |
1 | NPL.1.1.2_1 | MULTIPOLYGON (((84.802307 27.671244, 84.79982 2... | 417 | 407 | 422 |
2 | NPL.1.1.5_1 | MULTIPOLYGON (((85.446045 27.481852, 85.448784 ... | 1354 | 1359 | 1367 |
3 | NPL.1.1.3_1 | MULTIPOLYGON (((85.241989 27.580898, 85.240997 ... | 4728 | 4796 | 4703 |
4 | NPL.1.1.4_1 | MULTIPOLYGON (((85.710251 27.394238, 85.702797 ... | 0 | 0 | 0 |
Now we are ready to create a choropleth showing the distribution of modal locations, using mapboxgl.ChoroplethViz
.
Note: Mapbox requires an access token, which should be set as the environment variable MAPBOX_ACCESS_TOKEN
. Note that this is only required for producing the Mapbox visualisations, which is completely separate from FlowKit.
period_to_show = "benchmark" # Change this to "comparison" or "focal" to show results for the other periods
mapbox_token = os.environ["MAPBOX_ACCESS_TOKEN"]
# Colour scale for legend
max_total = max(
[
home_locations_geodataframe[f"Total ({period} period)"].max()
for period in home_locations_specs.keys()
]
)
color_stops = create_color_stops(np.linspace(0, max_total, 9), colors="YlGn")
# Create a map using mapboxgl.ChoroplethViz
home_locations_viz = mapboxgl.ChoroplethViz(
home_locations_geodataframe.__geo_interface__,
access_token=mapbox_token,
color_property=f"Total ({period_to_show} period)",
color_stops=color_stops,
opacity=0.8,
line_color="black",
line_width=0.5,
legend_gradient=True,
legend_layout="horizontal",
legend_text_numeric_precision=0,
below_layer="waterway-label",
center=(84.1, 28.4),
zoom=5.5,
)
# Show the visualisation
home_locations_viz.show()