Skip to content

Flows Above Normal

Post-crisis analysis

In this worked example we assume the role of an analyst working in the aftermath of a significant crisis in Nepal. Our aim is to use FlowKit to investigate which administrative regions people have been displaced from/to during the crisis, following the methodology used in this paper.

The Jupyter notebook for this worked example can be downloaded here, or can be run using the quick start setup.

Introduction

We want to investigate changes in people's home locations caused by the crisis. To do this we will calculate an origin-destination matrix, containing the number of subscribers who lived in region A during a 'benchmark' period before the crisis, but had moved to region B during a 'focal' period after the start of the crisis, for each pair of regions A and B. We call this the 'flows' from the benchmark period to the focal period.

Some of these movements would have been made under normal circumstances, and were not caused by the crisis. To separate the crisis-driven movements from normal mobility patterns, we will also calculate a second origin-destination matrix, containing the flows from the same benchmark period to a 'comparison' period shortly before the crisis began. We can then subtract these 'normal' flows from the 'crisis' flows to get the flows above normal, which we assume to have been caused by the crisis. The diagram below shows the two flows that we will calculate.

Flows above normal

We will finish by using the mapboxgl library to display the total flows above normal in or out of each region on an interactive map, which will look like this:

Screenshot of inflows visualisation

Load FlowClient and connect to FlowAPI

We start by importing FlowClient. We also import geopandas and mapboxgl, which we will use later to to visualise the data.

import flowclient as fc
import os
import numpy as np
import geopandas as gpd
import mapboxgl
from mapboxgl.utils import create_color_stops

%matplotlib inline

We must next generate a FlowAPI access token using FlowAuth. If you are running this notebook using the quick start setup, generating a token requires the following steps:

  1. Visit the FlowAuth login page at http://localhost:9091.
  2. Log in with username TEST_USER and password DUMMY_PASSWORD.
  3. Under "My Servers", select TEST_SERVER.
  4. Click the + button to create a new token.
  5. Give the new token a name, and click SAVE.
  6. Copy the token string using the COPY button.
  7. Paste the token in this notebook as TOKEN.

The steps are the same in a production setup, but the FlowAuth URL, login details and server name will differ.

Tokens are only valid for a fixed time, so once a token has expired you will need to create a new one.

Once we have a token, we can start a connection to the FlowAPI system. If you are connecting to FlowAPI over https and the system administrator has provided you with an SSL certificate file, you should provide the path to this file as the ssl_certificate argument toflowclient.connect() (in this example, you can set the path in the environment variable SSL_CERTIFICATE_FILE). If you are connecting over http, this argument is not required.

conn = fc.connect(
    url=os.getenv("FLOWAPI_URL", "http://localhost:9090"), # This is the URL for the interface flowclient uses to communicate with the FlowKit server
    token=TOKEN,                                           # Replace TOKEN with the token copied from FlowAuth
    ssl_certificate=os.getenv("SSL_CERTIFICATE_FILE"),     # Path to the SSL certificate file (required if FlowAPI uses an https connection)
)

Calculate origin-destination matrices

1. Create query specifications

We first need to choose the three time periods (benchmark, comparison and focal) between which we will calculate flows. The simulated crisis in our fake dataset begins on 10th February 2016, so both the benchmark and comparison periods should fall entirely before this date, and the focal period should be after this date.

We can use the flowclient get_available_dates function to check which dates have data available in FlowKit. We need to provide the connection we have just created as the 'connection' argument - flowclient uses this to communicate with the FlowKit server.

fc.get_available_dates(connection=conn)
{'calls': ['2016-01-01',
           '2016-01-02',
           '2016-01-03',
           '2016-01-04',
           '2016-01-05',
           '2016-01-06',
           '2016-01-07',
           '2016-01-08',
           '2016-01-09',
           '2016-01-10',
           '2016-01-11',
           '2016-01-12',
           '2016-01-13',
           '2016-01-14',
           '2016-01-15',
           '2016-01-16',
           '2016-01-17',
           '2016-01-18',
           '2016-01-19',
           '2016-01-20',
           '2016-01-21',
           '2016-01-22',
           '2016-01-23',
           '2016-01-24',
           '2016-01-25',
           '2016-01-26',
           '2016-01-27',
           '2016-01-28',
           '2016-01-29',
           '2016-01-30',
           '2016-01-31',
           '2016-02-01',
           '2016-02-02',
           '2016-02-03',
           '2016-02-04',
           '2016-02-05',
           '2016-02-06',
           '2016-02-07',
           '2016-02-08',
           '2016-02-09',
           '2016-02-10',
           '2016-02-11',
           '2016-02-12',
           '2016-02-13',
           '2016-02-14',
           '2016-02-15',
           '2016-02-16',
           '2016-02-17',
           '2016-02-18',
           '2016-02-19',
           '2016-02-20',
           '2016-02-21',
           '2016-02-22',
           '2016-02-23',
           '2016-02-24',
           '2016-02-25',
           '2016-02-26',
           '2016-02-27',
           '2016-02-28',
           '2016-02-29'],
 'mds': ['2016-01-01',
         '2016-01-02',
         '2016-01-03',
         '2016-01-04',
         '2016-01-05',
         '2016-01-06',
         '2016-01-07',
         '2016-01-08',
         '2016-01-09',
         '2016-01-10',
         '2016-01-11',
         '2016-01-12',
         '2016-01-13',
         '2016-01-14',
         '2016-01-15',
         '2016-01-16',
         '2016-01-17',
         '2016-01-18',
         '2016-01-19',
         '2016-01-20',
         '2016-01-21',
         '2016-01-22',
         '2016-01-23',
         '2016-01-24',
         '2016-01-25',
         '2016-01-26',
         '2016-01-27',
         '2016-01-28',
         '2016-01-29',
         '2016-01-30',
         '2016-01-31',
         '2016-02-01',
         '2016-02-02',
         '2016-02-03',
         '2016-02-04',
         '2016-02-05',
         '2016-02-06',
         '2016-02-07',
         '2016-02-08',
         '2016-02-09',
         '2016-02-10',
         '2016-02-11',
         '2016-02-12',
         '2016-02-13',
         '2016-02-14',
         '2016-02-15',
         '2016-02-16',
         '2016-02-17',
         '2016-02-18',
         '2016-02-19',
         '2016-02-20',
         '2016-02-21',
         '2016-02-22',
         '2016-02-23',
         '2016-02-24',
         '2016-02-25',
         '2016-02-26',
         '2016-02-27',
         '2016-02-28',
         '2016-02-29'],
 'sms': ['2016-01-01',
         '2016-01-02',
         '2016-01-03',
         '2016-01-04',
         '2016-01-05',
         '2016-01-06',
         '2016-01-07',
         '2016-01-08',
         '2016-01-09',
         '2016-01-10',
         '2016-01-11',
         '2016-01-12',
         '2016-01-13',
         '2016-01-14',
         '2016-01-15',
         '2016-01-16',
         '2016-01-17',
         '2016-01-18',
         '2016-01-19',
         '2016-01-20',
         '2016-01-21',
         '2016-01-22',
         '2016-01-23',
         '2016-01-24',
         '2016-01-25',
         '2016-01-26',
         '2016-01-27',
         '2016-01-28',
         '2016-01-29',
         '2016-01-30',
         '2016-01-31',
         '2016-02-01',
         '2016-02-02',
         '2016-02-03',
         '2016-02-04',
         '2016-02-05',
         '2016-02-06',
         '2016-02-07',
         '2016-02-08',
         '2016-02-09',
         '2016-02-10',
         '2016-02-11',
         '2016-02-12',
         '2016-02-13',
         '2016-02-14',
         '2016-02-15',
         '2016-02-16',
         '2016-02-17',
         '2016-02-18',
         '2016-02-19',
         '2016-02-20',
         '2016-02-21',
         '2016-02-22',
         '2016-02-23',
         '2016-02-24',
         '2016-02-25',
         '2016-02-26',
         '2016-02-27',
         '2016-02-28',
         '2016-02-29']}

We can see that we have data for all days from 1st January to 29th February 2016, for each of three event types: 'calls', 'sms' and 'mds' (mobile data sessions). Let's define our three reference periods as:

  • 'benchmark' period: the first 21 days of available data,
  • 'comparison' period: the last 14 days before the crisis begins,
  • 'focal' period: the first 14 days from the start of the crisis.

To count changes in subscribers' home locations between these periods, we need to define what we mean by 'home' locations. Here we will calculate 'modal locations'; a subscriber's modal location is calculated by assigning them a single location each day, and then finding their most common daily location over the period of interest. FlowKit provides two methods for determining the daily locations: 'last' (i.e. the location from which they made the last call of the day) or 'most-common' (i.e. the location from which they most commonly made/received calls during the day). Here we will use the 'last' method.

We can use the flowclient modal_location_from_dates function to create a modal location query specification for the benchmark period. (A query specification is a dictionary containing the parameters of a FlowKit query). This takes four arguments:

  • start_date: the first date in the benchmark period
  • end_date: the day after the last date in the benchmark period
  • method: the method to use for calculating daily locations ("last" or "most-common")
  • aggregation_unit: the spatial aggregation level (here we will use "admin3", to use administrative level 3 regions as the subscriber locations)

Note: when specifying FlowKit queries, the end date is not included in the interval.

# Create modal location query specification for benchmark period
# (Note: this is not the query result, just the parameters we will use to run the query)

benchmark_locations_spec = fc.modal_location_from_dates_spec(
    start_date="2016-01-01",    # First date in the period
    end_date="2016-01-22",      # Day after the last date in the period
    method="last",              # Use last event of the day to define daily locations
    aggregation_unit="admin3"   # Use administrative level 3 regions as the locations
)

We repeat this process to create two more modal location query specifications, for the comparison and focal periods:

# Create modal location query specifications for comparison and focal periods
# (Note: these are not the query results, just the parameters we will use to run the queries)

comparison_locations_spec = fc.modal_location_from_dates_spec(
    start_date="2016-01-27",    # First date in the period
    end_date="2016-02-10",      # Day after the last date in the period
    method="last",              # Use last event of the day to define daily locations
    aggregation_unit="admin3"   # Use administrative level 3 regions as the locations
)

focal_locations_spec = fc.modal_location_from_dates_spec(
    start_date="2016-02-10",    # First date in the period
    end_date="2016-02-24",      # Day after the last date in the period
    method="last",              # Use last event of the day to define daily locations
    aggregation_unit="admin3"   # Use administrative level 3 regions as the locations
)

Let's put these three query specifications into a dictionary, so we can more easily loop over them later (when running the queries and getting results):

# Put the three modal location query specifications in a dictionary
home_locations_specs = {
    "benchmark": benchmark_locations_spec,
    "comparison": comparison_locations_spec,
    "focal": focal_locations_spec,
}

Next, we use the flows function to create specifications for the two flows queries. Each flows query takes two modal location query specifications as parameters from_location and to_location, and an aggregation_unit parameter that must match the aggregation units of the two location queries.

# Create query specifications for the two flows

normal_flows_spec = fc.aggregates.flows_spec(
    from_location=home_locations_specs["benchmark"], # 'from' location is the benchmark modal locations
    to_location=home_locations_specs["comparison"],  # 'to' location is the comparison modal locations
)

crisis_flows_spec = fc.aggregates.flows_spec(
    from_location=home_locations_specs["benchmark"], # 'from' location is the benchmark modal locations
    to_location=home_locations_specs["focal"],       # 'to' location is the focal modal locations
)

2. Run the queries

Modal location queries return subscriber-level results, which cannot be accessed directly through FlowAPI. We can create spatially-aggregated query specifications by passing each modal location query specification to the spatial_aggregate function, and pass these query specifications to the run_query function to start running them.

Because some FlowKit queries may take a long time to run, the run_query function doesn't return the result of the query. Instead, it returns a query ID that can be used to check the status of the query run, and get the result when it is finished.

# Run spatial aggregates of the home locations queries

home_locations_ids = {
    period: fc.run_query(
        connection=conn,
        query_spec=fc.aggregates.spatial_aggregate_spec(locations=query_spec), # Wrap the query spec in a spatial aggregate query, so that we are allowed to get the result
    )
    for period, query_spec in home_locations_specs.items() # Loop over the three query specifications and run them all
}

home_locations_ids
{'benchmark': 'b3aa6bf17c298c6b3f98d61b70ec90e6',
 'comparison': '9b62b3d59817cddc765dac49ed10f9dc',
 'focal': 'f0994a138db73b08dca62cd3f2f340dc'}

Unlike the modal_location queries, flows queries return aggregated results, so we can directly run the two flows queries.

# Run the flows queries
normal_flows_id = fc.run_query(connection=conn, query_spec=normal_flows_spec)
crisis_flows_id = fc.run_query(connection=conn, query_spec=crisis_flows_spec)

print(normal_flows_id, crisis_flows_id)
6339197a5e88ed6be6c915455af14c4b d031b1fb9720215fab9985e37653fdbf

We can periodically check the status of the queries using the get_status function.

# Get status of flows queries
for flows_id in [normal_flows_id, crisis_flows_id]:
    print(fc.get_status(connection=conn, query_id=flows_id))
completed
completed

Visualise the distributions of locations

The flows queries may take some time to calculate. While we wait, let's visualise the results of the spatially-aggregated home locations queries, to see the distributions of home locations during the three periods.

1. Download geography data

If we want to display our query results on a map, we will need to combine them with geography data for the level 3 administrative regions. We can download the shapes of the spatial regions as a GeoJSON dictionary using the get_geography function.

# Download geography data as GeoJSON
# (aggregation_unit here needs to match the one we used for the modal locations and flows query specifications)
regions = fc.get_geography(connection=conn, aggregation_unit="admin3")

To help with joining the geography data to query results, it is helpful to load the GeoJSON into a geopandas GeoDataFrame. The dataframe has three columns:

  • 'pcod': the P-code that identifies each admin3 region. We set this as the index.
  • 'geometry': the polygons that describe the shapes of the regions.
  • 'centroid': a point location in the centre of each region. We won't use the centroid data in this example.
# Create a geopandas GeoDataFrame from the GeoJSON
regions_geodataframe = gpd.GeoDataFrame.from_features(regions).set_index("pcod")

regions_geodataframe.head()


pcod
geometry

 
centroid

 
NPL.1.1.1_1 MULTIPOLYGON (((85.43076324 27.62451744, 85.425... {'type': 'Point', 'coordinates': [85.442905911,...
NPL.1.1.2_1 MULTIPOLYGON (((84.80230713 27.67124367, 84.799... {'type': 'Point', 'coordinates': [84.960119929,...
NPL.1.1.3_1 MULTIPOLYGON (((85.24198914 27.58089828, 85.240... {'type': 'Point', 'coordinates': [85.349621081,...
NPL.1.1.4_1 MULTIPOLYGON (((85.71025084999999 27.39423752, ... {'type': 'Point', 'coordinates': [85.621936812,...
NPL.1.1.5_1 MULTIPOLYGON (((85.44604492000001 27.48185158, ... {'type': 'Point', 'coordinates': [85.344826394,...

The GeoDataFrame has a 'plot' method that allows us to quickly see the shapes of the regions:

regions_geodataframe.plot();

2. Get query results

Hopefully our home locations queries have finished running by now. We can use the get_result function to get the results of the spatial aggregate queries as pandas DataFrames.

# Get results for the spatially-aggregated home locations queries
home_locations_results = {
    period: fc.get_result(
        connection=conn,
        query_spec=fc.aggregates.spatial_aggregate_spec(locations=query_spec),
    )
    for period, query_spec in home_locations_specs.items()
}

Let's look at the "benchmark" dataframe. It has two columns:

  • 'pcod' is the P-code that identifies the region. We-ll use this to join to the geography data.
  • 'value' is the number of subscribers whose modal location is in each region.
home_locations_results["benchmark"].head()
pcod value
0 NPL.4.1.1_1 616
1 NPL.1.3.3_1 378
2 NPL.1.1.3_1 3503
3 NPL.3.1.4_1 1602
4 NPL.1.2.1_1 655

3. Visualise the results

Now that we have the home locations query results and geography data for the spatial regions, we can use these to create a choropleth map. In this example we will use the Mapbox GL library for the visualisation, which requires us to first join the datasets.

Since we have all datasets in pandas/geopandas dataframes, we can use the 'join' method to join them, using the 'pcod' column.

# First, rename the 'value' column in each of the result dataframes, so they all have different names
for period in home_locations_results.keys():
    home_locations_results[period] = home_locations_results[period].rename(
        columns={"value": f"Total ({period} period)"}
    )

# Join the result dataframes to the geography data, using the 'pcod' column
# (we provide a list of dataframes here, to join them all at once)
home_locations_geodataframe = regions_geodataframe.join(
    [result.set_index("pcod") for result in home_locations_results.values()],
    how="left",
)

home_locations_geodataframe.head()


pcod
geometry

 
centroid

 
Total (benchmark period)

 
Total (comparison period)

 
Total (focal period)

 
NPL.1.1.1_1 MULTIPOLYGON (((85.43076324 27.62451744, 85.425... {'type': 'Point', 'coordinates': [85.442905911,... 776 775 nan
NPL.1.1.2_1 MULTIPOLYGON (((84.80230713 27.67124367, 84.799... {'type': 'Point', 'coordinates': [84.960119929,... 789 786 nan
NPL.1.1.3_1 MULTIPOLYGON (((85.24198914 27.58089828, 85.240... {'type': 'Point', 'coordinates': [85.349621081,... 3503 3500 nan
NPL.1.1.4_1 MULTIPOLYGON (((85.71025084999999 27.39423752, ... {'type': 'Point', 'coordinates': [85.621936812,... 149 135 nan
NPL.1.1.5_1 MULTIPOLYGON (((85.44604492000001 27.48185158, ... {'type': 'Point', 'coordinates': [85.344826394,... 234 246 nan

There are some entries with missing values in the joined dataframe. This is because FlowKit redacts any values representing fewer than 15 subscribers, to protect those subscribers' privacy, so we are missing the results for regions where fewer than 15 people have their modal locations. We can assume that there are 0 subscribers in these regions, so let's fill the missing values with 0.

# Use pandas 'fillna' method to fill missing values with 0
home_locations_geodataframe = home_locations_geodataframe.fillna(0)

We will be able to hover over the map and see the values in each column, so let's drop columns we don't need and ensure the remaining columns have sensible names.

# Drop the 'centroid' column, since we're not using it
home_locations_geodataframe = home_locations_geodataframe.drop(columns="centroid")
# Rename the 'pcod' column to 'P-code'
home_locations_geodataframe = home_locations_geodataframe.reset_index().rename(
    columns={"pcod": "P-code"}
)

home_locations_geodataframe.head()
P-code geometry Total (benchmark period) Total (comparison period) Total (focal period)
0 NPL.1.1.1_1 MULTIPOLYGON (((85.43076324 27.62451744, 85.425... 776 775 0
1 NPL.1.1.2_1 MULTIPOLYGON (((84.80230713 27.67124367, 84.799... 789 786 0
2 NPL.1.1.3_1 MULTIPOLYGON (((85.24198914 27.58089828, 85.240... 3503 3500 0
3 NPL.1.1.4_1 MULTIPOLYGON (((85.71025084999999 27.39423752, ... 149 135 0
4 NPL.1.1.5_1 MULTIPOLYGON (((85.44604492000001 27.48185158, ... 234 246 0

Now we are ready to create a choropleth showing the distribution of modal locations, using mapboxgl.ChoroplethViz.

Note: Mapbox requires an access token, which should be set as the environment variable MAPBOX_ACCESS_TOKEN. Note that this is only required for producing the Mapbox visualisations, which is completely separate from FlowKit.

period_to_show = "benchmark"  # Change this to "comparison" or "focal" to show results for the other periods

mapbox_token = os.environ["MAPBOX_ACCESS_TOKEN"]

# Colour scale for legend
max_total = max(
    [
        home_locations_geodataframe[f"Total ({period} period)"].max()
        for period in home_locations_specs.keys()
    ]
)
color_stops = create_color_stops(np.linspace(0, max_total, 9), colors="YlGn")

# Create a map using mapboxgl.ChoroplethViz
home_locations_viz = mapboxgl.ChoroplethViz(
    home_locations_geodataframe.__geo_interface__,
    access_token=mapbox_token,
    color_property=f"Total ({period_to_show} period)",
    color_stops=color_stops,
    opacity=0.8,
    line_color="black",
    line_width=0.5,
    legend_gradient=True,
    legend_layout="horizontal",
    legend_text_numeric_precision=0,
    below_layer="waterway-label",
    center=(84.1, 28.4),
    zoom=5.5,
)

# Show the visualisation
home_locations_viz.show()