Preprocessing Examples

Examine the example dataset format and preprocess example data. Explore properties of the PreprocessedData object and masking operations available.

[93]:
from epimodel.preprocessing.data_preprocessor import preprocess_data
import numpy as np
import pandas as pd

Example Dataset

Data format:

[79]:
pd.read_csv('../notebooks/double-entry-data/double_entry_final.csv').head()
[79]:
Country Code Date Region Name Confirmed Active Deaths Mask Wearing Symptomatic Testing Gatherings <1000 Gatherings <100 ... Some Businesses Suspended Most Businesses Suspended School Closure University Closure Stay Home Order Travel Screen/Quarantine Travel Bans Public Transport Limited Internal Movement Limited Public Information Campaigns
0 AL 2020-01-22 00:00:00+00:00 Albania 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 AL 2020-01-23 00:00:00+00:00 Albania 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 AL 2020-01-24 00:00:00+00:00 Albania 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 AL 2020-01-25 00:00:00+00:00 Albania 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 AL 2020-01-26 00:00:00+00:00 Albania 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 21 columns

Loading example data:

[80]:
data = preprocess_data('../notebooks/double-entry-data/double_entry_final.csv')
Dropping NPI Travel Screen/Quarantine
Dropping NPI Travel Bans
Dropping NPI Public Transport Limited
Dropping NPI Internal Movement Limited
Dropping NPI Public Information Campaigns
Dropping NPI Symptomatic Testing
Masking invalid values

Exploring Preprocessed Data

NPI names:

[81]:
data.CMs
[81]:
['Mask Wearing',
 'Gatherings <1000',
 'Gatherings <100',
 'Gatherings <10',
 'Some Businesses Suspended',
 'Most Businesses Suspended',
 'School Closure',
 'University Closure',
 'Stay Home Order']

Region codes and names:

[82]:
data.RNames.groupby(level=0).first()
[82]:
Country Code
AD                   Andorra
AL                   Albania
AT                   Austria
BA    Bosnia and Herzegovina
BE                   Belgium
BG                  Bulgaria
CH               Switzerland
CZ            Czech Republic
DE                   Germany
DK                   Denmark
EE                   Estonia
ES                     Spain
FI                   Finland
FR                    France
GB            United Kingdom
GE                   Georgia
GR                    Greece
HR                   Croatia
HU                   Hungary
IE                   Ireland
IL                    Israel
IS                   Iceland
IT                     Italy
LT                 Lithuania
LV                    Latvia
MA                   Morocco
MT                     Malta
MX                    Mexico
MY                  Malaysia
NL               Netherlands
NO                    Norway
NZ               New Zealand
PL                    Poland
PT                  Portugal
RO                   Romania
RS                    Serbia
SE                    Sweden
SG                 Singapore
SI                  Slovenia
SK                  Slovakia
ZA              South Africa
Name: Region Name, dtype: object

All preprocessed data parameters

[111]:
data_details = {'Dates': data.Ds,
                'Region Abbreviations':data.Rs,
                'Region Names':data.RNames,
                'NPI names':data.CMs,
                'Active cases': data.Active,
                'Confirmed cases': data.Confirmed,
                'Deaths':data.Deaths,
                'New Deaths':data.NewDeaths,
                'New Cases':data.NewCases,
                'Active NPIs':data.ActiveCMs}

for k,v in data_details.items():
    try:
        dtype = v.dtype
    except AttributeError:
        dtype = type(v[0])
    print(f'{k} | Type:{type(v)} ({dtype}), Shape:{np.array(v).shape}')
Dates | Type:<class 'list'> (<class 'pandas._libs.tslibs.timestamps.Timestamp'>), Shape:(130,)
Region Abbreviations | Type:<class 'list'> (<class 'str'>), Shape:(30,)
Region Names | Type:<class 'pandas.core.series.Series'> (object), Shape:(5330,)
NPI names | Type:<class 'list'> (<class 'str'>), Shape:(9,)
Active cases | Type:<class 'numpy.ma.core.MaskedArray'> (float64), Shape:(30, 130)
Confirmed cases | Type:<class 'numpy.ma.core.MaskedArray'> (float64), Shape:(30, 130)
Deaths | Type:<class 'numpy.ma.core.MaskedArray'> (float64), Shape:(30, 130)
New Deaths | Type:<class 'numpy.ma.core.MaskedArray'> (float64), Shape:(30, 130)
New Cases | Type:<class 'numpy.ma.core.MaskedArray'> (float64), Shape:(30, 130)
Active NPIs | Type:<class 'numpy.ndarray'> (float64), Shape:(30, 9, 130)

Plotting Preprocesed Data

[84]:
cm_plot_style = [
            ("\uf963", "black"), # mask
            ("\uf0c0", "lightgrey"), # ppl
            ("\uf0c0", "grey"), # ppl
            ("\uf0c0", "black"), # ppl
            ("\uf07a", "tab:orange"), # shop 1
            ("\uf07a", "tab:red"), # shop2
            ("\uf549", "black"), # school
            ("\uf19d", "black"), # university
            ("\uf965", "black"), # home
            ("\uf072", "grey"), # plane1
            ("\uf072", "black"), # plane2
            ("\uf238", "black"), # train
            ("\uf1b9", "black"), # car
            ("\uf641", "black") # flyer
        ]
data.summary_plot(cm_plot_style)
'Font Awesome 5 Free-Solid-900.otf' can not be subsetted into a Type 3 font. The entire font will be embedded in the output.
../_images/examples_Preprocessing_Examples_16_1.png

Preprocessed Data Operations

Remove regions with fewer than 100 total deaths:

[83]:
data.remove_regions_min_deaths(100)
Region AL removed since it has 33.0 deaths on the last day
Region AD removed since it has 51.0 deaths on the last day
Region EE removed since it has 67.0 deaths on the last day
Region GE removed since it has 12.0 deaths on the last day
Region IS removed since it has 10.0 deaths on the last day
Region LV removed since it has 24.0 deaths on the last day
Region LT removed since it has 70.0 deaths on the last day
Region MT removed since it has -- deaths on the last day
Region NZ removed since it has 22.0 deaths on the last day
Region SG removed since it has 23.0 deaths on the last day
Region SK removed since it has 28.0 deaths on the last day

Remove region ‘AL’:

[112]:
data.remove_regions_from_codes('AL')

Mask dates for which NPIs were deactivated

[113]:
data.mask_reopenings()
Masking AT from 2020-05-04 00:00:00+00:00
Masking AT from 2020-05-21 00:00:00+00:00
Masking BE from 2020-05-14 00:00:00+00:00
Masking BA from 2020-05-17 00:00:00+00:00
Masking BG from 2020-05-04 00:00:00+00:00
Masking BG from 2020-05-21 00:00:00+00:00
Masking HR from 2020-04-30 00:00:00+00:00
Masking HR from 2020-05-14 00:00:00+00:00
Masking HR from 2020-05-29 00:00:00+00:00
Masking CZ from 2020-04-27 00:00:00+00:00
Masking CZ from 2020-05-14 00:00:00+00:00
Masking CZ from 2020-05-28 00:00:00+00:00
Masking DK from 2020-04-23 00:00:00+00:00
Masking DK from 2020-05-14 00:00:00+00:00
Masking FI from 2020-05-17 00:00:00+00:00
Masking FR from 2020-05-14 00:00:00+00:00
Masking DE from 2020-04-23 00:00:00+00:00
Masking DE from 2020-05-07 00:00:00+00:00
Masking DE from 2020-05-09 00:00:00+00:00
Masking GR from 2020-05-07 00:00:00+00:00
Masking GR from 2020-05-14 00:00:00+00:00
Masking HU from 2020-05-07 00:00:00+00:00
Masking HU from 2020-05-21 00:00:00+00:00
Masking IL from 2020-04-29 00:00:00+00:00
Masking IL from 2020-05-07 00:00:00+00:00
Masking IL from 2020-05-08 00:00:00+00:00
Masking IL from 2020-05-22 00:00:00+00:00
Masking IT from 2020-05-21 00:00:00+00:00
Masking MY from 2020-05-07 00:00:00+00:00
Masking MY from 2020-05-13 00:00:00+00:00
Masking NL from 2020-05-14 00:00:00+00:00
Masking NO from 2020-04-30 00:00:00+00:00
Masking NO from 2020-05-10 00:00:00+00:00
Masking NO from 2020-05-14 00:00:00+00:00
Masking PL from 2020-04-23 00:00:00+00:00
Masking PL from 2020-05-07 00:00:00+00:00
Masking PT from 2020-05-07 00:00:00+00:00
Masking RO from 2020-05-18 00:00:00+00:00
Masking RS from 2020-04-30 00:00:00+00:00
Masking RS from 2020-05-09 00:00:00+00:00
Masking RS from 2020-05-10 00:00:00+00:00
Masking SI from 2020-04-23 00:00:00+00:00
Masking SI from 2020-05-03 00:00:00+00:00
Masking SI from 2020-05-21 00:00:00+00:00
Masking ES from 2020-05-14 00:00:00+00:00
Masking ES from 2020-05-28 00:00:00+00:00
Masking CH from 2020-04-30 00:00:00+00:00
Masking CH from 2020-05-14 00:00:00+00:00
Masking GB from 2020-05-16 00:00:00+00:00

Mask the final 20 days for all regions:

[114]:
data.mask_region_ends(20)

Mask all but the first 14 days for region ‘AT’

[115]:
data.mask_region('AT',days=14)
[115]:
(61, 75)

Unmask data

[116]:
data.unmask_all()
[ ]: