Data Preprocessing

Data Merger

data_merger.py

This file contains the merge_data function, which takes data from different sources and parses it into our own format. If you are a user of EpidemicForecasting.org, you will not need to use this function.

Data Preprocessor

data_preprocessor.py

Preprocess appropriately formatted csv data into PreprocessedData object

epimodel.preprocessing.data_preprocessor.preprocess_data(data_path, last_day=None, schools_unis='two_separate', drop_features=None, min_confirmed=100, min_deaths=10, smoothing=1, mask_zero_deaths=False, mask_zero_cases=False)

Preprocess data .csv file, in our post-merge format, with different options.

Parameters
  • data_path – Path of .csv file to process.

  • last_day – Last day of window to analysis to use e.g. str ‘2020-05-30’. If None (default), go to the last day in the .csv file.

  • schools_unis

    how to process schools and unis. Options are:
    - two_xor. One xor feature, one and feature.
    - two_separate. One schools feature, one university feature.
    - one_tiered. One tiered feature. 0 if none active, 0.5 if either active, 1 if both active.
    - one_and. One feature, 1 if both active.

  • drop_features – list of strs, names of NPI features to drop. Defaults to all NPIs not collected by the EpidemicForecasting.org team.

  • min_confirmed – confirmed cases threshold, below which new (daily) cases are ignored.

  • min_deaths – deaths threshold, below which new (daily) deaths are ignored.

  • smoothing – number of days over which to smooth. This should be an odd number. If 1, no smoothing occurs.

  • mask_zero_deaths – bool, whether to ignore (i.e., mask) days with zero deaths.

  • mask_zero_cases – bool, whether to ignore (i.e., mask) days with zero cases.

Returns

PreprocessedData object.

Preprocessed Data

preprocessed_data.py

PreprocessedData Class definition.

class epimodel.preprocessing.preprocessed_data.PreprocessedData(Active, Confirmed, ActiveCMs, CMs, Rs, Ds, Deaths, NewDeaths, NewCases, RNames)

Bases: object

PreprocessedData Class

Class to hold data which is subsequently passed onto a PyMC3 model. Mostly a data wrapper, with some utility functions.

conditional_activation_plot(cm_plot_style, newfig=True, skip_yticks=False)

Draw conditional-activation plot.

Parameters
  • cm_plot_style – Countermeasure plot style array.

  • newfig – boolean, whether to create plot in a new figure

  • skip_yticks – boolean, whether to draw yticks.

cumulative_days_plot(cm_plot_style, newfig=True, skip_yticks=False)

Draw cumulative days plot.

Parameters
  • cm_plot_style – Countermeasure plot style array.

  • newfig – boolean, whether to create plot in a new figure

  • skip_yticks – boolean, whether to draw yticks.

mask_region(region, days=14)

Mask all but the first 14 days of cases and deaths for a specific region

Parameters
  • region – region code (2 digit EpidemicForecasting.org) code to mask

  • days – Number of days to provide to the model

mask_region_ends(n_days=20)

Mask the final n_days days across all countries.

Parameters

n_days – number of days to mask.

mask_reopenings(d_min=90, n_extra=0, print_out=True)

Mask reopenings.

This finds dates NPIs reactivate, then mask forwards, giving 3 days for cases and 12 days for deaths.

Parameters
  • d_min – day after which to mask reopening.

  • n_extra – int, number of extra days to mask

reduce_regions_from_index(reduced_regions_indx)

Reduce data to only pertain to region indices given. Occurs in place.

e.g., if reduced_regions_indx = [0], the resulting data object will contain data about only the first region.

Parameters

reduced_regions_indx – region indices to retain.

remove_regions_from_codes(regions_to_remove)

Remove region codes corresponding to regions in regions_to_remove. Occurs in place.

Parameters

regions_to_remove – Region codes, corresponding to regions to remove.

remove_regions_min_deaths(min_num_deaths=100)

Remove regions which have fewer than min_num_deaths at the end of the considered time period. Occurs in place.

Parameters

min_num_deaths – Minimum number of (total) deaths.

summary_plot(cm_plot_style)

Draw summary plot.

This includes both the cumulative days plot, and the conditional activation plot.

Parameters

cm_plot_style – Countermeasure plot style array.

unmask_all()

Unmask all cases, deaths.