Data Preprocessing¶
See also
Data Merger¶
data_merger.py
This file contains the merge_data function, which takes data from different sources and parses it into our own format. If you are a user of EpidemicForecasting.org, you will not need to use this function.
Data Preprocessor¶
data_preprocessor.py
Preprocess appropriately formatted csv data into PreprocessedData object
-
epimodel.preprocessing.data_preprocessor.
preprocess_data
(data_path, last_day=None, schools_unis='two_separate', drop_features=None, min_confirmed=100, min_deaths=10, smoothing=1, mask_zero_deaths=False, mask_zero_cases=False)¶ Preprocess data .csv file, in our post-merge format, with different options.
- Parameters
data_path – Path of .csv file to process.
last_day – Last day of window to analysis to use e.g. str ‘2020-05-30’. If None (default), go to the last day in the .csv file.
schools_unis –
how to process schools and unis. Options are:- two_xor. One xor feature, one and feature.- two_separate. One schools feature, one university feature.- one_tiered. One tiered feature. 0 if none active, 0.5 if either active, 1 if both active.- one_and. One feature, 1 if both active.drop_features – list of strs, names of NPI features to drop. Defaults to all NPIs not collected by the EpidemicForecasting.org team.
min_confirmed – confirmed cases threshold, below which new (daily) cases are ignored.
min_deaths – deaths threshold, below which new (daily) deaths are ignored.
smoothing – number of days over which to smooth. This should be an odd number. If 1, no smoothing occurs.
mask_zero_deaths – bool, whether to ignore (i.e., mask) days with zero deaths.
mask_zero_cases – bool, whether to ignore (i.e., mask) days with zero cases.
- Returns
PreprocessedData object.
Preprocessed Data¶
preprocessed_data.py
PreprocessedData Class definition.
-
class
epimodel.preprocessing.preprocessed_data.
PreprocessedData
(Active, Confirmed, ActiveCMs, CMs, Rs, Ds, Deaths, NewDeaths, NewCases, RNames)¶ Bases:
object
PreprocessedData Class
Class to hold data which is subsequently passed onto a PyMC3 model. Mostly a data wrapper, with some utility functions.
-
conditional_activation_plot
(cm_plot_style, newfig=True, skip_yticks=False)¶ Draw conditional-activation plot.
- Parameters
cm_plot_style – Countermeasure plot style array.
newfig – boolean, whether to create plot in a new figure
skip_yticks – boolean, whether to draw yticks.
-
cumulative_days_plot
(cm_plot_style, newfig=True, skip_yticks=False)¶ Draw cumulative days plot.
- Parameters
cm_plot_style – Countermeasure plot style array.
newfig – boolean, whether to create plot in a new figure
skip_yticks – boolean, whether to draw yticks.
-
mask_region
(region, days=14)¶ Mask all but the first 14 days of cases and deaths for a specific region
- Parameters
region – region code (2 digit EpidemicForecasting.org) code to mask
days – Number of days to provide to the model
-
mask_region_ends
(n_days=20)¶ Mask the final n_days days across all countries.
- Parameters
n_days – number of days to mask.
-
mask_reopenings
(d_min=90, n_extra=0, print_out=True)¶ Mask reopenings.
This finds dates NPIs reactivate, then mask forwards, giving 3 days for cases and 12 days for deaths.
- Parameters
d_min – day after which to mask reopening.
n_extra – int, number of extra days to mask
-
reduce_regions_from_index
(reduced_regions_indx)¶ Reduce data to only pertain to region indices given. Occurs in place.
e.g., if reduced_regions_indx = [0], the resulting data object will contain data about only the first region.
- Parameters
reduced_regions_indx – region indices to retain.
-
remove_regions_from_codes
(regions_to_remove)¶ Remove region codes corresponding to regions in regions_to_remove. Occurs in place.
- Parameters
regions_to_remove – Region codes, corresponding to regions to remove.
-
remove_regions_min_deaths
(min_num_deaths=100)¶ Remove regions which have fewer than min_num_deaths at the end of the considered time period. Occurs in place.
- Parameters
min_num_deaths – Minimum number of (total) deaths.
-
summary_plot
(cm_plot_style)¶ Draw summary plot.
This includes both the cumulative days plot, and the conditional activation plot.
- Parameters
cm_plot_style – Countermeasure plot style array.
-
unmask_all
()¶ Unmask all cases, deaths.
-