cross_validation¶

The `CROSS_VALIDATION` class¶

class CROSS_VALIDATION(ADSORBATE='CO', INCLUDED_BINDING_TYPES=[1, 2, 3, 4], cv_indices_path=None, cross_validation_path=None, nanoparticle_path=None, high_coverage_path=None, coverage_scaling_path=None, VERBOSE=False, SAVE_ALL_NN=True)[source]¶

Class for running cross validation. Splits primary data into balanced folds and generates sets of secondary data.

Parameters

ADSORBATE (str) – Adsorbate for which the spectra is to be generated.
INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.
cv_indices_path (str) – Folder path where indices for cross validation are saved or where they are to be created if they have not yet been created.
cross_validation_path (str) – Folder path where cross validation results are to be created.
nanoparticle_path (str) – File path where nanoparticle or single adsorbate json data is saved.
high_coverage_path (str) – File path where high coverage data for CO is saved saved.
coverage_scaling_path (str) – File path where coverage scaling coefficients are saved.
VERBOSE (bool) – Controls the printing of status statements.

Variables

CV_INDICES_PATH (str) – Folder path for cross validation indices.
CV_PATH (str) – Folder path where cross validation results are to be created.
NANO_PATH (str) – File path where nanoparticle or single adsorbate json data is saved.
HIGH_COV_PATH (str) – File path where high coverage data for CO is saved saved.
COV_SCALE_PATH (str) – File path where coverage scaling coefficients are saved.
VERBOSE (bool) – Controls the printing of status statements.
SAVE_ALL_NN (bool) – Controls whether a NN from each cross validation is saved or just from the test set.
ADSORBATE (str) – Adsorbate for which the spectra is to be generated.
INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.

_get_default_cross_validation_path()[source]¶

Get cross validation path if not set by user.

Returns: cross_validation_path – Folder path where cross validation results are to be created.
Return type: str

_get_default_cv_indices_path()[source]¶

Get cross validation indices path if not set by user.

Returns: cv_indices_path – Folder path where cross validation indices are to be created.
Return type: str

_get_state()[source]¶

Get important state variables of the class

Returns: Dict – State variables necessary to recreate the cross validation run.
Return type: dict

_run_NN(NUM_SAMPLES, INDICES, X_compare, y_compare, IS_TEST)[source]¶

run the neural network and generated validation statistics

Parameters

NUM_SAMPLES (int) – Number of complex spectra to generate
INDICES (list of int) – The indices of the primary data that will be selected.
X_compare (numpy.ndarray) – Complex spectra on which to test the model.
Y_compare (numpy.ndarray) – The target variable histogramsof the test or validation data.
IS_TEST (bool) – Indicates whether comparison set is the test set.

Variables

NN (neural_network.MLPRegressor) – Trained neural network. Only instantiated if IS_TEST == True

Returns

Dict – Dictionary of validation or test results and statistics

Return type

dict

_set_ir_gen_class()[source]¶

Instantiates the class for generaitng complex spectra with parameters set by set_model_parameters()

Variables

MAINconv (IR_GEN) – An IR_GEN class that will generate spectra and histograms of of binding types or GCN groups
OTHER_SITESconv (IR_GEN) – An IR_GEN class that will generate spectra for binding types not considered by MAINconv. Only instantiated if TARGET == ‘GCN’ and GCN_ALL == False

generate_test_cv_indices(CV_SPLITS=3, BINDING_TYPE_FOR_GCN=[1], test_fraction=0.25, random_state=0, read_file=False, write_file=False)[source]¶

Function to generate the test and cross validation indices that will be used to split the primary data.

Parameters

CV_SPLITS (int) – The number of cross validation splits
BINDING_TYPE_FOR_GCN (list or 'ALL') – The binding types to consider when tabulating the GCN values. Primary data with other binding types is considered noise. If ‘ALL’ is selected all binding types are used and no noise is added from other binding types as there are no other binding types.
test_fraction (float) – The percent of primary data to keep for testing.
random_state (int) – Can be set to None. Allows reproducibility.
write_file (bool) – Indicates whether the indices of the selected primary data for cross validation is written to a json file.
read_file (bool) – Indicates whether indices for selecting primary data during cross validation and testing is read from a file or directly from this function.

Variables

CV_SPLITS (int) – The number of cross validation splits
BINDING_TYPE_FOR_GCN (list or 'ALL') – The binding types to consider when tabulating the GCN values. Primary data with other binding types is considered noise. If ‘ALL’ is selected all binding types are used and no noise is added from other binding types as there are no other binding types.
INDICES_DICTIONARY (dict) – Dictionary of indices used in cross validation for both the binding type model and the GCN model.

get_secondary_data(NUM_SAMPLES, INDICES, iterations=1, IS_TRAINING_SET=False)[source]¶

Get secondary data (complex spectra)

Parameters

NUM_SAMPLES (int) – Number of complex spectra to generate
INDICES (list of int) – The indices of the primary data that will be selected.
iterations (int) – Number of times secondary data will be generated and strung together. Allows for a more diverse set of complex spectra for testing.
IS_TRAINING_SET (bool) – Indicates if the secondary set will be used for training or testing

Returns

X (numpy.ndarray) – Coverage shifted frequencies and intensities
Y (The target variable histograms. Either binding-type or GCN label)

get_test_results()[source]¶

Returns a dictionary of a trained neural network evaluated on the test results.

Variables

X_Test (numpy.ndarray) – Complex spectra on which to test the model.
Y_Test (numpy.ndarray) – The target variable histogramsof the test data.

Returns

Dict – Dictionary of test results and statistics

Return type

dict

get_test_secondary_data()[source]¶

Get one batch of secondary data (comlex spectra) for the test indices

Returns

X_Test (numpy.ndarray) – Complex spectra on which to test the model.
Y_Test (numpy.ndarray) – The target variable histogramsof the test data.

run_CV(write_file=False, CV_RESULTS_FILE=None)[source]¶

run cross validation on one cross validation set at a time

Parameters

write_file (bool) – Indicate whether cross validation results will be written out to a file.
CV_RESULTS_FILE (str or None) – File where cross validation results will be written as a json.

Variables

CV_RESULTS_FILE (str or None) – File where cross validation results will be written as a json.

Returns

DictList – Dictionary of cross validation and test results

Return type

list of dict

run_CV_multiprocess(write_file=False, CV_RESULTS_FILE=None, num_procs=None)[source]¶

run multiple cross valdation sets and the test set simultaneously

Parameters

write_file (bool) – Indicate whether cross validation results will be written out to a file.
CV_RESULTS_FILE (str or None) – File where cross validation results will be written as a json.
num_procs (int) – Number of processes to run at any given time.

Variables

CV_RESULTS_FILE (str or None) – File where cross validation results will be written as a json.

Returns

DictList – Dictionary of cross validation and test results

Return type

list of dict

run_single_CV(CV_INDEX_or_TEST)[source]¶

run a single cross validation process or the test set.

Parameters: CV_INDEX_or_TEST (str or int) – Indicates which cross validation set or if the test set is to be run.
Returns: Dict – Dictionary of a single cross validation or test result
Return type: dict

set_model_parameters(TARGET, COVERAGE, MAX_COVERAGES, NN_PROPERTIES, NUM_TRAIN, NUM_VAL, NUM_TEST, MIN_GCN_PER_LABEL=0, NUM_GCN_LABELS=11, GCN_ALL=False, TRAINING_ERROR=None, LOW_FREQUENCY=200, HIGH_FREQUENCY=2200, ENERGY_POINTS=501)[source]¶

Sets model parameters both for generating the secondary data (complex spectra) and for running the neural network.

Parameters

TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.
COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.
MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’
NN_PROPERTIES (dict) – Dictionary of properties for the neural network.
NUM_TRAIN (int) – Number of secondary data (complex spectra) in a single training set.
NUM_VAL (int) – Number of secondary data (complex spectra) in each validation set.
NUM_TEST (int) – Number of secondary data (complex spectra) in the test set.
MIN_GCN_PER_LABEL (int) – Minimum number of primary data points in each GCN group
NUM_GCN_LABELS (int) – The target number of GCN labels which to group the primary data points (simple spectra).
GCN_ALL (bool) – Whether or not to include all binding types in tabulating GCN values. If False, then only the selected binding types determined during the creation of the cross validation indices will be used.
TRAINING_ERROR (str float or None) – Indicates the type of error induced in the training set. Can be ‘gaussian’, a float, or None. If ‘gaussian’ then gaussian error is added to the scaling factor. If a float then uniform error is added after scaling of the primary DFT data.
LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated
HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated
ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into

Variables

TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.
COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.
MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’
NN_PROPERTIES (dict) – Dictionary of properties for the neural network.
NUM_TRAIN (int) – Number of secondary data (complex spectra) in a single training set.
NUM_VAL (int) – Number of secondary data (complex spectra) in each validation set.
NUM_TEST (int) – Number of secondary data (complex spectra) in the test set.
MIN_GCN_PER_LABEL (int) – Minimum number of primary data points in each GCN group
NUM_GCN_LABELS (int) – The target number of GCN labels which to group the primary data points (simple spectra).
GCN_ALL (bool) – Whether or not to include all binding types in tabulating GCN values. If False, then only the selected binding types determined during the creation of the cross validation indices will be used.
TRAINING_ERROR (str float or None) – Indicates the type of error induced in the training set. Can be ‘gaussian’, a float, or None. If ‘gaussian’ then gaussian error is added to the scaling factor. If a float then uniform error is added after scaling of the primary DFT data.
LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated
HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated
ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into

set_nn_parameters(NN_PROPERTIES)[source]¶

Set properties of the neural network so they can be changed after all other model properties have been set

Variables: NN_PROPERTIES (dict) – Dictionary of properties for the neural network.

The `LOAD_CROSS_VALIDATION` class¶

class LOAD_CROSS_VALIDATION(cv_indices_path=None, cross_validation_path=None)[source]¶

Bases: jl_spectra_2_structure.cross_validation.CROSS_VALIDATION

Child class for loading cross validation results.

Parameters

cv_indices_path (str) – Folder path where indices for cross validation are saved.
cross_validation_path (str) – Folder path where cross validation results are stored.

Variables

CV_FILES (list of str) – LIst of cross validation files.
CV_INDICES_PATH_OLD (str) – Folder path where indices for cross validation are saved.
CV_PATH_OLD (str) – Folder path where cross validation results are stored.

get_NN(dictionary)[source]¶

Loads a neural network from a dictionary containing properties such as number of nodes, acitavtion functions, their coefficients and their intercepts.

Parameters: dictionary (dict) – Dictionary containing neural network results and properties
Returns: NN – Trained neural network.
Return type: neural_network.MLPRegressor

get_NN_ensemble(indices, use_all_cv_NN=False)[source]¶

Loads a neural network from a dictionary containing properties such as number of nodes, acitavtion functions, their coefficients and their intercepts.

Parameters

indices (list) –

Indicates which cross validation results to use in generation of a NN: ensemble
use_all_cv_NNbool: Indicates whether to use NN from all cross validation runs in addition to the one trained on all the cross validation data.

Returns

NN_ensemble – Trained neural network ensemble.

Return type

neural_network.MLPRegressor ensemble

get_best_models(models_per_category, standard_deviations)[source]¶

Identify the best neural networks for each category (CO, GCN, etc.)

Parameters

models_per_category (int) – Number of models to include for each
standard_deviations (float) – Number of standard deviations to add to the mean cross validation loss in order select the best model. Optimizes but bias and variance

Variables

BEST_MODELS (dict) – Dictionary of just the best model results and their corresponding model index value

Returns

BEST_MODELS_dict – Dictionary of just the best model results and their corresponding model index value

Return type

dict

get_ensemble_cv()[source]¶

Get cross validation error for an ensemble of models

Variables: ENSEMBLE_MODELS (dict) – Dictionary of ensemble models
Returns: ENSEMBLE_MODELS_dict – Dictionary of ensemble models
Return type: dict

get_keys(dictionary)[source]¶

Get all keys in a dictionary of dictionaries for viewing

Parameters: dictionary (dict) – A dictionary of dictionaries.
Returns: dictionary_of_keys – A dictionary of keys.
Return type: dict

load_CV_class(index, new_cv_indices_path=None, new_cross_validation_path=None)[source]¶

Load all stored data from a single cross validation run.

Parameters

index (int) – Indicates which cross validation result to load
new_cv_indices_path (str) – Folder path where indices for new cross validation are to be saved.
new_cross_validation_path (str) – Folder path where new cross validation results are to be saved.

Variables

NN (neural_network.MLPRegressor) – Trained neural network.
CV_DICT_LIST (list of dict) – List of dictionaries with cross validations results, statistics, and the neural network traine on all cross validation data.

load_all_CV_data()[source]¶

Load results from all cross validation runs in order to identify the best neural network for each category (CO, GCN, etc.)

Variables: CV_RESULTS (dict) – Dictionary of all cross validation results

plot_models(dictionary, figure_directory='show', model_list=None, xlim=[0, 200], ylim1=[0, 0.3], ylim2=[0, 0.3])[source]¶

plot model learning curves for all models in a given dictionary

Parameters

dictionary (dict) – A dictionary of dictionaries.
figure_directory (str) – Folder location to save the learning curves
model_list (None or list of int) – If None each learning curve is displayed independently. If list then the curves are compiled into one figure of two or four panels.
xlim (list) – The start and end number for the epochs to show
ylim1 (list) – The range of loss to show in the first figure if model_list is of length 2. If model_list is of length 4 it is the range on the first and third figures.
ylim2 (list) – The range of loss to show in the second figure if model_list is of length 2. If model_list is of length 4 it is the range on the second and fourth figures.

plot_parity_plots(figure_directory='show', model_list=None, use_ensemble=False)[source]¶

Plot parity plots for models whose index is in model_list

Parameters

figure_directory (str) – Folder location to save the the parity plots
model_list (None or list of int) – Can be None only if load_CV_class(index) has already been run.
use_ensemble (bool) – Dictates whether an ensemble of CV models is used.

The `NESTED_DICT` class¶

class NESTED_DICT[source]¶

Implementation of perl’s autovivification feature.

Returns: dict – Returns a dictionary that is value so that setting key, value pair at any key, value level will create all necessary keys/values if they don’t exist up to that level
Return type: NESTED_DICT

The `NESTED_DICT_to_DICT` function¶

NESTED_DICT_to_DICT(nested_dict)[source]¶

Converts a nested dict to a traditional python dictionary.

Parameters: nested_dict (NESTED_DICT) – A dictionary of type NESTED_DICT
Returns: dictionary – A python dictionary
Return type: dict

The `NN_ENSEMBLE` class¶

class NN_ENSEMBLE(NN_LIST)[source]¶

Class for generating an ensemble of neural networks.

Parameters: NN_LIST (list) – A list of type NN
Variables: NN_LIST (list) – A list of type NN

predict(X, create_predictions_list=False)[source]¶

Predict value(s) using an ensemble

Parameters

X (numpy.array) – A numpy 2D array of input variables

Variables

PREDICTIONS_LIST (list) – List of predictions if create_predictions_list=True
STD (numpy.array) – A numpy 2D array of standard deviation in the predictions.

Returns

PREDICTIONS – A numpy 2D array of average predictions

Return type

numpy.array

cross_validation¶

The CROSS_VALIDATION class¶

The LOAD_CROSS_VALIDATION class¶

The NESTED_DICT class¶

The NESTED_DICT_to_DICT function¶

The NN_ENSEMBLE class¶

The `CROSS_VALIDATION` class¶

The `LOAD_CROSS_VALIDATION` class¶

The `NESTED_DICT` class¶

The `NESTED_DICT_to_DICT` function¶

The `NN_ENSEMBLE` class¶