cross_validation

The CROSS_VALIDATION class

class CROSS_VALIDATION(ADSORBATE='CO', INCLUDED_BINDING_TYPES=[1, 2, 3, 4], cv_indices_path=None, cross_validation_path=None, nanoparticle_path=None, high_coverage_path=None, coverage_scaling_path=None, VERBOSE=False, SAVE_ALL_NN=True)[source]

Class for running cross validation. Splits primary data into balanced folds and generates sets of secondary data.

Parameters
  • ADSORBATE (str) – Adsorbate for which the spectra is to be generated.

  • INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.

  • cv_indices_path (str) – Folder path where indices for cross validation are saved or where they are to be created if they have not yet been created.

  • cross_validation_path (str) – Folder path where cross validation results are to be created.

  • nanoparticle_path (str) – File path where nanoparticle or single adsorbate json data is saved.

  • high_coverage_path (str) – File path where high coverage data for CO is saved saved.

  • coverage_scaling_path (str) – File path where coverage scaling coefficients are saved.

  • VERBOSE (bool) – Controls the printing of status statements.

Variables
  • CV_INDICES_PATH (str) – Folder path for cross validation indices.

  • CV_PATH (str) – Folder path where cross validation results are to be created.

  • NANO_PATH (str) – File path where nanoparticle or single adsorbate json data is saved.

  • HIGH_COV_PATH (str) – File path where high coverage data for CO is saved saved.

  • COV_SCALE_PATH (str) – File path where coverage scaling coefficients are saved.

  • VERBOSE (bool) – Controls the printing of status statements.

  • SAVE_ALL_NN (bool) – Controls whether a NN from each cross validation is saved or just from the test set.

  • ADSORBATE (str) – Adsorbate for which the spectra is to be generated.

  • INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.

_get_default_cross_validation_path()[source]

Get cross validation path if not set by user.

Returns

cross_validation_path – Folder path where cross validation results are to be created.

Return type

str

_get_default_cv_indices_path()[source]

Get cross validation indices path if not set by user.

Returns

cv_indices_path – Folder path where cross validation indices are to be created.

Return type

str

_get_state()[source]

Get important state variables of the class

Returns

Dict – State variables necessary to recreate the cross validation run.

Return type

dict

_run_NN(NUM_SAMPLES, INDICES, X_compare, y_compare, IS_TEST)[source]

run the neural network and generated validation statistics

Parameters
  • NUM_SAMPLES (int) – Number of complex spectra to generate

  • INDICES (list of int) – The indices of the primary data that will be selected.

  • X_compare (numpy.ndarray) – Complex spectra on which to test the model.

  • Y_compare (numpy.ndarray) – The target variable histogramsof the test or validation data.

  • IS_TEST (bool) – Indicates whether comparison set is the test set.

Variables

NN (neural_network.MLPRegressor) – Trained neural network. Only instantiated if IS_TEST == True

Returns

Dict – Dictionary of validation or test results and statistics

Return type

dict

_set_ir_gen_class()[source]

Instantiates the class for generaitng complex spectra with parameters set by set_model_parameters()

Variables
  • MAINconv (IR_GEN) – An IR_GEN class that will generate spectra and histograms of of binding types or GCN groups

  • OTHER_SITESconv (IR_GEN) – An IR_GEN class that will generate spectra for binding types not considered by MAINconv. Only instantiated if TARGET == ‘GCN’ and GCN_ALL == False

generate_test_cv_indices(CV_SPLITS=3, BINDING_TYPE_FOR_GCN=[1], test_fraction=0.25, random_state=0, read_file=False, write_file=False)[source]

Function to generate the test and cross validation indices that will be used to split the primary data.

Parameters
  • CV_SPLITS (int) – The number of cross validation splits

  • BINDING_TYPE_FOR_GCN (list or 'ALL') – The binding types to consider when tabulating the GCN values. Primary data with other binding types is considered noise. If ‘ALL’ is selected all binding types are used and no noise is added from other binding types as there are no other binding types.

  • test_fraction (float) – The percent of primary data to keep for testing.

  • random_state (int) – Can be set to None. Allows reproducibility.

  • write_file (bool) – Indicates whether the indices of the selected primary data for cross validation is written to a json file.

  • read_file (bool) – Indicates whether indices for selecting primary data during cross validation and testing is read from a file or directly from this function.

Variables
  • CV_SPLITS (int) – The number of cross validation splits

  • BINDING_TYPE_FOR_GCN (list or 'ALL') – The binding types to consider when tabulating the GCN values. Primary data with other binding types is considered noise. If ‘ALL’ is selected all binding types are used and no noise is added from other binding types as there are no other binding types.

  • INDICES_DICTIONARY (dict) – Dictionary of indices used in cross validation for both the binding type model and the GCN model.

get_secondary_data(NUM_SAMPLES, INDICES, iterations=1, IS_TRAINING_SET=False)[source]

Get secondary data (complex spectra)

Parameters
  • NUM_SAMPLES (int) – Number of complex spectra to generate

  • INDICES (list of int) – The indices of the primary data that will be selected.

  • iterations (int) – Number of times secondary data will be generated and strung together. Allows for a more diverse set of complex spectra for testing.

  • IS_TRAINING_SET (bool) – Indicates if the secondary set will be used for training or testing

Returns

  • X (numpy.ndarray) – Coverage shifted frequencies and intensities

  • Y (The target variable histograms. Either binding-type or GCN label)

get_test_results()[source]

Returns a dictionary of a trained neural network evaluated on the test results.

Variables
  • X_Test (numpy.ndarray) – Complex spectra on which to test the model.

  • Y_Test (numpy.ndarray) – The target variable histogramsof the test data.

Returns

Dict – Dictionary of test results and statistics

Return type

dict

get_test_secondary_data()[source]

Get one batch of secondary data (comlex spectra) for the test indices

Returns

  • X_Test (numpy.ndarray) – Complex spectra on which to test the model.

  • Y_Test (numpy.ndarray) – The target variable histogramsof the test data.

run_CV(write_file=False, CV_RESULTS_FILE=None)[source]

run cross validation on one cross validation set at a time

Parameters
  • write_file (bool) – Indicate whether cross validation results will be written out to a file.

  • CV_RESULTS_FILE (str or None) – File where cross validation results will be written as a json.

Variables

CV_RESULTS_FILE (str or None) – File where cross validation results will be written as a json.

Returns

DictList – Dictionary of cross validation and test results

Return type

list of dict

run_CV_multiprocess(write_file=False, CV_RESULTS_FILE=None, num_procs=None)[source]

run multiple cross valdation sets and the test set simultaneously

Parameters
  • write_file (bool) – Indicate whether cross validation results will be written out to a file.

  • CV_RESULTS_FILE (str or None) – File where cross validation results will be written as a json.

  • num_procs (int) – Number of processes to run at any given time.

Variables

CV_RESULTS_FILE (str or None) – File where cross validation results will be written as a json.

Returns

DictList – Dictionary of cross validation and test results

Return type

list of dict

run_single_CV(CV_INDEX_or_TEST)[source]

run a single cross validation process or the test set.

Parameters

CV_INDEX_or_TEST (str or int) – Indicates which cross validation set or if the test set is to be run.

Returns

Dict – Dictionary of a single cross validation or test result

Return type

dict

set_model_parameters(TARGET, COVERAGE, MAX_COVERAGES, NN_PROPERTIES, NUM_TRAIN, NUM_VAL, NUM_TEST, MIN_GCN_PER_LABEL=0, NUM_GCN_LABELS=11, GCN_ALL=False, TRAINING_ERROR=None, LOW_FREQUENCY=200, HIGH_FREQUENCY=2200, ENERGY_POINTS=501)[source]

Sets model parameters both for generating the secondary data (complex spectra) and for running the neural network.

Parameters
  • TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.

  • COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.

  • MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’

  • NN_PROPERTIES (dict) – Dictionary of properties for the neural network.

  • NUM_TRAIN (int) – Number of secondary data (complex spectra) in a single training set.

  • NUM_VAL (int) – Number of secondary data (complex spectra) in each validation set.

  • NUM_TEST (int) – Number of secondary data (complex spectra) in the test set.

  • MIN_GCN_PER_LABEL (int) – Minimum number of primary data points in each GCN group

  • NUM_GCN_LABELS (int) – The target number of GCN labels which to group the primary data points (simple spectra).

  • GCN_ALL (bool) – Whether or not to include all binding types in tabulating GCN values. If False, then only the selected binding types determined during the creation of the cross validation indices will be used.

  • TRAINING_ERROR (str float or None) – Indicates the type of error induced in the training set. Can be ‘gaussian’, a float, or None. If ‘gaussian’ then gaussian error is added to the scaling factor. If a float then uniform error is added after scaling of the primary DFT data.

  • LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated

  • HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated

  • ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into

Variables
  • TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.

  • COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.

  • MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’

  • NN_PROPERTIES (dict) – Dictionary of properties for the neural network.

  • NUM_TRAIN (int) – Number of secondary data (complex spectra) in a single training set.

  • NUM_VAL (int) – Number of secondary data (complex spectra) in each validation set.

  • NUM_TEST (int) – Number of secondary data (complex spectra) in the test set.

  • MIN_GCN_PER_LABEL (int) – Minimum number of primary data points in each GCN group

  • NUM_GCN_LABELS (int) – The target number of GCN labels which to group the primary data points (simple spectra).

  • GCN_ALL (bool) – Whether or not to include all binding types in tabulating GCN values. If False, then only the selected binding types determined during the creation of the cross validation indices will be used.

  • TRAINING_ERROR (str float or None) – Indicates the type of error induced in the training set. Can be ‘gaussian’, a float, or None. If ‘gaussian’ then gaussian error is added to the scaling factor. If a float then uniform error is added after scaling of the primary DFT data.

  • LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated

  • HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated

  • ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into

set_nn_parameters(NN_PROPERTIES)[source]

Set properties of the neural network so they can be changed after all other model properties have been set

Variables

NN_PROPERTIES (dict) – Dictionary of properties for the neural network.

The LOAD_CROSS_VALIDATION class

class LOAD_CROSS_VALIDATION(cv_indices_path=None, cross_validation_path=None)[source]

Bases: jl_spectra_2_structure.cross_validation.CROSS_VALIDATION

Child class for loading cross validation results.

Parameters
  • cv_indices_path (str) – Folder path where indices for cross validation are saved.

  • cross_validation_path (str) – Folder path where cross validation results are stored.

Variables
  • CV_FILES (list of str) – LIst of cross validation files.

  • CV_INDICES_PATH_OLD (str) – Folder path where indices for cross validation are saved.

  • CV_PATH_OLD (str) – Folder path where cross validation results are stored.

get_NN(dictionary)[source]

Loads a neural network from a dictionary containing properties such as number of nodes, acitavtion functions, their coefficients and their intercepts.

Parameters

dictionary (dict) – Dictionary containing neural network results and properties

Returns

NN – Trained neural network.

Return type

neural_network.MLPRegressor

get_NN_ensemble(indices, use_all_cv_NN=False)[source]

Loads a neural network from a dictionary containing properties such as number of nodes, acitavtion functions, their coefficients and their intercepts.

Parameters

indices (list) –

Indicates which cross validation results to use in generation of a NN

ensemble

use_all_cv_NNbool

Indicates whether to use NN from all cross validation runs in addition to the one trained on all the cross validation data.

Returns

NN_ensemble – Trained neural network ensemble.

Return type

neural_network.MLPRegressor ensemble

get_best_models(models_per_category, standard_deviations)[source]

Identify the best neural networks for each category (CO, GCN, etc.)

Parameters
  • models_per_category (int) – Number of models to include for each

  • standard_deviations (float) – Number of standard deviations to add to the mean cross validation loss in order select the best model. Optimizes but bias and variance

Variables

BEST_MODELS (dict) – Dictionary of just the best model results and their corresponding model index value

Returns

BEST_MODELS_dict – Dictionary of just the best model results and their corresponding model index value

Return type

dict

get_ensemble_cv()[source]

Get cross validation error for an ensemble of models

Variables

ENSEMBLE_MODELS (dict) – Dictionary of ensemble models

Returns

ENSEMBLE_MODELS_dict – Dictionary of ensemble models

Return type

dict

get_keys(dictionary)[source]

Get all keys in a dictionary of dictionaries for viewing

Parameters

dictionary (dict) – A dictionary of dictionaries.

Returns

dictionary_of_keys – A dictionary of keys.

Return type

dict

load_CV_class(index, new_cv_indices_path=None, new_cross_validation_path=None)[source]

Load all stored data from a single cross validation run.

Parameters
  • index (int) – Indicates which cross validation result to load

  • new_cv_indices_path (str) – Folder path where indices for new cross validation are to be saved.

  • new_cross_validation_path (str) – Folder path where new cross validation results are to be saved.

Variables
  • NN (neural_network.MLPRegressor) – Trained neural network.

  • CV_DICT_LIST (list of dict) – List of dictionaries with cross validations results, statistics, and the neural network traine on all cross validation data.

load_all_CV_data()[source]

Load results from all cross validation runs in order to identify the best neural network for each category (CO, GCN, etc.)

Variables

CV_RESULTS (dict) – Dictionary of all cross validation results

plot_models(dictionary, figure_directory='show', model_list=None, xlim=[0, 200], ylim1=[0, 0.3], ylim2=[0, 0.3])[source]

plot model learning curves for all models in a given dictionary

Parameters
  • dictionary (dict) – A dictionary of dictionaries.

  • figure_directory (str) – Folder location to save the learning curves

  • model_list (None or list of int) – If None each learning curve is displayed independently. If list then the curves are compiled into one figure of two or four panels.

  • xlim (list) – The start and end number for the epochs to show

  • ylim1 (list) – The range of loss to show in the first figure if model_list is of length 2. If model_list is of length 4 it is the range on the first and third figures.

  • ylim2 (list) – The range of loss to show in the second figure if model_list is of length 2. If model_list is of length 4 it is the range on the second and fourth figures.

plot_parity_plots(figure_directory='show', model_list=None, use_ensemble=False)[source]

Plot parity plots for models whose index is in model_list

Parameters
  • figure_directory (str) – Folder location to save the the parity plots

  • model_list (None or list of int) – Can be None only if load_CV_class(index) has already been run.

  • use_ensemble (bool) – Dictates whether an ensemble of CV models is used.

The NESTED_DICT class

class NESTED_DICT[source]

Implementation of perl’s autovivification feature.

Returns

dict – Returns a dictionary that is value so that setting key, value pair at any key, value level will create all necessary keys/values if they don’t exist up to that level

Return type

NESTED_DICT

The NESTED_DICT_to_DICT function

NESTED_DICT_to_DICT(nested_dict)[source]

Converts a nested dict to a traditional python dictionary.

Parameters

nested_dict (NESTED_DICT) – A dictionary of type NESTED_DICT

Returns

dictionary – A python dictionary

Return type

dict

The NN_ENSEMBLE class

class NN_ENSEMBLE(NN_LIST)[source]

Class for generating an ensemble of neural networks.

Parameters

NN_LIST (list) – A list of type NN

Variables

NN_LIST (list) – A list of type NN

predict(X, create_predictions_list=False)[source]

Predict value(s) using an ensemble

Parameters

X (numpy.array) – A numpy 2D array of input variables

Variables
  • PREDICTIONS_LIST (list) – List of predictions if create_predictions_list=True

  • STD (numpy.array) – A numpy 2D array of standard deviation in the predictions.

Returns

PREDICTIONS – A numpy 2D array of average predictions

Return type

numpy.array