cross_validation¶
The CROSS_VALIDATION
class¶
-
class
CROSS_VALIDATION
(ADSORBATE='CO', INCLUDED_BINDING_TYPES=[1, 2, 3, 4], cv_indices_path=None, cross_validation_path=None, nanoparticle_path=None, high_coverage_path=None, coverage_scaling_path=None, VERBOSE=False, SAVE_ALL_NN=True)[source]¶ Class for running cross validation. Splits primary data into balanced folds and generates sets of secondary data.
- Parameters
ADSORBATE (str) – Adsorbate for which the spectra is to be generated.
INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.
cv_indices_path (str) – Folder path where indices for cross validation are saved or where they are to be created if they have not yet been created.
cross_validation_path (str) – Folder path where cross validation results are to be created.
nanoparticle_path (str) – File path where nanoparticle or single adsorbate json data is saved.
high_coverage_path (str) – File path where high coverage data for CO is saved saved.
coverage_scaling_path (str) – File path where coverage scaling coefficients are saved.
VERBOSE (bool) – Controls the printing of status statements.
- Variables
CV_INDICES_PATH (str) – Folder path for cross validation indices.
CV_PATH (str) – Folder path where cross validation results are to be created.
NANO_PATH (str) – File path where nanoparticle or single adsorbate json data is saved.
HIGH_COV_PATH (str) – File path where high coverage data for CO is saved saved.
COV_SCALE_PATH (str) – File path where coverage scaling coefficients are saved.
VERBOSE (bool) – Controls the printing of status statements.
SAVE_ALL_NN (bool) – Controls whether a NN from each cross validation is saved or just from the test set.
ADSORBATE (str) – Adsorbate for which the spectra is to be generated.
INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.
-
_get_default_cross_validation_path
()[source]¶ Get cross validation path if not set by user.
- Returns
cross_validation_path – Folder path where cross validation results are to be created.
- Return type
-
_get_default_cv_indices_path
()[source]¶ Get cross validation indices path if not set by user.
- Returns
cv_indices_path – Folder path where cross validation indices are to be created.
- Return type
-
_get_state
()[source]¶ Get important state variables of the class
- Returns
Dict – State variables necessary to recreate the cross validation run.
- Return type
-
_run_NN
(NUM_SAMPLES, INDICES, X_compare, y_compare, IS_TEST)[source]¶ run the neural network and generated validation statistics
- Parameters
NUM_SAMPLES (int) – Number of complex spectra to generate
INDICES (list of int) – The indices of the primary data that will be selected.
X_compare (numpy.ndarray) – Complex spectra on which to test the model.
Y_compare (numpy.ndarray) – The target variable histogramsof the test or validation data.
IS_TEST (bool) – Indicates whether comparison set is the test set.
- Variables
NN (neural_network.MLPRegressor) – Trained neural network. Only instantiated if IS_TEST == True
- Returns
Dict – Dictionary of validation or test results and statistics
- Return type
-
_set_ir_gen_class
()[source]¶ Instantiates the class for generaitng complex spectra with parameters set by set_model_parameters()
-
generate_test_cv_indices
(CV_SPLITS=3, BINDING_TYPE_FOR_GCN=[1], test_fraction=0.25, random_state=0, read_file=False, write_file=False)[source]¶ Function to generate the test and cross validation indices that will be used to split the primary data.
- Parameters
CV_SPLITS (int) – The number of cross validation splits
BINDING_TYPE_FOR_GCN (list or 'ALL') – The binding types to consider when tabulating the GCN values. Primary data with other binding types is considered noise. If ‘ALL’ is selected all binding types are used and no noise is added from other binding types as there are no other binding types.
test_fraction (float) – The percent of primary data to keep for testing.
random_state (int) – Can be set to None. Allows reproducibility.
write_file (bool) – Indicates whether the indices of the selected primary data for cross validation is written to a json file.
read_file (bool) – Indicates whether indices for selecting primary data during cross validation and testing is read from a file or directly from this function.
- Variables
CV_SPLITS (int) – The number of cross validation splits
BINDING_TYPE_FOR_GCN (list or 'ALL') – The binding types to consider when tabulating the GCN values. Primary data with other binding types is considered noise. If ‘ALL’ is selected all binding types are used and no noise is added from other binding types as there are no other binding types.
INDICES_DICTIONARY (dict) – Dictionary of indices used in cross validation for both the binding type model and the GCN model.
-
get_secondary_data
(NUM_SAMPLES, INDICES, iterations=1, IS_TRAINING_SET=False)[source]¶ Get secondary data (complex spectra)
- Parameters
NUM_SAMPLES (int) – Number of complex spectra to generate
INDICES (list of int) – The indices of the primary data that will be selected.
iterations (int) – Number of times secondary data will be generated and strung together. Allows for a more diverse set of complex spectra for testing.
IS_TRAINING_SET (bool) – Indicates if the secondary set will be used for training or testing
- Returns
X (numpy.ndarray) – Coverage shifted frequencies and intensities
Y (The target variable histograms. Either binding-type or GCN label)
-
get_test_results
()[source]¶ Returns a dictionary of a trained neural network evaluated on the test results.
- Variables
X_Test (numpy.ndarray) – Complex spectra on which to test the model.
Y_Test (numpy.ndarray) – The target variable histogramsof the test data.
- Returns
Dict – Dictionary of test results and statistics
- Return type
-
get_test_secondary_data
()[source]¶ Get one batch of secondary data (comlex spectra) for the test indices
- Returns
X_Test (numpy.ndarray) – Complex spectra on which to test the model.
Y_Test (numpy.ndarray) – The target variable histogramsof the test data.
-
run_CV
(write_file=False, CV_RESULTS_FILE=None)[source]¶ run cross validation on one cross validation set at a time
-
run_CV_multiprocess
(write_file=False, CV_RESULTS_FILE=None, num_procs=None)[source]¶ run multiple cross valdation sets and the test set simultaneously
-
set_model_parameters
(TARGET, COVERAGE, MAX_COVERAGES, NN_PROPERTIES, NUM_TRAIN, NUM_VAL, NUM_TEST, MIN_GCN_PER_LABEL=0, NUM_GCN_LABELS=11, GCN_ALL=False, TRAINING_ERROR=None, LOW_FREQUENCY=200, HIGH_FREQUENCY=2200, ENERGY_POINTS=501)[source]¶ Sets model parameters both for generating the secondary data (complex spectra) and for running the neural network.
- Parameters
TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.
COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.
MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’
NN_PROPERTIES (dict) – Dictionary of properties for the neural network.
NUM_TRAIN (int) – Number of secondary data (complex spectra) in a single training set.
NUM_VAL (int) – Number of secondary data (complex spectra) in each validation set.
NUM_TEST (int) – Number of secondary data (complex spectra) in the test set.
MIN_GCN_PER_LABEL (int) – Minimum number of primary data points in each GCN group
NUM_GCN_LABELS (int) – The target number of GCN labels which to group the primary data points (simple spectra).
GCN_ALL (bool) – Whether or not to include all binding types in tabulating GCN values. If False, then only the selected binding types determined during the creation of the cross validation indices will be used.
TRAINING_ERROR (str float or None) – Indicates the type of error induced in the training set. Can be ‘gaussian’, a float, or None. If ‘gaussian’ then gaussian error is added to the scaling factor. If a float then uniform error is added after scaling of the primary DFT data.
LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated
HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated
ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into
- Variables
TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.
COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.
MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’
NN_PROPERTIES (dict) – Dictionary of properties for the neural network.
NUM_TRAIN (int) – Number of secondary data (complex spectra) in a single training set.
NUM_VAL (int) – Number of secondary data (complex spectra) in each validation set.
NUM_TEST (int) – Number of secondary data (complex spectra) in the test set.
MIN_GCN_PER_LABEL (int) – Minimum number of primary data points in each GCN group
NUM_GCN_LABELS (int) – The target number of GCN labels which to group the primary data points (simple spectra).
GCN_ALL (bool) – Whether or not to include all binding types in tabulating GCN values. If False, then only the selected binding types determined during the creation of the cross validation indices will be used.
TRAINING_ERROR (str float or None) – Indicates the type of error induced in the training set. Can be ‘gaussian’, a float, or None. If ‘gaussian’ then gaussian error is added to the scaling factor. If a float then uniform error is added after scaling of the primary DFT data.
LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated
HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated
ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into
The LOAD_CROSS_VALIDATION
class¶
-
class
LOAD_CROSS_VALIDATION
(cv_indices_path=None, cross_validation_path=None)[source]¶ Bases:
jl_spectra_2_structure.cross_validation.CROSS_VALIDATION
Child class for loading cross validation results.
- Parameters
- Variables
-
get_NN
(dictionary)[source]¶ Loads a neural network from a dictionary containing properties such as number of nodes, acitavtion functions, their coefficients and their intercepts.
- Parameters
dictionary (dict) – Dictionary containing neural network results and properties
- Returns
NN – Trained neural network.
- Return type
neural_network.MLPRegressor
-
get_NN_ensemble
(indices, use_all_cv_NN=False)[source]¶ Loads a neural network from a dictionary containing properties such as number of nodes, acitavtion functions, their coefficients and their intercepts.
- Parameters
indices (list) –
- Indicates which cross validation results to use in generation of a NN
ensemble
- use_all_cv_NNbool
Indicates whether to use NN from all cross validation runs in addition to the one trained on all the cross validation data.
- Returns
NN_ensemble – Trained neural network ensemble.
- Return type
neural_network.MLPRegressor ensemble
-
get_best_models
(models_per_category, standard_deviations)[source]¶ Identify the best neural networks for each category (CO, GCN, etc.)
- Parameters
- Variables
BEST_MODELS (dict) – Dictionary of just the best model results and their corresponding model index value
- Returns
BEST_MODELS_dict – Dictionary of just the best model results and their corresponding model index value
- Return type
-
load_CV_class
(index, new_cv_indices_path=None, new_cross_validation_path=None)[source]¶ Load all stored data from a single cross validation run.
- Parameters
- Variables
NN (neural_network.MLPRegressor) – Trained neural network.
CV_DICT_LIST (list of dict) – List of dictionaries with cross validations results, statistics, and the neural network traine on all cross validation data.
-
load_all_CV_data
()[source]¶ Load results from all cross validation runs in order to identify the best neural network for each category (CO, GCN, etc.)
- Variables
CV_RESULTS (dict) – Dictionary of all cross validation results
-
plot_models
(dictionary, figure_directory='show', model_list=None, xlim=[0, 200], ylim1=[0, 0.3], ylim2=[0, 0.3])[source]¶ plot model learning curves for all models in a given dictionary
- Parameters
dictionary (dict) – A dictionary of dictionaries.
figure_directory (str) – Folder location to save the learning curves
model_list (None or list of int) – If None each learning curve is displayed independently. If list then the curves are compiled into one figure of two or four panels.
xlim (list) – The start and end number for the epochs to show
ylim1 (list) – The range of loss to show in the first figure if model_list is of length 2. If model_list is of length 4 it is the range on the first and third figures.
ylim2 (list) – The range of loss to show in the second figure if model_list is of length 2. If model_list is of length 4 it is the range on the second and fourth figures.
The NESTED_DICT
class¶
The NESTED_DICT_to_DICT
function¶
-
NESTED_DICT_to_DICT
(nested_dict)[source]¶ Converts a nested dict to a traditional python dictionary.
- Parameters
nested_dict (NESTED_DICT) – A dictionary of type NESTED_DICT
- Returns
dictionary – A python dictionary
- Return type
The NN_ENSEMBLE
class¶
-
class
NN_ENSEMBLE
(NN_LIST)[source]¶ Class for generating an ensemble of neural networks.
-
predict
(X, create_predictions_list=False)[source]¶ Predict value(s) using an ensemble
- Parameters
X (numpy.array) – A numpy 2D array of input variables
- Variables
PREDICTIONS_LIST (list) – List of predictions if create_predictions_list=True
STD (numpy.array) – A numpy 2D array of standard deviation in the predictions.
- Returns
PREDICTIONS – A numpy 2D array of average predictions
- Return type
numpy.array
-