jl_spectra_2_structure¶
The get_defaults
function¶
-
get_default_data_paths
(adsorbate)[source]¶ Get default paths to primary data of frequencies and intensities.
- Parameters
adsorbate (str) – Name of the adsorbate for which to get the default primary DFT data. Data is already provided for ‘CO’, ‘NO’, and ‘C2H4’.
- Returns
nanoparticle_path (str) – File path where nanoparticle or single adsorbate data will be saved.
isotope_path (str) – File path where isotope data for CO and NO will be saved.
high_coverage_path (str) – File path where high coverage data for CO will be saved.
coverage_scaling_path (str) – File path where coverage scaling coefficients will be saved.
Notes
Returns default frequencies to project intensities onto as well as default paths for locations of the pure and mixture spectroscopic data.
The get_exp_data_path
function¶
The IR_GEN
class¶
-
class
IR_GEN
(ADSORBATE='CO', INCLUDED_BINDING_TYPES=[1, 2, 3, 4], TARGET='binding_type', NUM_TARGETS=None, nanoparticle_path=None, high_coverage_path=None, coverage_scaling_path=None, VERBOSE=False)[source]¶ Class for generating complex synthetic IR spectra
- Parameters
ADSORBATE (str) – Adsorbate for which the spectra is to be generated.
INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.
TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.
NUM_TARGETS (int) – Number of GCN groups to predict. Only used if TARGET=’GCN’ and GCN must be discretized.
nanoparticle_path (str) –
path where nanoparticle or single adsorbate json data is saved. (File) –
high_coverage_path (str) – File path where high coverage data for CO is saved saved.
coverage_scaling_path (str) – File path where coverage scaling coefficients are saved.
VERBOSE (bool) – Controls the printing of status statements.
- Variables
BINDING_TYPES_with_4fold (list) – List of all binding types.
TARGET (str) – Set during initialization.
NUM_TARGETS (int) – Set during initialization.
X0cov (numpy.ndarray) – Set of frequencies and intensities at low coverage.
BINDING_TYPES (list) – Binding types to be predicted, accounts for both filtering and merging of certain sites.
GCNList (list) – GCN values for the data points
NANO_PATH (int) – Set during initialization.
HIGH_COV_PATH (int) – Set during initialization.
COV_SCALE_PATH (int) – Set during initialization.
ADSORBATE (float) – Set during initialization.
INCLUDED_BINDING_TYPES (int) – Set during initialization.
VERBOSE (int) – Set during initialization.
-
_add_high_coverage_data
(X, Y)[source]¶ method for adding high coverage GCN data to set of spectra
- Parameters
X (numpy.ndarray) – Zero coverage frequencies and intensities
Y (numpy.ndarray) – GCN labels
- Returns
X_new (numpy.ndarray) – Frequencies and intensities with high coveage data
Y_new (numpy.ndarray) – New taret vector with high covearge GCN data added
-
_coverage_shift
(X, BINDING_TYPES, SELF_COVERAGE, TOTAL_COVERAGE)[source]¶ Shift frqequencies and intensities to account for coverages effects.
- Parameters
X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.
SELF_COVERAGE (numpy.ndarray of floats) – Relative spatial coverage of each binding-type
TOTAL_COVERAGE (numpy.ndarray of floats) – Relative combined coverage fo which the primary data point “sits”
- Returns
Xcov – 3-D numpy array of frequencies and intensities that have been shifted to account for coverage effects. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
- Return type
-
_generate_spectra
(Xfrequencies, Xintensities, ENERGIES)[source]¶ Convert set of frequencies and intensities to spectra with minimal line width by Gaussian convolution.
- Parameters
Xfrequencies (numpy.ndarray) – 2-D numpy array of frequencies. The ndarray has dimensions \(m x n\) where \(m\) is the number of primary datapoints, \(n\) is the number of frequencies/intensities for each datapoint.
Xintensities (numpy.ndarray) – 2-D numpy array of intensities. The ndarray has dimensions \(m x n\) where \(m\) is the number of primary datapoints, \(n\) is the number of frequencies/intensities for each datapoint.
ENERGIES (numpy.ndarray) – Numpy array of energies onto which frequencies and itensities will be convolved.
- Returns
intmesh – numpy ndarray with dimensions \(m x p\) where \(m\) is the number of primary datapoints and \(p\) is the number of energy points onto which the frequencies and intensities aree projected with Fourier convolution.
- Return type
Notes
This preconvolution is necessary for numerical reasons before data is convoluted to induce greater spectral broadening.
-
_get_balanced_data
(X_new, Y_new, indices)[source]¶ Correct for imbalances in the data for improved learning
- Parameters
X_new (numpy.ndarray) – Full set of frequencies and intensities
Y_new (numpy.ndarray) – class/groups that are imbalanced
indices (list) – Indices for which individual X_new and Y)new will be selected.
- Returns
X_balanced (numpy.ndarray) – Frequencies and intensities corresponding to balanced Y
Y_balanced (numpy.ndarray) – Balanced set of classes/groups.
BINDING_TYPES_balanced (numpy.ndarray) – Balanced set of binding-types that correspond to Y_balanced
-
_get_coverage_shifted_X
(X0cov, Y, COVERAGE)[source]¶ Get frequencies and intensities shifted with coveage.
- Parameters
X0cov (numpy.ndarray) – Numpy array of size (n,2) where \(n\) is the number of frequency and intensity pairs.
Y (numpy.ndarray) – Target classes/groups (binding-types or GCN labels)
COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.
- Returns
X (numpy.ndarray) – Frequencies and intensities shifted to account for coverage
NUM_TARGETS (int) – Number of target classes/groups. Updated if TARGET is GCN and coverage is ‘high’
-
_get_probabilities
(num_samples, NUM_TARGETS)[source]¶ Get probablilities for sampling data from different classes/groups.
- Parameters
- Returns
probabilities – The probabilities to select each binding-type or GCN group. Has dimensions (num_samples, NUM_TARGETS).
- Return type
Notes
Returns the a 2-D numpy.ndarray that is of length num_samples along the first dimension and NUM_TARGETS alongt the second dimesion. Elements correspond to the probability of selecting primary data point from a a specific class/group such that the array sums to 1 along the 2nd dimention. The probability assigned to the first index of the along the 2nd dimenstion is comes from a uniform distribution between 0 and 1 while the probability assigned to each following index \(i = n\) comes from the uniform distribution \(1/\sum{p_{i}}\) where \(i\) in \(p_{i}\) comes from all previous index values \(i < n\). The probabilities are then shuffled along the second dimension. This probability distribution results in contribution to spectra from any given class/label is most likely zero and the likelihood of the contribution monotonically decreases as the the contribution goes to 1. This distribution ensures that all fractional contributions are as equally sampled as possible.
-
_get_sample_data
(X, Y, indices, IS_TRAINING_SET, TRAINING_ERROR)[source]¶ Apply scaling factor and pertub by error associated with scaling factor.
- Parameters
X (numpy.ndarray) – Zero coverage frequencies and intensities
Y (numpy.ndarray) – GCN labels
indices (list) – Indices for which individual X_new and Y)new will be selected.
IS_TRAINING_SET (bool) – Indicates whether synthetic spectra to be generated is for training or validating
TRAINING_ERROR (float or str) – Tag that indicates treatment of error in scaling factor for training data.
- Returns
X_sample (numpy.ndarray) – Sample of frequencies and intensities that will be mixed and then convoluted.
Y_sample (numpy.ndarray) – Targets classes that will be tabulated to generated fractional contributions to spectra
BINDING_TYPES_sample (numpy.ndarray) – Binding-types that will be used in coverage scaling if scaling is set to ‘high’
-
_mixed_lineshape
(FWHM, fL, ENERGY_POINTS, energy_spacing)[source]¶ Convolute spectra with Lorentzian or Gaussian induce broadening.
- Parameters
FWHM (float) – Full-width-half-maximum of the desired spectra
fL (float) – Fraction of spectra to be convoluted by a Lorentzian transform. The remaining portion of the transoform to make up the FWHM comes sourced from a Gaussian convoluting function.
ENERGY_POINTS (int) – Number of energy points in the desired spectra
energy_spacing (float) – spacing of the energy points
- Returns
transform – transform that is convolved with a spectra that has a narrow line width in order to produce spectra with greater line widths.
- Return type
Notes
Accepts a full-width-half-maximum, a fraction of Lorentzian, energy spacing and number of points and produces the transform with which the spectra is convolved through fourier convolution in order to produce realistic experimental spectra.
-
_perturb_and_shift
(perturbations, X, y, BINDING_TYPES=None)[source]¶ Shift spectra with scaling factor that has gaussian error.
- Parameters
perturbations (int) – The number of perturbed primary datapoints per original datapoint.
X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities.
y (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)
BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.
- Returns
Xperturbed (numpy.ndarray) – Perturbed intensities and frequencies. Has dimensions \(m_{new} x n x p\) where \(m_{new} = m x perturbations\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
yperturbed (numpy.ndarray) – Some y-variable to be regreesed such as binding-type or GCN group.
BINDING_TYPES_perturbed (numpy.ndarray) – Binding-types of the perturbed primary datapoints.
Notes
Returns a tuple of perturbed primary datapoints. The goal is to improve the incorporate error in the scaling factor into the validation and test set. Scaling factor is perturbed according to its calculated standard error.
-
_perturb_spectra
(perturbations, X, y, a=0.999, b=1.001, BINDING_TYPES=None)[source]¶ Generate pertubations of the frequencies and intensities uniformly.
- Parameters
perturbations (int) – The number of perturbed primary datapoints per original datapoint.
X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
y (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)
a (int) – The lower boudn of the uniform distribution.
b (int) – The upper bound of the uniform distribution.
BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.
- Returns
Xperturbed (numpy.ndarray) – Perturbed intensities and frequencies. Has dimensions \(m_{new} x n x p\) where \(m_{new} = m x perturbations\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
yperturbed (numpy.ndarray) – Some y-variable to be regreesed such as binding-type or GCN group.
BINDING_TYPES_perturbed (numpy.ndarray) – Binding-types of the perturbed primary datapoints.
Notes
Returns a tuple of perturbed primary datapoints. The goal is to improve the predictive range of the regressed model by expanding the datapoints and accounting for error in the primary data (DFT). Spectra is pertubed according to a uniform distribution between \(a\) and \(b\).
-
_scale_and_perturb
(X_balanced, Y_balanced, BINDING_TYPES_balanced, IS_TRAINING_SET, TRAINING_ERROR)[source]¶ Apply scaling factor and pertub by error associated with scaling factor.
- Parameters
X_balanced (numpy.ndarray) – Full set of frequencies and intensities corresponding to balanced data
Y_balanced (numpy.ndarray) – Balanced class/groups
BINDING_TYPES_balanced (list) – Balaned bindig types that correspond to spectra and target data
IS_TRAINING_SET (bool) – Indicates whether synthetic spectra to be generated is for training or validating
TRAINING_ERROR (float or str) – Tag that indicates treatment of error in scaling factor for training data.
- Returns
X_sample (numpy.ndarray) – Sample of frequencies and intensities that will be mixed and then convoluted.
Y_sample (numpy.ndarray) – Targets classes that will be tabulated to generated fractional contributions to spectra
BINDING_TYPES_sample (numpy.ndarray) – Binding-types that will be used in coverage scaling if scaling is set to ‘high’
-
_xyconv
(X_sample, Y_sample, probabilities, BINDING_TYPES_sample)[source]¶ - Covolutes balananced sample of primary data and generates complex
mixed spectra.
- Parameters
perturbations (int) – The number of perturbed primary datapoints per original datapoint.
X_sample (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints after data class balancing , \(n = 2\) and \(p\) is the number of frequencies/intensities.
Y_sample (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)
probabilities (numpy.ndarray) – The probabilities to select each binding-type or GCN group. Has dimensions (num_samples, NUM_TARGETS).
BINDING_TYPES_sample (numpy.ndarray) – Binding types of the primary datapoints after data class balancing.
- Returns
Xconv (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points.
yconv (numpy.ndarray) – Fraction of occupied binding-types or GCN groups that contribute to the total spectra. yconv has dimensions \(m x p\) where \(m\) is the desired number of samples and \(p\) is the the number of targets.
-
add_noise
(Xconv_main, Xconv_noise, noise2signalmax=0.67)[source]¶ Adds two sets of complex spectra, multiplying one by a uniform random variable between 0 and 0.67 which is treated as noise.
- Parameters
Xconv_main (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points.
Xconv_noise (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points. This spectra is treated as noise.
- Returns
X_noisey – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points. Data has noise from Xconv_noise.
- Return type
-
get_synthetic_spectra
(NUM_SAMPLES, indices, IS_TRAINING_SET, TRAINING_ERROR=None)[source]¶ Obtain convoluted complex synthetic spectra
- Parameters
NUM_SAMPLES (int) – Number of spectra to generate.
indices (list) – List of indices from primary datset from which to generate spectra
IS_TRAINING_SET (bool) – Indicates whether primary data should be used to compute trainign or validation set.
TRAINING_ERROR (int, str, or None) – Indicates the kind of perturbations to induce in the primary training data. If an integer the perturbations are uniform, if ‘gaussian’, the perturbations are a gaussian with the same variance as that used ot pertub the validation data.
- Returns
X (numpy.ndarray) – Coverage shifted frequencies and intensities
Y (numpy.ndarray) – The target variable histograms. Either binding-type or GCN label
-
scaling_factor_shift
(X)[source]¶ Shift frequencies by a scaling factor to match experiment.
- Parameters
X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
- Returns
X – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
- Return type
Notes
A scaling factor shift accounts for systematic errors in the functional to over- or under-bind as well as systematic erros in the harmonic approximation. Computed according to https://cccbdb.nist.gov/vibnotes.asp
-
set_GCNlabels
(Minimum=0, BINDING_TYPE_FOR_GCN=[1], showfigures=False, figure_directory='show')[source]¶ Cluster GCN values into groups/classes using k-means clustering.
- Parameters
Minimum (int) – Minimum number of datapoints in each cluster. If a generated cluster has fewer than this number of datapoints it is merged with the next cluster. If the last cluster has fewer than the minimum number of datapoints it is merged with the previous cluster.
showfigures (bool) – Whether or not to generate figures visualizing the clusters and their location in GCN-space.
figure_directory (str) – Either a directory where figures are to be saved or the string ‘show’ which indicates that the figure is supposed to be sent to gui output.
BINDING_TYPE_FOR_GCN (list) – List of binding types whose GCN values are to be included in clustering. Binding-types included will have a GCN label of 1 through the number of clusters. Binding-types not included will be assigned a GCN label of zero.
- Variables
GCNlabels (numpy.ndarray) – GCN label assigned to each primary datapoint.
NUM_TARGET (int) – Updated number of targets. If n clusters (after merging) generated by the K-means algorithm had less than the minimum number of clusters than NUM_TARGET originally instantiated by the class is reduced by n.
Notes
Assigns each primary datapoint a GCN label based on the GCN value using k-means clustering where the number of target clusters is equal to the number of targets instantiated with the class. This is required to be run if one wishes to learn a distribution of GCN sites as GCN is continuous. K-means clustering is an partially - supervised learning technique that generates clusters/groups that are relatively evenly spaced with roughly the same number of datapoints in each cluster.
-
set_spectra_properties
(COVERAGE=None, MAX_COVERAGES=[1, 1, 1, 1], LOW_FREQUENCY=200, HIGH_FREQUENCY=2200, ENERGY_POINTS=501)[source]¶ Set spectra specific properties
- Parameters
COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.
MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’
LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated
HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated
ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into
- Variables
X (numpy.ndarray) – Coverage shifted frequencies and intensities
Y (The target variables. Either binding-type or GCN label) –
NUM_TARGETS (int) – The number of binding-types or GCN labels. This is set by the user and can be altered by set_gcn_labels() and _get_coverage_shifted_X
X0cov (numpy.ndarray) – The zero coverage frequency and intensity pairs
LOW_FREQUENCY (float) – The low frequency set by the user.
HIGH_FREQUENCY (float) – The high frequency set by the user
ENERGY_POINTS (int) – The number of energy points set by the user.
MAX_COVERAGES (list) – List of maximum coverages set by the user.
Notes
This function calls _get_coverage_shifted_X in order to shift X frequencies according to the specified coverage.
The fold
function¶
-
fold
(frequencies, intensities, LOW_FREQUENCY, HIGH_FREQUENCY, ENERGY_POINTS, FWHM, fL)[source]¶ Generate spectra from set of frequencies and intensities
- Parameters
frequencies (list or numpy.ndarry) – Set of molecular frequencies
intensities (list or numpy.ndarray) – Intensities that correspond to frequencies
LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated
HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated
ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into
FWHM (float) – Full-width-half-maximum of the desired spectra
fL (float) – Fraction of spectra to be convoluted by a Lorentzian transform. The remaining portion of the transoform to make up the FWHM comes sourced from a Gaussian convoluting function.
- Returns
spectrum – The spectrum that is generated from the set of frequencies and intensities.
- Return type
The HREEL_2_scaledIR
function¶
-
HREEL_2_scaledIR
(HREEL, frequency_range=None, PEAK_CONV=2.7)[source]¶ Summary goes on one line here
- Parameters
HREEL (numpy.ndarray) – HREEL spectra that is of size (2,n) where n is the number of points the into which the frequency and intensity are discretized.
frequency_range (numpy.ndarray) – Frequency range onto which HREEL will be interpolated after conversion to IR.
PEAK_CONV (float) – The intensity is scaled by the wavenumber raised to PEAK_CONV.
IR_scaled –
------- –
numpy.ndarray (numpy.ndarray) – The IR spectra the HREELs is converted to.