jl_spectra_2_structure¶

The `get_defaults` function¶

get_default_data_paths(adsorbate)[source]¶

Get default paths to primary data of frequencies and intensities.

Parameters

adsorbate (str) – Name of the adsorbate for which to get the default primary DFT data. Data is already provided for ‘CO’, ‘NO’, and ‘C2H4’.

Returns

nanoparticle_path (str) – File path where nanoparticle or single adsorbate data will be saved.
isotope_path (str) – File path where isotope data for CO and NO will be saved.
high_coverage_path (str) – File path where high coverage data for CO will be saved.
coverage_scaling_path (str) – File path where coverage scaling coefficients will be saved.

Notes

Returns default frequencies to project intensities onto as well as default paths for locations of the pure and mixture spectroscopic data.

The `get_exp_data_path` function¶

get_exp_data_path()[source]¶

Get default paths to experimental data.

Returns: experimental_data – list of paths to experimental data
Return type: list

The `IR_GEN` class¶

class IR_GEN(ADSORBATE='CO', INCLUDED_BINDING_TYPES=[1, 2, 3, 4], TARGET='binding_type', NUM_TARGETS=None, nanoparticle_path=None, high_coverage_path=None, coverage_scaling_path=None, VERBOSE=False)[source]¶

Class for generating complex synthetic IR spectra

Parameters

ADSORBATE (str) – Adsorbate for which the spectra is to be generated.
INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.
TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.
NUM_TARGETS (int) – Number of GCN groups to predict. Only used if TARGET=’GCN’ and GCN must be discretized.
nanoparticle_path (str) –
path where nanoparticle or single adsorbate json data is saved. (File) –
high_coverage_path (str) – File path where high coverage data for CO is saved saved.
coverage_scaling_path (str) – File path where coverage scaling coefficients are saved.
VERBOSE (bool) – Controls the printing of status statements.

Variables

BINDING_TYPES_with_4fold (list) – List of all binding types.
TARGET (str) – Set during initialization.
NUM_TARGETS (int) – Set during initialization.
X0cov (numpy.ndarray) – Set of frequencies and intensities at low coverage.
BINDING_TYPES (list) – Binding types to be predicted, accounts for both filtering and merging of certain sites.
GCNList (list) – GCN values for the data points
NANO_PATH (int) – Set during initialization.
HIGH_COV_PATH (int) – Set during initialization.
COV_SCALE_PATH (int) – Set during initialization.
ADSORBATE (float) – Set during initialization.
INCLUDED_BINDING_TYPES (int) – Set during initialization.
VERBOSE (int) – Set during initialization.

_add_high_coverage_data(X, Y)[source]¶

method for adding high coverage GCN data to set of spectra

Parameters

X (numpy.ndarray) – Zero coverage frequencies and intensities
Y (numpy.ndarray) – GCN labels

Returns

X_new (numpy.ndarray) – Frequencies and intensities with high coveage data
Y_new (numpy.ndarray) – New taret vector with high covearge GCN data added

_coverage_shift(X, BINDING_TYPES, SELF_COVERAGE, TOTAL_COVERAGE)[source]¶

Shift frqequencies and intensities to account for coverages effects.

Parameters

X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.
SELF_COVERAGE (numpy.ndarray of floats) – Relative spatial coverage of each binding-type
TOTAL_COVERAGE (numpy.ndarray of floats) – Relative combined coverage fo which the primary data point “sits”

Returns

Xcov – 3-D numpy array of frequencies and intensities that have been shifted to account for coverage effects. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.

Return type

numpy.ndarray

_generate_spectra(Xfrequencies, Xintensities, ENERGIES)[source]¶

Convert set of frequencies and intensities to spectra with minimal line width by Gaussian convolution.

Parameters

Xfrequencies (numpy.ndarray) – 2-D numpy array of frequencies. The ndarray has dimensions \(m x n\) where \(m\) is the number of primary datapoints, \(n\) is the number of frequencies/intensities for each datapoint.
Xintensities (numpy.ndarray) – 2-D numpy array of intensities. The ndarray has dimensions \(m x n\) where \(m\) is the number of primary datapoints, \(n\) is the number of frequencies/intensities for each datapoint.
ENERGIES (numpy.ndarray) – Numpy array of energies onto which frequencies and itensities will be convolved.

Returns

intmesh – numpy ndarray with dimensions \(m x p\) where \(m\) is the number of primary datapoints and \(p\) is the number of energy points onto which the frequencies and intensities aree projected with Fourier convolution.

Return type

numpy.ndarray

Notes

This preconvolution is necessary for numerical reasons before data is convoluted to induce greater spectral broadening.

_get_balanced_data(X_new, Y_new, indices)[source]¶

Correct for imbalances in the data for improved learning

Parameters

X_new (numpy.ndarray) – Full set of frequencies and intensities
Y_new (numpy.ndarray) – class/groups that are imbalanced
indices (list) – Indices for which individual X_new and Y)new will be selected.

Returns

X_balanced (numpy.ndarray) – Frequencies and intensities corresponding to balanced Y
Y_balanced (numpy.ndarray) – Balanced set of classes/groups.
BINDING_TYPES_balanced (numpy.ndarray) – Balanced set of binding-types that correspond to Y_balanced

_get_coverage_shifted_X(X0cov, Y, COVERAGE)[source]¶

Get frequencies and intensities shifted with coveage.

Parameters

X0cov (numpy.ndarray) – Numpy array of size (n,2) where \(n\) is the number of frequency and intensity pairs.
Y (numpy.ndarray) – Target classes/groups (binding-types or GCN labels)
COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.

Returns

X (numpy.ndarray) – Frequencies and intensities shifted to account for coverage
NUM_TARGETS (int) – Number of target classes/groups. Updated if TARGET is GCN and coverage is ‘high’

_get_probabilities(num_samples, NUM_TARGETS)[source]¶

Get probablilities for sampling data from different classes/groups.

Parameters

num_samples (int) – number of complex spectra to generate
NUM_TARGETS (int) – number of class/labes that are desired. Usually the number of binding-types or the number of GCN labels > 0

Returns

probabilities – The probabilities to select each binding-type or GCN group. Has dimensions (num_samples, NUM_TARGETS).

Return type

numpy.ndarray

Notes

Returns the a 2-D numpy.ndarray that is of length num_samples along the first dimension and NUM_TARGETS alongt the second dimesion. Elements correspond to the probability of selecting primary data point from a a specific class/group such that the array sums to 1 along the 2nd dimention. The probability assigned to the first index of the along the 2nd dimenstion is comes from a uniform distribution between 0 and 1 while the probability assigned to each following index \(i = n\) comes from the uniform distribution \(1/\sum{p_{i}}\) where \(i\) in \(p_{i}\) comes from all previous index values \(i < n\). The probabilities are then shuffled along the second dimension. This probability distribution results in contribution to spectra from any given class/label is most likely zero and the likelihood of the contribution monotonically decreases as the the contribution goes to 1. This distribution ensures that all fractional contributions are as equally sampled as possible.

_get_sample_data(X, Y, indices, IS_TRAINING_SET, TRAINING_ERROR)[source]¶

Apply scaling factor and pertub by error associated with scaling factor.

Parameters

X (numpy.ndarray) – Zero coverage frequencies and intensities
Y (numpy.ndarray) – GCN labels
indices (list) – Indices for which individual X_new and Y)new will be selected.
IS_TRAINING_SET (bool) – Indicates whether synthetic spectra to be generated is for training or validating
TRAINING_ERROR (float or str) – Tag that indicates treatment of error in scaling factor for training data.

Returns

X_sample (numpy.ndarray) – Sample of frequencies and intensities that will be mixed and then convoluted.
Y_sample (numpy.ndarray) – Targets classes that will be tabulated to generated fractional contributions to spectra
BINDING_TYPES_sample (numpy.ndarray) – Binding-types that will be used in coverage scaling if scaling is set to ‘high’

_mixed_lineshape(FWHM, fL, ENERGY_POINTS, energy_spacing)[source]¶

Convolute spectra with Lorentzian or Gaussian induce broadening.

Parameters

FWHM (float) – Full-width-half-maximum of the desired spectra
fL (float) – Fraction of spectra to be convoluted by a Lorentzian transform. The remaining portion of the transoform to make up the FWHM comes sourced from a Gaussian convoluting function.
ENERGY_POINTS (int) – Number of energy points in the desired spectra
energy_spacing (float) – spacing of the energy points

Returns

transform – transform that is convolved with a spectra that has a narrow line width in order to produce spectra with greater line widths.

Return type

numpy.ndarray

Notes

Accepts a full-width-half-maximum, a fraction of Lorentzian, energy spacing and number of points and produces the transform with which the spectra is convolved through fourier convolution in order to produce realistic experimental spectra.

_perturb_and_shift(perturbations, X, y, BINDING_TYPES=None)[source]¶

Shift spectra with scaling factor that has gaussian error.

Parameters

perturbations (int) – The number of perturbed primary datapoints per original datapoint.
X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities.
y (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)
BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.

Returns

Xperturbed (numpy.ndarray) – Perturbed intensities and frequencies. Has dimensions \(m_{new} x n x p\) where \(m_{new} = m x perturbations\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
yperturbed (numpy.ndarray) – Some y-variable to be regreesed such as binding-type or GCN group.
BINDING_TYPES_perturbed (numpy.ndarray) – Binding-types of the perturbed primary datapoints.

Notes

Returns a tuple of perturbed primary datapoints. The goal is to improve the incorporate error in the scaling factor into the validation and test set. Scaling factor is perturbed according to its calculated standard error.

_perturb_spectra(perturbations, X, y, a=0.999, b=1.001, BINDING_TYPES=None)[source]¶

Generate pertubations of the frequencies and intensities uniformly.

Parameters

perturbations (int) – The number of perturbed primary datapoints per original datapoint.
X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
y (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)
a (int) – The lower boudn of the uniform distribution.
b (int) – The upper bound of the uniform distribution.
BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.

Returns

Xperturbed (numpy.ndarray) – Perturbed intensities and frequencies. Has dimensions \(m_{new} x n x p\) where \(m_{new} = m x perturbations\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
yperturbed (numpy.ndarray) – Some y-variable to be regreesed such as binding-type or GCN group.
BINDING_TYPES_perturbed (numpy.ndarray) – Binding-types of the perturbed primary datapoints.

Notes

Returns a tuple of perturbed primary datapoints. The goal is to improve the predictive range of the regressed model by expanding the datapoints and accounting for error in the primary data (DFT). Spectra is pertubed according to a uniform distribution between \(a\) and \(b\).

_scale_and_perturb(X_balanced, Y_balanced, BINDING_TYPES_balanced, IS_TRAINING_SET, TRAINING_ERROR)[source]¶

Apply scaling factor and pertub by error associated with scaling factor.

Parameters

X_balanced (numpy.ndarray) – Full set of frequencies and intensities corresponding to balanced data
Y_balanced (numpy.ndarray) – Balanced class/groups
BINDING_TYPES_balanced (list) – Balaned bindig types that correspond to spectra and target data
IS_TRAINING_SET (bool) – Indicates whether synthetic spectra to be generated is for training or validating
TRAINING_ERROR (float or str) – Tag that indicates treatment of error in scaling factor for training data.

Returns

X_sample (numpy.ndarray) – Sample of frequencies and intensities that will be mixed and then convoluted.
Y_sample (numpy.ndarray) – Targets classes that will be tabulated to generated fractional contributions to spectra
BINDING_TYPES_sample (numpy.ndarray) – Binding-types that will be used in coverage scaling if scaling is set to ‘high’

_xyconv(X_sample, Y_sample, probabilities, BINDING_TYPES_sample)[source]¶

Covolutes balananced sample of primary data and generates complex: mixed spectra.

Parameters

perturbations (int) – The number of perturbed primary datapoints per original datapoint.
X_sample (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints after data class balancing , \(n = 2\) and \(p\) is the number of frequencies/intensities.
Y_sample (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)
probabilities (numpy.ndarray) – The probabilities to select each binding-type or GCN group. Has dimensions (num_samples, NUM_TARGETS).
BINDING_TYPES_sample (numpy.ndarray) – Binding types of the primary datapoints after data class balancing.

Returns

Xconv (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points.
yconv (numpy.ndarray) – Fraction of occupied binding-types or GCN groups that contribute to the total spectra. yconv has dimensions \(m x p\) where \(m\) is the desired number of samples and \(p\) is the the number of targets.

add_noise(Xconv_main, Xconv_noise, noise2signalmax=0.67)[source]¶

Adds two sets of complex spectra, multiplying one by a uniform random variable between 0 and 0.67 which is treated as noise.

Parameters

Xconv_main (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points.
Xconv_noise (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points. This spectra is treated as noise.

Returns

X_noisey – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points. Data has noise from Xconv_noise.

Return type

numpy.ndarray

get_synthetic_spectra(NUM_SAMPLES, indices, IS_TRAINING_SET, TRAINING_ERROR=None)[source]¶

Obtain convoluted complex synthetic spectra

Parameters

NUM_SAMPLES (int) – Number of spectra to generate.
indices (list) – List of indices from primary datset from which to generate spectra
IS_TRAINING_SET (bool) – Indicates whether primary data should be used to compute trainign or validation set.
TRAINING_ERROR (int, str, or None) – Indicates the kind of perturbations to induce in the primary training data. If an integer the perturbations are uniform, if ‘gaussian’, the perturbations are a gaussian with the same variance as that used ot pertub the validation data.

Returns

X (numpy.ndarray) – Coverage shifted frequencies and intensities
Y (numpy.ndarray) – The target variable histograms. Either binding-type or GCN label

scaling_factor_shift(X)[source]¶

Shift frequencies by a scaling factor to match experiment.

Parameters: X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
Returns: X – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.
Return type: numpy.ndarray

Notes

A scaling factor shift accounts for systematic errors in the functional to over- or under-bind as well as systematic erros in the harmonic approximation. Computed according to https://cccbdb.nist.gov/vibnotes.asp

set_GCNlabels(Minimum=0, BINDING_TYPE_FOR_GCN=[1], showfigures=False, figure_directory='show')[source]¶

Cluster GCN values into groups/classes using k-means clustering.

Parameters

Minimum (int) – Minimum number of datapoints in each cluster. If a generated cluster has fewer than this number of datapoints it is merged with the next cluster. If the last cluster has fewer than the minimum number of datapoints it is merged with the previous cluster.
showfigures (bool) – Whether or not to generate figures visualizing the clusters and their location in GCN-space.
figure_directory (str) – Either a directory where figures are to be saved or the string ‘show’ which indicates that the figure is supposed to be sent to gui output.
BINDING_TYPE_FOR_GCN (list) – List of binding types whose GCN values are to be included in clustering. Binding-types included will have a GCN label of 1 through the number of clusters. Binding-types not included will be assigned a GCN label of zero.

Variables

GCNlabels (numpy.ndarray) – GCN label assigned to each primary datapoint.
NUM_TARGET (int) – Updated number of targets. If n clusters (after merging) generated by the K-means algorithm had less than the minimum number of clusters than NUM_TARGET originally instantiated by the class is reduced by n.

Notes

Assigns each primary datapoint a GCN label based on the GCN value using k-means clustering where the number of target clusters is equal to the number of targets instantiated with the class. This is required to be run if one wishes to learn a distribution of GCN sites as GCN is continuous. K-means clustering is an partially - supervised learning technique that generates clusters/groups that are relatively evenly spaced with roughly the same number of datapoints in each cluster.

set_spectra_properties(COVERAGE=None, MAX_COVERAGES=[1, 1, 1, 1], LOW_FREQUENCY=200, HIGH_FREQUENCY=2200, ENERGY_POINTS=501)[source]¶

Set spectra specific properties

Parameters

COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.
MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’
LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated
HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated
ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into

Variables

X (numpy.ndarray) – Coverage shifted frequencies and intensities
Y (The target variables. Either binding-type or GCN label) –
NUM_TARGETS (int) – The number of binding-types or GCN labels. This is set by the user and can be altered by set_gcn_labels() and _get_coverage_shifted_X
X0cov (numpy.ndarray) – The zero coverage frequency and intensity pairs
COVERAGE (str or float) – The COVERAGE set by the user.
LOW_FREQUENCY (float) – The low frequency set by the user.
HIGH_FREQUENCY (float) – The high frequency set by the user
ENERGY_POINTS (int) – The number of energy points set by the user.
MAX_COVERAGES (list) – List of maximum coverages set by the user.

Notes

This function calls _get_coverage_shifted_X in order to shift X frequencies according to the specified coverage.

The `fold` function¶

fold(frequencies, intensities, LOW_FREQUENCY, HIGH_FREQUENCY, ENERGY_POINTS, FWHM, fL)[source]¶

Generate spectra from set of frequencies and intensities

Parameters

frequencies (list or numpy.ndarry) – Set of molecular frequencies
intensities (list or numpy.ndarray) – Intensities that correspond to frequencies
LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated
HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated
ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into
FWHM (float) – Full-width-half-maximum of the desired spectra
fL (float) – Fraction of spectra to be convoluted by a Lorentzian transform. The remaining portion of the transoform to make up the FWHM comes sourced from a Gaussian convoluting function.

Returns

spectrum – The spectrum that is generated from the set of frequencies and intensities.

Return type

numpy.ndarray

The `HREEL_2_scaledIR` function¶

HREEL_2_scaledIR(HREEL, frequency_range=None, PEAK_CONV=2.7)[source]¶

Summary goes on one line here

Parameters

HREEL (numpy.ndarray) – HREEL spectra that is of size (2,n) where n is the number of points the into which the frequency and intensity are discretized.
frequency_range (numpy.ndarray) – Frequency range onto which HREEL will be interpolated after conversion to IR.
PEAK_CONV (float) – The intensity is scaled by the wavenumber raised to PEAK_CONV.
IR_scaled –
------- –
numpy.ndarray (numpy.ndarray) – The IR spectra the HREELs is converted to.

jl_spectra_2_structure¶

The get_defaults function¶

The get_exp_data_path function¶

The IR_GEN class¶

The fold function¶

The HREEL_2_scaledIR function¶

The `get_defaults` function¶

The `get_exp_data_path` function¶

The `IR_GEN` class¶

The `fold` function¶

The `HREEL_2_scaledIR` function¶