jl_spectra_2_structure

The get_defaults function

get_default_data_paths(adsorbate)[source]

Get default paths to primary data of frequencies and intensities.

Parameters

adsorbate (str) – Name of the adsorbate for which to get the default primary DFT data. Data is already provided for ‘CO’, ‘NO’, and ‘C2H4’.

Returns

  • nanoparticle_path (str) – File path where nanoparticle or single adsorbate data will be saved.

  • isotope_path (str) – File path where isotope data for CO and NO will be saved.

  • high_coverage_path (str) – File path where high coverage data for CO will be saved.

  • coverage_scaling_path (str) – File path where coverage scaling coefficients will be saved.

Notes

Returns default frequencies to project intensities onto as well as default paths for locations of the pure and mixture spectroscopic data.

The get_exp_data_path function

get_exp_data_path()[source]

Get default paths to experimental data.

Returns

experimental_data – list of paths to experimental data

Return type

list

The IR_GEN class

class IR_GEN(ADSORBATE='CO', INCLUDED_BINDING_TYPES=[1, 2, 3, 4], TARGET='binding_type', NUM_TARGETS=None, nanoparticle_path=None, high_coverage_path=None, coverage_scaling_path=None, VERBOSE=False)[source]

Class for generating complex synthetic IR spectra

Parameters
  • ADSORBATE (str) – Adsorbate for which the spectra is to be generated.

  • INCLUDED_BINDING_TYPES (list) – Binding types whose frequencies/intensiteis from the primary data set will be included in generating the complex spectra.

  • TARGET (str) – Geometric descriptor for which the target histogram is to be gnerated. Can be binding_type, GCN, or combine_hollow_sites. If it is combine_hollow_sites then 3-fold and 4-fold sites are grouped together.

  • NUM_TARGETS (int) – Number of GCN groups to predict. Only used if TARGET=’GCN’ and GCN must be discretized.

  • nanoparticle_path (str) –

  • path where nanoparticle or single adsorbate json data is saved. (File) –

  • high_coverage_path (str) – File path where high coverage data for CO is saved saved.

  • coverage_scaling_path (str) – File path where coverage scaling coefficients are saved.

  • VERBOSE (bool) – Controls the printing of status statements.

Variables
  • BINDING_TYPES_with_4fold (list) – List of all binding types.

  • TARGET (str) – Set during initialization.

  • NUM_TARGETS (int) – Set during initialization.

  • X0cov (numpy.ndarray) – Set of frequencies and intensities at low coverage.

  • BINDING_TYPES (list) – Binding types to be predicted, accounts for both filtering and merging of certain sites.

  • GCNList (list) – GCN values for the data points

  • NANO_PATH (int) – Set during initialization.

  • HIGH_COV_PATH (int) – Set during initialization.

  • COV_SCALE_PATH (int) – Set during initialization.

  • ADSORBATE (float) – Set during initialization.

  • INCLUDED_BINDING_TYPES (int) – Set during initialization.

  • VERBOSE (int) – Set during initialization.

_add_high_coverage_data(X, Y)[source]

method for adding high coverage GCN data to set of spectra

Parameters
Returns

  • X_new (numpy.ndarray) – Frequencies and intensities with high coveage data

  • Y_new (numpy.ndarray) – New taret vector with high covearge GCN data added

_coverage_shift(X, BINDING_TYPES, SELF_COVERAGE, TOTAL_COVERAGE)[source]

Shift frqequencies and intensities to account for coverages effects.

Parameters
  • X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.

  • BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.

  • SELF_COVERAGE (numpy.ndarray of floats) – Relative spatial coverage of each binding-type

  • TOTAL_COVERAGE (numpy.ndarray of floats) – Relative combined coverage fo which the primary data point “sits”

Returns

Xcov – 3-D numpy array of frequencies and intensities that have been shifted to account for coverage effects. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.

Return type

numpy.ndarray

_generate_spectra(Xfrequencies, Xintensities, ENERGIES)[source]

Convert set of frequencies and intensities to spectra with minimal line width by Gaussian convolution.

Parameters
  • Xfrequencies (numpy.ndarray) – 2-D numpy array of frequencies. The ndarray has dimensions \(m x n\) where \(m\) is the number of primary datapoints, \(n\) is the number of frequencies/intensities for each datapoint.

  • Xintensities (numpy.ndarray) – 2-D numpy array of intensities. The ndarray has dimensions \(m x n\) where \(m\) is the number of primary datapoints, \(n\) is the number of frequencies/intensities for each datapoint.

  • ENERGIES (numpy.ndarray) – Numpy array of energies onto which frequencies and itensities will be convolved.

Returns

intmesh – numpy ndarray with dimensions \(m x p\) where \(m\) is the number of primary datapoints and \(p\) is the number of energy points onto which the frequencies and intensities aree projected with Fourier convolution.

Return type

numpy.ndarray

Notes

This preconvolution is necessary for numerical reasons before data is convoluted to induce greater spectral broadening.

_get_balanced_data(X_new, Y_new, indices)[source]

Correct for imbalances in the data for improved learning

Parameters
  • X_new (numpy.ndarray) – Full set of frequencies and intensities

  • Y_new (numpy.ndarray) – class/groups that are imbalanced

  • indices (list) – Indices for which individual X_new and Y)new will be selected.

Returns

  • X_balanced (numpy.ndarray) – Frequencies and intensities corresponding to balanced Y

  • Y_balanced (numpy.ndarray) – Balanced set of classes/groups.

  • BINDING_TYPES_balanced (numpy.ndarray) – Balanced set of binding-types that correspond to Y_balanced

_get_coverage_shifted_X(X0cov, Y, COVERAGE)[source]

Get frequencies and intensities shifted with coveage.

Parameters
  • X0cov (numpy.ndarray) – Numpy array of size (n,2) where \(n\) is the number of frequency and intensity pairs.

  • Y (numpy.ndarray) – Target classes/groups (binding-types or GCN labels)

  • COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.

Returns

  • X (numpy.ndarray) – Frequencies and intensities shifted to account for coverage

  • NUM_TARGETS (int) – Number of target classes/groups. Updated if TARGET is GCN and coverage is ‘high’

_get_probabilities(num_samples, NUM_TARGETS)[source]

Get probablilities for sampling data from different classes/groups.

Parameters
  • num_samples (int) – number of complex spectra to generate

  • NUM_TARGETS (int) – number of class/labes that are desired. Usually the number of binding-types or the number of GCN labels > 0

Returns

probabilities – The probabilities to select each binding-type or GCN group. Has dimensions (num_samples, NUM_TARGETS).

Return type

numpy.ndarray

Notes

Returns the a 2-D numpy.ndarray that is of length num_samples along the first dimension and NUM_TARGETS alongt the second dimesion. Elements correspond to the probability of selecting primary data point from a a specific class/group such that the array sums to 1 along the 2nd dimention. The probability assigned to the first index of the along the 2nd dimenstion is comes from a uniform distribution between 0 and 1 while the probability assigned to each following index \(i = n\) comes from the uniform distribution \(1/\sum{p_{i}}\) where \(i\) in \(p_{i}\) comes from all previous index values \(i < n\). The probabilities are then shuffled along the second dimension. This probability distribution results in contribution to spectra from any given class/label is most likely zero and the likelihood of the contribution monotonically decreases as the the contribution goes to 1. This distribution ensures that all fractional contributions are as equally sampled as possible.

_get_sample_data(X, Y, indices, IS_TRAINING_SET, TRAINING_ERROR)[source]

Apply scaling factor and pertub by error associated with scaling factor.

Parameters
  • X (numpy.ndarray) – Zero coverage frequencies and intensities

  • Y (numpy.ndarray) – GCN labels

  • indices (list) – Indices for which individual X_new and Y)new will be selected.

  • IS_TRAINING_SET (bool) – Indicates whether synthetic spectra to be generated is for training or validating

  • TRAINING_ERROR (float or str) – Tag that indicates treatment of error in scaling factor for training data.

Returns

  • X_sample (numpy.ndarray) – Sample of frequencies and intensities that will be mixed and then convoluted.

  • Y_sample (numpy.ndarray) – Targets classes that will be tabulated to generated fractional contributions to spectra

  • BINDING_TYPES_sample (numpy.ndarray) – Binding-types that will be used in coverage scaling if scaling is set to ‘high’

_mixed_lineshape(FWHM, fL, ENERGY_POINTS, energy_spacing)[source]

Convolute spectra with Lorentzian or Gaussian induce broadening.

Parameters
  • FWHM (float) – Full-width-half-maximum of the desired spectra

  • fL (float) – Fraction of spectra to be convoluted by a Lorentzian transform. The remaining portion of the transoform to make up the FWHM comes sourced from a Gaussian convoluting function.

  • ENERGY_POINTS (int) – Number of energy points in the desired spectra

  • energy_spacing (float) – spacing of the energy points

Returns

transform – transform that is convolved with a spectra that has a narrow line width in order to produce spectra with greater line widths.

Return type

numpy.ndarray

Notes

Accepts a full-width-half-maximum, a fraction of Lorentzian, energy spacing and number of points and produces the transform with which the spectra is convolved through fourier convolution in order to produce realistic experimental spectra.

_perturb_and_shift(perturbations, X, y, BINDING_TYPES=None)[source]

Shift spectra with scaling factor that has gaussian error.

Parameters
  • perturbations (int) – The number of perturbed primary datapoints per original datapoint.

  • X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities.

  • y (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)

  • BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.

Returns

  • Xperturbed (numpy.ndarray) – Perturbed intensities and frequencies. Has dimensions \(m_{new} x n x p\) where \(m_{new} = m x perturbations\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.

  • yperturbed (numpy.ndarray) – Some y-variable to be regreesed such as binding-type or GCN group.

  • BINDING_TYPES_perturbed (numpy.ndarray) – Binding-types of the perturbed primary datapoints.

Notes

Returns a tuple of perturbed primary datapoints. The goal is to improve the incorporate error in the scaling factor into the validation and test set. Scaling factor is perturbed according to its calculated standard error.

_perturb_spectra(perturbations, X, y, a=0.999, b=1.001, BINDING_TYPES=None)[source]

Generate pertubations of the frequencies and intensities uniformly.

Parameters
  • perturbations (int) – The number of perturbed primary datapoints per original datapoint.

  • X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.

  • y (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)

  • a (int) – The lower boudn of the uniform distribution.

  • b (int) – The upper bound of the uniform distribution.

  • BINDING_TYPES (numpy.ndarray) – Binding types of the primary datapoints.

Returns

  • Xperturbed (numpy.ndarray) – Perturbed intensities and frequencies. Has dimensions \(m_{new} x n x p\) where \(m_{new} = m x perturbations\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.

  • yperturbed (numpy.ndarray) – Some y-variable to be regreesed such as binding-type or GCN group.

  • BINDING_TYPES_perturbed (numpy.ndarray) – Binding-types of the perturbed primary datapoints.

Notes

Returns a tuple of perturbed primary datapoints. The goal is to improve the predictive range of the regressed model by expanding the datapoints and accounting for error in the primary data (DFT). Spectra is pertubed according to a uniform distribution between \(a\) and \(b\).

_scale_and_perturb(X_balanced, Y_balanced, BINDING_TYPES_balanced, IS_TRAINING_SET, TRAINING_ERROR)[source]

Apply scaling factor and pertub by error associated with scaling factor.

Parameters
  • X_balanced (numpy.ndarray) – Full set of frequencies and intensities corresponding to balanced data

  • Y_balanced (numpy.ndarray) – Balanced class/groups

  • BINDING_TYPES_balanced (list) – Balaned bindig types that correspond to spectra and target data

  • IS_TRAINING_SET (bool) – Indicates whether synthetic spectra to be generated is for training or validating

  • TRAINING_ERROR (float or str) – Tag that indicates treatment of error in scaling factor for training data.

Returns

  • X_sample (numpy.ndarray) – Sample of frequencies and intensities that will be mixed and then convoluted.

  • Y_sample (numpy.ndarray) – Targets classes that will be tabulated to generated fractional contributions to spectra

  • BINDING_TYPES_sample (numpy.ndarray) – Binding-types that will be used in coverage scaling if scaling is set to ‘high’

_xyconv(X_sample, Y_sample, probabilities, BINDING_TYPES_sample)[source]
Covolutes balananced sample of primary data and generates complex

mixed spectra.

Parameters
  • perturbations (int) – The number of perturbed primary datapoints per original datapoint.

  • X_sample (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints after data class balancing , \(n = 2\) and \(p\) is the number of frequencies/intensities.

  • Y_sample (numpy.ndarray) – The array of target variables to be regressed (binding types or GCN-labels)

  • probabilities (numpy.ndarray) – The probabilities to select each binding-type or GCN group. Has dimensions (num_samples, NUM_TARGETS).

  • BINDING_TYPES_sample (numpy.ndarray) – Binding types of the primary datapoints after data class balancing.

Returns

  • Xconv (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points.

  • yconv (numpy.ndarray) – Fraction of occupied binding-types or GCN groups that contribute to the total spectra. yconv has dimensions \(m x p\) where \(m\) is the desired number of samples and \(p\) is the the number of targets.

add_noise(Xconv_main, Xconv_noise, noise2signalmax=0.67)[source]

Adds two sets of complex spectra, multiplying one by a uniform random variable between 0 and 0.67 which is treated as noise.

Parameters
  • Xconv_main (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points.

  • Xconv_noise (numpy.ndarray) – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points. This spectra is treated as noise.

Returns

X_noisey – Convoluted complex spectra of dimensions \(m x n\) where \(m\) is the desired number of samples and \(n\) is the number of energy points. Data has noise from Xconv_noise.

Return type

numpy.ndarray

get_synthetic_spectra(NUM_SAMPLES, indices, IS_TRAINING_SET, TRAINING_ERROR=None)[source]

Obtain convoluted complex synthetic spectra

Parameters
  • NUM_SAMPLES (int) – Number of spectra to generate.

  • indices (list) – List of indices from primary datset from which to generate spectra

  • IS_TRAINING_SET (bool) – Indicates whether primary data should be used to compute trainign or validation set.

  • TRAINING_ERROR (int, str, or None) – Indicates the kind of perturbations to induce in the primary training data. If an integer the perturbations are uniform, if ‘gaussian’, the perturbations are a gaussian with the same variance as that used ot pertub the validation data.

Returns

  • X (numpy.ndarray) – Coverage shifted frequencies and intensities

  • Y (numpy.ndarray) – The target variable histograms. Either binding-type or GCN label

scaling_factor_shift(X)[source]

Shift frequencies by a scaling factor to match experiment.

Parameters

X (numpy.ndarray) – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.

Returns

X – 3-D numpy array of frequencies and intensities. The ndarray has dimensions \(m x n x p\) where \(m\) is the number of primary datapoints, \(n = 2\) and \(p\) is the number of frequencies/intensities for each datapoint.

Return type

numpy.ndarray

Notes

A scaling factor shift accounts for systematic errors in the functional to over- or under-bind as well as systematic erros in the harmonic approximation. Computed according to https://cccbdb.nist.gov/vibnotes.asp

set_GCNlabels(Minimum=0, BINDING_TYPE_FOR_GCN=[1], showfigures=False, figure_directory='show')[source]

Cluster GCN values into groups/classes using k-means clustering.

Parameters
  • Minimum (int) – Minimum number of datapoints in each cluster. If a generated cluster has fewer than this number of datapoints it is merged with the next cluster. If the last cluster has fewer than the minimum number of datapoints it is merged with the previous cluster.

  • showfigures (bool) – Whether or not to generate figures visualizing the clusters and their location in GCN-space.

  • figure_directory (str) – Either a directory where figures are to be saved or the string ‘show’ which indicates that the figure is supposed to be sent to gui output.

  • BINDING_TYPE_FOR_GCN (list) – List of binding types whose GCN values are to be included in clustering. Binding-types included will have a GCN label of 1 through the number of clusters. Binding-types not included will be assigned a GCN label of zero.

Variables
  • GCNlabels (numpy.ndarray) – GCN label assigned to each primary datapoint.

  • NUM_TARGET (int) – Updated number of targets. If n clusters (after merging) generated by the K-means algorithm had less than the minimum number of clusters than NUM_TARGET originally instantiated by the class is reduced by n.

Notes

Assigns each primary datapoint a GCN label based on the GCN value using k-means clustering where the number of target clusters is equal to the number of targets instantiated with the class. This is required to be run if one wishes to learn a distribution of GCN sites as GCN is continuous. K-means clustering is an partially - supervised learning technique that generates clusters/groups that are relatively evenly spaced with roughly the same number of datapoints in each cluster.

set_spectra_properties(COVERAGE=None, MAX_COVERAGES=[1, 1, 1, 1], LOW_FREQUENCY=200, HIGH_FREQUENCY=2200, ENERGY_POINTS=501)[source]

Set spectra specific properties

Parameters
  • COVERAGE (str or float) – The coverage at which the synthetic spectra is generated. If high, spectra at various coverages is generated.

  • MAX_COVERAGES (list) – Maximum coverages allowed for each binding-type if COVERAGE is set to ‘high’

  • LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated

  • HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated

  • ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into

Variables
  • X (numpy.ndarray) – Coverage shifted frequencies and intensities

  • Y (The target variables. Either binding-type or GCN label) –

  • NUM_TARGETS (int) – The number of binding-types or GCN labels. This is set by the user and can be altered by set_gcn_labels() and _get_coverage_shifted_X

  • X0cov (numpy.ndarray) – The zero coverage frequency and intensity pairs

  • COVERAGE (str or float) – The COVERAGE set by the user.

  • LOW_FREQUENCY (float) – The low frequency set by the user.

  • HIGH_FREQUENCY (float) – The high frequency set by the user

  • ENERGY_POINTS (int) – The number of energy points set by the user.

  • MAX_COVERAGES (list) – List of maximum coverages set by the user.

Notes

This function calls _get_coverage_shifted_X in order to shift X frequencies according to the specified coverage.

The fold function

fold(frequencies, intensities, LOW_FREQUENCY, HIGH_FREQUENCY, ENERGY_POINTS, FWHM, fL)[source]

Generate spectra from set of frequencies and intensities

Parameters
  • frequencies (list or numpy.ndarry) – Set of molecular frequencies

  • intensities (list or numpy.ndarray) – Intensities that correspond to frequencies

  • LOW_FREQUENCY (float) – The lowest frequency for which synthetic spectra is generated

  • HIGH_FREQUENCY (float) – The high frequency for which synthetic spectra is generated

  • ENERGY_POINTS (int) – The number of points the synthetic spectra is discretized into

  • FWHM (float) – Full-width-half-maximum of the desired spectra

  • fL (float) – Fraction of spectra to be convoluted by a Lorentzian transform. The remaining portion of the transoform to make up the FWHM comes sourced from a Gaussian convoluting function.

Returns

spectrum – The spectrum that is generated from the set of frequencies and intensities.

Return type

numpy.ndarray

The HREEL_2_scaledIR function

HREEL_2_scaledIR(HREEL, frequency_range=None, PEAK_CONV=2.7)[source]

Summary goes on one line here

Parameters
  • HREEL (numpy.ndarray) – HREEL spectra that is of size (2,n) where n is the number of points the into which the frequency and intensity are discretized.

  • frequency_range (numpy.ndarray) – Frequency range onto which HREEL will be interpolated after conversion to IR.

  • PEAK_CONV (float) – The intensity is scaled by the wavenumber raised to PEAK_CONV.

  • IR_scaled

  • -------

  • numpy.ndarray (numpy.ndarray) – The IR spectra the HREELs is converted to.