

Machine Learning Units¶

Multiple additional unit types exist in the context of Workflows based upon the Machine Learning Model, besides those of general applicability .

The below table describes the types of Machine Learning-specific units are available.

Category	Definition	Example
Pre-Processing	Transformations performed on the data prior to training	Data Standardization
Feature Extraction	Creates new featues from existing data	Principle-Component Analysis
Feature Selection	Automatically selects features as descriptors	Random Forest Importance Score
Hyperparameter tuning	Automatically optimizes the model architecture	Model Generation via Genetic Algorithm
Model Training	Adjusting model parameters and coefficients to minimize a training set loss	Training a neural network
Model Prediction	Using a trained model, predict what a new datapoint's values will be	Predicting a band gap with a trained neural network
Post-Processing	Any action taken after a model is trained, we lump into the post-processing category	Showing a learning curve of the neural network

Pre-Processing¶

The DataFrame input/output type of units are used to define the training data, to select target properties that need to be predicted as output of the Machine Learning computation, and feature properties used as input. The data is returned in the Pandas DataFrame format ¹.

Processing units are used for general data manipulation, for example cleaning missing data or remove duplicates in the data. Other examples might include the scaling and reducing of the input data. Scaling is a method used to standardize the range of independent variables or features of data, through for instance the normalization of the data.

Feature Extraction¶

Feature extraction is the process of generating new features from an existing set of features. For example, Auto-Encoders ² are commonly-used in dimensionality-reduction, taking an input vector and encoding into a vector with fewer dimensions.

Feature Selection¶

Feature selection units involve selecting the number of feature properties for model training.

Hyperparameter Tuning¶

Oftentimes, models will have hyperparameters which must be tuned prior to their use. For example, to develop LASSO Regression ³, an extra penalty term is added to the loss function. This penalty term is equal to a coefficient (Lambda) multiplied by the sum of the absolue values of the model coefficients. This extra penalty term helps prevent overfitting, and results in LASSO reducing some coefficients to 0 at higher values of lambda. By choosing lambda, one can control how many coefficients are extinguished. Becuse lambda is chosen before the model is trained, it is known as a hyperparameter to the model.

Model Training¶

Training a model refers to optimizing a model's parameters such that some loss function is minimized. For example, when a Convolutional Neural Network ⁴ is trained for image classification, the weights and biases of the network are adjusted to improve the model's performance against the training set.

Model Prediction¶

Prediction refers to taking a set of data for-which the targets are unknown, and using a model to decide what value the targets are likely to have. For example, a classifier might be used to predict whether a picture contains a dog or cat. A regressor might be used to predict a material's catalytic activity, given descriptors such as the band-gap or chemical potential.

Post-Processing¶

Post-processing refers to the actions taken after a model is trained, oftentimes aimed at simplifying a model for performance reasons. For example, this can include pruning ⁵ a neural network to reduce the number of mathematical optimizations that need to be performed, or knowledge distillation ⁶, where a simpler model is trained using the outputs of a more complex model.

In addition, on the platform we use "post-processing" to refer to any action which is taken after a model is trained, such as the visual display of data (such as training metrics or a parity plot).