# 1-bit Stochastic Gradient Descent (1-bit SGD)

1-bit Stochastic Gradient Descent is a technique from Microsoft Research aimed at increasing the data parallelism inherent in training deep neural networks. They describe the technique in the paper 1-Bit Stochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs.

They accelerate training neural networks with stochastic gradient descent by:

1. splitting up the computation for each minibatch across many nodes in a distributed system.
2. reducing the bandwidth requirements for communication between nodes by exchanging gradients (instead of model parameters) and quantizing those gradients all the way to just 1 bit.
3. they add the quantization error from Step 2 into the next minibatch gradient before quantization.

1-bit Stochastic Gradient Descent is available is a technique in Microsoft’s Cognitive Toolkit (CNTK).

# 1x1 Convolution

A 1x1 convolution or a network in network is an architectural technique used in some convolutional neural networks.

The technique was first described in the paper Network In Network.

A 1x1 convolution is a convolutional layer where the filter is of dimension (1 1).

The filter takes in a tensor of dimension (n_h n_w n_c), over the (n_c) values in the third dimension and outputting a (n_h n_w) matrix. Subsequently, an activation function (like ReLU) is applied to the output matrix.

If we have (p) (1 1) filters, then the output of the layer is a tensor of dimension (n_h n_w p). This is useful if the number of channels (n_c) in the previous layer of the network has grown too large and needs to be altered to (p) channels.

The (1 1) convolution technique was featured in paper introducing the Inception network architecture, titled Going Deeper With Convolutions.

# Abstractive sentence summarization

Abstractive sentence summarization refers to creating a shorter version of a sentence with the same meaning.

This is in contrast to extractive sentence summarization, which pulls the most informative sentences from a document.

# ACCAMS

ACCAMS: Additive Co-Clustering to Approximate Matrices Succinctly

# Activation function

In neural networks, an activation function defines the output of a neuron.

The activation function takes the dot product of the input to the neuron ((x)) and the weights ((w)).

Typically activation functions are nonlinear, as that allows the network to approximate a wider variety of functions.

## References

AdaBoost, short for Adaptive Boosting, aka Weight Boosted Trees, uses the same training set over and over. Increases the margin (separation) like SVMs.

## References

AdaGrad is a gradient-descent based optimization algorithm. It automatically tunes the learning rate based on its observations of the data’s geometry. AdaGrad is designed to perform well with datasets that have infrequently-occurring features.

## References

ADAM, or Adaptive Moment Estimation, is a stochastic optimization method introduced by Diederik P. Kingma and Jimmy Lei Ba.

## References

The term adaptive learning rate refers to variants of stochastic gradient descent with learning rates that change over the course of the algorithm’s execution.

Allowing the learning rate to change dynamically eliminates the need to pick a “good” static learning rate, and can lead to faster training and a trained model with better performance.

# AlexNet

AlexNet is a convolutional neural network architecture proposed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton in 2012.

At the time, it achieved state-of-the-art performance on the test set for the 2010 ImageNet Large Scale Visual Recognition Competition (LSVRC). A variant of the model won the 2012 ImageNet LSVRC with a top-5 test error rate of 15.3%–ten percentage points ahead of the second place winner.

# Anchor box

Anchor boxes are a technique used in some computer vision object detection algorithms to help identify objects of different shapes.

Anchor boxes are hand-picked boxes of different height/width ratios (for 2-dimensional boxes) designed to match the relative ratios of the object classes being detected. For example, an object detector that detects cars and people may have a wide anchor box to detect cars and a tall, narrow box to detect people.

The Fast R-CNN paper introduced the idea of using the $$k$$-means-clustering to automatically determine the appropriate anchor box dimensions for a given $$k$$ number of anchor boxes.

# Autoencoder

Autoencoders are an unsupervised learning model that aim to learn distributed representations of data.

Typically an autoencoder is a neural network trained to predict its own input data. A large enough network will simply memorize the training set, but there are a few things that can be done to generate useful distributed representations of input data, including:

1. constraining the size of the model, forcing it to learn a lower-dimensional representation that can be used to re-construct the original higher-dimensional data points.
2. adding artificial noise to the initial data points, and training the autoencoder to predict the data points minus the artificial noise. See denoising autoencoder for more information.

# Backprop

See Backpropagation.

# Backpropagation

A technique to find good weight values in a neural network by trying different weights, and seeing if the change contributes positively to prediction quality.

• Backprop

# Bag-of-n-grams

A bag-of-$$n$$-grams model is a way to represent a document, similar to a [bag-of-words][/terms/bag-of-words/] model.

A bag-of-$$n$$-grams model represents a text document as an unordered collection of its $$n$$-grams.

For example, let’s use the following phrase and divide it into bi-grams ($$n = 2$$).

James is the best person ever.

becomes

• <start>James
• James is
• is the
• the best
• person ever.
• ever.<end>

In a typical bag-of-$$n$$-grams model, these 6 bigrams would be a sample from a large number of bigrams observed in a corpus. And then James is the best person ever. would be encoded in a representation showing which of the corpus’s bigrams were observed in the sentence.

A bag-of-$$n$$-grams model has the simplicity of the bag-of-words model, but allows the preservation of more word locality information.

# Bag-of-words

The phrase bag-of-words typically refers to a way of representing text in natural language processing, although it has been applied to computer vision.

A bag-of-words representation contains how many times each word appears in a document, but disregards the order of the words.

Often, bag-of-words models will only include the $$k$$ most frequent words in a corpus. This reduces the memory needed to store relatively-infrequent words, and the underlying representation of the document is mostly the same because common words dominate the document.

Bag-of-words models are often highly effective at representing documents in tasks like classification, clustering, or topic modeling. But they can struggle with tasks where word order matters, like sentiment analysis and machine translation. For example, in a bag-of-words model, the phrase “dog bites man” and “man bites dog” have identical representations.

# Bagging

Bagging, short for bootstrap aggregating, is training different base learners on different subsets of the training set randomly, by drawing random training sets from the given sample (with replacement).

# Beam search

Beam search is a memory-restricted version of breadth-first search.

Beam search commonly appears in machine translation or other literature where sequence-to-sequence learning is common. In this domain, beam search allows the neural network to consider many candidate responses instead of selecting the highest-scoring token at each step.

Google’s blog post announcing SyntaxNet explains the advantages of beam search for sequence-to-sequence learning in greater detail: > At each point in processing many decisions may be possible—due to ambiguity—and a neural network gives scores for competing decisions based on their plausibility. For this reason, it is very important to use beam search in the model. Instead of simply taking the first-best decision at each point, multiple partial hypotheses are kept at each step, with hypotheses only being discarded when there are several other higher-ranked hypotheses under consideration.

## References

The bias-variance tradeoff refers to the problem of minimizing two different sources of error when training a supervised learning model:

1. Bias - Bias is a consistent error, possibly from the algorithm having made an incorrect assumption about the training data. Bias is often related to underfitting.

2. Variance - Variances comes from a high sensitivity to differences in training data. Variance is often related to overfitting.

It is typically difficult to simultaneously minimize bias and variance.

# Boosting

Learners trained serially so that instances on which the preceding base learners are not accurate are given more emphasis in training later base-learners; actively tries to generate complementary learners, instead of leaving this to chance.

# Bounding box

A bounding box is a rectangle (in 2D datasets) or rectangular prism (in 3D datasets) drawn around an object identified in an image.

Object localization is a task in computer vision where a model is trained to draw bounding boxes around object detected in an image.

# Burstiness

Wikipedia defines burstiness as follows:

In statistics, burstiness is the intermittent increases and decreases in activity or frequency of an event. One of measures of burstiness is the Fano factor—a ratio between the variance and mean of counts.

## Word Burstiness

In natural language processing, burstiness has a slightly more specific definition, defined by Slava Katz in the mid 1990s.

The authors of Accounting for Burstiness in Topic Models give the following succinct definition of burstiness:

Church and Gale (1995) note that real texts systematically exhibit this phenomenon: a word is more likely to occur again in a document if it has already appeared in the document. Importantly, the burstiness of a word and its semantic content are positively correlated; words that are more informative are also more bursty.

Additionally, burstiness also tells us that later appearances of a word are less significant than the first appearance.

If a term is used once in a document, then it is likely to be used again. This phenomenon is called burstiness, and it implies that the second and later appearances of a word are less significant than the first appearance.

# Capture-recapture model

A capture-recapture model is a technique to estimate an unknown population by capturing, tagging, and re-capturing samples from the population.

In the article How many Mechanical Turk workers are there?, Panos Ipeirotis explains a simple version of a capture-recapture model as follows:

The simplest possible technique is the following:

• Capture/marking phase: Capture $$n_1$$ animals, mark them, and release them back.

• Recapture phase: A few days later, capture $$n_2$$ animals. Assuming there are $$N$$ animals overall, $$n_1/N$$ of them are marked. So, for each of the $$n_2$$ captured animals, the probability that the animal is marked is $$n_1/N$$ (from the capture/marking phase).

• Calculation: On expectation, we expect to see $$n_2 \cdot \frac{n_1}{N}$$ marked animals in the recapture phase. (Notice that we do not know $$N$$.) So, if we actually see $$m$$ marked animals during the recapture phase, we set $$m = n_2 \cdot \frac{n_1}{N}$$ and we get the estimate that:

$N=n_1 \cdot \frac{n_2}{m}$

He adds that this basic version of a capture-recapture model makes the following assumptions, and the estimate $$N$$ can be inaccurate when these assumptions are violated:

• Assumption of no arrivals / departures (“closed population”): The vanilla capture-recapture scheme assumes that there are no arrivals or departures of workers between the capture and recapture phase.

• Assumption of no selection bias (“equal catchability”): The vanilla capture-recapture scheme assumes that every worker in the population is equally likely to be captured.

# Catastrophic forgetting

Catastrophic forgetting (or catastrophic interference) is a problem in machine learning where a model forgets an existing learned pattern when learning a new one.

The model uses the same parameters to recognize both patterns, and learning the second pattern overwrites the parameters’ configuration from having learned the first pattern.

# Chunking

The paper Natural Language Processing (almost) from Scratch describes chunking as:

Also called shallow parsing, chunking aims at labeling segments of a sentence with syntactic constituents such as noun or verb phrases (NP or VP). Each word is assigned only one unique tag, often encoded as a begin-chunk (e.g., B-NP) or inside-chunk tag (e.g., I-NP).

# Clustering

## Related Terms

In neural networks, co-adaptation refers to when different hidden units in a neural networks have highly correlated behavior.

It is better for computational efficiency and the the model’s ability to learn a general representation if hidden units can detect features independently of each other.

A few different regularization techniques aim at reducing co-adapatation–dropout being a notable one.

# Community detection

Community detection refers to the problem of detecting whether a graph has community structure.

# Computer vision

Computer vision is the field of teaching computers to perceive sensor data–such as from cameras, RADAR, and LIDAR sensors–to achieve an understanding of what is in the data.

Computer vision is a wide-ranging field that comprises of many techniques and subfields.

# Connectionist Temporal Classification (CTC)

Connectionist Temporal Classification is a term coined in a paper by Graves et al. titled Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.

It refers to the use of recurrent neural networks (which is a form of connectionism) for the purpose of labeling unsegmented data sequences (AKA temporal classification).

# Continuous-Bag-of-Words (CBOW)

Continuous Bag of Words refers to a algorithm that predicts a target word from its surrounding context.

CBOW is one of the algorithms used for training word2vec vectors.

# Convex combination

A convex combination is a linear combination, where all the coefficients are greater than 0 and sum to 1.

The Convex combination Wikipedia article gives the following example:

Given a finite number of points $$x_1, x_2, \ldots, x_n$$ in a real vector space, a convex combination of these points is a point of the form

$a_1 x_1 + a_2 x _2 + \ldots + a_n x_n$ is a convex combination if all real numbers $$a_i \geq 0$$ and $$a_1 + a_2 + \ldots + a_n = 1$$

# Convex hull

The convex hull of a set $$X$$ in an affine space over the reals is the smallest convex set that contains $$X$$. When the points are two dimensional, the convex hull can be thought of as the rubber band around the points of $$X$$.

As per Wikipedia, a convex set is the smallest affine space closed under convex combination.

A convex combination is a linear combination where all the coefficients are greater than 0 and all sum to 1.

# Convolution

Convolution is an operation where a filter (a small matrix) is applied to some input (typically a much larger matrix).

Convolution is a common operation in image processing, but convolution has also found some applications in natural language processing, audio processing, and other fields of machine learning.

Padding is a preprocessing step before a convolution operation. The input matrix is often padded to control the output dimensions of the convolution, or ot preserve information around the edges of the input matrix.

## Stride

Stride is the number of steps the filter takes in the convolution operation.

## Calculating the output dimensions of a convolution

For example, let’s say we have $$n \times n$$ matrix $$A$$ and a $$f \times f$$ filter $$F$$. The output dimension depends on two parameters – padding $$p$$ and stride $$s$$.

The dimensions for the output matrix $$A * F$$ will be

$\left \lfloor \frac{n + 2p - f}{s} \right \rfloor + 1 \times \left \lfloor \frac{n + 2p - f}{s} \right \rfloor + 1$.

In a same convolution, $$s = 1$$ and $$p = \frac{f - 1}{2}$$. The $$n \times n$$ matrix $$A$$ gets padded to $n + p n + p$ and the output matrix becomes $$n \times n$$.

# Coreference resolution

The Stanford NLP group defines coreference resolution as:

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.

Coreference resolution should not be confused with Named Entity Recognition, which is focused on labeling sequences of text that refer to entities–but not focused on linking those entities together.

# Data augmentation

Data augmentation is the process of using computer algorthms or other synthetic means to increase the size of a collected dataset.

Machine learning algorithms typically become more resistant to overfitting when they are trained with more data. But in many cases, it can be expensive to collect more data.

It is often possible to significantly increase the size of a dataset by computing simple transformations that are unlikely to be learned by the model, but also do not change the value of the labels.

For example, if one were building a model to identify pictures of cats versus non-cats, flipping the pictures horizontally or vertically increases the size of the available image data. When the pictures are flipped, whether they contain a cat or not doesn’t change. Despite this, many statistical models would be unlikely to learn flip-invariance without a significant amount of data–either collected from the real-world or generated through data augmentation.

# Data parallelism

Data parallelism is when data is distributed across multiple nodes in a distributed computing environment, and then each node acts on the data in parallel.

On each node, the computation is the same, but the data is different.

# Decision tree

A supervised learning method that iteratively refines a prediction by asking questions about the input feature most likely to affect the outcome, making a ‘tree’ of question branches.

# Deep Convolutional Generative Adversarial Network (DCGAN)

DCGAN refers to a model described by Radford, Metz, and Chintala that uses deep convolutional neural networks in a generative adversarial network model.

Generative adversarial networks (GANs) are structured as a competition between two models:

1. a generative model that tries to create fake examples of training data interspersed with real training data.
2. a discriminative model that tries to classify real examples from fake ones.

DCGAN uses deep convolutional neural networks for both models. Convolutional neural networks (CNNs) are well-known for their performance on image data. DCGAN uses the strong performance of (CNNs) to learn unsupervised representations of the input data.

# Deep Learning

Deep Learning is about learning using neural networks with multiple [hidden layers][2].

# Dimensionality reduction

Dimensionality reduction is about taking a set of data, and reducing its number of dimensions in such a way as to balance information size and independence of the features’ information. The point is to get a smaller dataset that still retains most of the original information.

# Distance metric learning

Distance metric learning is the task of using a labeled dataset to learn a similarity learning where the similarity function has to obey the four axioms of a distance metric.

The problem is typically defined with a dataset where there some of the datapoints are known to be “similar” and should be closer to each other than another arbitrarily-chosen datapoint in the dataset.

Distance metric learning first received significant attention in the machine learning community from the 2002 NIPS paper titled Distance metric learning, with application to clustering with side-information.

## Axioms of a distance metric

The four axioms of a distance metric are:

• Non-negativity: $$d(x, y) \geq 0$$ – The distance must always be greater than zero.
• Identity of indiscernibles: $$d(x, y) = 0 \Leftrightarrow x = y$$ – The distance must be zero for two elements that are the same (i.e. indiscernible from each other).
• Symmetry: $$d(x,y) = d(y,x)$$ – The distances must be the same, no matter which order the parameters are given.
• Triangle inequality: $$d(x,z) \leq d(x,y) + d(y,z)$$ – For three elements in the set, the sum of the distances for any two pairs must be greater than the distance for the remaining pair.

# Distance metric

As per Wikipedia, a distance metric, metric, or distance function, “is a function that defines a distance between each pair of elements of a set.”

A distance metric $$d(\cdot)$$ requires the following four axioms to be true for all elements $$x$$, $$y$$, and $$z$$ in a given set.

• Non-negativity: $$d(x, y) \geq 0$$ – The distance must always be greater than zero.
• Identity of indiscernibles: $$d(x, y) = 0 \Leftrightarrow x = y$$ – The distance must be zero for two elements that are the same (i.e. indiscernible from each other).
• Symmetry: $$d(x,y) = d(y,x)$$ – The distances must be the same, no matter which order the parameters are given.
• Triangle inequality: $$d(x,z) \leq d(x,y) + d(y,z)$$ – For three elements in the set, the sum of the distances for any two pairs must be greater than the distance for the remaining pair.

# Distributed representation

In machine learning, data with a local representation typically has 1 unit per element. A 5-word vocabulary might be defined by a 5-dimensional vector, with $$[1, 0, 0, 0, 0]^T$$ denoting the first word, $$[0, 1, 0, 0, 0]^T$$ denoting the second word, and so forth.

Distributed representations are the opposite, instead of concentrating the meaning of a data point into one component or one “element”, the meaning of the data is distributed across the whole vector.

The word that is $$[1, 0, 0, 0, 0]^T$$ in a local representation might look like $$[-0.150, -0.024, -0.233, -0.253, -0.183]^T$$ in a distributed representation.

# Distributional similarity

Distributional similarity is the idea that the meaning of words can be understood from their context.

This should not be confused with the term distributed representation, which refers to the idea of representing information with relatively dense vectors as opposed to a one-hot representation.

# doc2vec

doc2vec is the gensim library’s name for its paragraph vector implementation. doc2vec can be used to generate unsupervised representations of sentences, paragraphs, and documents.

# Dropout

Dropout is a technique for regularizing neural networks, developed by Hinton et al. and published in the paper Improving neural networks by preventing co-adaptation of feature detectors.

The core idea behind dropout is to randomly set some of the weights in a neural network to $$0$$ during the training phase.

Dropout add a hyperparameter of a “keep probability”. This is the probability that a weight value is left undisturbed– that it will not be set to $$0$$.

Dropout can be considered analogous to model-averaging because the process simulates training many similar neural networks on the same data.

# Dying ReLU

Dying ReLU refers to a problem when training neural networks with rectified linear units (ReLU). The unit dies when it only outputs 0 for any given input.

When training with stochastic gradient descent, the unit is not likely to return to life, and the unit will no longer be useful during training.

Leaky ReLU is a variant that solves the Dying ReLU problem by returning a small value when the input $$x$$ is less than 0.

# Embed, Encode, Attend, Predict

The phrase Embed, Encode, Attend, Predict refers to Matthew Honnibal’s conceptual framework for deep learning for natural language processing.

The steps have the following meanings:

1. Embed – This is the process of turning text or sparse vectors into dense word embeddigs. These embeddings are much easier to work with than other representations, and do an excellent job of capturing semantic information.

2. Encode – This is the process of encoding a sequence of word vectors into a matrix, using techniques like recurrent neural networks or LSTMs.

3. Attend – This refers to taking the matrix from the Encode step and transforming it into a vector, most likely using an attention mechanism.

4. Predict – The final step in the Natural Language Processing pipeline is making a prediction given the input text.

# Entailment

The word entailment has many meanings. Please review Wikipedia’s disambiguation page for “entail” for definitions that do not refer to textual entailment.

Textual entailment refers to drawing a relation from an entailing text to an entailed hypothesis.

Textual entailment can be illustrated with examples of three different relations:

An example of a positive TE (text entails hypothesis) is:

• text: If you help the needy, God will reward you.
• hypothesis: Giving money to a poor man has good consequences.

An example of a negative TE (text contradicts hypothesis) is:

• text: If you help the needy, God will reward you.
• hypothesis: Giving money to a poor man has no consequences.

An example of a non-TE (text does not entail nor contradict) is:

• text: If you help the needy, God will reward you.
• hypothesis: Giving money to a poor man will make you a better person.

## Synonyms

• Textual entailment

# Extractive sentence summarization

Extractive sentence summarization refers to programmatically creating a shorter version of a document by extracting the “important” parts.

TextRank is an example of an algorithm that can rank sentences in a document for the purpose of extractive summarization.

# Face detection

Face detection is the problem of detecting wheter an image has a (usually human) face in it.

The problem of identifying whether the image has a specific single person’s face is called face verification. The problem of identifying whether the image has any of $$k$$ person’s faces is called face recognition.

# Face recognition

Face recognition is the problem of identifying whether an input image contains the faces of any of $$k$$ people… or if the image has none of the $$k$$ faces.

Face recognition is a harder problem than face verification because face verification only compares a single image to one person, whereas face recognition does this for $$k$$ people.

# Face verification

Face verification is the problem of identifying whether an image belongs to a person–given the one image and one person as the input.

Face verification is an easier problem than face recognition because face verification only compares a single image to one person, whereas face recognition does this for $$k$$ people.

# Facet

In statistical plotting, a facet is a type of plot. Data is split into subsets and the subsets are plotted in a row or grid of subplots.

The term is common among users of ggplot2, a plotting package for the R statistical computing language.

Facet is also the name of a plotting and visualizaton tool created by the People + AI Research (PAIR) initiative at Google.

# Facet (disambiguation)

A facet can refer to a type of plot or chart designed to efficiently display multidimensional data.

Facets is also the name of a tool from Google’s People + AI Research (PAIR) lab designed to help explore datasets.

# Facet (visualization tool)

Facets is a a plotting and visualizaton tool created by the People + AI Research (PAIR) initiative at Google.

Facets is broken into two tools with the following goals:

• Facets Overview – summarize statistics for features collected from datasets
• Facets Dive – explore the relationship between different features in a dataset

From the Facets homepage, they state that

Success stories of (Facets) Dive include the detection of classifier failure, identification of systematic errors, evaluating ground truth and potential new signals for ranking.

# fastText

fastText is a project from Facebook Research for producing word embeddings and sentence classification.

The fastText project is hosted on Github and instructions for using their pre-trained word embeddings can be found here.

# Feature learning

See Representation learning.

# Filter (convolution)

A filter (also known as a kernel) is a small matrix used in convolution operations.

Convolution filters are commonly used in image processing to modify images or extract features.

The dimensions of a convolution filter are typically small, odd, and square. For example, convolution filters are typically $$3 \times 3$$ or $$5 \times 5$$ matrices. Odd dimensions are preferred to even dimensions.

## Synonyms

• Kernel (convolution)

# First-order information

First-order information is a term used to mean information obtained by computing the first derivative of a function. The first derivative of a function reveals the slope of a tangent line to the function. This gives a general idea of how the function is changing at that point, but does not give information about the curvature of the function–the second derivative is required for that.

First-order information should not be confused with first-order logic.

# GloVe (Global Vectors) embeddings

GloVe, or Global Vectors, refers to a word embedding algorithm from the Stanford NLP group.

## References

Gradient descent is an optimization algorithm designed to find the minimum of a function. Many machine learning algorithms use gradient descent or a variant.

Common variants include: - Stochastic Gradient Descent (SGD) - Minibatch Gradient Descent

## Related Terms

The gradient is the vector generalization of the derivative.

For a function $$f([x_1, \ldots, x_n]^T)$$, the gradient $$\nabla_x f([x_1, \ldots, x_n]^T)$$ is the vector containing the $$n$$ partial derivatives of $$f$$ with respect to each $$x_i$$.

# Gram matrix

Given a set $$V$$ of $$m$$ vectors (points in $$\mathcal R^n$$), the Gram matrix $$G$$ is the matrix of all possible inner products of $$V$$, i.e., $g_{ij} = \mathbf v_i^T \mathbf v_j$

# Grid search

The Hadamard product refers to component-wise multiplication of the same dimension. The $$\odot$$ symbol is commonly used as the Hadamard product operator.

Here is an example for the Hadamard product for a pair of $$3 \times 3$$ matrices.

$\begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \odot \begin{bmatrix} j & k & l \\ m & n & o \\ p & q & r \end{bmatrix} = \begin{bmatrix} aj & bk & cl \\ dm & ne & fo \\ gp & hq & ir \end{bmatrix}$

# Harmonic Precision-Recall Mean (F1 Score)

The $$F_1$$ score is a classification accuracy metric that combines precision and recall. It is designed to be useful metric when classifying between unbalanced classes or other cases when simpler metrics could be misleading.

When classifying between two cases (“positive” and “negative”), there are the four possible results of prediction:

 Actual Positive Actual Negative Predicted Positive True Positives False Positives Predicted Negative False Negatives True Negatives

Precision answers the question, “What fraction of positive predictions are true predictions?”

A cancer diagnostic test that suggested that all patients have cancer would achieve

$\mathrm{Precision} = \frac{\mathrm{True\;Positives}}{\mathrm{True\;Positives} + \mathrm{False\;Positives}}$

Recall answers the question, “Out of all the true positives, what fraction of them did we identify?”

A cancer diagnostic test that suggested that all patients have cancer would achieve perfect recall, as all patients that actually have cancer would be identified.

$\mathrm{Recall} = \frac{\mathrm{True\;Positives}}{\mathrm{True\;Positives + False\;Negatives}}$

The F1 score is a way to combine precision and recall in the following way:

$F_1 = 2 * \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

For a classifier to have a high $$F_1$$ score, it needs to have high precision and high recall.

# He initialization

The term He initialization refers to the first author in the paper “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”.

He initialization initializes the bias vectors of a neural network to $$0$$ and the weights to random numbers drawn from a Gaussian distribution where the mean is $$0$$ and the variance is $$\sqrt(2/n_l)$$ where $$n_l$$ is the dimension of the previous layer.

# Helvetica scenario

See Mode collapse.

# Hinge loss

From the scikit-learn documentation, we get the following definition:

The hinge_loss function computes the average distance between the model and the data using hinge loss, a one-sided metric that considers only prediction errors. (Hinge loss is used in maximal margin classifiers such as support vector machines.)

# Hopfield Network

A Hopfield network (HN) is a type of recurrent neural network(RNN). The HNs have only one layer, with each neuron connected to every other neuron: All neurons act as input and output. The model of the network consists of a set of neurons and corresponding set of unit delays, forming a multiple loop feedback system. The output of a neuron is, via a unit delay element, sent to all of the others neurons in the HN, except itself.

Some important properties of HNs are that all neurons have binary outputs (either 0, 1 or -1, 1), it isn’t allowed a connection from a neuron to itself, the neurons are updated at random order (asynchronously), and the weights between the neurons are symmetric. This last property guarantees that the energy function decreases while following the activation function rules, and HNs are guaranteed to converge to a local minimum.

Hopfield networks are used for pattern recognition, and provides a model for understanding human memory. A HN is initially trained to store a number of patterns, and then it’s able to recognize any of the learned patterns by exposure to only partial or even corrupted information about the pattern (it returns the closest pattern or the best guess). This property makes the Hopfield network a form of associative memory (or content-addressable memory). It’s applications include image detection and recognition, enhancement of X-Ray images, medical images restoration, etc.

# Hypergraph

A hypergraph is a generalization of the graph. A graph has edges that connect pairs of vertices, but a hypergraph has hyperedges that can connect any number of vertices.

# Inception

Inception refers to a particular neural network model in the CVPR 2015 paper titled Going Deeper With Convolutions.

# Inceptionism

Inceptionism refers to a visualization technique to understand what a neural network learned. The network is fed an image, asked what the network detected, and then that feature in the image is amplified. The full technique is described in the Google Research blog post titled Inceptionism: Going Deeper into Neural Networks.

# Independent and Identically Distributed (i.i.d)

A collection of random variables is independent and identically distributed if they have these properties:

1. they all have the same probability distribution.
2. they are all mutually independent of each other.

# Internal covariate shift

The term interal covariate shift comes from the paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

The authors’ precise definition is:

We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training.

In neural networks, the output of the first layer feeds into the second layer, the output of the second layer feeds into the third, and so on. When the parameters of a layer change, so does the distribution of inputs to subsequent layers.

These shifts in input distributions can be problematic for neural networks, especially deep neural networks that could have a large number of layers.

Batch normalization is a method intended to mitigate internal covariate shift for neural networks.

# Inverted dropout

Inverted dropout is a variant of the original dropout technique developed by Hinton et al.

Just like traditional dropout, inverted dropout randomly keeps some weights and sets others to zero. This is known as the “keep probability” $$p$$.

The one difference is that, during the training of a neural network, inverted dropout scales the activations by the inverse of the keep probability $$q = 1 - p$$.

This prevents network’s activations from getting too large, and does not require any changes to the network during evaluation.

In contrast, traditional dropout requires scaling to be implemented during the test phase.

# Jaccard index (intersection over union)

The Jaccard index–otherwise known as intersection over union–is used to calculate the similarity or difference of sample sets.

$J(\mathbb{A}, \mathbb{B}) = \frac{\left | \mathbb{A} \cap \mathbb{B} \right |} {\left | \mathbb{A} \cup \mathbb{B} \right |}$

$0 \leq J(\mathbb{A}, \mathbb{B}) \leq 1$

The index is defined to be 1 if the sets are empty.

# Kernel (convolution)

See Filter (convolution).

# Language segmentation

This phrase is most concisely described in in this work by David Alfter:

Language segmentation consists in finding the boundaries where one language ends and another language begins in a text written in more than one language. This is important for all natural language processing tasks.

[…]

One important point that has to be borne in mind is the difference between language identification and language segmentation. Language identification is concerned with recognizing the language at hand. It is possible to use language identification for language segmentation. Indeed, by identifying the languages in a text, the segmentation is implicitly obtained. Language segmentation on the other hand is only concerned with identifying language boundaries. No claims about the languages involved are made.

# Latent Dirichlet Allocation Differential Evolution (LDADE)

LDADE is a tool proposed by Agrawal et al. in a paper titled What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering).

It tunes LDA parameters using Differential Evolution to increase the clustering stability of standard LDA.

# Latent semantic analysis (LSA)

See Latent Semantic Indexing (LSI).

# Latent Semantic Indexing (LSI)

## Synonyms

• Latent semantic analysis (LSA)

# Leaky ReLU

Leaky ReLU is a type of activation function that tries to solve the Dying ReLU problem.

A traditional rectified linear unit $$f(x)$$ returns 0 when $$x \leq 0$$. The Dying ReLU problem refers to when the unit gets stuck this way–always returning 0 for any input.

Leaky ReLU aims to fix this by returning a small, negative, non-zero value instead of 0, as such:

$f(x) = \begin{cases} \max(0,x) & x > 0 \\ \alpha x & x \leq 0 \end{cases}$ where $$\alpha$$ is typically a small value like $$\alpha = 0.0001$$.

# Learning rate annealing

See Learning rate decay.

# Learning rate decay

## Synonyms

• Learning rate annealing

# Learning To Rank (LTR)

Learning-to-rank is the application of machine learning to ranking search results, recommendations, or similar information.

# LeNet

LeNet was an early convolutional neural network proposed by Lecun et al in the paper Gradient-Based Learning Applied to Document Recognition.

LeNet was designed for handwriting recognition. Many modern convolutional neural network architectures are inspired by LeNet.

# Likelihood

In statistics, likelihood is the hypothetical probability that a past event would yield a specific outcome. Probability is concerned with the future, but likelihood is concerned with the past.

# Long Short-Term Memory (LSTM)

Long short term memory (LSTM) networks try to reduce the vanishing and exploding gradient problem during the backpropagation in recurrent neural networks. LSTM are in general, a RNN where each neuron has a memory cell and three gates: input, output and forget. The purpose of the memory cell is to retain information previously used by the RNN, or forget if needed. LSTMs are explicitly designed to avoid the long-term dependency problem in RNNs, and have been shown to be able to learn complex sequences better than simple RNNs.

The structure of a memory cell is: an input gate, that determines how much of information from the previous layer gets stored in the cell; the output gate, that determines how of the next layer gets to know about the state of the current cell; and the forget gate, which determines what to forget about the current state of the memory cell.

# Magnet loss

Magnet loss is a type of loss function used in distance metric learning machine learning problems. It was introduced in the paper Metric Learning with Adaptive Density Discrimination. The authors of the paper introduced magnet loss as an improvement over triplet loss and other loss functions designed to learn a distance metric.

Instead of working on individuals, pairs, or triplets of data points, magnet loss operates on entire regions of the embedding space that the data points inhabit. Magnet loss models the distributions of different classes in the embedding space and works to reduce the overlap between distributions.

# Margin

In machine learning, a margin often refers to the distance between the two hyperplanes that separate linearly-separable classes of data points.

The term is most commonly used when discussing support vector machines, but often appears in other literature discussing boundaries between points in a vector space.

# Mean Reciprocal Rank (MRR)

$$\newcommand{\Correctrank}{\mathrm{rank}}$$

Mean Reciprocal Rank is a measure to evaluate systems that return a ranked list of answers to queries.

For a single query, the reciprocal rank is $$\frac 1 \Correctrank$$ where $$\Correctrank$$ is the position of the highest-ranked answer ($$1, 2, 3, \ldots, N$$ for $$N$$ answers returned in a query). If no correct answer was returned in the query, then the reciprocal rank is 0.

For multiple queries $$Q$$, the Mean Reciprocal Rank is the mean of the $$Q$$ reciprocal ranks.

$\mathrm{MRR} = \frac 1 Q \sum_{i=1}^{Q} \frac 1 {\Correctrank_i}$

# METEOR

METEOR is an automatic evaluation metric for machine translation, designed to mitigate perceived weaknesses in BLEU. METEOR scores machine translation hypotheses by aligning them to reference translations, much like BLEU does.

## Synonyms

• Mini-Batching

# Mode collapse

Mode collapse, also known as the Helvetica scenario, is a common problme when training generative adversarial networks.

## Synonyms

• Helvetica scenario

# Model parallelism

Model parallelism is where multiple computing nodes evaluate the same model with the same data, but using different parameters or hyperparameters.

In contrast to model parallelism, data parallelism where the different computing nodes have the same parameters but different data.

# Momentum

Momentum is commonly understood as a variation of stochastic gradient descent, but with one important difference: stochastic gradient descent can unnecessarily oscillate, and doesn’t accelerate based on the shape of the curve.

In contrast, momentum can dampen oscillations and accelerate convergence.

Momentum was originally proposed in 1964 by Boris T. Polyak.

# Multinomial mixture model

A multinomial mixture model is a mixture of multinomial distributions.

The Wikipedia page for the multinomial distribution notes the following regarding the relationship between the multinomial distribution and the categorical distribution:

Note that, in some fields, such as natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a “multinomial distribution” when a categorical distribution is actually meant. This stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a “1-of-$$K$$” vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range $${\displaystyle 1\dots K} 1 \dots K$$; in this form, a categorical distribution is equivalent to a multinomial distribution over a single trial.

This is important to remember when reading about categorical mixture models versus multinomial mixture models.

# Multiple crops at test time

Multi-crop at test time is a form of data augmentation that a model uses during test time, as opposed to most data augmentation techniques that run during training time.

• cropping a test image in multiple ways
• using the model to classify these cropped variants of the test image
• averaging the results of the model’s many predictions

Multi-crop at test time is a technique that some machine learning researchers use to improve accuracy at test time. The technique found popularity among some competitors in the ImageNet Large Scale Visual Recognition Competition after the famous AlexNet paper, titled ImageNet Classification with Deep Convolutional Neural Networks, used the technique.

# Named Entity Recognition in Query (NERQ)

Named Entity Recognition in Query is a phrase used in a research paper and patent from Microsoft, referring to the Named Entity Recognition problem in web search queries.

# Neural checklist model

Neural checklist models were introduced in the paper Globally Coherent Text Generation with Neural Checklist Models by Kiddon et al.

A neural checklist model is a recurrent neural network that tracks an agenda of text strings that should be mentioned in the output.

This technique allows the neural checklist model to generate globally coherent text, as opposed to text from traditional RNNs that is only locally coherent.

The original paper describes applying the neural checklist model to recipes and dialogue responses for information systems, where there already exists a pre-existing notion of all the topics that should be present in a natural language response.

# Neural style transfer

Neural style transfer refers to the use of neural networks to apply the style of a style image to a content image.

(Most of the literature in neural style transfer refers to images, but recent research has explored the use of neural style transfer techniques to other domains.)

## Examples

The below examples of neural style transfer are from Justin Johnson of Stanford University. The top left image is Vincent van Gogh’s The Starry Night and the top right image is a photograph of Stanford’s Hoover tower. The bottom image is generated by Justin Johnson using neural style transfer.

## History

In the 1990s, some computer science researchers began exploring non-photorealistic rendering, which offered ways to generate images inspired by the style and texture of specific artwork (such as oil paintings). However, these techniques were generally limited in the types of styles they could generate target images for.

This limitation began to disappear in 2015 with the publication of the paper A Neural Algorithm of Artistic Style. The paper’s authors demonstrated the use of convolutional neural networks to generate images inspired by many different styles.

## Loss Function

Broadly, neural style transfer is the technique of using gradient descent to minimize the loss function over the following variables:

• $$x_c$$ – our input content image. In the above example, this would be the photograph of Stanford University’s Hoover Tower.
• $$x_s$$ – our style image. In the above example, this would be van Gogh’s The Starry Night.
• $$x$$ – our generated image. In the above example, this is the generated image of Hoover Tower in the style of The Starry Night.

We intend to find a value of $$x$$ that minimizes $$\alpha E_c(x, x_c) + \beta E_s(x, x_s)$$ where:

• $$E_c(\cdot)$$ is a function that compares the difference of content between $$x$$ and $$x_c$$.
• $$E_s(\cdot)$$ is a function that compares the difference of style between $$x$$ and $$x_s$$.
• $$\alpha$$ and $$\beta$$ are weighting factors to balance whether the generated image favors accurate content or accurate style.

# Neural Turing Machine (NTM)

Neural Turing Machines (NTM) consists of a RNN (commonly with LSTM), and a memory bank, where the neural network can make write and read operations. By making each operation of the NTM differentiable, it can be trained efficiently trained with gradient descent.

The main idea of the NTM is to use the memory bank – a large, addressable memory – to give a memory to the RNN so that it can read and write to, yielding a practical mechanism to learn programs. The NTM has been shown to be able to infer simple algorithms, such as copying, sorting and associative recall from input and output examples.

# No Free Lunch (NFL) theorem

The “No Free Lunch” theorem is the idea that all optimizers perform equally well when averaged across all possible optimization problems.

# Non-max suppression

Non-max suppression refers to the idea of suppressing predicted information that is not predicted with the highest confidence.

Non-max suppression is commonly used in computer vision software, where a computer vision model may identify many nearby redundant edges or bounding boxes.

To reduce this redundancy, only the maximum feature in a set of related features are kept. For example, a computer vision model might identify a single object using multiple overlapping bounding boxes. The bounding box with the highest prediction probability is kept, and the overlapping boxes are removed in favor of the kept box. Other boxes may be unaffected by non-max suppression if they are too far away.

In the case of bounding boxes, non-max suppression is said to prune low-confidence bounding boxes that have a high intersection over union with the highest-confidence bounding box they intersect with.

# Object localization

Object localization is the machine learning problem that encompasses object detection–finding whether an object exists exists in an image–and finding the location of the object an image.

The location of an object in an image is typically represented as a “bounding box”.

## An example of 2-dimensional object localization

Below is a rough example of what a machine learning model would output if trained and evaluated on 2D images for the purpose of drawing 2D bounding boxes around a single object. In the real world, many of the exact details may vary.

To train a machine learning model to identify the presence or absence of a single object, the model’s output would be a single 1 (for the object’s presence) or 0 (for the object’s absence). We’ll refer to this as $$p_{\mathrm{exists}}$$.

But to draw a two-dimensional bounding box showing where the object is in the image, the model would need to process the image ane return a vector that looked like the following:

$\begin{bmatrix} p_{\mathrm{exists}} \\ b_{\mathrm x} \\ b_{\mathrm y} \\ b_{\mathrm{width}} \\ b_{\mathrm{height}} \end{bmatrix}$

If the object exists, then the remaining values in the vector have the following meanings:

• $$b_{\mathrm x}$$ is the X position of the object in the image. This may either be the center of the object, or the X position of one of the corners.
• $$b_{\mathrm y}$$ is the Y position of the object in the image. This may either be the center of the object, or the Y position of one of the corners.
• $$b_{\mathrm{width}}$$ is the width of the bounding box.
• $$b_{\mathrm{height}}$$ is the height of the bounding box.

When the object does not exist in the given image, many machine learning models may return vectors with random numbers.

When training an object localization model, one typically uses a dataset where $$p_{\mathrm{exists}}$$ is known to be 1 or 0. When evaluating an object localization model, the model typically returns $$p_{\mathrm{exists}}$$ as a probability between 0 and 1. When using this probability, we may pick a threshold–such as 0.5–where we assume the model has found an object.

## Different variations in object localization

In addition to 2D images from digital cameras, object localization is also relevant in other sensor data that might capture objects–such as a RADAR or LIDAR scan.

Object localization models may be trained to identify multiple objects, or to identify objects with 3D bounding boxes or non-rectangular bounding “boxes”.

# One-hot encoding

One-hot encoding refers to a way of transforming data into vectors where all components are 0, except for one component with a value of 1, e,g.: $0 = [1, 0, 0, 0, 0]^T$ $1 = [0, 1, 0, 0, 0]^T$ $\ldots$ $4 = [0, 0, 0, 0, 1]^T$ and so on.

One-hot encoding can make it easier for machine learning algorithms to manipulate and learn categorical variables.

# One-shot learning

One-shot learning refers to the problem of training a statistical model (such as a classifier) with only a single example per class.

One way to build a system capable of one-shot learning is to use representation learning, to learn representations or features of data that can be used to accurately classify a single example.

# Out-of-core

The term out-of-core typically refers to processing data that is too large to fit into a computer’s main memory.

Typically, when a dataset fits neatly into a computer’s main memory, randomly accessing sections of data has a (relatively) small performance penalty.

When data must be stored in a medium like a large spinning hard drive or an external computer network, it becomes very expensive to randomly seek to an arbitrary section of data or to process the same data multiple times.

In such a case, an out-of-core algorithm would try to access all relevant data in one sequence.

However, modern computers have a deep memory hierarchy, and replacing random access with sequential access can increase performance even on datasets that fit within memory.

# Overfitting

Overfitting is when a statistical model learns patterns in the training data that are too complex to generalize well.

## References

Padding is a preprocessing step before a convolution operation.

When we convolve a $$n \times n$$ image with an $$f \times f$$ filter and a stride length of $$1$$, the output is a matrix of dimension $$n - f \times n - f$$.

For deep convolutional neural networks that may do many convolutions, this would cause the input matrix to dramatically shrink and become too small.

Additionally, values in the middle of the input matrix have a greater influence on the output than values on the edges.

There are several different methods for choosing what values to pad an input matrix with:

• Repeating the nearest border values as values for padding
• Using values from the opposite side of the matrix as padding values

# Paragraph vector

Paragraph Vectors is the name of the model proposed by Le and Mikolov to generate unsupervised representations of sentences, paragraphs, or entire documents without losing local word order.

This is in contrast to bag-of-words representations, which can offer useful representations of documents but lose all word order information.

# Parameter budget

A parameter budget refers to the idea of constraining the number of learnable parameters for a machine learning model. Some types of parameters are more useful for improving a model than others, thus they should be prioritized in a model with a restricted parameter budget.

In neural networks, deeper networks seem to work better when the parameter budget is constrained.

A related idea is the computational budget, but the budget for overall computation is not strictly tied to the number of parameters in a model.

# Parameter sharing

See Weight sharing.

# Part-of-Speech (POS) Tagging

Part-of-Speech tagging is the process of reading natural language text and assigning parts of speech to each token.

One could imagine taking in a sentence like:

The dog ran away.

and creating a data structure that had the following annotations:

Words can have different parts-of-speech depending on their context. For example, the word away can be either an adverb or an adjective, or part of a larger phrase.

# PCA Color Augmentation

The term PCA Color Augmentation refers to a type of data augmentation technique first mentioned in the paper titled ImageNet Classification with Deep Convolutional Neural Networks. This paper is famous for introducing the AlexNet convolutional neural network architecture, which won the 2012 ImageNet Large Scale Visual Recognition Competition.

In general, data augmentation is the process of increasing the size of a dataset by transforming it in ways that a neural network is unlikely to learn by itself. For example, image-recognition datasets often train with images flipped vertically or horizontally.

Another form of data augmentation on image datasets is to alter the color balance of the image–for example, adjusting the values of the red, green, and blue pixels in the image.

Specifically, PCA Color Augmentation is designed to shift those values based on which values are the most present in the image. Images with heavy red values and minimal green values will have their red values altered the most through PCA Color Augmentation.

The specific mechanism relies on the principal component analysis (PCA) algorithm to find the relative color balance of a given image.

The AlexNet paper describes PCA Color Augmentation in this paragraph:

Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Therefore to each RGB image pixel $$I_{xy} = \left [ I^R_{xy}, I^G_{xy}, I^B_{xy} \right ]^T$$, we add the following quantity:

$\left [ \mathbf p_1, \mathbf p_2, \mathbf p_3 \right ] \left [ \alpha_1 \lambda_1, \alpha_2 \lambda_2, \alpha_3 \lambda_3 \right ]^T$

# Peephole connection

Peephole connections refer to a modification to the basic LSTM architecture. WildML describes LSTM peephole connections as the following:

Notably, there exist several variations on the basic LSTM architecture. A common one is creating peephole connections that allow the gates to not only depend on the previous hidden state $$s_{t-1}$$, but also on the previous internal state $$c_{t-1}$$, adding an additional term in the gate equations.

Peephole connections were originally introduced by Gers and Schmidhuber in 2000 to help LSTMs learn precise timings. From their abstract:

Surprisingly, LSTM augmented by “peephole connections” from its internal cells to its multiplicative gates can learn the fine distinction between sequences of spikes separated by either 50 or 49 discrete time steps.

The paper LSTM: A Search Space Odyssey gives an overview of various types of LSTM architectures and their performance, including LSTMs with peephole connections.

# Perplexity

Wikipedia defines perplexity as the following:

In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample.

# Point Estimator

A point estimator estimates population parameters (e.g. mean, variance) with sample data.

# Pooling layer

A pooling layer is a common type of layer in a convolutional neural network (CNN). A pooling layer does not contain any weights that need to be learned during neural network training.

However, pooling layers come with two hyperparameters: - Stride $$s$$ - Filter (or kernel) size $$f$$

Both of these hyperparameters have the same meaning as they do for convolutional layers.

Pooling layers help reduce the amount of computation a convolutional neural network needs, and can often help increase the performance of a CNN.

A pooling layer will turn an $$n \times n$$ matrix into a $$\left \lfloor \frac{n - f}{s} \right ]\rfloor + 1 \times \left \lfloor \frac{n - f}{s} \right ]\rfloor + 1$$ matrix.

There are several types of pooling: - Max pooling, the most common type of pooling in convolutional neural networks. - Average pooling. - Global Average Pooling (GAP). - $$k$$-Max Pooling and the closely-related Dynamic $$k$$-Max Pooling, which have found applications in neural networks for natural language processing.

# Provenance tracking

In data acquisition, provenance tracking is keeping track of “where each piece of data comes from and whether it is still up-to-date”. This quoted definition comes from Philip Guo in an article for the ACM where he elaborates:

The main problem that programmers face in data acquisition is keeping track of provenance, i.e., where each piece of data comes from and whether it is still up-to-date. It is important to accurately track provenance, since data often needs to be re-acquired in the future to run updated experiments. Re-acquisition can occur either when the original data sources get updated or when researchers want to test alternate hypotheses. Also, provenance can enable downstream analysis errors to be traced back to the original data sources.

In a data science tutorial, Andrew Davison writes about the history of the term:

The term comes originally from the art world, where it refers to the chronology of the ownership and/or location of a work of art. Having detailed evidence of provenance can help to establish that a painting has not been altered or stolen, and is not a forgery. Wikipedia has a nice entry on the provenance of the Arnolfini Portrait by van Eyck. More recently, the term has been applied to other fields, including archaeology, palaeontology, and science more generally, where it refers to having knowledge of all the steps involved in producing a scientific result, such as a figure, from experiment design through acquisition of raw data, and all the subsequent steps of data selection, analysis and visualization. Such information is necessary for reproduction of a given result, and can serve to establish precedence (in case of patents, Nobel prizes, etc.)

# Random initialization

Random initialization refers to the practice of using random numbers to initialize the weights of a machine learning model.

Random initialization is one way of performing symmetry breaking, which is the act of preventing all of the weights in the machine learning model from being the same.

# Rectified Linear Unit (ReLU)

A Rectified Linear Unit is a common name for a neuron (the “unit”) with an activation function of $$f(x) = \max(0,x)$$.

Neural networks built with ReLU have the following advantages:

• gradient computation is simpler because the activation function is computationally similar than comparable activation functions like $$\tanh(x)$$.
• Neural networks with ReLU are less susceptible to the vanishing gradient problem but may suffer from the dying ReLU problem.

# Recurrent Neural Network Language Model (RNNLM)

A Recurrent Neural Network Language Model (RNNLM) is a recurrent neural network tasked with modeling languages.

# Recurrent Neural Network

Recurrent neural networks (RNN) are feed-forward neural networks, but differently than traditional feed-forward models, RNNs contain an internal memory. A RNN have an internal loop that allows information to persist in the network. Neurons receive information not just from the previous layer, but also from themselves from the previous pass. This means that the order of inputs to the RNN matter, and may give different results with different order, as RNNs have state.

RNNs are particularly sensitive to the vanishing and exploding gradient problems, where depending on the activation functions used, the information can get lost over time. Long short-term memory networks(LSTMs) addresses this problem. RNNs are commonly used with sequential data, due to the fact that an RNN is sensitive to the order of the inputs, like in the natural language processing area. Common examples of sequential data includes texts, audio and video.

# Reinforcement Learning (RL)

Reinforcement learning is about learning from feedback (reinforcement) while learning ‘on the job’, i.e. learning by trying, rather than from labeled answer data. This is how robots may learn, but is also used for playing games, a tight feedback loop through e.g. score may help give the algorithm an idea of what works well.

# Representation learning

## Synonyms

• Feature learning

# ResNet

ResNet stands for “Residual Network” and was introduced in the paper Deep Residual Learning for Image Recognition. ResNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2015 competition.

# Same convolution

A same convolution is a type of convolution where the output matrix is of the same dimension as the input matrix.

For a $$n \times n$$ input matrix $$A$$ and a $$f \times f$$ filter matrix $$F$$, the output of the convolution $$A * F$$ is of dimension $$\left \lfloor \frac{n + 2p - f}{s} \right \rfloor + 1 \times \left \lfloor \frac{n + 2p - f}{s} \right \rfloor + 1$$ where $$s$$ represents the stride length and $$p$$ represents the padding.

In a same convolution:

• $$s$$ is typically set to $$1$$
• $$p$$ is set to $$\frac{f - 1}{2}$$
• $$f$$ is an odd number

The result is that $$A$$ is padded to be $$n + p \times n + p$$ and $$A * F$$ becomes $$n \times n$$ – the same as the original dimensions of $$A$$.

# Search-based software engineering (SBSE)

Search-based software engineering applies search and optimization techniques to software engineering problems.

[LDADE][/terms/latent-dirichlet-allocation-differential-evolution-ldade] is an example of a system that applies search-based software engineering to optimize topic modeling.

# Second-order information

The term second-order information refers to information about a function gained by computing its second derivative. The second derivative reveals information about the function’s curvature.

# Self-supervised learning

Self-supervised learning is a type of supervised learning where the training labels are determined by the input data.

word2vec and similar word embeddings are a good example of self-supervised learning. word2vec models predict a word from its surrounding words (and vice versa). Unlike “traditional” supervised learning, the class labels are not separate from the input data.

Autoencoders are another example of self-supervised learning, as they are trained to shrink and reconstruct their inputs.

# Semantic hashing

A hash is a small, lossy representation of a data point. Typically hashes are fixed-sized representations of variable-sized data.

Cryptographic hash functions take and data and return fixed-size representations. These typically have the property that the hash value (ideally) reveals no information about the original piece of data.

The MD5 hash for the string hello is 5d41402abc4b2a76b9719d911017c592, but the hash for Hello is 8b1a9953c4611296a827abf8c47804d7. It is typically not trivial to infer that these two distinct-looking hashes have similar inputs.

Semantic hashes are the opposite – they are fixed-length representations where similar pieces of input data should have similar hashes. Word embeddings are a type of semantic hash for words. Words that occur in contexts end up with similar embeddings.

Semantic hashes are useful for document search, information retrieval, or other problems where one needs to find similar pieces of data.

# Semi-supervised learning

Semi-supervised learning mixes labeled and labeled data to produce better models.

In machine learning, finding or creating correctly-labeled data (e.g. images with annotations and descriptions, audio with text transcriptions) can be difficult or expensive.

Semi-supervised learning techniques take advantage of a small amount of labeled data and a large amount of unlabeled data to produce a better model than a purely supervised learning or a purely unsupervised learning technique.

# sense2vec

sense2vec refers to a system in a paper titled sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings. It solves a problem with previous word embeddings like word2vec and GloVe where words of different senses (e.g. “duck” as an animal and “duck” as a verb) are represented by the same embedding.

sense2vec uses word sense information to train more accurate word embeddings.

# Sequence to Sequence Learning (seq2seq)

This typically refers to the method originally described by Sutskever et al. in the paper Sequence to Sequence Learning with Neural Networks.

Feedforward neural networks and many other models can learn complex patterns, but require fixed-length input. This makes it difficult for these models to learn variable-length sequences. To solve this, the authors applied one LSTM to read the input seqeunce and a second LSTM to generate the output sequence.

A few potential applications of sequence to sequence learning include:

• Machine translation
• Text summarization
• Speech-to-text conversion

# Siamese neural network

A Siamese neural network is a neural network architecture that runs two pieces of data through identical neural networks, and then the outputs are fed to a loss function measuring similarity between outputs.

Siamese neural networks are a common model architecture for one-shot learning.

For example, a Siamese neural network might be used to train a model to measure similarity between two different images, for the purpose of identifying whether the images are of the object…. but without training on many examples of that object.

# Similarity learning

Similarity learning is the area of machine learning focusing on learning how similar or different two objects are.

Similarity learning is often used in areas of machine learning where there are not a fixed number of classes that objects fit into. Face verification is one example of such an area.

Learning-to-rank is another area of machine learning where the model needs to learn a similarity function between the pieces of data in the dataset.

Distance metric learning is related to similarity learning, but where the similarity function is also required to obey the four axioms of a distance metric. Outside of distance metric learning, similarity learning often learns a pseudometric, where not all four axioms of a distance metric are true.

# Sobel filter (convolution)

The Sobel filter is a set of two convolution filters used to detect horizontal and vertical edges in images.

The horizontal filter is

$\begin{bmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \end{bmatrix}$

and the vertical filter is

$\begin{bmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{bmatrix}$

# Softmax

The softmax turns $$n$$ numbers in $$\mathbb R^N$$ into a probability distribution proportional to the size of the numbers.

Given an $$n$$-dimensional vector $$\mathbf v$$ with all component terms in $$\mathbb R^N$$, the softmax of $$\mathbb v$$ is: $\mathrm{softmax}(\mathbb v)_i = \frac{\exp{(v_i)}} {\sum_{j=1}^{n} \exp{(v_i)}}$

# Stationary environment

A stationary environment refers to data-generating distributions that do not change over time.

A non-stationary environment, in contrast, refers to data-generating distributions that do change over time.

It is a difficult problem to train machine learning algorithms to generalize well in non-stationary environments. See Machine Learning in Non-Stationary Environments for more information.

# Stochastic block model (SBM)

The stochastic block model is a common model for detection of community structure

# Stride (convolution)

In convolutions, the stride is the number of horizontal and vertical steps that the filter takes over the original matrix.

# Supervised learning

Supervised learning is about training on labeled data, i.e. questions with known answers. The task is to find the function that maps a set of inputs (x1, x2, x3, …) to the value to be predicted (y).

# Support Vector Machine (SVM)

Support Vector Machine is a classification method in supervised learning that seeks to use support vectors (cases close to the boundary) to find an optimal hyperplane separating items from different classes.

# Symmetry breaking

Symmetry breaking refer to a requirement of initializing machine learning models like neural networks.

When some machine learning models have weights all initialized to the same value, it can be difficult or impossible for the weights to differ as the model is trained. This is the “symmetry”.

Initializing the model to small random values breaks the symmetry and allows different weights to learn independently of each other.

# Synset

A synset is WordNet’s terminology for a synonym ring.

WordNet is a database of English words grouped into sets of synonyms. WordNet’s synsets are often useful in information retrieval and natural langauge processing tasks to discover when two different words can mean similar things.

# SyntaxNet

SyntaxNet is a framework for natural language syntactic parsers released by Google in 2016.

SyntaxNet tags words in a sentence with their syntactic part-of-speech and creates a parse tree showing dependencies between words in a sentence.

Parsey McParseface is a SyntaxNet model trained on the English language. At its time of release, Parsey McParseface is the world’s most accurate model of its kind.

# Termite

Termite is a visual analysis tool to determine the quality of topic models like latent Dirichlet allocation.

Termite lays out document terms as a table of circles where:

• rows represent document terms
• columns represent topics
• circular areas represent term probabilities

# Test term

This is the body of yet another test term. I think that this is going to be extremely promising.

$$_0^100 x^2$$

# TextRank

TextRank is a graph ranking algorithm applied to text. It is useful for various unsupervised learning tasks in natural language processing.

See Entailment.

# Top-1 error rate

The term top-1 error rate refers method of benchmarking machine learning models in the ImageNet Large Scale Visual Recognition Competition.

The model is considered to have classified a given image correctly if the target label is the model’s top prediction. This is in contrast to the top-5 error rate where the model only needs to identify the correct label in the model’s top 5 predictions.

# Top-5 error rate

The term top-5 error rate refers method of benchmarking machine learning models in the ImageNet Large Scale Visual Recognition Competition.

The model is considered to have classified a given image correctly if the target label is one of the model’s top 5 predictions.

# Transduction

Similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and new inputs.

# Tree-LSTM

Tree-LSTMs are a variant of Long Short Term Memory (LSTM) neural networks.

A traditional LSTM is structured as a linear chain, and displays strong performance on sequence modeling tasks–such as machine translation.

However, some types of data (such as text) are better represented as tree structures instead of sequences. Thus, Tree-LSTMs were introduced by Tai, et al in 2015.

# Triplet loss function

See Triplet loss.

# Triplet loss

Triplet loss is a loss function that come from the paper FaceNet: A Unified Embedding for Face Recognition and Clustering. The loss function is designed to optimize a neural network that produces embeddings used for comparison.

The loss function operates on triplets, which are three examples from the dataset:

• $$x_i^a$$ – an anchor example. In the context of FaceNet, $$x_i^a$$ is a photograph of a person’s face.
• $$x_i^p$$ – a positive example that has the same identity as the anchor. In FaceNet, this is a second picture of the same person as the picture from the anchor example.
• $$x_i^n$$ – a negative example that represents a different entity. For FaceNet, this would be an image of a second person–someone different than the person represented by the anchor and positive examples.

The triplet loss function is designed to train the model to produce embeddings such that the positive example $$x_i^p$$ is closer to the anchor than the negative example $$x_i^n$$.

## Math Details

More formally, for an embedding function $$f(x) \in \mathbb R^d$$ that embeds input data $$x$$ into a $$d$$-dimensional vector, we want

$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha \leq ||f(x_i^a) - f(x_i^n)||_2^2$

for all $$N$$ possible triplets of $$x_i^a$$, $$x_i^p$$, and $$x_i^n$$. The $$||x||_2^2$$ operator is the square of the Eucledian norm. The $$\alpha$$ symbol stands for a margin to ensure that the model doesn’t make the embeddings $$f(x_i^a)$$, $$f(x_i^p)$$, and $$f(x_i^n)$$ equal each other to trivially satisfy the above inequality.

This leads to the following loss function over the $$N$$ possible triplets.

$\sum_i^N \left [ ||f(x_i^a) - f(x_i^p)||_2^2 - ||f(x_i^a) - f(x_i^n)||_2^2 + \alpha \right ]_+$

the $$[x]_+$$ operator stands for $$\max(0,x)$$.

## Triplet selection

In a typical dataset, many triplets of $$x_i^a$$, $$x_i^p$$, and $$x_i^n$$ will satisfy the inequality in the previous section without the algorithm learning a useful embedding. This slows down the training of a machine learning algorithm that uses the triplet loss function.

To speed training back up, it makes sense to train the algorithm on examples where $$f(x_i^a)$$ is closer to $$f(x_i^n)$$ than $$f(x_i^p)$$ in the embedding space (ignoring the term $$\alpha$$).

## Synonyms

• Triplet loss function

# UNK

UNK, unk, <unk> are variants of a symbol in natural language processing and machine translation to indicate an out-of-vocabulary word.

Many language models do calculations upon representations of the $$n$$ most frequent words in the corpus. Words that are less frequent are replaced with the <unk> symbol.

This is what such a transformation might look like. The below is an example of a source document in a corpus with common English words.

Today I’ll bake; tomorrow I’ll brew, Then I’ll fetch the queen’s new child, It is good that no one knows, Rumpelstiltskin is my name.

Every word in the above quote is common in English, except for Rumpelstiltskin, which is replaced as following:

Today I’ll bake; tomorrow I’ll brew, Then I’ll fetch the queen’s new child, It is good that no one knows, <unk> is my name.

# Unsupervised learning

Unsupervised learning is about problems where we don’t have labeled answers, such as clustering, dimensionality reduction, and anomaly detection.

# Valid convolution

A valid convolution is a type of convolution operation that does not use any padding on the input.

For an $$n \times n$$ input matrix and an $$f \times f$$ filter, a valid convolution will return an output matrix of dimensions

$\left \lfloor \frac{n - f}{s} \right \rfloor + 1 \times \left \lfloor \frac{n - f}{s} \right \rfloor + 1$

where $$s$$ is the stride length of the convolution.

This is in contrast to a same convolution, which pads the $$n \times n$$ input matrix such that the output matrix is also $$n \times n$$.

# Variance

Wikipedia describes variance as follows:

In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its mean, and it informally measures how far a set of (random) numbers are spread out from their mean.

Variance the square of standard deviation, or rather, standard deviation is the square root of variance. Thus, sometimes variance is written as $$\sigma^2$$ where $$\sigma$$ stands for the standard deviation.

# VGGNet

VGGNet is a deep convolutional neural network for image recognition, trained by the Visual Geometry Group (VGG) at the University of Oxford.

VGGNet helped the VGG team secure the first place in Localization and second place in Classification in the 2014 ImageNet Large Scale Visual Recognition Competition.

# Weight sharing

In neural networks, weight sharing is a way to reduce the number of parameters while allowing for more robust feature detection. Reducing the number of parameters can be considered a form of model compression.

## Synonyms

• Parameter sharing

# Word embedding

A word embedding (or word vector) refers to a dense vector representation of a word.

Before word embeddings, words were typically represented with sparse vectors in bag-of-words models or with n-grams.

Word embeddings are typically trained with an unsupervised model over a large corpus of text. In the training process, the vectors are updated to better predict elements of the corpus, such as words surround a given target word.

At the end of this process, word embeddings often have geometric relations to each other that encode semantic meaning. A common example for this is using vector addition and subtraction to find related words. The vector $$\mathrm{King}_v - \mathrm{Man}_v + \mathrm{Woman}$$ is most similar to the vector $$\mathrm{Queen}_v$$. This example comes from the project page for word2vec.

Common implementations of word embeddings include:

# word2phrase

word2phrase refers to a program in the word2vec toolkit that discovers multi-word phrases in a corpus of words.

From the original word2vec Google Code page:

In certain applications, it is useful to have vector representation of larger pieces of text. For example, it is desirable to have only one vector for representing ‘san francisco’. This can be achieved by pre-processing the training data set to form the phrases using the word2phrase tool, as is shown in the example script ./demo-phrases.sh.

# word2vec

word2vec refers to a pair of models, open-source software, and pre-trained word embeddings from Google.

The models are:

The original paper is titled Efficient Estimation of Word Representations in Vector Space by Mikolov et al.

The source code was originally hosted on Google Code but is now located on Github.

# YOLO (object detection algorithm)

YOLO (an acronym standing for the phrase “You Only Look Once”) refers to a fast object detection algorithm. Previous attempts at building object detection algorithms involved running object detectors or object localizers multiple times over a single image.

Instead of needing multiple executions over a single image, YOLO detects objects through sending an image through a single forward pass through a convolutional neural network.

# YOLO9000 (object detection algorithm)

See YOLOv2 (object detection algorithm).

# YOLOv2 (object detection algorithm)

## Synonyms

• YOLO9000 (object detection algorithm)

## Signal processing

In signal processing, zero padding refers to the practice of adding zeroes to a time-domain signal. Zero-padding is often done before performing a fast Fourier transform on the time-domain signal.

## Neural networks

In convolutional neural networks, zero-padding refers to surrounding a matrix with zeroes. This can help preserve features that exist at the edges of the original matrix and control the size of the output feature map.

Below is an example of a padding operator $$\mathrm{Pad}(n, \mathbf X)$$ that adds $$n$$ layers of zeroes around the matrix $$\mathbf X$$. $\mathbf X = \begin{bmatrix} a & b & c \\ d & f & g \\ h & j & k \end{bmatrix},$ $\mathrm{Pad}(1, \mathbf X) = \begin{bmatrix} 0 & 0 & 0 & 0 & 0 \\ 0 & a & b & c & 0 \\ 0 & d & f & g & 0 \\ 0 & h & j & k & 0 \\ 0 & 0 & 0 & 0 & 0 \end{bmatrix}$

# Zero-shot learning

Ian Goodfellow in a Quora answer defines zero-shot learning as the following:

Zero-shot learning is being able to solve a task despite not having received any training examples of that task. For a concrete example, imagine recognizing a category of object in photos without ever having seen a photo of that kind of object before. If you’ve read a very detailed description of a cat, you might be able to tell what a cat is in a photograph the first time you see it.