Statistics and clustering

This vignette is adapted from the official Armadillo documentation.

Mean

The mean function computes the mean of a vector, matrix, or cube. For a vector argument, the mean is calculated using all the elements of the vector. For a matrix argument, the mean is calculated for each column by default (dim = 0), or for each row (dim = 1). For a cube argument, the mean is calculated along the specified dimension (same as for matrices plus dim = 2 for slices).

Usage:

mean(V)

mean(M)
mean(M, dim)

mean(Q)
mean(Q, dim)

Examples

[[cpp11::register]] list mean1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);
  
  // create a cube with 3 copies of B + random noise
  cube C(B.n_rows, B.n_cols, 3);
  C.slice(0) = B + 0.1 * randn<mat>(B.n_rows, B.n_cols);
  C.slice(1) = B + 0.2 * randn<mat>(B.n_rows, B.n_cols);
  C.slice(2) = B + 0.3 * randn<mat>(B.n_rows, B.n_cols);

  vec D = mean(A).t();
  vec E = mean(A, 1);
  vec F = mean(mean(B, 1), 1);

  writable::list res(3);
  res[0] = as_doubles(D);
  res[1] = as_doubles(E);
  res[2] = as_doubles(F);

  return res;
}

Median

The median function computes the median of a vector or matrix. For a vector argument, the median is calculated using all the elements of the vector. For a matrix argument, the median is calculated for each column by default (dim = 0), or for each row (dim = 1).

Usage:

median(V)

median(M)
median(M, dim)

Examples

[[cpp11::register]] list median1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  vec C = median(A).t();
  vec D = median(A, 1);
  vec E = median(median(B, 1), 1);

  writable::list res(3);
  res[0] = as_doubles(C);
  res[1] = as_doubles(D);
  res[2] = as_doubles(E);

  return res;
}

Standard deviation

The stddev function computes the standard deviation of a vector or matrix. For a vector argument, the standard deviation is calculated using all the elements of the vector. For a matrix argument, the standard deviation is calculated for each column by default (dim = 0), or for each row (dim = 1).

The norm_type argument is optional; by default norm_type = 0 is used. The norm_type argument controls the type of normalization used, with N denoting the number of observations:

  • for norm_type = 0, normalization is done using N-1, providing the best unbiased estimation of the standard deviation (if the observations are from a normal distribution).
  • for norm_type = 1, normalization is done using N, which provides the second moment of the observations about their mean.

Usage:

stddev(V)
stddev(V, norm_type)

stddev(M)
stddev(M, norm_type)
stddev(M, norm_type, dim)

Examples

[[cpp11::register]] list stddev1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  vec C = stddev(A).t();
  vec D = stddev(A, 1).t();
  vec E = stddev(A, 1, 1);

  writable::list res(3);
  res[0] = as_doubles(C);
  res[1] = as_doubles(D);
  res[2] = as_doubles(E);

  return res;
}

Variance

The var function computes the variance of a vector or matrix. For a vector argument, the variance is calculated using all the elements of the vector. For a matrix argument, the variance is calculated for each column by default (dim = 0), or for each row (dim = 1).

The norm_type argument is optional; by default norm_type = 0 is used. The norm_type argument controls the type of normalization used, with N denoting the number of observations:

  • for norm_type = 0, normalization is done using N-1, providing the best unbiased estimation of the standard deviation (if the observations are from a normal distribution).
  • for norm_type = 1, normalization is done using N, which provides the second moment of the observations about their mean.

Usage:

var(V)
var(V, norm_type)

var(M)
var(M, norm_type)
var(M, norm_type, dim)

Examples

[[cpp11::register]] list var1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  vec C = var(A).t();
  vec D = var(A, 1).t();
  vec E = var(A, 1, 1);

  writable::list res(3);
  res[0] = as_doubles(C);
  res[1] = as_doubles(D);
  res[2] = as_doubles(E);

  return res;
}

Range

The range function computes the range of a vector or matrix. For a vector argument, the range is calculated using all the elements of the vector. For a matrix argument, the range is calculated for each column by default (dim = 0), or for each row (dim = 1).

Usage:

range(V)

range(M)
range(M, dim)

Examples

[[cpp11::register]] list range1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  vec C = range(A).t();
  vec D = range(A, 1);

  writable::list res(2);
  res[0] = as_doubles(C);
  res[1] = as_doubles(D);

  return res;
}

Covariance

The cov function computes the covariance between two matrices or vectors. If each row of X and Y is an observation and each column is a variable, the (i,j)-th entry of cov(X,Y) is the covariance between the i-th variable in X and the j-th variable in Y.

For two matrix arguments X and Y, the cov(X,Y) function computes the covariance between the two matrices. For vector arguments, the type of vector is ignored and each element in the vector is treated as an observation. The cov(X) function is equivalent to cov(X, X).

The norm_type argument is optional; by default norm_type = 0 is used. The norm_type argument controls the type of normalization used, with N denoting the number of observations:

  • for norm_type = 0, normalization is done using N-1, providing the best unbiased estimation of the covariance matrix (if the observations are from a normal distribution).
  • for norm_type = 1, normalization is done using N, which provides the second moment matrix of the observations about their mean.

Usage:

cov(X, Y, norm_type)

cov(X)
cov(X, norm_type)

Examples

[[cpp11::register]] list cov1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  mat C = cov(A, B);
  mat D = cov(A, B, 1);

  writable::list res(2);
  res[0] = as_doubles_matrix(C);
  res[1] = as_doubles_matrix(D);

  return res;
}

Correlation

The cor function computes the correlation coefficient between two matrices or vectors. If each row of X and Y is an observation and each column is a variable, the (i,j)-th entry of cor(X,Y) is the correlation coefficient between the i-th variable in X and the j-th variable in Y.

For two matrix arguments X and Y, the cor(X,Y) function computes the correlation coefficient between the two matrices. For vector arguments, the type of vector is ignored and each element in the vector is treated as an observation.

The norm_type argument is optional; by default norm_type = 0 is used. The norm_type argument controls the type of normalization used, with N denoting the number of observations:

  • for norm_type = 0, normalization is done using N-1.
  • for norm_type = 1, normalization is done using N.

Usage:

cor(X, Y)
cor(X, Y, norm_type)

cor(X)
cor(X, norm_type)

Examples

[[cpp11::register]] list cor1_(const doubles_matrix<>& X,
  const doubles_matrix<>& Y) {
  mat A = as_Mat(X);
  mat B = as_Mat(Y);

  mat C = cor(A, B);
  mat D = cor(A, B, 1);

  writable::list res(2);
  res[0] = as_doubles_matrix(C);
  res[1] = as_doubles_matrix(D);

  return res;
}

Histogram

The hist function computes a histogram of counts for a vector or matrix. For a vector argument, the histogram is calculated using all the elements of the vector. For a matrix argument, the histogram is calculated for each column by default (dim = 0), or for each row (dim = 1).

The bin centers can be automatically determined from the data, with the number of bins specified via n_bins (the default is 10). The range of the bins is determined by the range of the data. The bin centers can be explicitly specified in the centers vector, which must contain monotonically increasing values.

Usage:

hist(V)
hist(V, n_bins)
hist(V, centers)

hist(M, centers)
hist(M, centers, dim)

Examples

[[cpp11::register]] list hist1_(const int& n) {
  vec A = randu<vec>(n);

  uvec h1 = hist(A, 11);
  uvec h2 = hist(A, linspace<vec>(-2, 2, 11));

  writable::list res(2);
  res[0] = as_integers(h1);
  res[1] = as_integers(h2);

  return res;
}

Histogram of counts with user specified edges

The histc function computes a histogram of counts for a vector or matrix. For a vector argument, the histogram is calculated using all the elements of the vector. For a matrix argument, the histogram is calculated for each column by default (dim = 0), or for each row (dim = 1).

The bin edges have to be specified and contain monotonically increasing values.

Usage:

histc(V)
histc(V, edges)

hist(M, edges)
hist(M, edges, dim)

Examples

[[cpp11::register]] integers histc1_(const int& n) {
  vec A = randu<vec>(n);

  uvec h = histc(A, linspace<vec>(-2,2,11));

  return as_integers(h);
}

Quantiles of a dataset

The quantile function computes the quantiles corresponding to the cumulative probability values for a vector or matrix. For a vector argument, the quantiles are calculated using all the elements of the vector. For a matrix argument, the quantiles are calculated for each column by default (dim = 0), or for each row (dim = 1).

The probabilities have to be specified as a second argument P.

The algorithm for calculating the quantiles is based on Definition 5 in: Rob J. Hyndman and Yanan Fan. Sample Quantiles in Statistical Packages. The American Statistician, 50(4), 361-365, 1996. DOI: 10.2307/2684934

Usage:

quantile(V, P)

quantile(M, P)
quantile(M, P, dim)

Examples

[[cpp11::register]] doubles quantile1_(const int& n) {
  vec A = randu<vec>(n);

  vec P = {0.0, 0.25, 0.50, 0.75, 1.0};
  vec Q = quantile(A, P);

  return as_doubles(Q);
}

Principal component analysis (PCA)

TODO: This needs a custom method.

Probability density function of normal distribution

The normpdf function computes the probability density function of a normal distribution for a scalar, vector, or matrix. For each scalar x in X, the probability density function is computed according to a Gaussian (normal) distribution using the corresponding mean value m in M and the corresponding standard deviation value s in S.

$$ y = \frac{1}{s \sqrt{2\pi}} \exp\left[-\frac{(x - m)^2}{2s^2}\right] $$

  • X can be a scalar, vector, or matrix.
  • M and S can jointly be either scalars, vectors, or matrices.
  • If M and S are omitted, their values are assumed to be 0 and 1, respectively.

Caveat

To reduce the incidence of numerical underflows, consider using log_normpdf().

Examples

[[cpp11::register]] list normpdf1_(const int& n) {
  vec X = randu<vec>(n);
  vec M = randu<vec>(n);
  vec S = randu<vec>(n);

  vec P1 = normpdf(X);
  vec P2 = normpdf(X, M, S);
  vec P3 = normpdf(1.23, M, S);
  vec P4 = normpdf(X, 4.56, 7.89);
  double P5 = normpdf(1.23, 4.56, 7.89);

  writable::list res(5);

  res[0] = as_doubles(P1);
  res[1] = as_doubles(P2);
  res[2] = as_doubles(P3);
  res[3] = as_doubles(P4);
  res[4] = as_doubles({P5});

  return res;
}

Probability density function of log-normal distribution

The log_normpdf function computes the probability density function of a log-normal distribution for a scalar, vector, or matrix. For each scalar x in X, the probability density function is computed according to a log-normal distribution using the corresponding mean value m in M and the corresponding standard deviation value s in S.

$$ y = \log\left[\frac{1}{x s \sqrt{2\pi}} \exp\left[-\frac{(\log(x) - m)^2}{2s^2}\right]\right] $$

  • X can be a scalar, vector, or matrix.
  • M and S can jointly be either scalars, vectors, or matrices.
  • If M and S are omitted, their values are assumed to be 0 and 1, respectively.

Examples

[[cpp11::register]] list lognormpdf1_(const int& n) {
  vec X = randu<vec>(n);
  vec M = randu<vec>(n);
  vec S = randu<vec>(n);

  vec P1 = log_normpdf(X);
  vec P2 = log_normpdf(X, M, S);
  vec P3 = log_normpdf(1.23, M, S);
  vec P4 = log_normpdf(X, 4.56, 7.89);
  double P5 = log_normpdf(1.23, 4.56, 7.89);

  writable::list res(5);

  res[0] = as_doubles(P1);
  res[1] = as_doubles(P2);
  res[2] = as_doubles(P3);
  res[3] = as_doubles(P4);
  res[4] = as_doubles({P5});

  return res;
}

Cumulative distribution function of normal distribution

For each scalar x in X, compute its cumulative distribution function according to a Gaussian (normal) distribution using the corresponding mean value m in M and the corresponding standard deviation value s in S.

  • X can be a scalar, vector, or matrix.
  • M and S can jointly be either scalars, vectors, or matrices.
  • If M and S are omitted, their values are assumed to be 0 and 1, respectively.

Examples

[[cpp11::register]] list normcdf1_(const int& n) {
  vec X = randu<vec>(n);
  vec M = randu<vec>(n);
  vec S = randu<vec>(n);

  vec P1 = normcdf(X);
  vec P2 = normcdf(X, M, S);
  vec P3 = normcdf(1.23, M, S);
  vec P4 = normcdf(X, 4.56, 7.89);
  double P5 = normcdf(1.23, 4.56, 7.89);

  writable::list res(5);

  res[0] = as_doubles(P1);
  res[1] = as_doubles(P2);
  res[2] = as_doubles(P3);
  res[3] = as_doubles(P4);
  res[4] = as_doubles({P5});

  return res;
}

Random vectors from multivariate normal distribution

Generate a matrix with random column vectors from a multivariate Gaussian (normal) distribution with parameters M and C.

  • M is the mean and it must be a column vector.
  • C is the covariance matrix and it must be symmetric positive semi-definite (ideally positive definite).
  • N is the number of column vectors to generate. If N is omitted, it is assumed to be 1.

Usage:

X = mvnrnd(M, C)
mvnrnd(X, M, C)
mvnrnd(X. M. C. N)

The first form returns an error if the generation fails. The second and third form reset X and return a boolean set to false without error if the generation fails.

Examples

[[cpp11::register]] doubles_matrix<> mvnrnd1_(const int& n, const int&m) {
  vec M = randu<vec>(n);

  mat B = randu<mat>(n, n);
  mat C = B.t() * B;

  mat X = mvnrnd(M, C, m);

  return as_doubles_matrix(X);
}

Random numbers from chi-squared distribution

Generate a random scalar, vector, or matrix with elements sampled from a chi-squared distribution with the degrees of freedom specified by parameter DF or DF_scalar.

  • DF is a vector or matrix, while DF_scalar is a scalar.
  • For the chi2rnd(DF) form, the output vector or matrix has the same size and type as DF.
  • Each value in DF and DF_scalar must be greater than zero.
  • Each element in DF specifies a separate degree of freedom.

Usage:

v = chi2rnd(DF)
X = chi2rnd(DF)

double s = chi2rnd<double>(DF_scalar) // float also works
vec v = chi2rnd<vec>(DF_scalar, n_elem)
mat X = chi2rnd<mat>(DF_scalar, n_rows, n_cols)
mat Y = chi2rnd<mat>(DF_scalar, size(X))

Examples

[[cpp11::register]] list chi2rnd1_(const int& n, const int& m) {
  mat X = chi2rnd(2, n, m);
  mat Y = randi<mat>(n, m, distr_param(1, 10));
  mat Z = chi2rnd(Y);

  writable::list res(2);
  res[0] = as_doubles_matrix(X);
  res[1] = as_doubles_matrix(Z);

  return res; 
}

Random matrix from Wishart distribution

Generate a random matrix sampled from the Wishart distribution with parameters S and df.

  • S is a symmetric positive definite matrix (e.g., a covariance matrix).
  • df is a scalar specifying the degrees of freedom; it can be a non-integer value.
  • D is an optional argument to specify the Cholesky decomposition of S.

Usage:

W = wishrnd(S, df)
wishrnd(W, S, df)

The first form returns an error if the generation fails. The second form resets W and returns a boolean set to false without error if the generation fails.

Examples

[[cpp11::register]] doubles_matrix<> wishrnd1_(const int& n) {
  mat X = randu<mat>(n, n);
  mat S = X.t() * X;

  mat W = wishrnd(S, 6.7);

  return as_doubles_matrix(W);
}

Random matrix from inverse Wishart distribution

Generate a random matrix sampled from the inverse Wishart distribution with parameters T and df.

  • T is a symmetric positive definite matrix.
  • df is a scalar specifying the degrees of freedom; it can be a non-integer value.
  • Dinv is an optional argument; it specifies the Cholesky decomposition of the inverse of T. If Dinv is provided, T is ignored. Using Dinv is more efficient if iwishrnd() needs to be used many times for the same T matrix.

Usage:

W = iwishrnd(T, df)
iwishrnd(W, T, df)

The first form returns an error if the generation fails. The second form resets W and returns a boolean set to false without error if the generation fails.

Examples

[[cpp11::register]] doubles_matrix<> iwishrnd1_(const int& n, const double& d) {
  mat X = randu<mat>(n, n);
  mat T = X.t() * X;

  mat W = iwishrnd(T, d);

  return as_doubles_matrix(W);
}

Cluster data into disjoint sets

Cluster given data into k disjoint sets.

Usage:

kmeans(means, data, k, seed_mode, n_iter, print_mode)
  • The means parameter is the output matrix for storing the resulting centroids of the sets, with each centroid stored as a column vector. If the clustering fails, the means matrix is reset and set to false.
  • The data parameter is the input data matrix, with each sample stored as a column vector. The k parameter indicates the number of centroids.
  • The seed_mode parameter specifies how the initial centroids are seeded. It is one of:
    • keep_existing: use the centroids specified in the means matrix as the starting point.
    • static_subset: use a subset of the data vectors (repeatable).
    • random_subset: use a subset of the data vectors (random).
    • static_spread: use a maximally spread subset of data vectors (repeatable).
    • random_spread: use a maximally spread subset of data vectors (random start).
  • The n_iter parameter specifies the number of clustering iterations. This is data dependent, but about 10 is typically sufficient.
  • The print_mode parameter is either true or false, indicating whether progress is printed during clustering.

Caveats

  • The number of samples in the data matrix should be larger than k.
  • Works much faster with OpenMP enabled in the compiler (e.g., -fopenmp in GCC and clang). Cpp11armadillo finds OpenMP and uses it by default.
  • For probabilistic clustering, use the gmm_diag or gmm_full classes instead.

Examples

[[cpp11::register]] list kmeans1_(const int& n, const int& d) {
  mat data(d, n, fill::randu);

  mat means;

  bool status = kmeans(means, data, 2, random_subset, 10, true);

  if (status == false) {
    stop("clustering failed");
  }

  writable::list res(2);

  res[0] = writable::logicals({status});
  res[1] = as_doubles_matrix(means);

  return res;
}

Probabilistic clustering and likelihood calculation via mixture of Gaussians

gmm_diag and gmm_full are classes for multi-variate probabilistic clustering and likelihood calculation via Gaussian Mixture Models (GMMs). The implementation details are available in Sanderson and Curtin (2017).

The distribution of data is modelled as:

$$ p(x) = \sum_{g=0}^{n_{\text{gaus}}-1} h_g N(x \mid m_g, C_g) $$

where:

  • x is a column vector.
  • ngaus is the number of Gaussians; ngaus ≥ 1.
  • N(x ∣ mg, Cg) represents a Gaussian (normal) distribution.
  • Each Gaussian g has the following parameters:
    • hg is the heft (weight), with constraints hg ≥ 0 and hg = 1.
    • mg is the mean vector (centroid) with dimensionality ndims.
    • Cg is the covariance matrix (either diagonal or full).

Both gmm_diag and gmm_full include tailored k-means and Expectation Maximisation algorithms for learning model parameters from training data.

For an instance of gmm_diag or gmm_full named as M, the member functions and variables are:

  • M.log_p(V): return a scalar representing the log-likelihood of vector V (of type vec).
  • M.log_p(V, g): return a scalar representing the log-likelihood of vector V (of type vec), according to Gaussian with index g.
  • M.log_p(X): return a row vector (of type rowvec) containing log-likelihoods of each column vector in matrix X (of type mat).
  • M.log_p(X, g): return a row vector (of type rowvec) containing log-likelihoods of each column vector in matrix X (of type mat), according to Gaussian with index g.
  • M.sum_log_p(X): return a scalar representing the sum of log-likelihoods for all column vectors in matrix X (of type mat).
  • M.sum_log_p(X, g): return a scalar representing the sum of log-likelihoods for all column vectors in matrix X (of type mat), according to Gaussian with index g.
  • M.avg_log_p(X): return a scalar representing the average log-likelihood of all column vectors in matrix X (of type mat).
  • M.avg_log_p(X, g): return a scalar representing the average log-likelihood of all column vectors in matrix X (of type mat), according to Gaussian with index g.
  • M.assign(V, dist_mode): return the index of the closest mean (or Gaussian) to vector V (of type vec); parameter dist_mode is one of:
    • eucl_dist: Euclidean distance (takes only means into account).
    • prob_dist: probabilistic “distance”, defined as the inverse likelihood (takes into account means, covariances, and hefts).
  • M.assign(X, dist_mode): return a row vector (of type urowvec) containing the indices of the closest means (or Gaussians) to each column vector in matrix X (of type mat); parameter dist_mode is either eucl_dist or prob_dist (as per the .assign() function above).
  • M.raw_hist(X, dist_mode): return a row vector (of type urowvec) representing the raw histogram of counts; each entry is the number of counts corresponding to a Gaussian; each count is the number times the corresponding Gaussian was the closest to each column vector in matrix X; parameter dist_mode is either eucl_dist or prob_dist (as per the .assign() function above).
  • M.norm_hist(X, dist_mode): similar to the .raw_hist() function above; return a row vector (of type rowvec) containing normalised counts; the vector sums to one; parameter dist_mode is either eucl_dist or prob_dist (as per the .assign() function above).
  • M.generate(): return a column vector (of type vec) representing a random sample generated according to the model’s parameters.
  • M.generate(N): return a matrix (of type mat) containing N column vectors, with each vector representing a random sample generated according to the model’s parameters.
  • M.save(filename): save the model to a file and return a bool indicating either success (true) or failure (false).
  • M.load(filename): load the model from a file and return a bool indicating either success (true) or failure (false).
  • M.n_gaus(): return the number of means/Gaussians in the model.
  • M.n_dims(): return the dimensionality of the means/Gaussians in the model.
  • M.reset(n_dims, n_gaus): set the model to have dimensionality n_dims, with n_gaus number of Gaussians; all the means are set to zero, all covariance matrix representations are equivalent to the identity matrix, and all the hefts (weights) are set to be uniform.
  • M.hefts: read-only row vector (of type rowvec) containing the hefts (weights).
  • M.means: read-only matrix (of type mat) containing the means (centroids), stored as column vectors.
  • M.dcovs: read-only matrix (of type mat) containing the representation of diagonal covariance matrices, with the set of diagonal covariances for each Gaussian stored as a column vector; applicable only to the gmm_diag class.
  • M.fcovs: read-only cube containing the full covariance matrices, with each covariance matrix stored as a slice within the cube; applicable only to the gmm_full class.
  • M.set_hefts(V): set the hefts (weights) of the model to be as specified in row vector V (of type rowvec); the number of hefts must match the existing model.
  • M.set_means(X): set the means to be as specified in matrix X (of type mat); the number of means and their dimensionality must match the existing model.
  • M.set_dcovs(X): set the diagonal covariances matrices to be as specified in matrix X (of type mat), with the set of diagonal covariances for each Gaussian stored as a column vector; the number of covariance matrices and their dimensionality must match the existing model; applicable only to the gmm_diag class.
  • M.set_fcovs(X): set the full covariances matrices to be as specified in cube X, with each covariance matrix stored as a slice within the cube; the number of covariance matrices and their dimensionality must match the existing model; applicable only to the gmm_full class.
  • M.set_params(means, covs, hefts): set all the parameters at the same time; the type and layout of the parameters is as per the .set_hefts(), .set_means(), .set_dcovs(), and .set_fcovs() functions above; the number of Gaussians and dimensionality can be different from the existing model.
  • M.learn(data, n_gaus, dist_mode, seed_mode, km_iter, em_iter, var_floor, print_mode): learn the model parameters via multi-threaded k-means and/or EM algorithms; return a bool value, with true indicating success, and false indicating failure; the parameters have the following meanings:
    • data: matrix (of type mat) containing training samples; each sample is stored as a column vector.
    • n_gaus: set the number of Gaussians to n_gaus; to help convergence, it is recommended that the given data matrix (above) contains at least 10 samples for each Gaussian.
    • dist_mode: specifies the distance used during the seeding of initial means and k-means clustering:
      • eucl_dist: Euclidean distance.
      • maha_dist: Mahalanobis distance, which uses a global diagonal covariance matrix estimated from the training samples; this is recommended for probabilistic applications.
    • seed_mode: specifies how the initial means are seeded prior to running k-means and/or EM algorithms:
      • keep_existing: keep the existing model (do not modify the means, covariances, and hefts).
      • static_subset: a subset of the training samples (repeatable).
      • random_subset: a subset of the training samples (random).
      • static_spread: a maximally spread subset of training samples (repeatable).
      • random_spread: a maximally spread subset of training samples (random start).
      • Caveat: seeding the initial means with static_spread and random_spread can be much more time consuming than with static_subset and random_subset.
    • km_iter: the number of iterations of the k-means algorithm; this is data dependent, but typically 10 iterations are sufficient.
    • em_iter: the number of iterations of the EM algorithm; this is data dependent, but typically 5 to 10 iterations are sufficient.
    • var_floor: the variance floor (smallest allowed value) for the diagonal covariances; setting this to a small non-zero value can help with convergence and/or better quality parameter estimates.
    • print_mode: either true or false; enable or disable printing of progress during the k-means and EM algorithms.

Caveats

  • gmm_diag is tailored for diagonal covariance matrices.
  • gmm_full is tailored for full covariance matrices.
  • gmm_diag is considerably faster than gmm_full, at the cost of some reduction in modelling accuracy.
  • For faster execution on multi-core machines, enable OpenMP in your compiler (e.g., -fopenmp in GCC and clang). Cpp11armadillo finds OpenMP and uses it by default.

Examples

[[cpp11::register]] list gmm1_(const int& n, const int& d) {
  // create synthetic data with 2 Gaussians

  mat data(d, n, fill::zeros);

  vec mean0 = linspace<vec>(1, d, d);
  vec mean1 = mean0 + 2;

  int i = 0;

  while (i < n) {
    if (i < n) {
      data.col(i) = mean0 + randn<vec>(d);
      ++i;
    }
    if (i < n) {
      data.col(i) = mean0 + randn<vec>(d);
      ++i;
    }
    if (i < n) {
      data.col(i) = mean1 + randn<vec>(d);
      ++i;
    }
  }

  // model the data as a diagonal GMM with 2 Gaussians

  gmm_diag model;

  bool status = model.learn(data, 2, maha_dist, random_subset, 10, 5, 1e-5,
    true);

  if (status == false) {
    stop("learning failed");
  }

  model.means.print("means:");

  double scalar_likelihood = model.log_p(data.col(0));
  rowvec set_likelihood = model.log_p(data.cols(0, 9));

  double overall_likelihood = model.avg_log_p(data);

  uword gaus_id = model.assign(data.col(0), eucl_dist);
  urowvec gaus_ids = model.assign(data.cols(0, 9), prob_dist);

  rowvec hist1 = model.norm_hist(data, eucl_dist);
  urowvec hist2 = model.raw_hist(data, prob_dist);

  writable::list res(9);

  res[0] = writable::logicals({status});
  res[1] = as_doubles_matrix(model.means);
  res[2] = as_doubles({scalar_likelihood});
  res[3] = as_doubles(set_likelihood.t());
  res[4] = as_doubles({overall_likelihood});
  res[5] = as_integers(gaus_id);
  res[6] = as_integers(gaus_ids.t());
  res[7] = as_doubles(hist1.t());
  res[8] = as_integers(hist2.t());

  return res;
}
Sanderson, Conrad, and Ryan Curtin. 2017. “An Open Source C++ Implementation of Multi-Threaded Gaussian Mixture Models, k-Means and Expectation Maximisation.” In 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), 1–8. https://doi.org/10.1109/ICSPCS.2017.8270510.