This vignette is adapted from the official Armadillo documentation.
The mean
function computes the mean of a vector, matrix,
or cube. For a vector argument, the mean is calculated using all the
elements of the vector. For a matrix argument, the mean is calculated
for each column by default (dim = 0
), or for each row
(dim = 1
). For a cube argument, the mean is calculated
along the specified dimension (same as for matrices plus
dim = 2
for slices).
Usage:
[[cpp11::register]] list mean1_(const doubles_matrix<>& X,
const doubles_matrix<>& Y) {
mat A = as_Mat(X);
mat B = as_Mat(Y);
// create a cube with 3 copies of B + random noise
cube C(B.n_rows, B.n_cols, 3);
C.slice(0) = B + 0.1 * randn<mat>(B.n_rows, B.n_cols);
C.slice(1) = B + 0.2 * randn<mat>(B.n_rows, B.n_cols);
C.slice(2) = B + 0.3 * randn<mat>(B.n_rows, B.n_cols);
vec D = mean(A).t();
vec E = mean(A, 1);
vec F = mean(mean(B, 1), 1);
writable::list res(3);
res[0] = as_doubles(D);
res[1] = as_doubles(E);
res[2] = as_doubles(F);
return res;
}
The median
function computes the median of a vector or
matrix. For a vector argument, the median is calculated using all the
elements of the vector. For a matrix argument, the median is calculated
for each column by default (dim = 0
), or for each row
(dim = 1
).
Usage:
[[cpp11::register]] list median1_(const doubles_matrix<>& X,
const doubles_matrix<>& Y) {
mat A = as_Mat(X);
mat B = as_Mat(Y);
vec C = median(A).t();
vec D = median(A, 1);
vec E = median(median(B, 1), 1);
writable::list res(3);
res[0] = as_doubles(C);
res[1] = as_doubles(D);
res[2] = as_doubles(E);
return res;
}
The stddev
function computes the standard deviation of a
vector or matrix. For a vector argument, the standard deviation is
calculated using all the elements of the vector. For a matrix argument,
the standard deviation is calculated for each column by default
(dim = 0
), or for each row (dim = 1
).
The norm_type
argument is optional; by default
norm_type = 0
is used. The norm_type
argument
controls the type of normalization used, with N
denoting
the number of observations:
norm_type = 0
, normalization is done using
N-1
, providing the best unbiased estimation of the standard
deviation (if the observations are from a normal distribution).norm_type = 1
, normalization is done using
N
, which provides the second moment of the observations
about their mean.Usage:
[[cpp11::register]] list stddev1_(const doubles_matrix<>& X,
const doubles_matrix<>& Y) {
mat A = as_Mat(X);
mat B = as_Mat(Y);
vec C = stddev(A).t();
vec D = stddev(A, 1).t();
vec E = stddev(A, 1, 1);
writable::list res(3);
res[0] = as_doubles(C);
res[1] = as_doubles(D);
res[2] = as_doubles(E);
return res;
}
The var
function computes the variance of a vector or
matrix. For a vector argument, the variance is calculated using all the
elements of the vector. For a matrix argument, the variance is
calculated for each column by default (dim = 0
), or for
each row (dim = 1
).
The norm_type
argument is optional; by default
norm_type = 0
is used. The norm_type
argument
controls the type of normalization used, with N
denoting
the number of observations:
norm_type = 0
, normalization is done using
N-1
, providing the best unbiased estimation of the standard
deviation (if the observations are from a normal distribution).norm_type = 1
, normalization is done using
N
, which provides the second moment of the observations
about their mean.Usage:
The range
function computes the range of a vector or
matrix. For a vector argument, the range is calculated using all the
elements of the vector. For a matrix argument, the range is calculated
for each column by default (dim = 0
), or for each row
(dim = 1
).
Usage:
The cov
function computes the covariance between two
matrices or vectors. If each row of X
and Y
is
an observation and each column is a variable, the (i,j)
-th
entry of cov(X,Y)
is the covariance between the
i
-th variable in X
and the j
-th
variable in Y
.
For two matrix arguments X
and Y
, the
cov(X,Y)
function computes the covariance between the two
matrices. For vector arguments, the type of vector is ignored and each
element in the vector is treated as an observation. The
cov(X)
function is equivalent to
cov(X, X)
.
The norm_type
argument is optional; by default
norm_type = 0
is used. The norm_type
argument
controls the type of normalization used, with N
denoting
the number of observations:
norm_type = 0
, normalization is done using
N-1
, providing the best unbiased estimation of the
covariance matrix (if the observations are from a normal
distribution).norm_type = 1
, normalization is done using
N
, which provides the second moment matrix of the
observations about their mean.Usage:
The cor
function computes the correlation coefficient
between two matrices or vectors. If each row of X
and
Y
is an observation and each column is a variable, the
(i,j)
-th entry of cor(X,Y)
is the correlation
coefficient between the i
-th variable in X
and
the j
-th variable in Y
.
For two matrix arguments X
and Y
, the
cor(X,Y)
function computes the correlation coefficient
between the two matrices. For vector arguments, the type of vector is
ignored and each element in the vector is treated as an observation.
The norm_type
argument is optional; by default
norm_type = 0
is used. The norm_type
argument
controls the type of normalization used, with N
denoting
the number of observations:
norm_type = 0
, normalization is done using
N-1
.norm_type = 1
, normalization is done using
N
.Usage:
The hist
function computes a histogram of counts for a
vector or matrix. For a vector argument, the histogram is calculated
using all the elements of the vector. For a matrix argument, the
histogram is calculated for each column by default
(dim = 0
), or for each row (dim = 1
).
The bin centers can be automatically determined from the data, with
the number of bins specified via n_bins
(the default is
10). The range of the bins is determined by the range of the data. The
bin centers can be explicitly specified in the centers
vector, which must contain monotonically increasing values.
Usage:
The histc
function computes a histogram of counts for a
vector or matrix. For a vector argument, the histogram is calculated
using all the elements of the vector. For a matrix argument, the
histogram is calculated for each column by default
(dim = 0
), or for each row (dim = 1
).
The bin edges have to be specified and contain monotonically increasing values.
Usage:
The quantile
function computes the quantiles
corresponding to the cumulative probability values for a vector or
matrix. For a vector argument, the quantiles are calculated using all
the elements of the vector. For a matrix argument, the quantiles are
calculated for each column by default (dim = 0
), or for
each row (dim = 1
).
The probabilities have to be specified as a second argument
P
.
The algorithm for calculating the quantiles is based on Definition 5 in: Rob J. Hyndman and Yanan Fan. Sample Quantiles in Statistical Packages. The American Statistician, 50(4), 361-365, 1996. DOI: 10.2307/2684934
Usage:
TODO: This needs a custom method.
The normpdf
function computes the probability density
function of a normal distribution for a scalar, vector, or matrix. For
each scalar x
in X
, the probability density
function is computed according to a Gaussian (normal) distribution using
the corresponding mean value m
in M
and the
corresponding standard deviation value s
in
S
.
$$ y = \frac{1}{s \sqrt{2\pi}} \exp\left[-\frac{(x - m)^2}{2s^2}\right] $$
X
can be a scalar, vector, or matrix.M
and S
can jointly be either scalars,
vectors, or matrices.M
and S
are omitted, their values are
assumed to be 0 and 1, respectively.To reduce the incidence of numerical underflows, consider using
log_normpdf()
.
[[cpp11::register]] list normpdf1_(const int& n) {
vec X = randu<vec>(n);
vec M = randu<vec>(n);
vec S = randu<vec>(n);
vec P1 = normpdf(X);
vec P2 = normpdf(X, M, S);
vec P3 = normpdf(1.23, M, S);
vec P4 = normpdf(X, 4.56, 7.89);
double P5 = normpdf(1.23, 4.56, 7.89);
writable::list res(5);
res[0] = as_doubles(P1);
res[1] = as_doubles(P2);
res[2] = as_doubles(P3);
res[3] = as_doubles(P4);
res[4] = as_doubles({P5});
return res;
}
The log_normpdf
function computes the probability
density function of a log-normal distribution for a scalar, vector, or
matrix. For each scalar x
in X
, the
probability density function is computed according to a log-normal
distribution using the corresponding mean value m
in
M
and the corresponding standard deviation value
s
in S
.
$$ y = \log\left[\frac{1}{x s \sqrt{2\pi}} \exp\left[-\frac{(\log(x) - m)^2}{2s^2}\right]\right] $$
X
can be a scalar, vector, or matrix.M
and S
can jointly be either scalars,
vectors, or matrices.M
and S
are omitted, their values are
assumed to be 0 and 1, respectively.[[cpp11::register]] list lognormpdf1_(const int& n) {
vec X = randu<vec>(n);
vec M = randu<vec>(n);
vec S = randu<vec>(n);
vec P1 = log_normpdf(X);
vec P2 = log_normpdf(X, M, S);
vec P3 = log_normpdf(1.23, M, S);
vec P4 = log_normpdf(X, 4.56, 7.89);
double P5 = log_normpdf(1.23, 4.56, 7.89);
writable::list res(5);
res[0] = as_doubles(P1);
res[1] = as_doubles(P2);
res[2] = as_doubles(P3);
res[3] = as_doubles(P4);
res[4] = as_doubles({P5});
return res;
}
For each scalar x
in X
, compute its
cumulative distribution function according to a Gaussian (normal)
distribution using the corresponding mean value m
in
M
and the corresponding standard deviation value
s
in S
.
X
can be a scalar, vector, or matrix.M
and S
can jointly be either scalars,
vectors, or matrices.M
and S
are omitted, their values are
assumed to be 0 and 1, respectively.[[cpp11::register]] list normcdf1_(const int& n) {
vec X = randu<vec>(n);
vec M = randu<vec>(n);
vec S = randu<vec>(n);
vec P1 = normcdf(X);
vec P2 = normcdf(X, M, S);
vec P3 = normcdf(1.23, M, S);
vec P4 = normcdf(X, 4.56, 7.89);
double P5 = normcdf(1.23, 4.56, 7.89);
writable::list res(5);
res[0] = as_doubles(P1);
res[1] = as_doubles(P2);
res[2] = as_doubles(P3);
res[3] = as_doubles(P4);
res[4] = as_doubles({P5});
return res;
}
Generate a matrix with random column vectors from a multivariate
Gaussian (normal) distribution with parameters M
and
C
.
M
is the mean and it must be a column vector.C
is the covariance matrix and it must be symmetric
positive semi-definite (ideally positive definite).N
is the number of column vectors to generate. If
N
is omitted, it is assumed to be 1.Usage:
The first form returns an error if the generation fails. The second
and third form reset X
and return a boolean set to
false
without error if the generation fails.
Generate a random scalar, vector, or matrix with elements sampled
from a chi-squared distribution with the degrees of freedom specified by
parameter DF
or DF_scalar
.
DF
is a vector or matrix, while DF_scalar
is a scalar.chi2rnd(DF)
form, the output vector or matrix
has the same size and type as DF
.DF
and DF_scalar
must be
greater than zero.DF
specifies a separate degree of
freedom.Usage:
Generate a random matrix sampled from the Wishart distribution with
parameters S
and df
.
S
is a symmetric positive definite matrix (e.g., a
covariance matrix).df
is a scalar specifying the degrees of freedom; it
can be a non-integer value.D
is an optional argument to specify the Cholesky
decomposition of S
.Usage:
The first form returns an error if the generation fails. The second
form resets W
and returns a boolean set to
false
without error if the generation fails.
Generate a random matrix sampled from the inverse Wishart
distribution with parameters T
and df
.
T
is a symmetric positive definite matrix.df
is a scalar specifying the degrees of freedom; it
can be a non-integer value.Dinv
is an optional argument; it specifies the Cholesky
decomposition of the inverse of T
. If Dinv
is
provided, T
is ignored. Using Dinv
is more
efficient if iwishrnd()
needs to be used many times for the
same T
matrix.Usage:
The first form returns an error if the generation fails. The second
form resets W
and returns a boolean set to
false
without error if the generation fails.
Cluster given data into k
disjoint sets.
Usage:
means
parameter is the output matrix for storing
the resulting centroids of the sets, with each centroid stored as a
column vector. If the clustering fails, the means
matrix is
reset and set to false
.data
parameter is the input data matrix, with each
sample stored as a column vector. The k
parameter indicates
the number of centroids.seed_mode
parameter specifies how the initial
centroids are seeded. It is one of:
keep_existing
: use the centroids specified in the
means
matrix as the starting point.static_subset
: use a subset of the data vectors
(repeatable).random_subset
: use a subset of the data vectors
(random).static_spread
: use a maximally spread subset of data
vectors (repeatable).random_spread
: use a maximally spread subset of data
vectors (random start).n_iter
parameter specifies the number of clustering
iterations. This is data dependent, but about 10 is typically
sufficient.print_mode
parameter is either true
or
false
, indicating whether progress is printed during
clustering.k
.-fopenmp
in GCC and clang). Cpp11armadillo finds OpenMP and
uses it by default.gmm_diag
or
gmm_full
classes instead.[[cpp11::register]] list kmeans1_(const int& n, const int& d) {
mat data(d, n, fill::randu);
mat means;
bool status = kmeans(means, data, 2, random_subset, 10, true);
if (status == false) {
stop("clustering failed");
}
writable::list res(2);
res[0] = writable::logicals({status});
res[1] = as_doubles_matrix(means);
return res;
}
gmm_diag
and gmm_full
are classes for
multi-variate probabilistic clustering and likelihood calculation via
Gaussian Mixture Models (GMMs). The implementation details are available
in Sanderson and Curtin (2017).
The distribution of data is modelled as:
$$ p(x) = \sum_{g=0}^{n_{\text{gaus}}-1} h_g N(x \mid m_g, C_g) $$
where:
Both gmm_diag
and gmm_full
include tailored
k-means and Expectation Maximisation algorithms for learning model
parameters from training data.
For an instance of gmm_diag
or gmm_full
named as M
, the member functions and variables are:
M.log_p(V)
: return a scalar representing the
log-likelihood of vector V
(of type vec
).M.log_p(V, g)
: return a scalar representing the
log-likelihood of vector V
(of type vec
),
according to Gaussian with index g
.M.log_p(X)
: return a row vector (of type
rowvec
) containing log-likelihoods of each column vector in
matrix X
(of type mat
).M.log_p(X, g)
: return a row vector (of type
rowvec
) containing log-likelihoods of each column vector in
matrix X
(of type mat
), according to Gaussian
with index g
.M.sum_log_p(X)
: return a scalar representing the sum of
log-likelihoods for all column vectors in matrix X
(of type
mat
).M.sum_log_p(X, g)
: return a scalar representing the sum
of log-likelihoods for all column vectors in matrix X
(of
type mat
), according to Gaussian with index
g
.M.avg_log_p(X)
: return a scalar representing the
average log-likelihood of all column vectors in matrix X
(of type mat
).M.avg_log_p(X, g)
: return a scalar representing the
average log-likelihood of all column vectors in matrix X
(of type mat
), according to Gaussian with index
g
.M.assign(V, dist_mode)
: return the index of the closest
mean (or Gaussian) to vector V
(of type vec
);
parameter dist_mode
is one of:
eucl_dist
: Euclidean distance (takes only means into
account).prob_dist
: probabilistic “distance”, defined as the
inverse likelihood (takes into account means, covariances, and
hefts).M.assign(X, dist_mode)
: return a row vector (of type
urowvec
) containing the indices of the closest means (or
Gaussians) to each column vector in matrix X
(of type
mat
); parameter dist_mode
is either
eucl_dist
or prob_dist
(as per the
.assign()
function above).M.raw_hist(X, dist_mode)
: return a row vector (of type
urowvec
) representing the raw histogram of counts; each
entry is the number of counts corresponding to a Gaussian; each count is
the number times the corresponding Gaussian was the closest to each
column vector in matrix X
; parameter dist_mode
is either eucl_dist
or prob_dist
(as per the
.assign()
function above).M.norm_hist(X, dist_mode)
: similar to the
.raw_hist()
function above; return a row vector (of type
rowvec
) containing normalised counts; the vector sums to
one; parameter dist_mode
is either eucl_dist
or prob_dist
(as per the .assign()
function
above).M.generate()
: return a column vector (of type
vec
) representing a random sample generated according to
the model’s parameters.M.generate(N)
: return a matrix (of type
mat
) containing N
column vectors, with each
vector representing a random sample generated according to the model’s
parameters.n_dims
, with n_gaus
number of Gaussians; all
the means are set to zero, all covariance matrix representations are
equivalent to the identity matrix, and all the hefts (weights) are set
to be uniform.rowvec
)
containing the hefts (weights).mat
) containing the
means (centroids), stored as column vectors.mat
) containing the
representation of diagonal covariance matrices, with the set of diagonal
covariances for each Gaussian stored as a column vector; applicable only
to the gmm_diag
class.gmm_full
class.V
(of type rowvec
);
the number of hefts must match the existing model.X
(of type mat
); the number of means and their
dimensionality must match the existing model.X
(of type mat
), with the
set of diagonal covariances for each Gaussian stored as a column vector;
the number of covariance matrices and their dimensionality must match
the existing model; applicable only to the gmm_diag
class.X
, with each covariance matrix stored as a slice
within the cube; the number of covariance matrices and their
dimensionality must match the existing model; applicable only to the
gmm_full
class..set_hefts()
, .set_means()
,
.set_dcovs()
, and .set_fcovs()
functions
above; the number of Gaussians and dimensionality can be different from
the existing model.data
: matrix (of type mat
) containing
training samples; each sample is stored as a column vector.n_gaus
: set the number of Gaussians to
n_gaus
; to help convergence, it is recommended that the
given data matrix (above) contains at least 10 samples for each
Gaussian.dist_mode
: specifies the distance used during the
seeding of initial means and k-means clustering:
eucl_dist
: Euclidean distance.maha_dist
: Mahalanobis distance, which uses a global
diagonal covariance matrix estimated from the training samples; this is
recommended for probabilistic applications.seed_mode
: specifies how the initial means are seeded
prior to running k-means and/or EM algorithms:
keep_existing
: keep the existing model (do not modify
the means, covariances, and hefts).static_subset
: a subset of the training samples
(repeatable).random_subset
: a subset of the training samples
(random).static_spread
: a maximally spread subset of training
samples (repeatable).random_spread
: a maximally spread subset of training
samples (random start).static_spread
and random_spread
can be much more time consuming than with
static_subset
and random_subset
.km_iter
: the number of iterations of the k-means
algorithm; this is data dependent, but typically 10 iterations are
sufficient.em_iter
: the number of iterations of the EM algorithm;
this is data dependent, but typically 5 to 10 iterations are
sufficient.var_floor
: the variance floor (smallest allowed value)
for the diagonal covariances; setting this to a small non-zero value can
help with convergence and/or better quality parameter estimates.print_mode
: either true
or
false
; enable or disable printing of progress during the
k-means and EM algorithms.gmm_diag
is tailored for diagonal covariance
matrices.gmm_full
is tailored for full covariance matrices.gmm_diag
is considerably faster than
gmm_full
, at the cost of some reduction in modelling
accuracy.-fopenmp
in GCC and clang). Cpp11armadillo
finds OpenMP and uses it by default.[[cpp11::register]] list gmm1_(const int& n, const int& d) {
// create synthetic data with 2 Gaussians
mat data(d, n, fill::zeros);
vec mean0 = linspace<vec>(1, d, d);
vec mean1 = mean0 + 2;
int i = 0;
while (i < n) {
if (i < n) {
data.col(i) = mean0 + randn<vec>(d);
++i;
}
if (i < n) {
data.col(i) = mean0 + randn<vec>(d);
++i;
}
if (i < n) {
data.col(i) = mean1 + randn<vec>(d);
++i;
}
}
// model the data as a diagonal GMM with 2 Gaussians
gmm_diag model;
bool status = model.learn(data, 2, maha_dist, random_subset, 10, 5, 1e-5,
true);
if (status == false) {
stop("learning failed");
}
model.means.print("means:");
double scalar_likelihood = model.log_p(data.col(0));
rowvec set_likelihood = model.log_p(data.cols(0, 9));
double overall_likelihood = model.avg_log_p(data);
uword gaus_id = model.assign(data.col(0), eucl_dist);
urowvec gaus_ids = model.assign(data.cols(0, 9), prob_dist);
rowvec hist1 = model.norm_hist(data, eucl_dist);
urowvec hist2 = model.raw_hist(data, prob_dist);
writable::list res(9);
res[0] = writable::logicals({status});
res[1] = as_doubles_matrix(model.means);
res[2] = as_doubles({scalar_likelihood});
res[3] = as_doubles(set_likelihood.t());
res[4] = as_doubles({overall_likelihood});
res[5] = as_integers(gaus_id);
res[6] = as_integers(gaus_ids.t());
res[7] = as_doubles(hist1.t());
res[8] = as_integers(hist2.t());
return res;
}