gpp_linear_algebra

Contents:

gpp_linear_algebra.hpp

This file provides low level linear algebra functionality in support of the other gpp_* components. The functions here generally wrap BLAS (levels 1 through 3) and LAPACK functionality as well as a few utilities for convenience/debugging (e.g., matrix print outs).

First, look over gpp_common.hpp for general comments on loop layouts, storage formats, and shorthand as well as definitions of standard notation.

The functions here wrap BLAS/LAPACK functionality to make it convenient/easy to switch between different library implementations (e.g., different vendors or even on a GPU). They also provide an opportunity to avoid associated overhead for “small” problem sizes. These functions wrap BLAS:

  • Level 1: O(n) operations; vector scale, dot product, etc.
  • Level 2: O(n^2) operations; Matrix-vector multiplies (+ special cases) and triangular solves
  • Level 3: O(n^3) operations: matrix-matrix multiplies and triangular solves

and LAPACK:

  • O(n^3), triangular factorization routines (PLU, Cholesky)
  • O(n^3), matrix inverse

Matrix storage formats are always column-major as prescribed in gpp_common.hpp. We also only deal with lower triangular matrices here; these are stored in the lower triangle. The contents of the upper triangle are ignored and can technically have any value, even NaN. See PLU for further description of its special output format.

namespace optimal_learning

Macro to allow restrict as a keyword for C++ compilation and CUDA/nvcc compilation. See related entry in gpp_common.hpp for more details.

Functions

OL_NONNULL_POINTERS void VectorScale(int size, double alpha, double *restrict vector)

Multiplies first size elements of vector by alpha, vector := vector*alpha.

Should be equivalent to BLAS call: dscal(size, alpha, vector, 1);

Parameters:
size:number of elements in vector
alpha:number to scale by
vector[size]:vector to scale
Outputs:
vector[size]:vector with elements scaled

OL_NONNULL_POINTERS void VectorAXPY(int size, double alpha, double const *restrict vec1, double *restrict vec2)

Computes y_i = alpha * x_i + y_i; y is modified in-place.

Parameters:
size:number of elements in x, y
alpha:quantity to scale by
vec1[size]:x, vector to scale and add to y
vec2[size]:y, vector to add to
Outputs:
vec2[size]:input y plus alpha*x

OL_PURE_FUNCTION OL_NONNULL_POINTERS OL_WARN_UNUSED_RESULT double DotProduct(double const *restrict vector1, double const *restrict vector2, int size)

Computes dot product between two vectors.

Equivalent BLAS call: ddot(size, vector1, 1, vector2, 1);

Parameters:
vector1[size]:first vector
vector2[size]:second vector
size:length of vectors
Returns:
dot/inner product, <vector1, vector2>

OL_NONNULL_POINTERS void CholeskyFactorLMatrixVectorSolve(double const *restrict A, int size_m, double *restrict x)

Solves A * x = b IN-PLACE, where A has been previously cholesky-factored (A = L * L^T) such that the lower triangle of A contains L. Consists of two calls to TriangularMatrixVectorSolve. As in that function, before calling, x holds the RHS, b. After return, x will be OVERWRITTEN with the solution. No inputs may be nullptr.

Math: A * x = b L * L ^T * x = b L^T * x = L \ b  (dtrsv, no transpose) x = L^T \ (L \ b)  (dtrsv, transpose)

Should be equivalent to BLAS call: dpotrs('L', size_m, 1, A, size_m, x, size_m, &info);

Parameters:
A[size_m][size_m]:
 cholesky-factored system of equations such that its lower triangle contains L. i.e., result ‘chol’ from ComputeCholeskyFactorL(A_full, size_m, chol)
size_m:dimension of A
x[size_m]:the RHS vector, b
Outputs:
x[size_m]:the solution, A\b.

OL_NONNULL_POINTERS void CholeskyFactorLMatrixMatrixSolve(double const *restrict A, int size_m, int size_n, double *restrict X)

Same as CholeskyFactorLMatrixVectorSolve except this accepts matrix RHSs; it solves multi-RHS linear systems of the form: A * X = B. A must have been Cholesky-factored (L * L^T = A) beforehand, and it must hold the factor L in its lower trinagle. Usually this is obtained via ComputeCholeskyFactorL().

Note that this operation is done IN-PLACE. X initially holds B and is overwritten with the solution.

MATH: This is analogous to CholeskyFactorLMatrixVectorSolve; see that function for further details.

Should be equivalent to BLAS call: dpotrs('L', size_m, size_n, A, size_m, X, size_m, &info);

Parameters:
A[size_m][size_m]:
 non-singular matrix holding L, the cholesky factor of A, in its lower triangle
size_m:number of rows of A, X, B; number of columns of A
size_n:number of columns of X, B
X[size_m][size_n]:
 matrix of RHS vectors, B`
Outputs:
X[size_m][size_n]:
 matrix of solutions, A\B

gpp_linear_algebra.cpp

Implementations of functions for linear algebra operations; the functionality here largely is a subset of that supported by BLAS and LAPACK.

Linear algebra functions currently do not call libraries like the BLAS/LAPACK because for [currently] small problem sizes, overhead kills their performance advantage. Additionally, for the custom implementations on our specific use cases we also gain some performance through more restrictive assumptions on data ordering and no need (due to small problem size) for advanced and complex optimizations like blocking.

However, if/when BLAS is needed, current linear algebra functions are designed to easily map into BLAS calls so they can serve as wrappers later. This also makes it easy to handle BLAS from different vendors and on different computing environments (e.g., GPUs, Xeon Phi).

See gpp_linear_algebra.hpp file docs and (primarily) gpp_common.hpp for a few important implementation notes (e.g., restrict, memory allocation, matrix storage style, etc). Note the matrix looping idiom (gpp_common.hpp, item 8) in particular; in summary, we use:

for (int i = 0; i < m; ++i) {
  y[i] = 0;
  for (int j = 0; j < n; ++j) {
    y[i] += A[j]*x[j];
  }
  A += n;
}

Defines

OL_CHOL(i, j)

namespace optimal_learning

Macro to allow restrict as a keyword for C++ compilation and CUDA/nvcc compilation. See related entry in gpp_common.hpp for more details.

Functions

double VectorNorm(double const *restrict vector, int size)

Slow (compared to computing \sqrt(x_i*x_i)) but stable computation of \||vector\|_2

Computing norm += Square(vector[i]) can be unsafe due to overflow & precision loss.

Note that for very large vectors, this method is still potentially inaccurate. BUT we are not using anything nearly that large right now. The best solution for accuracy would be to use Kahan summation. Around 10^16 elements, this function will fail most of the time. Around 10^8 elements, the loss of precision may already be substantial.

Computes \|x\|_2 in a reasonably (see implementation notes) accurate and stable way.

Slower than the naive implementation due to scaling done to prevent overflow & reduce precision loss.

Parameters:
vector[size]:the vector x
size:number of elements in x
Returns:
The vector 2-norm (aka Euclidean norm) of x.

void MatrixTranspose(double const *restrict matrix, int num_rows, int num_cols, double *restrict transpose)

Transposes a 2D matrix. O(num_rows*num_cols) = O(n^2) for square matrices Would have no effect on symmetric matrices.

For example, A[3][4] = [4 53 81 32 12 2 5 8 93 2 1 0] becomes A[4][3] = [4 32 5 2 53 12 8 1 81 2 93 0]

Parameters:
matrix[num_rows][num_cols]:
 matrix to be transposed
num_rows:number of rows in matrix
num_cols:number of columns in matrix
Outputs:
transpose[num_cols][num_rows]:
 transpose of matrix

void ZeroUpperTriangle(int size, double *restrict matrix)

Zeroes the strict upper triangle of a matrix (assuming column-major storage)

Parameters:
size:dimension of matrix
matrix[size][size]:
 matrix whose upper tri is to be zeroed (on input)
Outputs:
matrix[size][size]:
 lower triangular part of input matrix

int ComputeCholeskyFactorL(int size_m, double *restrict chol)

Cholesky factorization, A = L * L^T (see Smith 1995 or Golub, Van Loan 1983, etc.) This implementation uses the outer-product formulation. The outer-product version is 2x slower than gaxpy or dot product style implementations; this is to remain consistent with Smith 1995’s formulation of the gradient of cholesky.

This implemention is not optimized nor does it pivot when symmetric, indefinite matrices or poorly conditioned SPD matrices are detected.

Instead, non-SPD matrices trigger an error printed to stdout.

Should be the same as BLAS call: dpotrf('L', size_m, A, size_m, &info); Implementation is similar to dpotf2, the unblocked version (same arg list as dpotrf).

Computes the cholesky factorization of a symmetric, positive-definite (SPD) matrix, A = L * L^T; A is the input matrix, L is the (lower triangular) cholesky factor. O(n^3) operations. No inputs may be nullptr.

A must be SPD. Calling this function on an indefinite matrix will produce a nonsensical L. Calling this function on a semi-definite matrix may result in a severe loss of precision as well as inaccurate L.

The strict upper triangle of chol is NOT accessed.

Parameters:
size_m:dimension of matrix
chol[size_m][size_m]:
 SPD (square) matrix (A) (on entry)
Outputs:
chol[size_m][size_m]:
 cholesky factor of A (L). L is stored in the lower triangle of A. Do not acccess the upper triangle of L. (on exit)
Returns:
0 if successful. Otherwise the matrix is NOT positive definite and this returns i, the index of the i-th leading minor that is not positive definite.

void TriangularMatrixVectorSolve(double const *restrict A, char trans, int size_m, int lda, double *restrict x)

Solve A*x = b or A^T*x = b when A is lower triangular IN-PLACE. Uses the standard “backsolve” technique, instead of forming A^-1 which is VERY poorly conditioned. Backsolve, however, is backward-stable.

See .h file docs for information on “lda”.

Should be equiv to BLAS call: dtrsv('L', 'N', 'N', size_m, A, lda, x, 1);

Solves the system A*x = b or A^T * x = b when A is lower triangular. A must be nonsingular. Before calling, x holds the RHS, b. After return, x will be OVERWRITTEN with the solution. No inputs may be nullptr.

DOES NOT form A^-1 explicitly.

Nonsensical output for singular A.

Parameters:
A[size_m][size_m]:
 input to be solved; must be lower triangular and non-singular
trans:‘N’ to solve A * x = b, ‘T’ to solve A^T * x = b
size_m:dimension of A
lda:the first dimension of A as declared by the caller; lda >= size_m
x[size_m]:the RHS vector, b
Outputs:
x[size_m]:the solution, A\b.

void TriangularMatrixMatrixSolve(double const *restrict A, char trans, int size_m, int size_n, int lda, double *restrict X)

Calls dtrsv on each column of X, solving A * X_i = B_i (X_i being i-th column of X). Does not use blocking or any other optimization techniques.

Should be equiv to BLAS call: dtrsm('L', 'L', 'N', 'N', size_m, size_n, 1.0, A, lda, B, size_m);

Solve A * X = B or A^T * X = B (A, X, B matrices) when A is lower triangular. Solves IN-PLACE.

Parameters:
A[size_m][size_m]:
 input to be solved; must be lower triangular and non-singular
trans:‘N’ to solve A * X = B, ‘T’ to solve A^T * X = B
size_m:dimension of A
lda:the first dimension of A as declared by the caller; lda >= size_m
X[size_m]:the RHS matrix, B
Outputs:
X[size_m]:the solution, A\B.

void TriangularMatrixInverse(double const *restrict matrix, int size_m, double *restrict inv_matrix)

Computes A^-1, the inverse of A when A has been previously cholesky-factored. Only the lower triangle of A is read.

Computes inverse by successive x_i = A \ e_i operations, where e_i is the unit vector with a 1 in the i-th entry. Implementation saves computation (factor fo 2) by ignoring all leading zeros in x_i. So x_0 = A \ e_0 is \approx m^2 operations, x_1 = A \ e_1 is \approx (m-1)^2 operations, ..., and x_{m-1} = A \ e_{m-1} is 1 operation.

This is NOT backward-stable and should NOT be used! Substantial superfluous numerical error can occur for poorly conditioned matrices. Caveat: may have utility if you are very certain of what you are doing in the face of [severe] loss of precision

Computes L^-1 when L is lower triangular. L must be nonsingular otherwise the output will be nonsense.

This is NOT backward-stable and should NOT be used! Substantial superfluous numerical error can occur for poorly conditioned matrices. Caveat: may have utility if you are very certain of what you are doing in the face of [severe] loss of precision

Parameters:
matrix[size_m][size_m]:
 triangular matrix, L, to be inverted
size_m:dimension of matrix
OUPUTS:
inv_matrix[size_m][size_m]:
 inverse of L

void TriangularMatrixVectorMultiply(double const *restrict A, char trans, int size_m, double *restrict x)

Special case of GeneralMatrixVectorMultiply. As long as A has zeros in the strict upper-triangle, GeneralMatrixVectorMultiply will work too (but take >= 2x as long).

Computes results IN-PLACE. Avoids accessing the strict upper triangle of A.

Should be equivalent to BLAS call: dtrmv('L', trans, 'N', size_m, A, size_m, x, 1);

Computes A * x or A^T * x in-place. A must be lower-triangular. The vector x is OVERWRITTEN with the result before return.

Parameters:
A[size_m][size_m]:
 lower triangular matrix to be multiplied
trans:‘N’ for A * x, ‘T’ for A^T * x
size_m:dimension of A, x
x[size_m]:vector to multiply by A
Outputs:
x[size_m]:the product A * x or A^T * x

void SymmetricMatrixVectorMultiply(double const *restrict A, double const *restrict x, int size_m, double *restrict y)

Special case of GeneralMatrixVectorMultiply for symmetric A (need not be SPD). As long as A is stored fully (i.e., upper triangle is valid), GeneralMatrixVectorMultiply will work too (but take >= 2x as long).

Avoids accessing the strict upper triangle of A.

Should be equivalent to BLAS call: dsymv('L', size_m, 1.0, A, size_m, x, 1, 0.0, y, 1);

Computes y = A * x (or equivalently y = A^T * x). This is NOT done in-place. A must be symmetric. Only the lower triangular part of A is read, so there is no need to store the duplicate values when using this function. No inputs may be nullptr.

Parameters:
A[size_m][size_m]:
 symmetric matrix to be multiplied
x[size_m]:vector to multiply by A
size_m:dimension of A, x
Outputs:
y[size_m]:the product A * x

void GeneralMatrixVectorMultiply(double const *restrict A, char trans, double const *restrict x, double alpha, double beta, int size_m, int size_n, int lda, double *restrict y)

Computes matrix-vector product y = alpha * A * x + beta * y or y = alpha * A^T * x + beta * y. Since A is stored column-major, we treat the matrix-vector product as a weighted sum of the columns of A, where x provides the weights.

That is, a matrix-vector product can be thought of as: (trans = 'T') [  a_row1  ][   ] [  a_row2  ][ x ] [    ...   ][   ] [  a_rowm  ][   ] That is, y_i is the dot product of the i-th row of A with x.

OR the “dual” view: (trans = 'N') [        |        |     |        ][ x_1 ] [ a_col1 | a_col2 | ... | a_coln ][ ... ] = x_1*a_col1 + ... + x_n*a_coln [        |        |     |        ][ x_n ] That is, y is the weighted sum of columns of A.

Should be equivalent to BLAS call: dgemv(trans, size_m, size_n, alpha, A, size_m, x, 1, beta, y, 1);

Computes y = alpha * A * x + beta * y or y = alpha * A^T * x + beta * y. This is NOT done in-place A can be any 2 dimensional matrix (no requirements on triangularity, symmetry, etc) No inputs may be nullptr.

Parameters:
A[size_m][size_n]:
 input matrix to be multiplied
trans:whether to multiply by A (‘N’) or A^T (‘T’)
x[size_n OR size_m]:
 vector to multiply by A, size_n if trans=='N', size_m if trans=='T'
alpha:scale factor on A*x
beta:scale factor on y
size_m:number of rows of A; size of y if trans == 'N', size of x if trans == 'T'
size_n:number of columns of A, size of x if trans == 'N', size of y if trans == 'T'
lda:the first dimension of A as declared by the caller; lda >= size_m
Outputs:
y[size_m OR size_n]:
 the product A * x (size_m if trans=='N', size_n if trans == 'T')

void GeneralMatrixMatrixMultiply(double const *restrict Amat, char transA, double const *restrict Bmat, double alpha, double beta, int size_m, int size_k, int size_n, double *restrict Cmat)

Matrix-matrix product C = alpha * op(A) * B + beta * C, where op(A) is A or A^T. Does so by computing matrix-vector products of A with each column of B (to generate corresponding column of C).

Does not use blocking or other advanced optimization techniques.

Should be equivalent to BLAS call: dgemm('N', 'N', size_m, size_n, size_k, alpha, A, size_m, B, size_k, beta, C, size_m);

Computes the matrix-matrix product C = alpha * op(A) * B + beta * C, where op(A) = A or A^T, depending on transA A, B, C can be general matrices (no requirements on symmetry, etc.) Equivalent to calling GeneralMatrixVectorMultiply with A against each column of B. No inputs may be nullptr.

Parameters:
A[size_m][size_k]:
 left matrix multiplicand
transA:whether to multiply by A (‘N’) or A^T (‘T’)
B[size_k][size_n]:
 right matrix multiplicand
alpha:scale factor on A*B
beta:scale factor on C
size_m:rows of op(A), C
size_k:cols of op(A), rows of B
size_n:cols of B, C
Outputs:
C[size_m][size_n]:
 the result A * B

void SPDMatrixInverse(double const *restrict chol_matrix matrix, int size_m, double *restrict inv_matrix)

Computes A^-1 by cholesky-factoring A = L * L^T, computing L^-1, and then computing A^-1 = L^-T * L^-1.

This is NOT backward-stable and should NOT be used! Substantial superfluous numerical error can occur for poorly conditioned matrices. Caveat: may have utility if you are very certain of what you are doing in the face of [severe] loss of precision

Computes A^-1 when A is SPD. Output is nonsense for non-SPD A matrices. A must be previously cholesky-factored (e.g., via ComputeCholeskyFactorL)

This is NOT backward-stable and should NOT be used! Substantial superfluous numerical error can occur for poorly conditioned matrices. Caveat: may have utility if you are very certain of what you are doing in the face of [severe] loss of precision

Parameters:
matrix[size_m][size_m]:
 matrix, A, to be inverted (only the lower-triangle is read) A must be previously cholesky-factored.
size_m:dimension of A
Outputs:
inv_matrix:A^-1 (stored as a full matrix even though it is symmetric)

int ComputePLUFactorization(int r, int *restrict pivot, double *restrict A)

Computes the PLU factorization of a matrix A using LU-decomposition with partial pivoting: A = P * L * U. P is a permutation matrix–an identity matrix with some rows (potentially) swapped. L is lower triangular with 1s on the diagonal. U is upper triangular. Since L‘s diagonal is unit, L and U can be stored together in A (omitting the 1s). Since P is just a permuted identity, we can instead track the row-swaps done in a vector.

Fails with error code if the matrix is definitely singular–a pivot element is 0 or subnormal.

For further details: 1. http://en.wikipedia.org/wiki/LU_decomposition (not the best info really) 2. L. Trefethen and D. Bau, Numerical Linear Algebra, Chp 20-23 3. G. Golub and C. Van Loan, Matrix Computations, Chp 3. 4. keywords: PLU, LU factorization/decomposition [with partial pivoting]

Parameters:
r:dimension of the matrix
A[size][size]:the matrix to be factored (e.g., describing a system of equations)
Outputs:
pivot[size]:array of pivot positions (i.e., row-swaps) used during factorization
A[size][size]:a matrix containing the L, U factors
Returns:
0 if successful. Otherwise the index (fortran-style, from 1) of the failed pivot element.

void PLUMatrixVectorSolve(int r, double const *restrict LU, int const *restrict pivot, double *restrict b)

Solves the system P * L * U * x = b.

  1. Apply the permutation P (pivot) to b.
  2. Forward-substitute to solve Ly = Pb. This loop is unrolled 4 times for speed.
  3. Backward-substitute to solve Ux = y. This loop is also unrolled 4 times.

Solves the system of equations A*x = b, where A has been previously PLU-factored (by ComputePLUFactorization()). So in essence, solves: P * L * U * x = b. The input b is overwritten with the solution x. A is provided in the matrix LU and the array pivot. pivot represents the permutation matrix P. The triangular matrices L and U are stored in A: the upper triangle of A holds U. The strict lower triangle of A holds L (since L‘s diagonal is unit).

Parameters:
r:size of vector, dimension of matrix
LU[size][size]:factored matrix containing L, U factors the original system, A
pivot[size]:pivot position array generated during LU decomposition with partial pivoting
b[size]:the right hand side
Outputs:
b[size]:the solution vector