ST793 Project: A Blogpost on “Doubly Enhanced EM Algorithm for Model-Based Tensor Clustering” by Mai et al

Ayumi Mutoh, Jisu Oh, Shih-Ni Prim

December 7, 2023

1 Introduction

In recent decades, while tensor data have gained popularity in modern science, their high-dimensional structures often pose challenges for statistical analysis, specifically in model-based clustering. Model-based clustering is a statistical approach to data clustering, where observed data is considered to have been created from a finite combination of component models, such as the Gaussian mixture model (GMM). Since the formalization of the expected-maximization (EM) algorithm by Dempster et al. (1977), the EM algorithm has been widely employed in the majority of model-based clustering applications. While the GMMs can be readily extended to higher-order tensors using the standard EM algorithm, their performance can be further enhanced by integrating the Doubly Enhanced EM algorithm (DEEM), as proposed by Qing Mai and Deng (2022). Mai et al. consider a tensor normal mixture model (TNMM) that incorporates tensor correlation structure and variable selection for clustering and parameter estimation. They developed the DEEM algorithm which enables DEEM to excel in high-dimensional tensor data analysis. Similar to the EM algorithm, DEEM carries out an enhanced E-step and an enhanced M-step.

In this blogpost, we first introduce the DEEM methods with intermediate steps for the theoretical explanation. The objective is to break down the steps, making the derivation more accessible for our readers to follow. Subsequently, we will conduct a simulation study to evaluate the performance of DEEM.

2 Theoretical Derivation

2.1 EM Algorithm

Before delving into DEEM, we would like to review the EM algorithm and its functioning in clustering.

The EM algorithm is an iterative approach that cycles between two steps for maximum likelihood estimation in the presence of latent variables. The observed data Y is incomplete and data Z is missing. The first step is to write down the joint likelihood, $L_{c} (𝜃 | Y, Z)$ , of the “complete” data $(Y, Z)$ . The “E” step of the EM algorithm is to compute the conditional expectation of log-likelihood, $log L_{c} (𝜃 | Y, Z)$ , given Y, assuming the true parameter value is $𝜃^{(ν)}$

Q (𝜃, 𝜃^{(ν)}, Y) = E_{𝜃^{(ν)}} (log L_{c} (𝜃 | Y, Z) | Y) .

In the “M” step, we maximize $Q (𝜃, 𝜃^{(ν)}, Y)$ with respect to $𝜃$ with $𝜃^{(ν)}$ fixed. We repeat the E step and M step until convergence.

The EM algorithm is well-known for its use in unsupervised learning problems such as clustering with a mixture model. The process from Yang et al. (2012) goes as follows:

Identify the number of clusters.
Define each cluster by generating a Gaussian model.
For every observation, calculate the probability that it belongs to each cluster (Ex. observation 12 has 40% probability of belonging to Cluster A and 60% probability of belonging to Cluster B.)
Use the above probabilities to recalculate the Gaussian models.
Repeat until observations “converge” on their assignments.

Let’s consider a simple example. Suppose we have data $X_{i}$ as shown in Figure 1, which comes from two distinct classes. We use this data to build a Gaussian model for each class. Since we don’t know which class each observation belongs to, there is no straightforward way to construct two Gaussian models to partition the data. Therefore, we begin with a random guess of our Gaussian model parameters: $μ_{1}, σ_{1}^{2}, μ_{2}, σ_{2}^{2}$ .

We have ‘missing’ data points $X_{i}$ that we believe belong to either of the two distributions. After initializing two random Gaussian models, we compute the likelihood of each observation, $X_{i}$ , being expressed in both of the Gaussian models. The next is the E-step, where we compute the probability that each $X_{i}$ can belong to any of two distributions. Now we have each point’s probability of belonging to either distribution.

In the M-step, we update the parameters, $μ_{1}, σ_{1}^{2}, μ_{2}, σ_{2}^{2}$ , of the model to their most likely values. For the new $μ_{1}$ , we take a weighted average of all the points, weighted by the probability that they belong to the first distribution. Denoting $p_{i}$ is the probability that $X_{i}$ belongs to the first distribution.

μ_{1} = \frac{p_{1} X_{1} + p_{2} X_{2} + \dots + p_{n} X_{n}}{p_{1} + p_{2} + \dots + p_{n}}

The new $σ_{1}^{2}$ can be updated similarly.

σ_{1}^{2} = \frac{p_{1} {(X_{1} - μ_{1})}^{2} + p_{2} {(X_{2} - μ_{1})}^{2} + \dots + p_{n} {(X_{n} - μ_{1})}^{2}}{p_{1} + p_{2} + \dots + p_{n}}

We repeat this process for $μ_{2}$ and $σ_{2}^{2}$ and update our distributions. We iterate through the E-step and M-step until convergence, obtaining two clusters as shown in Figure 2.

Figure 1: Mixture of two Gaussian Distributions

Figure 2: Clusters Found by EM algorithm

2.2 Tensor

While the term “tensor” might sound unfamiliar to some, tensors are simply multi-way arrays. Data is often structured as matrices, and they are in fact second-order tensors. When we use the term “tensor,” we usually mean tensors of third- and higher-order. The “order” means the dimension of a tensor, and it is sometimes called the “mode.” You can think of a third-order tensor as a cube. As shown in Figure 3, a tensor can be manipulated similarly as a matrix. In a matrix, we can talk about rows and columns. In a third-order tensor, we can talk about fibers when you fix two modes and keep all values of one mode. The index is then the mode that has all the values. Slices are when you fix one mode and keep all values for the rest of the modes. The index is then the mode that is fixed.

Figure 3: Dimensions and Terminology of a Tensor, taken from Kolda and Bader (2009)

Before we continue onto the DEEM algorithm, some concepts and notations are necessary to understand derivations in the following section. Note that the following notations are taken from Kolda and Bader (2009). First of all, we should go over the concept of matricization. If we want to matricize a third-order tensor, we can think of cutting a cube into slices and putting the slices side-by-side to make them into a matrix. We now borrow an example (Example 2.1) given in Kolda and Bader (2009) to demonstrate how to matricize a third-order tensor $X \in ℝ^{3 \times 4 \times 2}$ :

X_{1} = [\begin{matrix} 1 & 4 & 7 & 10 \\ 2 & 5 & 8 & 11 \end{matrix}], X_{2} = [\begin{matrix} 13 & 16 & 19 & 22 \\ 14 & 17 & 20 & 23 \end{matrix}] .

The three mode- $n$ unfoldings/matricizations are then

X_{1} = [\begin{matrix} 1 & 4 & 7 & 10 & 13 & 16 & 19 & 22 \\ 2 & 5 & 8 & 11 & 14 & 17 & 20 & 23 \end{matrix}],

X_{2} = [\begin{matrix} 1 & 2 & 3 & 13 & 14 & 15 \\ 4 & 5 & 6 & 16 & 17 & 18 \\ 7 & 8 & 9 & 19 & 20 & 21 \end{matrix}],

X_{3} = [\begin{matrix} 1 & 2 & 3 & 4 & 5 & \dots & 9 & 10 & 11 & 12 \end{matrix}] .

The example above shows that mode- $1$ unfolding is to put $X_{1}$ and $X_{2}$ side-by-side and mode- $2$ unfolding is to stack on top of each other.

This concept is intuitive but much more awkward when we want to define it formally. In Kolda and Bader (2009), the mode- $n$ matricization of a tensor $X \in ℝ^{I_{1} \times I_{2} \times \dots \times I_{M}}$ is denoted by $X_{(n)}$ , which is a matrix of the dimension $(I_{n}, \prod_{p \neq I_{n}} I_{p})$ . (The first dimension comes from the dimension of mode $n$ , and the second dimension comes from the product of all the other dimensions.) The tensor element $(i_{1}, \dots, i_{M})$ is mapped to the matrix element $(i_{n}, j)$ in the following manner:

j = 1 + \sum_{k = 1, k \neq n}^{M} (i_{k} - 1) J_{k} with J_{k} = \prod_{m = 1, m \neq n}^{k - 1} I_{m} .

Next, the notation $⟦ \cdot ⟧$ is defined as:

⟦G; A^{(1)}, A^{(2)}, \dots, A^{(M)} ⟧ : = G \times_{1} A^{(1)} \times_{2} A^{(2)} \dots \times_{M} A^{(M)},

where $A$ are matrices and $X$ and $G$ are tensors. The symbol $\times_{n}$ means $n$ -mode (matrix) product of a tensor $G \in ℝ^{I_{1} \times I_{2} \times \dots \times I_{M}}$ and the matrices $A^{(n)} \in ℝ^{J \times I_{n}}$ . $G \times_{n} A^{(n)}$ is then a tensor of the dimension $I_{1} \times \dots \times I_{n - 1} \times J \times I_{n + 1} \times \dots \times I_{M}$ . We can write the elements of $G \times_{n} A^{(n)}$ as:

{(G \times_{n} A^{(n)})}_{i_{1} \dots i_{n - 1} j i_{n + 1} \dots i_{N}} = \sum_{i_{n} = 1}^{I_{n}} g_{i_{1} i_{2} \dots i_{N}} \cdot a_{j i_{n}}^{(n)} .

Also, for any two tensors $A$ and $B$ in $ℝ^{I_{1} \times I_{2} \times \dots \times I_{M}}$ , we define their inner product by

⟨ A, B ⟩ = \sum_{J \in I_{1} \times I_{2} \times \dots \times I_{N}} a_{J} b_{J} .

With that, we are ready to learn about the DEEM algorithm. If you are interested in knowing more about tensors, Kolda and Bader (2009) has lots of great details. So be sure to check it out!

2.3 Doubly Enhanced EM Algorithm

In this subsection, we introduce the doubly enhanced EM (DEEM) algorithm and discuss its theoretical properties. Algorithm 1 from Qing Mai and Deng (2022) is provided in the Appendix as our Figure 6.

Let $Z$ denote the random tensor in $ℝ^{p_{1}, \times \dots \times p_{M}}$ such that every element in $Z$ is distributed as iid $N (0, 1)$ . Then we say that a random tensor $X$ has a tensor normal distribution, denoted by $X \sim TN (μ; Σ_{1}, \dots, Σ_{M})$ , if $X = μ + ⟦Z; Σ_{1}^{1 ∕ 2}, \dots, Σ_{M}^{1 ∕ 2} ⟧$ , where $μ \in ℝ^{p_{1}, \times \dots \times p_{M}}$ is the total mean and each $Σ_{i} \in ℝ^{p_{i} \times p_{i}}$ means the covariance matrix within $i$ th class. In other words, a tensor normal regression is multivariate normal distribution generalized to a high dimension. An $M^{th}$ order random tensor $X$ has a total of $\sum_{i = 1}^{M} p_{i}$ means and $M$ covariate matrices each of the dimension $p_{i} \times p_{i}$ for $i = 1, \dots, M$ . We can find that the density of $X$ has the form

p (X | μ; Σ_{1}, \dots, Σ_{M}) = \frac{1}{{(2 π)}^{p ∕ 2} | Σ_{1} |^{q_{i} ∕ 2} \dots | Σ_{M} |^{q_{M} ∕ 2}} exp (- \frac{1}{2} ⟨ ⟦X - μ; Σ_{1}^{- 1}, …, Σ_{M}^{- 1} ⟧, X - μ ⟩),

(1)

where $p = p_{1} p_{2} \dots p_{M}$ and $q_{i} = p ∕ p_{i}$ .

We will consider independent tensor-variate observations in $ℝ^{p_{1}, \times \dots \times p_{M}}$ drawn from $K$ clusters with the same within-class covariance matrices; suppose that $μ_{i}$ ’s are the mean tensor of the $k$ th cluster. Let $π_{k}$ be the probability of an observation to be taken from the $k$ th cluster.

Then the sample ${X_{i}}_{i = 1}^{n}$ from a mixture of the tensor normal distributions can be written as the following:

X_{i} \sim \sum_{k = 1}^{K} π_{k} TN (μ_{k}; Σ_{1}, \dots, Σ_{M}), i = 1, 2, \dots, n,

or equivalently,

P (Y_{i} = k) = π_{k} X_{i} | Y_{i} = k \sim TN (μ_{k}; Σ_{1}, \dots, Σ_{M}), i = 1, 2, \dots, n .

(2)

Hence, $Y_{i}$ indicates the number of the cluster from which $X_{i}$ was taken, and if $Y_{i} = k$ is given, $X_{i}$ has the tensor normal distribution with the mean $μ_{k}$ of the cluster $k$ and the within-class covariance matrices $Σ_{1}, \dots, Σ_{M}$ . (Recall we assume that the clusters have the same within-class matrices.)

Suppose that ${X_{i}}_{i = 1}^{n}$ is a sample from the model (2). Let $𝜃 = {π_{i}, μ_{i}, Σ_{j} : 1 \leq i \leq K, 1 \leq j \leq M}$ denote the set of all parameters in the model. If we can observe $Y_{i}$ , then the complete log-likelihood can be obtained as follows:

\begin{array}{l} ℓ_{c} (𝜃; X, Y) = log \prod_{i = 1}^{n} π_{Y_{i}} p (X_{i} | μ_{Y_{i}}; Σ_{1}, \dots Σ_{M}) = \sum_{i = 1}^{n} [log π_{Y_{i}} + log p (X_{i} | μ_{Y_{i}}, Σ_{1}, …, Σ_{M})] . \end{array}

But, in general, we cannot observe $Y_{i}$ ; hence, from an initial value ${\tilde{𝜃}}^{(0)}$ , we create a sequence ${\tilde{𝜃}}^{(t)}$ through the E-step to obtain the $Q$ function

Q (𝜃; {\tilde{𝜃}}^{(t)}) = E_{Y | X, {\tilde{𝜃}}^{(t)}} [ℓ_{c} (𝜃; X, Y)] = \sum_{i = 1}^{n} \sum_{k = 1}^{K} {\tilde{ξ}}_{ik}^{(t)} [log π_{k} + log p (X_{i} | μ_{k}, Σ_{1}, … Σ_{M})]

where

{\tilde{ξ}}_{ik}^{(t)} = P (Y_{i} = k | X, {\tilde{𝜃}}^{(t)}) = \frac{{\tilde{π}}_{k}^{(t)} p (X_{i} | {\tilde{μ}}_{k}^{(t)}, {\tilde{Σ}}_{1}^{(t)}, \dots, {\tilde{Σ}}_{M}^{(t)})}{\sum_{j = 1}^{K} {\tilde{π}}_{j}^{(t)} p (X_{i} | {\tilde{μ}}_{j}^{(t)}, {\tilde{Σ}}_{1}^{(t)}, \dots, {\tilde{Σ}}_{M}^{(t)})}

(3)

and the M-step to update the parameter

{\tilde{𝜃}}^{(t + 1)} = \underset{𝜃}{argmax} Q (𝜃; {\tilde{𝜃}}^{(t)}) .

Then the EM sequence ${\tilde{𝜃}}^{(t)}$ converges to the MLE, but there are some issues in our situation: Getting the updates for $π_{k}$ and $μ_{k}$ is quite easy and straightforward, but it is challenging to obtain the updates for the covariance matrices $Σ_{i}$ . When we compute ${\hat{ξ}}_{ik}^{(t)}$ in (3), all the elements in $X_{i}$ are used, and the standard EM algorithm does not involve a process for variable selection. Thus, due to an excessive number of parameters in the model, it may lead to the accumulation of errors, which potentially results in inaccurate estimates.

To overcome these problems, we introduce the enhanced E-step, where we replace ${\tilde{ξ}}_{ik}^{(t)}$ with an estimator ${\hat{ξ}}^{(t)}$ that can be calculated relatively faster under the sparsity assumption. We want to find the objective function $Q^{DEEM}$ that has a better property than the standard $Q$ function above. First, it can be seen that

ξ_{i 1} = P (Y_{i} = 1 | X_{i}, 𝜃) = \frac{π_{1}}{π_{1} + \sum_{k = 2}^{K} π_{k} exp [⟨ X_{i} - (μ_{k} + μ_{1}) ∕ 2, B_{k} ⟩]}

(4)

and

ξ_{ik} = P (Y_{i} = k | X_{i}, 𝜃) = \frac{π_{k} exp [⟨ X_{i} - (μ_{k} + μ_{1}) ∕ 2, B_{k} ⟩]}{π_{1} + \sum_{j = 2}^{K} π_{j} exp [⟨ X_{i} - (μ_{j} + μ_{1}) ∕ 2, B_{j} ⟩]}

(5)

for $k \geq 2$ , where

B_{k} = ⟦ μ_{k} - μ_{1}; Σ_{1}^{- 1}, \dots, Σ_{M}^{- 1} ⟧ \in ℝ^{p_{1}, \times \dots \times p_{M}} .

Note that we are considering the clustering problem, that is, we are interested in recovering $Y_{i}$ ’s and finding a method that minimizes the clustering error. It can be shown that the covariance matrices $Σ_{i}$ are nuisance parameters in the optimal clustering rule, i.e., the values of $Σ_{i}$ are not used in the optimal clustering rule if we already know $B_{k}$ ’s. To cover more general cases, we do not impose conditions on $Σ_{i}$ . Instead, we assume the sparsity condition on $B$ ; for $B_{k} = {[b_{k}^{J}]}_{J}$ , where $J = (j_{1}, \dots, j_{M})$ denotes an index of the tensor, we impose the condition $b_{2}^{J} = \dots = b_{K}^{J} = 0$ for almost every $J$ . In other words, if $D = {J : b_{k}^{J} \neq 0 for some k = 2, \dots, K}$ , then we assume that the number of elements in $D$ is significantly smaller than $p = p_{1} p_{2} \dots p_{M}$ . This assumption comes from the belief that, in the high-dimensional setting, most of the variables are not significant in estimation. The above expressions for $ξ_{ik}$ show that this assumption reduces the computational cost and improves the estimation efficiency.

If we accept the fact that $(B_{2}, \dots, B_{K})$ minimizes the quantity

\sum_{k = 2}^{K} (⟨ B_{k}, ⟦ B_{k}, Σ_{1}, \dots, Σ_{M} ⟧ ⟩ - 2 ⟨ B_{k}, μ_{k} - μ_{1} ⟩),

then it is reasonable to obtain the sequence of estimates ${\hat{B}}_{k}^{(t + 1)}$ by solving the optimization problem

\underset{B_{2}, …, B_{K}}{argmin} \sum_{k = 2}^{K} (⟨ B_{k}, ⟦ B_{k}, {\hat{Σ}}_{1}^{(t)}, \dots, {\hat{Σ}}_{M}^{(t)} ⟧ ⟩ - 2 ⟨ B_{k}, {\hat{μ}}_{k}^{(t)} - {\hat{μ}}_{1}^{(t)} ⟩) + λ^{(t + 1)} \sum_{J} (\sum_{k = 2}^{K} {(b_{k}^{J})}^{2}),

where we added the lasso penalty term to satisfy the sparsity assumption to some extent. Using these ${\hat{B}}_{k}^{(t + 1)}$ , we can obtain the sequence ${\hat{ξ}}_{ik}^{(t + 1)}$ by replacing the parameters in (4) and (5) with their estimates. Then, the objective $Q^{DEEM}$ is defined using ${\hat{ξ}}^{(t)}$ as follows:

Q^{DEEM} (𝜃; {\hat{𝜃}}^{(t)}) = \sum_{i = 1}^{n} \sum_{k = 1}^{K} {\hat{ξ}}_{ik}^{(t)} [log π_{k} + log p (X_{i} | μ_{k}, Σ_{1}, …, Σ_{M})] .

In light of the sparsity assumption, ${\hat{ξ}}_{ik}^{(t)}$ can be computed based on the values of relatively smaller variables.

In M-step, the parameters can be updated inductively from the proposed $Q^{DEEM}$ function: The estimates for $π_{k}$ and $μ_{k}$ can be obtained by the formula

{\hat{π}}_{k}^{(t + 1)} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{ξ}}_{ik}^{(t + 1)} and {\hat{μ}}_{k}^{(t + 1)} = \frac{\sum_{i = 1}^{n} {\hat{ξ}}_{ik}^{(t + 1)} X_{i}}{\sum_{i = 1}^{n} {\hat{ξ}}_{ik}^{(t + 1)}}, k = 1, 2, \dots, K .

Then given $ξ_{ik}^{(t + 1)}$ , we calculate the intermediate covariance matrices

{\overset{⌣}{Σ}}_{j}^{(t + 1)} = \frac{1}{n q_{j}} \sum_{i = 1}^{n} \sum_{k = 1}^{K} {\hat{ξ}}_{ik}^{(t + 1)} {(X_{i} - {\hat{μ}}_{k}^{T} (t + 1))}_{(j)} {(X_{i} - {\hat{μ}}_{k}^{T} (t + 1))}_{(j)}^{T},

and the conditional variance of $X_{i}^{1 \dots 1}$

{({\hat{σ}}_{1}^{11})}^{(t + 1)} = \frac{1}{n} \sum_{i = 1}^{n} \sum_{k = 1}^{K} {\hat{ξ}}_{ik}^{(t + 1)} {(X_{i}^{1 \dots 1} - {({\hat{μ}}_{k}^{1 \dots 1})}^{(t + 1)})}^{2} .

The target covariance estimator is given by scaling the intermediate covariances with ${({\hat{σ}}_{1}^{11})}^{(t + 1)}$ and ${({\overset{⌣}{σ}}_{1}^{11})}^{(t + 1)}$ :

{\hat{Σ}}_{j}^{(t + 1)} = {\begin{aligned} \frac{1}{{({\overset{⌣}{σ}}_{j}^{11})}^{(t + 1)}} {\overset{⌣}{Σ}}_{j}^{(t + 1)} if j \geq 2, \\ \frac{{({\hat{σ}}_{1}^{11})}^{(t + 1)}}{{({\overset{⌣}{σ}}_{1}^{11})}^{(t + 1)}} {\overset{⌣}{Σ}}_{1}^{(t + 1)} if j = 1 . \end{aligned}

The general process is summarized in Algorithm 1 in the Appendix as our Figure 6.

Now we are interested in how this sequence of parameters ${\hat{𝜃}}^{(t)}$ behave. In fact, under some initialization condition, it can be seen that there are some constant $C$ and $0 < κ < 1 ∕ 2$ such that for large $t$ , with a probability $\geq 1 - O (\prod_{i = 1}^{M} p_{i}^{- 1})$ ,

∥ {\hat{B}}^{(t)} - B ∥ \leq C \sqrt{\frac{s Σ_{i = 1}^{M} log p_{i}}{n}},

where $s = o (\sqrt{n ∕ \sum_{i} log p_{i}})$ and $d_{0}$ is a measure of the difference between the initial value ${\hat{𝜃}}^{(0)}$ and the true parameter $𝜃$ . This result implies that if the number of iterations $t$ is large, the DEEM estimator ${\hat{B}}^{(t)}$ converges to the true parameter $B$ . Here, the condition $s = o (\sqrt{n ∕ \sum_{i} log p_{i}})$ means the sparsity assumption.

We consider the error rate of DEEM and the optimal clustering rule:

R (DEEM) = min_{Π} P (Π ({\hat{Y}}_{i}^{DEEM} \neq Y_{i})) and R (Opt) = P ({\hat{Y}}_{i}^{opt} \neq Y_{i}),

where $Π$ denotes the permutation operator and ${\hat{Y}}_{i}^{DEEM} = {argmax}_{k} {\hat{ξ}}_{ik}$ . Here, the optimal clustering rule is optimal in the sense that it minimizes the clustering error. From the result above, we can show that if $t$ is large, then with probability $\geq 1 - O (\prod_{i = 1}^{M} p_{i}^{- 1})$

R (DEEM) - R (Opt) \leq C \frac{s \sum_{i = 1}^{M} log p_{i}}{n} .

Consequently, this result shows that the error rate of DEEM converges to the error rate of the optimal clustering rule if $t$ is large.

3 Simulation Study

3.1 Data Generation

For our simulation studies, we follow the framework used in Qing Mai and Deng (2022). For each setting, $K$ denotes the number of mixture groups, and noise is generated as a $M^{th}$ -order tensor:

X_{i} \sim \sum_{k = 1}^{K} π_{k}^{*} TN (π_{k}^{*}; Σ_{1}^{*}, \dots, Σ_{M}^{*}), i = 1, \dots, n

(6)

For $K - 1$ mixture groups, the $X_{i}$ is given as a given $B_{k}$ plus the noise above. For $1$ mixture group, the values are simply the noise. Qing Mai and Deng designate two types of $Σ_{k}^{*}$ :

Ω = {\begin{aligned} AR (ρ) : & ω_{ij} = ρ^{| i - j |} \\ CS (ρ) : & ω_{ij} = ρ + (1 - ρ) 1 (i = j) . \end{aligned}

The covariance matrices are not sparse if they are set using the two formats above. For each setting, we generate $100$ independent datasets, the same number of replicates as used by the authors, and present the mean error rate and standard deviation.

3.2 Settings

The settings are provided in Table 1. Note that, for $B_{k}^{*}$ , the indices not included in the subscript is $0$ . In other words, $B_{k}^{*}$ is a sparse tensor. Both the DEEM and EM algorithms require $K$ to be given. $λ$ is a tuning parameter for regularization; however, due to the long computation time, we experiment with several different values of $λ$ and set it at $0.05$ instead of tuning it for each setting. We encourage interested readers to try out two functions–tune_K and tune_lambda–in the R package TensorClustering.

We choose four settings from the seven settings, because these settings are increasingly more computationally expensive, and we believe that they demonstrate the advantage of the DEEM algorithm compared to the classical EM algorithm in terms of accuracy, as shown in Table 2. We also run three extra settings, which we call M8, M9, M10 to avoid confusion with the seven settings in the paper. The results for these extra settings are shown in Tables 4 and 5.


Setting	Parameters

M1	$K = 2, p = 10 \times 10 \times 4, Σ_{1}^{} = CS (0.3), Σ_{2}^{} = AR (0.8), Σ_{3}^{} = CS (0.3), B_{2, [1 : 6, 1, 1]}^{} = 0.5$

M3	$K = 3, p = 10 \times 10 \times 4, Σ_{1}^{} = CS (0.3), Σ_{2}^{} = AR (0.8), Σ_{3}^{} = CS (0.5), B_{2, [1 : 6, 1, 1]}^{} = 0.5, B_{3, [1 : 6, 1, 1]}^{*} = - 0.5$

M4	$K = 4, p = 10 \times 10 \times 4, Σ_{1}^{} = I_{10}, Σ_{2}^{} = AR (0.8), Σ_{3}^{} = I_{4}, B_{2, [1 : 6, 1, 1]}^{} = 0.8, B_{3, [1 : 6, 1, 1]}^{*} = - 0.8$

M5	$K = 6, p = 10 \times 10 \times 4, Σ_{1}^{} = AR (0.9), Σ_{2}^{} = CS (0.6), Σ_{3}^{} = AR (0.9), B_{2, [1 : 6, 1, 1]}^{} = 0.6, B_{3, [1 : 6, 1, 1]}^{} = 1.2, B_{4, [1 : 6, 1, 1]}^{} = 1.8, B_{5, [1 : 6, 1, 1]}^{} = 2.4, B_{6, [1 : 6, 1, 1]}^{} = 3$

M8	$K = 2, p = 10 \times 10 \times 10, Σ_{1}^{} = AR (0.5), Σ_{2}^{} = CS (0.5), Σ_{3}^{} = AR (0.5), B_{2, [1 : 6, 1, 1]}^{} = 1.5$

M9	$K = 2, p = 10 \times 10 \times 10, Σ_{1}^{} = I_{10}, Σ_{2}^{} = I_{10}, Σ_{3}^{} = I_{10}, B_{2, [1 : 6, 1, 1]}^{} = 1.5$

M10	$K = 2, p = 10 \times 10 \times 4 \times 4 \times 4, Σ_{1}^{} = AR (0.5), Σ_{2}^{} = CS (0.5), Σ_{3}^{} = AR (0.5), Σ_{4}^{} = I_{4}, Σ_{5}^{} = I_{4}, B_{2, [1 : 6, 1, 1, 1, 1]}^{} = 5$

Table 1: Simulation settings

3.3 Metrics

Note that it is as straightforward to calculate the mean error rate for a clustering problem as it is for a classification problem. Both methods return labels for the groups; however, the group labels do not matter. For example, if there are five observations and their true group labels are $(1, 1, 2, 2, 2)$ and the methods return $(2, 2, 1, 1, 1)$ , the error rate should be $0$ . In the paper, the authors explain that the mean clustering error rate is calculated by:

min_{Π} \frac{1}{n} \sum_{i = 1}^{n} 1 (Ŷ_{i} \neq Π (Y_{i})) over all possible permutations Π : {1, \dots,} \mapsto {1, \dots, K}

We create a function to permute the true labels, compare the estimated labels and the true labels, and return the lowest error rate. To compare the speed of the two methods, we also record the computation time. Tables 3 and 5 provides the mean computation time and standard error (in parentheses) for each setting.

3.4 External R Packages and Functions

For the DEEM algorithm, we use the function DEEM; for the standard EM algorithm, we use the function TGMM. Both functions are from the R package TensorClustering. We use the Trnorm function from the R package Tlasso to generate tensor noise with designated covariance matrices. We use the permutations function in the gtools package to permute true labels. In short, be sure to install the three R packages: TensorClustering, Tlasso, and gtools if you would like to reproduce our simulation.

3.5 Simulation result

The error rates and computation time are shown in Tables 2 and 3. It is clear that DEEM has lower mean error rates in all four settings. The error rates are in general higher than those given in the article, possibly because the hyperparameters are not tuned for each setting in our case.

The computation time tells a different story, however. As seen in Table 3, DEEM is not always the winner in terms of time. As the setting becomes more complicated and estimating the clusters becomes more challenging, it takes longer for DEEM to converge. For setting M5, it is possible that DEEM reached the maximum iterations for some runs.


Setting	DEEM	EM

M1	0.41 (0.05)	0.45 (0.03)
M3	0.46 (0.09)	0.56 (0.05)
M4	0.35 (0.03)	0.57 (0.06)
M5	0.31 (0.11)	0.43 (0.06)

Table 2: Error Rates from 100 Replicates


Setting	DEEM	EM

M1	0.72 (0.45)	0.93 (0.39)
M3	13.95 (7.78)	7.74 (3.99)
M4	15.8 (0.81)	21.9 (9.68)
M5	332.96 (124.38)	14.66 (5.88)

Table 3: Computation Time (seconds) from 100 Replicates

Next, we transform the values in the tables into figures. As shown in Figure 4, DEEM always has lower mean error rates. However, as the model becomes complicated, DEEM’s error rates become more varied, even though the mean rate is still lower. In Figure 5, the story seems more complicated. (Note that we cannot use the same y-axis for all four plots, because the computation time for DEEM for M5 is so long, which would make some of the boxes very small and not informative.) For the two settings M1 and M4, DEEM has a lower computation time. For M3, the computation time for DEEM is much more varied, and EM has an overall shorter computation time. For M5, DEEM has a very long computation time; in fact, the 100 replicates took almost 10 hours. It is unclear if the reduced error rate is worth the computational cost.

Figure 4: Boxplots of Mean Error Rates from 100 Replicates

Figure 5: Boxplots of Mean Computation Time (in seconds) from 100 Replicates

In terms of the extra settings, we noticed that the error for DEEM is much lower than EM for M8, for which we set two clusters with the covariance matrices to have a moderate correlation. For M9, we let the setting be an easy case, since the covariance matrices are all identity matrices, and DEEM still has a much lower mean error rate than EM. For M10, we use a fifth-order tensor and make the estimating task an easy one. As seen in Table 4, DEEM has a much lower mean error rate than EM, although its computation time is again quite long.

One recurrent problem from the simulation is that DEEM has a much longer running time than EM. We leave this question about DEEM’s computation time to interested readers.


Setting	DEEM	EM

M8	0.28 (0.11)	0.45 (0.03)
M9	0.18 (0.05)	0.34 (0.10)
M10	0.0003 (0.002)	0.31 (0.13)

Table 4: Error Rates from 100 Replicates


Setting	DEEM	EM

M8	23.7 (4,3)	0.17 (0.07)
M9	7.99 (1.88)	0.32 (0.09)
M10	72.8 (211.41)	1.51 (0.08)

Table 5: Computation Time (seconds) from 100 Replicates

4 Summary

In this blogpost, we review a new method proposed by Qing Mai and Deng (2022), which is essentially an upgraded version of the classical EM algorithm. This new method, DEEM, tends to have lower error rates on tensor data. However, despite the paper’s claim that the enhanced M step in the DEEM algorithm facilitates fast covariance estimation, we have encountered situations where the running time could be prohibitive. While DEEM proves to be efficient and effective in handling tensor data, there remains potential for further enhancement.

References

Dempster, A., Laird, N. and Rubin, D. (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39, 1–22.

Kolda, T. G. and Bader, B. W. (2009) Tensor decompositions and applications. SIAM Review, 51, 455–500. URLhttps://doi.org/10.1137/07070111X.

Qing Mai, Xin Zhang, Y. P. and Deng, K. (2022) A doubly enhanced em algorithm for model-based tensor clustering. Journal of the American Statistical Association, 117, 2120–2134.

Yang, M.-S., Lai, C.-Y. and Lin, C.-Y. (2012) A robust em clustering algorithm for gaussian mixture models. Pattern Recognition, 45, 3950–3961.

Appendix