I recently gave a tutorial at CMU about spectral learning for NLP. This tutorial was based on a tutorial I had given last year with Michael Collins, Dean Foster, Karl Stratos and Lyle Ungar at NAACL.
One of the algorithms I explained there was the spectral learning algorithm for HMMs by Hsu, Kakade and Zhang (2009). This algorithm estimates parameters of HMMs in the “unsupervised setting” — only from sequences of observations. (Just like the Baum-Welch algorithm — expectation-maximization for HMMs — does.)
I want to repeat this explanation here, and give some intuition about this algorithm, since it seems to confuse people quite a lot. At a first glance, it looks quite mysterious why the algorithm works, though its implementation is very simple. It is one of the earlier algorithms in this area of latent-variable learning using the method of moments and spectral methods, and promoted the creation of other algorithms for latent-variable learning.
So here are the main ideas behind it, with some intuition. In my explanation of the algorithm, I am going to forget about the “spectral” part. No singular value decomposition will be involved, or any type of spectral decomposition. Just plain algebraic and matrix multiplication tricks that require understanding what marginal probabilities are and how matrix multiplication and inversion work, and nothing more. Pedagogically, I think that’s the right thing to do, since introducing the SVD step complicates the understanding of the algorithm.
Consider a hidden Markov model. The parameters are represented in matrix form \( T \), \( O \) and \( \pi \). We assume \( m \) latent states, \( n \) observations. More specifically, \( T \) is an \( m \times m \) stochastic matrix where \( m \) is the number of latent states, such that \( T_{hh’} \) is the probability of transitioning to state \( h \) from state \( h’ \). \( O \) is an \( n \times m \) matrix such that \( O_{xh} \) is the probability of emitting symbol \( x \) — an observation — from latent state \( h \). \( \pi \) is an \( m \) length vector with \( \pi_h\) being the initial probability for state \( h \).
To completely get rid of the SVD step, and simplify things, we will have to make the assumption that \(m = n\). This means that the number of states equals the number of observations. Not a very useful HMM, perhaps, but it definitely makes the derivation more clear. The fact that \( m=n\) means that \( O \) is now a square matrix — and we will assume it is invertible. We will also assume that \(T \) is invertible, and that \( \pi \) is positive in all coordinates.
If we look at the joint distribution of \(p(X_1 = x_1,X_2 = x_2)\), the first two observations in the HMM, then it can written as:
\( p(X_1 = x_1, X_2 = x_2) = \sum_{h_1,h_2} p(X_1 = x_1, H_1 = h_1, X_2 = x_2, H_2 = h_2) = \sum_{h_1,h_2} \pi_{h_1} O_{x_1,h_1} T_{h_2,h_1} O_{x_2,h_2} \)
Nothing special here, just marginal probability summing out the first two latent states.
It is not hard to see that this can be rewritten in matrix form, i.e. if we define \( [P_{2,1}]_{x_2,x_1} = p(X_1 = x_1, X_2= x_2) \) then:
\( P_{2,1} = O T \mathrm{diag}(\pi)O^{\top} \)
where \( \mathrm{diag}(\pi) \) is just an \( m \times m \) diagonal matrix with \( \pi_h \) on the diagonal.
Just write down this matrix multiplication step-by-step explicitly, multiplying, say, from right to left, and you will be able to verify this identity for \( P_{2,1} \). Essentially, the matrix product, which involves dot-product between rows and vectors of two matrices, eliminates and sums out the latent states (and does other things, like multiplying in the starting probabilities).
Alright. So far, so good.
Now, what about the joint distribution of three observations?
\( p(X_1 = x_1, X_2 = x, X_3=x_3) = \sum_{h_1,h_2,h_3} p(X_1 = x_1, H_1 = h_1, X_2 = x_2, H_2 = h_2, X_3=x_3, H_3 = h_3) = \sum_{h_1,h_2,h_3} \pi_{h_1} O_{x_1,h_1} T_{h_2,h_1} O_{x_2,h_2} T_{h_3,h_2} O_{x_3,h_3} \)
Does this have a matrix form too? Yes, not surprisingly. If we fix \( x \), the second observation, and define \( [P_{3,x,1}]_{x_3,x_1} = p(X_1 = x_1, X_2 = x, X_3 = x_3) \), (i.e. \( P_{3,x,1} \) is an \( m \times m \) matrix defined for each observation symbol \( x \)), then
\( P_{3,x,1} = OT \mathrm{diag}(O_x) T \mathrm{diag}(\pi) O^{\top} \).
Here, \( \mathrm{diag}(O_x) \) is a diagonal matrix where the on the diagonal we have the \(x\)th row of \( O \).
Now define \( B_x = P_{3,x,1}P_{2,1}^{-1} \) (this is well-defined because \( P_{2,1} \) is invertible — all the conditions we had on the HMM parameters make sure that it is true), then:
\( B_x = OT \mathrm{diag}(O_x) T \mathrm{diag}(\pi) O^{\top} \times (O T\mathrm{diag}(\pi)O^{\top})^{-1} = OT\mathrm{diag}(O_x)O^{-1} \)
(just recall that \( (AB)^{-1} = B^{-1} A^{-1} \) whenever both sides are defined and \( A \) and \( B \) are square matrices.)
This part of getting \( B_x \) (and I will explain in a minute why we need it) is the hardest part in our derivation so far. We can also verify that \( p(X_1 = x_1) \) equals \( O\pi \). Let’s call \( b_1 \) a vector such that \([b_1]_x = p(X_1=x_1)\) — i.e. \( b_1 \) is exactly the vector \( P_1 \).
We can also rewrite \( P_1 \) the following way:
\( P_1^{\top} = 1^{\top} T \mathrm{diag}(\pi) O^{\top} = 1^{\top} O^{-1} \underbrace{O T \mathrm{diag}(\pi) O^{\top}}_{P_{2,1}} \)
where \( 1^{\top} \) is an \( 1 \times m \) vector with the value 1 in all coordinates. The first equality is the “surprising” one — we use \( T \) to calculate the distribution of \( p(X_1 = x_1) \) — but if you write down this matrix multiplication explicitly, you will discover that we will be summing over the elements of \( T \) in such a way that it does not play a role in the sum — that’s because each row of \( T \) sums to 1. (As Hsu et al. put it in their paper: this is an unusual but easily verified form to write \( P_1 \).)
The above leads to the identity \( P_1^{\top} = 1^{\top} O^{-1} P_{2,1} \).
Now, it can be easily verified from the above form of \( P_1 \) that for \( b_{\infty}^{\top} \) defined as \( (P^{\top}_{2,1})^{-1} P_1 \), an \(m\) length vector, then:
\( b_{\infty}^{\top} = 1^{\top} O^{-1} \).
So what do we have so far? We managed to define the following matrices and vectors based only on the joint distribution of the first three symbols in the HMM:
\( B_x = P_{3,x,1}P_{2,1}^{-1} = OT\mathrm{diag}(O_x)O^{-1}, \)
\( b_1 = P_1 = O\pi, \)
\( b_{\infty}^{\top} = (P^{\top}_{2,1})^{-1} P_1 = 1^{\top} O^{-1}. \)
The matrix \( B_x \in \mathbb{R}^{m \times m} \) and vectors \( b_{\infty} \in \mathbb{R}^m \) and \( b_1 \in \mathbb{R}^m \) will now play the role of our HMM parameters. How do we use them as our parameters?
Say we just observe a single symbol in our data, i.e. the length of the sequence is 1, and that symbol is \(x\). Let’s multiply \( b^{\top}_{\infty} B_x b_1 \).
According to the above equalities, it is true that this equals:
\( b^{\top}_{\infty} B_x b_1 = (1^{\top} O^{-1}) (O T \mathrm{diag}(O_x) O^{-1}) (O \pi) = 1^{\top} T \mathrm{diag}(O_x) \pi \).
Note that this quantity is a scalar. We are multiplying a matrix by a vector from left and right. Undo this matrix multiplication, and write it the way we like in terms of sums over the latent states, and what do we get? The above just equals:
\( b^{\top}_{\infty} B_x b_1 = \sum_{h_1,h_2} T_{h_2,h_1} O_{x,h_1} \pi_{h_1} = \sum_{h_1,h_2} p(H_1) p(X_1 = x | H_1 = h_1) p(H_2 = h_2 | H_1 = h_1) = p(X_1 = x_1) \).
So, this triplet-product gave us back the distribution over the first observation. That’s not very interesting, we could have done it just by using \( b_1 \) directly. But… let’s go on and compute:
\( b^{\top}_{\infty} B_{x_2} B_{x_1} b_1. \)
This can be easily verified to equal \( p(X_1 = x_1, X_2 = x_2) \).
The interesting part is that in the general case,
\( b^{\top}_{\infty} B_{x_n} B_{x_{n-1}}…B_{x_1} b_1 = p(X_1 = x_1, \ldots, X_n = x_n) \) –
we can now calculate the probability of any observation sequence in the HMM only by knowing the distribution over the first three observations! (To convince yourself about the general case above, just look at Lemma 1 in the Hsu et al. paper.)
In order to turn this into an estimation algorithm, we just need to estimate from data \( P_{2,1} \) and \( P_{3,x,1} \) for each observation symbol (all observed, just “count and normalize”), and voila, you can estimate the probability of any sequence of observations (one of the basic problems with HMMs according to this old classic paper, for example).
But… We made a heavy assumption. We assumed that \( n = m \) — we have as many observation symbols as latent states. What do we do if that’s not true? (i.e. if \( m < n \))? That’s where the “spectral” part kicks in. Basically, what we need to do is to reduce our \( O \) matrix into an \( m \times m \) matrix using some \( U \) matrix, while ensuring that \( U^{\top}O \) is invertible (just like we assumed \( O \) was invertible before). Note that \( U \) needs to be \( n \times m \).
It turns out that a \( U \) that will be optimal in some sense, and will also make all of the above algebraic tricks work is the left singular value matrix of \( P_{2,1} \). Understanding why this is the case requires some basic knowledge of linear algebra — read the paper to understand this!