Phylogenetics and Algebraic Geometry

Ok, this post is going a little bit out of my field. If anyone can fill in some of the gaps in my understanding of phylogenetics and in how to get from there to the math, please do so in the comments. So, anyway, phylogenetics is the study of how various organisms are related, and is very closely related to taxonomy, the classification of organisms. It doesn’t look, at first, like it’s the sort of thing that would grab techniques from algebraic geometry, the study of solutions to polynomial equations. My thought is that it just looks too hard on the surface to hope to have anything nice popping out mathematically that’s that elementary. Well…that’s wrong. Without realizing it, algebraic geometers have been looking at problems relevant to phylogenetics for quite some time.

Let $T$ be a tree. That is, take some collection of vertices $V$ and a collection of edges $E$ connecting pairs of vertices such that there are no loops and the whole thing is connected. In fact, we’ll have a distinguished vertex called the root, which we think of as the starting point for the tree. At the root, we stick a vector $\pi$, which is a probability vector (ie, the entries add to one) and which has $k$ entries. We call this the number of states, and we think of $\pi$ as a probability distribution: what is the chance that the root is in state $i$? By definition, it’s $\pi_i$.

So now there’s the rest of the tree. Along each edge, we stick a $k\times k$ matrix, whose $a_{ij}$ element represents the probability of the system moving from state $i$ to state $j$. We can put a new matrix on each edge if we like. So now we’ve got a whole mess of variables representing the probability distribution we started with and how the distribution can change. So now, each vertex can be taken to be a random variable with $k$ possible states by giving it the distribution $\pi$ multiplied by the matrices of the unique chain of edges from the root to the vertex.

Now, for any reasonable modeling in phylogenetics, we need to constrain the entries of the matrices and the entries of $\pi$ in biologically sensible ways. So we had a system of $N$ variables, that is, the entries of the matrices and the vector $\pi$, and we put in some constraints, which gives us a subset $P$ of $\mathbb{R}^N$. So specifying $P$ is the same as determining a model for evolution. More on that in a moment.

So now say that there are $n$ leaves, with a leaf just being a vertex in our tree which is only contained in one edge. Then at each leaf, we can see $k$ possible outcomes. That is, the leaf can be in any of the $k$ states that we’re looking at. So there are $k^n$ possible outcomes, and given $\pi$ and all the matrices, we can at least write down probabilities for all of them. In fact, and here’s where the algebraic geometry begins, the probability of any given outcome is given as a polynomial in the entries of the matrices and $\pi$! So we get a polynomial map $\phi:\mathbb{R}^N\to \mathbb{R}^{k^n}$.

Now, from this we have complete freedom in all of our variables, so the map only relies on the tree and the number of states. Not terribly good for us. However, we have $P\subset \mathbb{R}^N$ which actually defines our model, so what we should be interested in is $\phi(P)$. So what can we say about this image? Well, probability dictates that $\phi(P)$ is contained in the standard $k^n-1$ simplex, simply because the image consists of probability vectors, and so the entries must add up to one.

Before doing any more, a very simple example is in order. We’ll look at the binary tree with three leaves. That means that there’s the root, then there’s two edges $a$ and $b$. Then, off of $b$, there are two more edges $c$ and $d$, and the vertices at the end of $a,c,d$ are the three leaves. We’ll also take $k=2$, so that our tree gives a binary random variable.

Assume the first leaf, attached to $a$, is in state $i$, the second in $j$ and the third in state $k$. Then there’s a polynomial $\phi_{ijk}=\pi_0a_{0i}b_{00}c_{0j}d_{0k}+\pi_0 a_{0i}b_{01}c_{1j}d_{1k}+\pi_1a_{1i}b_{10}c_{0j}d_{0k}+\pi_1 a_{1i}b_{11}c_{1j}d_{1k}$. Each term represents one possibility. The first, for instance, is the probability that we started in state 0, then moved to state $i$ on $a$ and that along $b$ nothing changed, and then along $c$ and $d$ we came out with the final configuration we wanted. The other terms have similar meanings, so $\phi_{ijk}$, when given numbers, tells us the probability that the final state will be $i,j,k$. So we get 8 polynomials which define a map $\mathbb{R}^{18}\to \mathbb{R}^8$, and in fact give a map on a nine dimensional cube in $\mathbb{R}^{18}$ when you remember that all these numbers must be positive, between zero and one, and that pairs of them must add to 1, as they are probailities.

We’re going to ignore that, however, and notice that this gives a set of perfectly good polynomials if we plug in complex numbers. Additionally, we’re not really working on $\mathbb{C}^{18}$ at this point. Each matrix acts as homogeneous coordinates on a copy of $\mathbb{P}^3$ and the starting probability vector acts as coordinates on $\mathbb{P}^1$. In fact, by working over projective space, we are kind of preserving the probability issues, because we can always choose homogeneous coordinates for a point which sum up to one. So our map is really $\mathbb{P}^3\times\mathbb{P}^3\times\mathbb{P}^3\times\mathbb{P}^3\times\mathbb{P}^1\to\mathbb{P}^7$.In general, we’ll get a $\mathbb{P}^{k^2-1}$ for each edge and a $\mathbb{P}^{k-1}$ for the probability vector as the domain.

Now back to the biology a little bit. Generally, we’ll have $k$ is either two (because the variable is either “yes” or “no”) or else four (usually A,C,G,T, so we’ve got random DNA), and the remaining question regards how can one thing change to another. One model, the general Markov model, is what we did above. We make no assumptions whatsoever on the entries of the matrices, they can be whatever they want. On the other hand, and this one I actually see the biology in, there’s the stationary base composition model. This one doesn’t require anything of the matrix entries, but rather requires that the matrices all have $\pi$ as an eigenvector, so that the distribution of the states stays the same. Say, there’s always the same fraction of nucleotides of each type.

Now I’m going to do a rather simple example in order to illustrate the classical geometry that comes out of it. Start with a root and then two leaves, so it’s a tree with exactly two edges. Give these edges the same matrix and take $k=2$. Then we’ll further require that the matrix be of the form $\left[\begin{array}{cc}a_0&a_1\\a_1&a_0\end{array}\right]$. Then we get $\phi_{ij}=\pi_1a_{0i}a_{0j}+\pi_1a_{1i}a_{1j}$, which simplifies out to $\phi_{00}=\pi_0a_0^2+\pi_1a_1^2$, $\phi_{11}=\pi_0a_1^2+\pi_1a_0^2$ and $\phi_{01}=\phi_{10}=\pi_0a_0a_1+\pi_1a_0a_1$. Now, working with Groebner bases to eliminate the $\pi_i,a_i$ variables, this tells us that the image is $\mathbb{P}^2\subset\mathbb{P}^3$. In fact, we can get a bit more out of this if we try: it turns out to be the map given by the Segre embedding of $\mathbb{P}^1\times\mathbb{P}^1$ into $\mathbb{P}^3$, followed by a projection down to $\mathbb{P}^2$.

Charles Siegel is currently a postdoc at Kavli IPMU in Japan. He works on the geometry of the moduli space of curves.
This entry was posted in Algebraic Geometry, Biology, Combinatorics, Probability. Bookmark the permalink.

7 Responses to Phylogenetics and Algebraic Geometry

1. carlbrannen says:

Wow. The “possibly related posts” connected this blog up to my blog, but it got the wrong post. I’ve spent more than my share of time solving this sort of problem, but in the quantum mechanical equivalent.

For an elementary particle that is made up of several components, that is, a bound state, one finds that this should not depend on time; it must be an eigenstate of the Hamiltonian. Translated into Markov chanis, this means that the probability amplitudes, when written in N-dimensional matrix form, M, should be idempotent, MM = M.

And this gives a set of N coupled quadratic equations in N unkonwns similar to what you’re working on, except for the idempotency. One can take a rather simple set of equations, and from them, get the weak hypercharge and weak isospin quantum numbers of the elementary fermions.

2. Charles says:

Cool. I really need to get back to studying particle physics…perhaps once my orals are done I’ll have time.