I. "Have Any of These People Ever Been to a Chinese Restaurant?"
The Dirichlet process is a stochastic process that can be used to partition a set of elements into a set of subsets. In biological modeling, it is commonly used to assign elements into groups, such as molecular sequence sites into distinct rate categories. Very often, an intuitive explanation as to how it works invokes the "Chinese Restaurant Process" analogy. I have always found this analogy very jarring and confusing, as, while it is a good description of how the Dirichlet process works, it is a terrible description of how Chinese restaurants, or, indeed, any other type of restaurant run by (and patronized by) humans[1], works. As Dr. Emily Jane McTavish says, "Have any of these people ever been to a Chinese restaurant?" Indeed. In fact, one may wonder if any of them have been to any restaurant outside of a Kafkaesque performance-art themed one.
II. The "Traveler's Restaurant Process" Analogy
I believe a far more intuitive analogy might be given by the "Traveler's Restaurant [Selection] Process". A common dictum in traveler lore, guides, advice, and general wisdom is that when picking a restaurant for a meal to prefer establishments that appear to be more popular or frequented with locals. And so, we can imagine instead, describing how a group of N travellers distribute themselves among the various restaurants at a food court. As with the original analogy, we have the travelers entering the establishment one-by-one, with the food court initially empty. (Yes, this is still a weakness in the analogy that lends a surreal contrivance to the whole tale, but maybe this can be alleviated somewhat by imagining that it is (a) 1 am in the morning; (b) the travelers have just arrived and are suffering from jet lag; and (c) wander down to the food court after checking in their respective hotels/motels/pensions, and variances in processes times lead them to come in straggling individually. As for the food court open at 1 am -- in a number of parts of the world, this is pretty commonplace, I assure you!). The first traveler finds an empty food court, and picks a restaurant at random. The next traveler enters the food court, and looking around, sees just one restaurant occupied. It is possible that she makes a beeline for that restaurant, but maybe she is feeling cranky or does not particular like the other traveler (or other travelers in general), and heads off to a different restaurant. And so on with the next traveler, and the next, and the next, until the last one, each of whom makes an individual decision whether to try a new, empty restaurant (if they are feeling particularly adventurous or misanthrophic or introverted or have squabbled with or are otherwise disdainful of everyone else) or else, at the other extreme, head for the most crowded restaurant (if they are sticking to the true and tried traveler's restaurant selection dictum or are feeling particular sociable), or everything in between. At the end of it all we take the clusters of travelers as they have distributed themselves into individual restaurants as the partition of the original the set of travelers.
Makes (better) sense?
Maybe not!
III. Working Through the Process
Emily also says that many times just walking through the model expressions and formulas often yields much better intuition than contrived and twisted analogies and metaphors. And maybe she is right!
So, let us take a set of $N$ elements: $t_1, t_2, ..., t_N$. We can imagine each element to be a traveler from the "Traveler's Restaurant Process" analogy, or vice versa, or anything else you prefer. Each element is assigned to a group in turn, creating new groups as needed. With no existing groups, the first element is automatically assigned to a new group. Well, technically, the first element is assigned to new group with probability given by:
\[ \begin{align} \frac{\alpha}{\alpha + n - 1}, \end{align} \]
where $\alpha$ is known as the concentration parameter, and $n$ is is the 1-based index of the element (i.e., the first element has $n=1$, the second has $n=2$, the third has $n=3$, and so on). The first element has $n=1$, and the above expression evaluates to 1.0, so is always assigned a new group. The more general case, applicable to all subsequent elements up to and including the last one, is that the $n^{\text{th}}$ element is:
- Either assigned a new group, $g_*$ with probability according to expression (1) above,
- Or assigned to one of the existing groups, $g_i$, with probability: $$ \begin{align} \frac{|g_i|}{\alpha + n - 1}, \end{align} $$
where $\alpha$ is the concentration parameter and $n$ is is the 1-based index of the element, as previously described, and $|g_i|$ is the number of elements in group $g_i$.
IV. How/Why Does it Work?
How do we know that each element will definitely be assigned to a group? We can check that the probabilities of either being assigned to a new group or existing group sum to one to reassure ourselves of this. Let us say that we are dealing with $n^{\text{th}}$ element out of $N$ elements, where $1 \leq n \leq N$. As described above, the probability that this element will be assigned to a new group is given by expression (1). On the other hand, the probability of being assigned to any one of the existing groups is given by:
\[ \begin{align} \sum_{i=1}^{M}\frac{|g_i|}{\alpha + n - 1} \end{align} \]
where $M$ is the number of existing groups, $g_1, g_2, ..., g_M$. Now, while we do not know in this general case exactly how the previous $n-1$ elements are distributed amongst the various $M$ existing groups, we do know that, regardless, there must be a total of $n-1$ elements across all groups. So, then, the above expression reduces to:
\[ \begin{align} \frac{\sum_{i=1}^{M} |g_i|}{\alpha + n - 1} = \frac{n-1}{\alpha + n - 1}. \end{align} \]
So the probability of either being assigned to a new group or being assigned to an existing group is given by the sum of Expression (1) and (4), which is:
\[ \begin{align} \frac{\alpha}{\alpha + n - 1} + \frac{n-1}{\alpha + n - 1} &= \frac{\alpha + n - 1}{\alpha + n - 1} \\ &= 1. \end{align} \]
And thus, we are assured of all elements being assigned to a group, whether a new one or an existing one, as well as being able to sleep at night knowing that the distribution of partitions under the Dirichlet process is a proper probability as it sums to 1.0.
V. The (Anti-?)Concentration Parameter
We have mentioned and used the concentration parameter, $\alpha$, above, without actually explaining it. Basically, the concentration parameter, as its name implies, determines how "clumpy" the process is. Unfortunately, contrary to what its name implies, at low values the process is more "clumpy" --- i.e., yielding partitions where elements tend to be grouped together more frequently, resulting in larger and correspondingly fewer subsets --- while at high values the process is less "clumpy" --- yielding partitions where elements tend to be more dispersed, resulting in smaller and correspondingly more subsets. Yeah, it really should have been called the "anti-concentration" or "dispersion" parameter. Or perhaps folks should just stick to using the more neutral and less evocative, but non-misleading, standard term: the "scaling" parameter.
(See updates here and here that clear up this issue! Basically, the "concentration" parameter gets its name not due to way it concentrates elements, as naive and incorrect intuition led me to think, but rather due to how increasing it leads to the distribution of values across the subsets converging on the base distribution! What values? What base distribution? EXACTLY! Those concepts do not really enter into any of the analogies we have discussed so far, or, indeed, even in the way we have explained the process with reference to the equations and model. Only when considering a version of the DP process where our elements are not just anonymous exchangeable atoms, but value-bearing elements that we want to cluster based on values, does the base distribution enter the picture, and only then does the "concentration" parameter do its concentrating the higher it gets!)
VI. In Action
This Gist wraps up all this logic in a script that you can play with to get a feel for how different concentration parameters work.
With low values of the concentration parameter, we pretty much get all the elements packed into a single set:
# python dirichlet_partition.py -n100 -v0 -a 0.01 Mean number of subsets per partition: 1.04 Mean number of elements per subset: 9.8
On the other hand, with high values of the scaling parameter, we the trend is toward each element being in its own set, with nearly as many subsets in the partition as there are elements in the full set:
# python dirichlet_partition.py -n100 -v0 -a 100 Mean number of subsets per partition: 9.65 Mean number of elements per subset: 1.03944444444
And, of course, moderate values result in something in between:
# python dirichlet_partition.py -n100 -v0 -a 1 Mean number of subsets per partition: 3.03 Mean number of elements per subset: 3.92166666667 # python dirichlet_partition.py -n100 -v0 -a 5 Mean number of subsets per partition: 5.76 Mean number of elements per subset: 1.88055555556 # python dirichlet_partition.py -n100 -v0 -a 10 Mean number of subsets per partition: 7.01 Mean number of elements per subset: 1.49297619048
VII. The Script
The script to run this is shown below. If you are interested in downloading and using it, please visit: https://gist.github.com/jeetsukumaran/605de7053e9122af39d091f74bdd946d.
[gist id="605de7053e9122af39d091f74bdd946d"]
Footnotes
- Interestingly enough, I can imagine a restaurant that was patronized by cats working pretty much exactly like the traditional analogy, given the way some cats tend to be clumpers and others loners. So, if you don't like the "Traveler's Restaurant Process" analogy, please feel free to use the "Cat Restaurant Process" analogy, or, better yet, the "Cat Room/Furniture Occupation Process" analogy.
- Update 2017-07-25: So it seems that the concentration parameter gets its name not from the fact that it (inversely) controls the concentration or clustering or elements, but because of its relationship to the base distribution of the Dirichlet process. The base distribution is something that we have not talked about, and I will cover it in a future post once I understand it well enough to talk about it in a useful and interesting way! But for the time being, it is sufficient to say that it is the distribution over the elements themselves before they are assigned into groups. The concentration parameter is so called because, as it increases in value, it increasingly "concentrates" the values of elements on the base distribution (while at the same time increasingly disperses the elements to distinct groups). So the reason for the counter-intuitive name for the parameter is wrong intuition --- it is not referring to how "concentrated" the elements are in terms of the subsets of partition, but how closely the distribution of elements resemble the base distribution. More details on this can be found here, where the concentration parameter is described as a prior belief in the base distribution.
- Update 2017-07-26: So, here is a GREAT explanation of what is going on: http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric-bayes-and-the-dirichlet-process/ Summarizing: basically, if we understand the Dirichlet process using the Chinese Restaurant Process or the Traveler's Restaurant Process or the Hair-Clog Process etc. etc., we are ignoring the base distribution as we do not care about the values of the elements that are clustered, or the distribution of those values within each subset. Only with reference to, e.g. the Polya Urn Model, where each element has associated with it a value which, and is clustered to a group based on this values to the distribution of values associated with each group, while parameters of the distribution of each "urn" are drawn from a base distribution, do things make sense. As the concentration parameter increases, the elements spread out across more and more subsets/urns/tables, and as the parameters of the distribution of each subset/urn/table's "value" are sampled from the base distribution, in effect, the base distribution itself gets more and more (and better and better) sampled: as each subset/urn/table represents an independent sampling of the base distribution. Thus the distribution across all subset/urns/tables converges on the base distribution. Conversely, as the concentration parameter decreases, the elements cluster into fewer groups, and, with fewer groups we get more limited sampling of the the base distribution.