In probability theory, the Chinese restaurant process is a discrete-time stochastic process, analogous to seating customers at tables in a restaurant. Imagine a restaurant with an infinite number of circular tables, each with infinite capacity. Customer 1 sits at the first table. The next customer either sits at the same table as customer 1, or the next table. This continues, with each customer choosing to either sit at an occupied table with a probability proportional to the number of customers already there (i.e., they are more likely to sit at a table with many customers than few), or an unoccupied table. At time n, the n customers have been partitioned among m ≤ n tables (or blocks of the partition). The results of this process are exchangeable, meaning the order in which the customers sit does not affect the probability of the final distribution. This property greatly simplifies a number of problems in population genetics, linguistic analysis, and image recognition.
The restaurant analogy first appeared in a 1985 write-up by David Aldous,[1] where it was attributed to Jim Pitman (who additionally credits Lester Dubins).[2]
Formal definition
[
edit
]
For any positive integer n {\displaystyle n} , let P n {\displaystyle {\mathcal {P}}_{n}} denote the set of all partitions of the set { 1 , 2 , 3 , . . . , n } ≜ [ n ] {\displaystyle \{1,2,3,...,n\}\triangleq [n]} . The Chinese restaurant process takes values in the infinite Cartesian product ∏ n ≥ 1 P n {\displaystyle \prod _{n\geq 1}{\mathcal {P}}_{n}} .
The value of the process at time n {\displaystyle n} is a partition B n {\displaystyle B_{n}} of the set [ n ] {\displaystyle [n]} , whose probability distribution is determined as follows. At time n = 1 {\displaystyle n=1} , the trivial partition B 1 = { { 1 } } {\displaystyle B_{1}=\{\{1\}\}} is obtained (with probability one). At time n + 1 {\displaystyle n+1} the element " n + 1 {\displaystyle n+1} " is either:
B n {\displaystyle B_{n}}
| b | / ( n + 1 ) {\displaystyle |b|/(n+1)}
| b | {\displaystyle |b|}
B n {\displaystyle B_{n}}
1 / ( n + 1 ) {\displaystyle 1/(n+1)}
The random partition so generated has some special properties. It is exchangeable in the sense that relabeling { 1 , . . . , n } {\displaystyle \{1,...,n\}} does not change the distribution of the partition, and it is consistent in the sense that the law of the partition of [ n − 1 ] {\displaystyle [n-1]} obtained by removing the element n {\displaystyle n} from the random partition B n {\displaystyle B_{n}} is the same as the law of the random partition B n − 1 {\displaystyle B_{n-1}} .
The probability assigned to any particular partition (ignoring the order in which customers sit around any particular table) is
Pr ( B n = B ) = ∏ b ∈ B ( | b | − 1 ) ! n ! , B ∈ P n {\displaystyle \Pr(B_{n}=B)={\frac {\prod _{b\in B}(|b|-1)!}{n!}},\qquad B\in {\mathcal {P}}_{n}}
where b {\displaystyle b} is a block in the partition B {\displaystyle B} and | b | {\displaystyle |b|} is the size of b {\displaystyle b} .
The definition can be generalized by introducing a parameter θ > 0 {\displaystyle \theta >0} which modifies the probability of the new customer sitting at a new table to θ n + θ {\displaystyle {\frac {\theta }{n+\theta }}} and correspondingly modifies the probability of them sitting at a table of size | b | {\displaystyle |b|} to | b | n + θ {\displaystyle {\frac {|b|}{n+\theta }}} . The vanilla process introduced above can be recovered by setting θ = 1 {\displaystyle \theta =1} . Intuitively, θ {\displaystyle \theta } can be interpreted as the effective number of customers sitting at the first empty table.
Alternative definition
[
edit
]
An equivalent, but subtly different way to define the Chinese restaurant process, is to let new customers choose companions rather than tables.[3] Customer n + 1 {\displaystyle n+1} chooses to sit at the same table as any one of the n {\displaystyle n} seated customers with probability 1 n + θ {\displaystyle {\frac {1}{n+\theta }}} , or chooses to sit at a new, unoccupied table with probability θ n + θ {\displaystyle {\frac {\theta }{n+\theta }}} . Notice that in this formulation, the customer chooses a table without having to count table occupancies---we don't need | b | {\displaystyle |b|} .
Distribution of the number of tables
[
edit
]
Chinese restaurant tableParameters
θ
>
0
{\displaystyle \theta >0}
n ∈ { 1 , 2 , … } {\displaystyle n\in \{1,2,\ldots \}}
Supportk ∈ { 1 , 2 , … , n } {\displaystyle k\in \{1,2,\ldots ,n\}}
PMFΓ ( θ ) Γ ( n + θ ) | s ( n , k ) | θ k {\displaystyle {\frac {\Gamma (\theta )}{\Gamma (n+\theta )}}|s(n,k)|\theta ^{k}}
Meanθ ( ψ ( θ + n ) − ψ ( θ ) ) {\displaystyle \theta (\psi (\theta +n)-\psi (\theta ))}
The Chinese restaurant table distribution (CRT) is the probability distribution on the number of tables in the Chinese restaurant process.[4] It can be understood as the sum of n {\displaystyle n} independent Bernoulli random variables, each with a different parameter:
K = ∑ i = 1 n b i b i ∼ Bernoulli ( θ i − 1 + θ ) {\displaystyle {\begin{aligned}K&=\sum _{i=1}^{n}b_{i}\\[4pt]b_{i}&\sim \operatorname {Bernoulli} \left({\frac {\theta }{i-1+\theta }}\right)\end{aligned}}}
The probability mass function of K {\displaystyle K} is given by [5]
f ( k ) = Γ ( θ ) Γ ( n + θ ) | s ( n , k ) | θ k , k = 1 , … , n , {\displaystyle f(k)={\frac {\Gamma (\theta )}{\Gamma (n+\theta )}}|s(n,k)|\theta ^{k},\quad k=1,\dots ,n,}
where s {\displaystyle s} denotes Stirling numbers of the first kind.
Generalization
[
edit
]
This construction can be generalized to a model with two parameters, θ {\displaystyle \theta } & α {\displaystyle \alpha } ,[2][6] commonly called the strength (or concentration) and discount parameters respectively. At time n + 1 {\displaystyle n+1} , the next customer to arrive finds | B | {\displaystyle |B|} occupied tables and decides to sit at an empty table with probability
θ + | B | α n + θ , {\displaystyle {\frac {\theta +|B|\alpha }{n+\theta }},}
or at an occupied table b {\displaystyle b} of size | b | {\displaystyle |b|} with probability
| b | − α n + θ . {\displaystyle {\frac {|b|-\alpha }{n+\theta }}.}
In order for the construction to define a valid probability measure it is necessary to suppose that either α < 0 {\displaystyle \alpha <0} and θ = − L α {\displaystyle \theta =-L\alpha } for some L ∈ { 1 , 2 , , . . . } {\displaystyle L\in \{1,2,,...\}} ; or that 0 ≤ α < 1 {\displaystyle 0\leq \alpha <1} and θ > − α {\displaystyle \theta >-\alpha } .
Under this model the probability assigned to any particular partition B {\displaystyle B} of [ n ] {\displaystyle [n]} , in terms of the Pochhammer k-symbol, is
Pr ( B n = B ) = ( θ + α ) | B | − 1 , α ( θ + 1 ) n − 1 , 1 ∏ b ∈ B ( 1 − α ) | b | − 1 , 1 {\displaystyle \Pr(B_{n}=B)={\frac {(\theta +\alpha )_{|B|-1,\alpha }}{(\theta +1)_{n-1,1}}}\prod _{b\in B}(1-\alpha )_{|b|-1,1}}
where, by convention, ( a ) 0 , c = 1 {\displaystyle (a)_{0,c}=1} , and for b > 0 {\displaystyle b>0}
( a ) b , c = ∏ i = 0 b − 1 ( a + i c ) = { a b if c = 0 , c b Γ ( a / c + b ) Γ ( a / c ) otherwise . {\displaystyle (a)_{b,c}=\prod _{i=0}^{b-1}(a+ic)={\begin{cases}a^{b}&{\text{if }}c=0,\\\\{\dfrac {c^{b}\,\Gamma (a/c+b)}{\Gamma (a/c)}}&{\text{otherwise}}.\end{cases}}}
Thus, for the case when θ > 0 {\displaystyle \theta >0} the partition probability can be expressed in terms of the Gamma function as
Pr ( B n = B ) = Γ ( θ ) Γ ( θ + n ) α | B | Γ ( θ / α + | B | ) Γ ( θ / α ) ∏ b ∈ B Γ ( | b | − α ) Γ ( 1 − α ) . {\displaystyle \Pr(B_{n}=B)={\frac {\Gamma (\theta )}{\Gamma (\theta +n)}}{\dfrac {\alpha ^{|B|}\,\Gamma (\theta /\alpha +|B|)}{\Gamma (\theta /\alpha )}}\prod _{b\in B}{\dfrac {\Gamma (|b|-\alpha )}{\Gamma (1-\alpha )}}.}
In the one-parameter case, where α {\displaystyle \alpha } is zero, this simplifies to
Pr ( B n = B ) = Γ ( θ ) θ | B | Γ ( θ + n ) ∏ b ∈ B Γ ( | b | ) . {\displaystyle \Pr(B_{n}=B)={\frac {\Gamma (\theta )\,\theta ^{|B|}}{\Gamma (\theta +n)}}\prod _{b\in B}\Gamma (|b|).}
Or, when θ {\displaystyle \theta } is zero,
Pr ( B n = B ) = α | B | − 1 Γ ( | B | ) Γ ( n ) ∏ b ∈ B Γ ( | b | − α ) Γ ( 1 − α ) . {\displaystyle \Pr(B_{n}=B)={\frac {\alpha ^{|B|-1}\,\Gamma (|B|)}{\Gamma (n)}}\prod _{b\in B}{\frac {\Gamma (|b|-\alpha )}{\Gamma (1-\alpha )}}.}
As before, the probability assigned to any particular partition depends only on the block sizes, so as before the random partition is exchangeable in the sense described above. The consistency property still holds, as before, by construction.
If α = 0 {\displaystyle \alpha =0} , the probability distribution of the random partition of the integer n {\displaystyle n} thus generated is the Ewens distribution with parameter θ {\displaystyle \theta } , used in population genetics and the unified neutral theory of biodiversity.
Animation of a Chinese restaurant process with scaling parameterθ = 0.5 , α = 0 {\displaystyle \theta =0.5,\ \alpha =0}
[7])Derivation
[
edit
]
Here is one way to derive this partition probability. Let C i {\displaystyle C_{i}} be the random block into which the number i {\displaystyle i} is added, for i = 1 , 2 , 3 , . . . {\displaystyle i=1,2,3,...} . Then
Pr ( C i = c ∣ C 1 , … , C i − 1 ) = { θ + | B | α θ + i − 1 if c ∈ new block , | b | − α θ + i − 1 if c ∈ b ; {\displaystyle \Pr(C_{i}=c\mid C_{1},\ldots ,C_{i-1})={\begin{cases}{\dfrac {\theta +|B|\alpha }{\theta +i-1}}&{\text{if }}c\in {\text{new block}},\\\\{\dfrac {|b|-\alpha }{\theta +i-1}}&{\text{if }}c\in b;\end{cases}}}
The probability that B n {\displaystyle B_{n}} is any particular partition of the set { 1 , . . . , n } {\displaystyle \{1,...,n\}} is the product of these probabilities as i {\displaystyle i} runs from 1 {\displaystyle 1} to n {\displaystyle n} . Now consider the size of block b {\displaystyle b} : it increases by one each time we add one element into it. When the last element in block b {\displaystyle b} is to be added in, the block size is | b | − 1 {\displaystyle |b|-1} . For example, consider this sequence of choices: (generate a new block b {\displaystyle b} )(join b {\displaystyle b} )(join b {\displaystyle b} )(join b {\displaystyle b} ). In the end, block b {\displaystyle b} has 4 elements and the product of the numerators in the above equation gets θ ⋅ 1 ⋅ 2 ⋅ 3 {\displaystyle \theta \cdot 1\cdot 2\cdot 3} . Following this logic, we obtain Pr ( B n = B ) {\displaystyle \Pr(B_{n}=B)} as above.
Expected number of tables
[
edit
]
For the one parameter case, with α = 0 {\displaystyle \alpha =0} and 0 < θ < ∞ {\displaystyle 0<\theta <\infty } , the number of tables is distributed according to the chinese restaurant table distribution. The expected value of this random variable, given that there are n {\displaystyle n} seated customers, is[8]
∑ k = 1 n θ θ + k − 1 = θ ⋅ ( Ψ ( θ + n ) − Ψ ( θ ) ) {\displaystyle {\begin{aligned}\sum _{k=1}^{n}{\frac {\theta }{\theta +k-1}}=\theta \cdot (\Psi (\theta +n)-\Psi (\theta ))\end{aligned}}}
where Ψ ( θ ) {\displaystyle \Psi (\theta )} is the digamma function. In the general case ( α > 0 {\displaystyle \alpha >0} ) the expected number of occupied tables is[6]
Γ ( θ + n + α ) Γ ( θ + 1 ) α Γ ( θ + n ) Γ ( θ + α ) − θ α , {\displaystyle {\begin{aligned}{\frac {\Gamma (\theta +n+\alpha )\Gamma (\theta +1)}{\alpha \Gamma (\theta +n)\Gamma (\theta +\alpha )}}-{\frac {\theta }{\alpha }},\end{aligned}}}
however, note that the Γ ( ⋅ ) {\displaystyle \Gamma (\cdot )} function here is not the standard gamma function.[6]
The Indian buffet process
[
edit
]
It is possible to adapt the model such that each data point is no longer uniquely associated with a class (i.e., we are no longer constructing a partition), but may be associated with any combination of the classes. This strains the restaurant-tables analogy and so is instead likened to a process in which a series of diners samples from some subset of an infinite selection of dishes on offer at a buffet. The probability that a particular diner samples a particular dish is proportional to the popularity of the dish among diners so far, and in addition the diner may sample from the untested dishes. This has been named the Indian buffet process and can be used to infer latent features in data.[9]
Applications
[
edit
]
The Chinese restaurant process is closely connected to Dirichlet processes and Pólya's urn scheme, and therefore useful in applications of Bayesian statistics including nonparametric Bayesian methods. The Generalized Chinese Restaurant Process is closely related to Pitman–Yor process. These processes have been used in many applications, including modeling text, clustering biological microarray data,[10] biodiversity modelling, and image reconstruction [11][12]
See also
[
edit
]
References
[
edit
]
This article is the third part of the series on Clustering with Dirichlet Process Mixture Models. The previous time we defined the Finite Mixture Model based on Dirichlet Distribution and we posed questions on how we can make this particular model infinite. We briefly discussed the idea of taking the limit of the model when the k number of clusters tends to infinity but as we stressed the existence of such an object is not trivial (in other words, how do we actually “take the limit of a model”?). As a reminder, the reason why we want to take make k infinite is because in this way we will have a non-parametric model which does not require us to predefine the total number of clusters within the data.
Update: The Datumbox Machine Learning Framework is now open-source and free to download. Check out the package com.datumbox.framework.machinelearning.clustering to see the implementation of Dirichlet Process Mixture Models in Java.
Even though our target is to build a model which is capable of performing clustering on datasets, before that we must discuss about Dirichlet Processes. We will provide both the strict mathematical definitions and the more intuitive explanations of DP and we will discuss ways to construct the process. Those constructions/representations can be seen as a way to find occurrences of Dirichlet Process in “real life”.
Despite the fact that I tried to adapt my research report in such a way so that these blog posts are easier to follow, it is still important to define the necessary mathematical tools and distributions before we jump into using the models. Dirichlet Process models are a topic of active research, but they do require having a good understanding of Statistics and Stochastic Processes before using them. Another problem is that as we will see in this article, Dirichlet Processes can be represented/constructed with numerous ways. As a result several academic papers use completely different notation/conventions and examine the problem from different points of view. In this post I try to explain them as simple as possible and use the same notation. Hopefully things will become clearer with the two upcoming articles which focus on the definition of Dirichlet Process Mixture Models and on how to actually use them to perform cluster analysis.
A Dirichlet process over a Θ space is a stochastic process. It is a probability distribution over “probability distributions over Θ space” and a draw from it is a discrete distribution. More formally a Dirichlet Distribution is a distribution over probability measures. A probability measure is a function of subsets of space Θ to [0,1]. G is a DP distributed random probability measure, denoted as , if for any partition (A1,…An) of space Θ we have that .
Figure 1: Marginals on finite partitions are Dirichlet distributed.
The DP has two parameters: The first one is the base distribution G0 which serves like a mean . The second one is the strength parameter α which is strictly positive and serves like inverse-variance . It determines the extent of the repetition of the values of the output distribution. The higher the value of a, the smaller the repetition; the smaller the value, the higher the repetition of the values of output distribution. Finally the Θ space is the parameter space on which we define the DP. Moreover the space Θ is also the definition space of G0 which is the same as the one of G.
A simpler and more intuitive way to explain a Dirichlet Process is the following. Suppose that we have a space Θ that can be partitioned in any finite way (A1,…,An) and a probability distribution G which assigns probabilities to them. The G is a specific probability distribution over Θ but there are many others. The Dirichlet Process on Θ models exactly this; it is a distribution over all possible probability distributions on space Θ. The Dirichlet process is parameterized with the G0 base function and the α concentration parameter. We can say that G is distributed according to DP with parameters α and G0 if the joint distribution of the probabilities that G assigns to the partitions of Θ follows the Dirichlet Distribution. Alternatively we can say that the probabilities that G assigns to any finite partition of Θ follows Dirichlet Distribution.
Figure 2: Graphical Model of Dirichlet Process
Finally above we can see the graphical model of a DP. We should note that α is a scalar hyperparameter, G0 is the base distribution of DP, G a random distribution over Θ parameter space sampled from the DP which assigns probabilities to the parameters and θi is a parameter vector which is drawn from the G distribution and it is an element of Θ space.
The Posterior Dirichlet Processes were discussed by Ferguson. We start by drawing a random probability measure G from a Dirichlet Process, . Since G is a probability distribution over Θ we can also sample from this distribution and draw independent identically distributed samples θ1,…, θn ~ G. Since draws from a Dirichlet Process are discrete distributions, we can represent where is a short notation for which is a delta function that takes 1 if and 0 elsewhere. An interesting effect of this is that since G is defined this way, there is a positive probability of different samples having the same value . As we will see later on, this creates a clustering effect that can be used to perform Cluster Analysis on datasets.
By using the above definitions and observations we want to estimate the posterior of the Dirichlet Process given the samples θ. Nevertheless since we know that and by using the Bayes Rules and the Conjugacy between Dirichlet and Multinomial we have that and .
Equation 1: Posterior Dirichlet Process
This property is very important and it is used by the various DP representations.
In the previous segments we defined the Dirichlet Process and presented its theoretical model. One important question that we must answer is how do we know that such an object exists and how we can construct and represent a Dirichlet Process.
The first indications of existence was provided by Ferguson who used the Kolmogorov Consistency Theorem, gave the definition of a Dirichlet Process and described the Posterior Dirichlet Process. Continuing his research, Blackwell and MacQueen used the de Finetti’s Theorem to prove the existence of such a random probability measure and introduced the Blackwell-MacQueen urn scheme which satisfies the properties of Dirichlet Process. In 1994 Sethuraman provided an additional simple and direct way of constructing a DP by introducing the Stick-breaking construction. Finally another representation was provided by Aldous who introduced the Chinese Restaurant Process as an effective way to construct a Dirichlet Process.
The various Representations of the Dirichlet Process are mathematically equivalent but their formulation differs because they examine the problem from different points of view. Below we present the most common representations found in the literature and we focus on the Chinese Restaurant Process which provides a simple and computationally efficient way to construct inference algorithms for Dirichlet Process.
The Blackwell-MacQueen urn scheme can be used to represent a Dirichlet Process and it was introduced by Blackwell and MacQueen. It is based on the Pólya urn scheme which can be seen as the opposite model of sampling without replacement. In the Pólya urn scheme we assume that we have a non-transparent urn that contains colored balls and we draw balls randomly. When we draw a ball, we observe its color, we put it back in the urn and we add an additional ball of the same color. A similar scheme is used by Blackwell and MacQueen to construct a Dirichlet Process.
This scheme produces a sequence of θ1,θ2,… with conditional probabilities . In this scheme we assume that G0 is a distribution over colors and each θn represents the color of the ball that is placed in the urn. The algorithm is as follows:
· We start with an empty urn.
· With probability proportional to α we draw and we add a ball of this color in the urn.
· With probability proportional to n-1 we draw a random ball from the urn, we observe its color, we place it back to the urn and we add an additional ball of the same color in the urn.
Previously we started with a Dirichlet Process and derived the Blackwell-MacQueen scheme. Now let’s start reversely from the Blackwell-MacQueen scheme and derive the DP. Since θi were drawn in an iid manner from G, their joint distribution will be invariant to any finite permutations and thus they are exchangeable. Consequently by using the de Finetti’s theorem, we have that there must exist a distribution over measures to make them iid and this distribution is the Dirichlet Process. As a result we prove that the Blackwell-MacQueen urn scheme is a representation of DP and it gives us a concrete way to construct it. As we will see later, this scheme is mathematically equivalent to the Chinese Restaurant Process.
The Stick-breaking construction is an alternative way to represent a Dirichlet Process which was introduced by Sethuraman. It is a constructive way of forming the distribution and uses the following analogy: We assume that we have a stick of length 1, we break it at position β1 and we assign π1 equal to the length of the part of the stick that we broke. We repeat the same process to obtain π2, π3,… etc; due to the way that this scheme is defined we can continue doing it infinite times.
Based on the above the πk can be modeled as , where the while as in the previous schemes the θ are sampled directly by the Base distribution . Consequently the G distribution can be written as a sum of delta functions weighted with πk probabilities which is equal to . Thus the Stick-breaking construction gives us a simple and intuitively way to construct a Dirichlet Process.
The Chinese Restaurant Process, which was introduced by Aldous, is another effective way to represent a Dirichlet Process and it can be directly linked to Blackwell-MacQueen urn scheme. This scheme uses the following analogy: We assume that there is a Chinese restaurant with infinite many tables. As the customers enter the restaurant they sit randomly to any of the occupied tables or they choose to sit to the first available empty table.
The CRP defines a distribution on the space of partitions of the positive integers. We start by drawing θ1,…θn from Blackwell-MacQueen urn scheme. As we discussed in the previous segments, we expect to see a clustering effect and thus the total number of unique θ values k will be significantly less than n. Thus this defines a partition of the set {1,2,…,n} in k clusters. Consequently drawing from the Blackwell-MacQueen urn scheme induces a random partition of the {1,2,…,n} set. The Chinese Restaurant Process is this induced distribution over partitions. The algorithm is as follows:
· We start with an empty restaurant.
· The 1st customer sits always on 1st table
· The n+1th customer has 2 options:
o Sit on the 1st unoccupied table with probability
o Sit on any of the kth occupied tables with probability
where is the number of people sitting on that table
Where α is the dispersion value of DP and n is the total number of customers in the restaurant at a given time. The latent variable zi stores the table number of the ith customer and takes values from 1 to kn where kn is the total number of occupied tables when n customers are in the restaurant. We should note that the kn will always be less or equal to n and on average it is about . Finally we should note that the probability of table arrangement is invariant to permutations. Thus the zi is exchangeable which implies that tables with same size of customers have the same probability.
The Chinese Restaurant Process is strongly connected to Pólya urn scheme and Dirichlet Process. The CRP is a way to specify a distribution over partitions (table assignments) of n points and can be used as a prior on the space of latent variable zi which determines the cluster assignments. The CRP is equivalent to Pólya’s urn scheme with only difference that it does not assign parameters to each table/cluster. To go from CRP to Pólya’s urn scheme we draw for all tables k=1,2… and then for every xi which is grouped to table zi assign a . In other words assign to the new xi the parameter θ of the table. Finally since we can’t assign the θ to infinite tables from the beginning, we can just assign a new θ every time someone sits on a new table. Due to all the above, the CRP can help us build computationally efficient algorithms to perform Cluster Analysis on datasets.
In this post, we discussed the Dirichlet Process and several ways to construct it. We will use the above ideas in the next article. We will introduce the Dirichlet Process Mixture Model and we will use the Chinese Restaurant Representation in order to construct the Dirichlet Process and preform Cluster Analysis. If you missed few points don’t worry as things will start becoming clearer with the next two articles.
I hope you found this post interesting. If you did, take a moment to share it on Facebook and Twitter. 🙂
My name is Vasilis Vryniotis. I'm a Machine Learning Engineer and a Data Scientist. Learn more