Thoughts about Extracting Information from Sequences of Noisy Observations

Often we have some quantity of interest that we can only observe indirectly. In these situations, we would like to know how much relevant information we can extract from our noisy, indirect observations, and how to perform the extraction. Figuring out how to do this would in theory allow you to extract any feature of interest from any sort of observation. The question has been studied before as the 'information bottleneck problem.' In a perfect world, one might expect that the amount of information we can extract is just the mutual information between the variable of interest and the observation. Unfortunately, the answer turns out to be not so simple and the general solution for single observations has only been reduced to an implicit equation involving the source/observation joint distribution (cf these peoples' work). However, (to my knowlege) the question has not been studied for sequences of observations. I am highly suspicious that the answer in limit with sequence length is on average the mutual information. This post contains my thoughts on a proof strategy.

1: Preliminary Setup

Say we are interested in a sequence of variables distributed like an RV \(X\) in a finite alphabet \(\mathcal{X}\), but we can only observe a sequence distributed like an RV \(Y\) in a finite alphabet \(\mathcal{Y}\) and dependent on \(X\). Denote the vector of the first \(n\) iid draws of \(X\) as \(X^n,\) and similarly \(Y^n\) for \(Y\). From our observation \(Y^n\), we want to create the strongest possible characterization \(U^n\) of the variable of interest, \(X^n\).

I am trying to prove the following hypothesis:

Given some arbitrarily small \(\varepsilon>0\), there exists some \(N_\varepsilon\) so that if \(n>N_\varepsilon\), then we can make a random variable \(U^n\) that satisfies the following three properties:

  1. \(X^n \rightarrow Y^n \rightarrow U^n.\) That is, \(U^n\) depends only on what we can observe, and doesn't contain the source distribution \(X^n\) as a hidden variable.
  2. \(\frac{I(X^n; U^n)}{n} \geq I(X;Y)-\varepsilon.\) That is, the sequence \(U^n\) preserves the information \(Y^n\) contains about \(X^n\) with negligible loss.
  3. \(\frac{H(U^n)}{n} \leq I(X; Y) + \varepsilon.\) That is, the sequence \(U^n\) contains only a negligible amount of information extraneous to \(X\).

We know such \(U^n\) exist in specific cases, one of which will be shown later on (Figure 3).

Notice that \(\frac{1}{n}\cdot I(U^n;X^n)=\frac{1}{n}\cdot (H(U^n)-H(U^n|X^n)).\) Rearranging properties (2) and (3) with this identity, we can replace them with the following equivalent properties:

(2') \(\frac{1}{n}H(U^n|X^n) \leq 2\varepsilon.\) That is, \(U^n\) is almost completely determined by \(X^n\) (3') \(\frac{1}{n} H(U^n) \in \left(I(X;Y)-\varepsilon, \ I(X;Y)+\varepsilon\right).\)

This will prove to be useful later on.

2: Bipartite Graphs and Typical Sets

We can form a graph which represents the joint probability distribution of \((X^n,Y^n)\) by putting an edge between every member \(x^n\in \mathcal{X}^n\) and every member \(y^n\in \mathcal{Y}^n\), and weighting each edge with \(P(x^n,y^n).\) See Figure 1 for an example. The probability that \(x^n\) and \(y^n\) happen is the weight of the edge that connects the two.

\(\textbf{Figure 1}\): A case with \(\left|\mathcal{X}\right|=\left|\mathcal{Y}\right|=4\) and \(n=1.\) In graph theory terms this is an undirected, weighted bipartite graph \(G=(\mathcal{X}^n,\mathcal{Y}^n,E).\) The adjacency matrix of this graph is just the probability matrix whose elements are specified by \(P(x^n,y^n).\)

Having access to a sequence of observations helps us because as our sequence length \(n\) grows, the outcomes most likely to occur all happen within structured and high-probability 'typical sets,' \(A^n_\varepsilon(X),\ A^n_\varepsilon(Y), \ A^n_\varepsilon(X,Y).\) Typical sets by their definition as presented in Cover's information theory text have a number of properties we care about:

A small aside: Typical sets get large really fast as \(n\) grows, and we need \(n\) to be fairly large for the approximate uniformity to happen so real typical sets are tough to visualize. However if \(X,Y, (X,Y)\) are all uniformly distributed to begin with, then the typical sets \(A^n_\varepsilon(\cdot)\) are the entire alphabet for any choice of \(n.\) So for all our diagrams we use uniformly distributed \(X,Y,(X,Y)\) with small alphabet sizes \(\left|\mathcal{X}\right|,\left|\mathcal{Y}\right|\) and let \(n=1.\) In real life the typical sets will become similar in structure to the diagrams seen here, but will have much larger dimensions.

This does something interesting to our bipartite graph. To simplify exposition, for the moment we ignore the negligible \(k\varepsilon\) term. We can safely omit any edges between \((X^n,Y^n)\) pairs that aren't typical, since those edges only happen with very low probability. The remaining edge weights are close to \(2^{-nH(X,Y)}.\) Now we're left with a graph that has \(2^{nH(X)}\) nodes on one side, \(2^{nH(Y)}\) nodes on the other, and \(2^{nH(X,Y)}\) edges all with roughly equal weight. See Figure 2 for a small example.

Each typical \(x^n\) happens with probability \(2^{-nH(X)}\), so the sum of all the edge weights connected to it must equal about \(2^{-nH(X)}.\) But all the edges have roughly equal weight, \(2^{-nH(X,Y)}\) so if we let \(R^n\) be the degree of a typical \(x^n\) then \(R^n\cdot 2^{-nH(X,Y)} = 2^{-nH(X)}.\) Then each typical \(x^n\) is connected to roughly \(R^n=2^{nH(X|Y)}\) different \(y^n.\) Similarly, each typical \(y^n\) is connected to roughly \(C^n=2^{nH(Y|X)}\) typical \(x^n.\) If we let \(K^n=2^{nI(X;Y)}\) then in any scenario we have a sequence of bipartite graphs \(G_n=(U_n,V_n,E_n)\) with the following structure:

Up to scale and a negligible factor \(k\varepsilon\), the adjacency matrix and the probability matrix are the same. Instead of thinking about probability distributions, we can instead consider general bipartite graphs with the properties listed above.

\(\textbf{Figure 2}\): An example with a probability matrix shown on the left.   \(n=1,\ H(X)=H(Y)=2,\ H(X,Y)=3\)
\(\textbf{Figure 2}\): An example with a probability matrix shown on the left. \(n=1,\ H(X)=H(Y)=2,\ H(X,Y)=3\)

3: Connection to Graph Partitioning

We observe some \(Y^n\) and would like to use it to determine what \(X^n\) happened to as fine a degree as is possible, while still discarding the features of \(Y^n\) that do not help describe \(X^n\). One way to do this is to quantize \(Y^n\) into descision regions by partitioning \(\mathcal{Y}^n\) into some \(N\) sets \(\mathcal{A}=\lbrace A_1,\dots, A_N \rbrace $. (All the $A_i\) are nonempty subsets of \(\mathcal{Y}^n\), \(A_i\cap A_j = \varnothing\) and \(\cup_i A_i = \mathcal{Y}^n\).) Then if the observation happens to fall in \(A_i\), we decide that the \(X^n\) that generated our observation is one of the vertices that is a neighbor to a vertex in \(A_i\). If our partition is chosen well enough, it suffices to describe the in dex of the bin our observation fell into, rather than the entire observation, to describe what we know about \(X^n\). If \(N=\left|\mathcal{Y}^n\right|\), then the information is perfectly preserved but it takes as many bits to describe the quantization as it does the original obseervation (the bins are all size 1). On the other hand, if \(N\) is too small then information about \(X^n\) is lost in quantization. A problem with the underlying distribution also arises, as we will see in the next paragraph.

\(\textbf{Figure 3}\): A situation where descision regions precisely describe the information about \(X\) contained in \(Y\).
\(\textbf{Figure 3}\): A situation where descision regions precisely describe the information about \(X\) contained in \(Y\).

An example of a case where descision regions work well is shown in Figure 3, where we partition \(\mathcal{Y}^n\) into an orange set and a pink set. Instead of describing \(Y^n\) in its entirity, we can simply describe the partition index and still preserve all the information \(Y^n\) contains about \(X^n\). Indeed, it is easy to check that the random variable defined by the bin index (color) the reception falls into satisfies all three properties of \(U^n\) described in Section 1. However, our ability to make a good partition hinges on the fact that the graph has two distinct connected components. An example of a problematic (probably more common) case is shown in Figure 4. In this case, the graph is connected, and the descision regions' neighbors have nonempty intersection.

\(\textbf{Figure 4}\): A situation where a pathology in \(P(X,Y)\) prevents descision regions from precisely describing the information \(Y\) contains about \(X\). \(x2\) and \(x3\) are associated with more than one bin: if one of these is sent, we may receive something pink or orange. This is a problem because letting \(U^n=(\text{bin color})\) then \(H(U^n|X^n)\gg 0\), violating property \(\mathbf{(2')}\) described in Section 1. Any other partition you form for this distribution has the same problem.
\(\textbf{Figure 4}\): A situation where a pathology in \(P(X,Y)\) prevents descision regions from precisely describing the information \(Y\) contains about \(X\). \(x2\) and \(x3\) are associated with more than one bin: if one of these is sent, we may receive something pink or orange. This is a problem because letting \(U^n=(\text{bin color})\) then \(H(U^n|X^n)\gg 0\), violating property \(\mathbf{(2')}\) described in Section 1. Any other partition you form for this distribution has the same problem.

I suspect that as sequence length \(n\) grows, the connectedness across clusters descreases. That is, if we choose our bins right, then given which \(X^n\) happened, the bin our observation \(Y^n\) will fall into is determined up to negligible probability. Here is why I think this: As \(n\) increases, \(K^n\) grows large. The graph's adjacecy matrix is of dimension \(R^nK^n \times C^nK^n\), so it has \(R^nC^nK^{2n}\) elements, but only \(\left|E_n\right|\) of those are nonzero (nonnegligible, really). We can derive from the properties at the end of Section 2 that the total number of edges in our graph is \(\left|E_n\right|=R^nC^nK^n \ll R^nC^nK^{2n}.\) So as \(n\) increases the adjacency matrix becomes arbitrarily sparse. I suspect that the sparsity of the graph causes the coupling to be caused by a small fraction of the edges (which this work seems to experimentally support).

If this is true, we can just pretend that the cluster-to-cluster edges don't exist and form a partition of \(Y^n\). If there are few enough cluster-to-cluster edges, then the mutual information between the observation's bin number and the source should become close to \(I(X^n;Y^n)\), since the bins nearly characterize the relationship between \(X^n\) and \(Y^n\) as they did in Figure 3. In summary, the information bottleneck problem for sequences is solved if we can prove the following:

Given a sequence of bipartite graphs \(G_n=(U_n,V_n, E_n),\) where for some \(R,C,K>0\): \(\left|U_n\right|=(R\cdot K)^n\) \(\left|V_n\right|=(C\cdot K)^n\) If \(u_n \in U_n,\) then \(\left|N(u_n)\right|=C^n\) If \(v_n \in V_n,\) then \(\left|N(v_n)\right|=R^n\), ...then does the size of the minimal edge separator become small relative to the total number of edges as \(n\rightarrow \infty\)?

More coffee is required...

Comments