Lately I have been thinking about Kullback-Liebler divergence on probability distributions, also called the KL discriminant or relative entropy. It is often useful to think about this quantity as a distance, even though it doesn't behave much like one: it's not symmetric, it doesn't follow the triangle inequality, and even tiny perturbations in other metrics like total variation and \(L^p\) can make the KL divergence between two distributions infinite or nonexistant. These might seem like strong reasons to avoid using KL divergences, but the unfortunate truth is that they shows up all the time in comms and information theory. To my understanding, information theory results often only hold up to small perturbations in KL divergence rather than some nicer metric.

For this reason, you would think that there is a lot of study on what kinds of conditions on distributions implies that the KL divergence between them is small. Unfortunately, I haven't been able to find any simple results of this sort. The state of the art on this, to my knowlege, appears to be summarized in this document. There is also probably a lot of pertinent work contained in the field of information geometry, but I haven't built up enough background to investigate it yet.

I have a hunch that while KL divergence is very sensitive to some specific types of perturbation, widely-applicable conditions must exist that imply the divergence between two distributions is small, at least most of the time. To get to such a result we need to examine what can go wrong and make KL divergence between two things blow up.

In the most general case, the definition of KL divergence is necessarily close to the ground-up definition of the Lebesgue integral, since it needs to be able to handle all sorts of probability measures. However in a simple case, it is true that

when such a quantity is defined. The following things can make this formula large or undefined:

- The domain of \(Q\) does not contain the domain of \(P\).
- The
*quotient*\(p/q\) is large. That is, what \(Q\) believes to be low probability, \(P\) thinks is relatively high probability.

Arguably, the first point is a technicality and can be avoided so I will mostly focus on the second point. To begin thinking about it, we need to formalize what a 'low probability region' of a distribution is. The following is my attempt to do so and a preliminary result about such regions.

For an absolutely continuous (wrt Lebesgue) distribution \(P\) on the reals with CDF \(F\), a point \(x\in \overline{\mathsf{supp}(P)}\) is a

l-(r-)tailif the left (right) derivative of \(F\) at \(x\) is 0.

One natural question is how many tails a distribution can have. Unfortunately, the amount of tails can be uncountable: Take a countable collection of disjoint intervals whose union forms the complement of the Cantor set (open, so we are guaranteed there is such a collection). Think of a probability distribution that is 0 except at each of these intervals, where its PDF has a smooth bump. It is easy to check that any point in the Cantor set is now a tail, and namely that set is uncountable and nowhere-dense. Even worse, you can make Cantor-like sets with positive measure. However at least we can say the following:

Remark:The collection \(T\) of l-tails (r-tails) of a distribution \(P\) with \(S = \overline{\mathsf{supp}(P)}\) is nowhere dense.

*Proof:* \(T\) cannot be dense in any open sub-interval \(I\subseteq S\), for if it were then either \(P(I)>0\) meaning \(P\) is not absolutely continuous, or \(P(I)=0\), meaning \(I\) isn't part of \(S\). Either case violates our assumptions. \(\square\)

There is some idea that rather than looking at the places where distributions go to \(0\), we may want to look at the places where they drop below some \(\varepsilon.\) I am hoping that the event of two distributions being away from the set of \(\varepsilon\)-tails is close to \(1\) under mild conditions. Then the ratio between the two distributions within that event is bounded above, so the KL-divergence is then bounded, conditioned on that event. Under some weak type of convergence (maybe not weak convergence itself) then we can expect the KL-divergence to go to 0, at least conditioning on a sequence of emperically-high-probability sets.

Hopefully these really preliminary ideas will lead to isolating the low-probability regions of absolutely continuous distributions, which are one of the big problem areas for KL divergence.