Taming Some Inequalities
As used to solve a classic problem about distinguishing distributions
|
|
Composite of src1, src2 |
Gregory Valiant and Paul Valiant are top researchers who are not unrelated to each other. Families like the Valiants and Blums could be a subject for another post—or how to distinguish from those who are unrelated.
Today Ken and I wish to talk about a wonderful paper of theirs, “An Automatic Inequality Prover and Instance Optimal Identity Testing.”
I really like this paper, which I saw at FOCS 2014. I am less keen on the title—it suggests to me that they are working on some type of AI area. The phrase “Automatic Inequality Prover” sounds like an algorithm that takes in an inequality and outputs a proof that it is correct. Of course it does this only when the inequality is true. This is not what they do.
Ken chimes in: The paper has a neat mix of other ideas. There are natural cases of extremely succinct descriptions of strings that would take double exponential time to write out. It is notable that the formal system underlying their derivation of inequalities does not yield exponential complexity (let along undecidability) but stays within linear programming—do -completeness results lurk? The instance-optimality notion is a hybrid of uniform and nonuniform lower bounds in a hybrid white-box/black-box setting.
The Result on Distributions
They study the problem of telling one distribution from another. Suppose that and
are two discrete distributions:
The question is how many independent samples are needed to distinguish the two cases: (i) from (ii)
.
The answer is too many, since and
could be very close. So they replace (ii) by the weaker condition that
and
are far apart in the
metric,
Now they are able to give essentially optimal answers to the question. There is a twist on the notion of optimal. Given the distribution explicitly as an
-tuple, their algorithm is allowed to run within factors that depend on
. Their optimal bounds
do not even separate out as a product
, though they give close upper bounds
that do. In fact, they technically have the form
where the “
” part governs an adjustment to the norm of
for the upper bound. The lower bound is expressed by the statement:
Theorem 1 There is a global constant
,
, such that for all distributions
all sufficiently small
(possibly depending on
), and every tester
given white-box knowledge of
but limited to samples
of size
from the unknown distribution
there exists a distribution
that is either
or has
, such that the probability over
of
giving the wrong answer is at least
.
We can “almost” jump ahead of the choice of
in the quantifier order. Trivially we can’t: if
is the “tester” that always says “distinct” then we need to choose
to break it. But if
avoids false positives when
, then what they build is a distribution
such that a random member of
impersonates
well enough on small samples to make
give a false negative. What’s neat is that no uniformity or complexity bound is needed on
—it’s a combinatorial argument directly on
and the small sample size. The white-box nature of
separates their bounds from the higher samples needed by a tester to learn
, if
too were unknown.
See their paper for the details on , discussion of instance-optimality, and delicate issues such as not being able to define-away the
by adjusting
. (In the paper there are constants
rather than
.) Their relaxed upper bound
is indeed simple:
This improved a previous upper bound that had in the denominator and polylog(
) factors in the numerator. That is quite a jump from a beautifully tight work of analysis.
Hairy and Basic Inequalities
What strikes us even more is the “Automatic ” part. In order to analyze their algorithm for deciding the above question on distributions, they are naturally led to consider some “hairy” inequalities—their term. I suspect that Greg and Paul are quite facile at handling almost any inequalities, so for them to say the inequalities were hairy means they were indeed complicated inequalities.
This leads to the neatest part of their paper, in our opinion. They give a complete characterization of a general class of inequalities that extend Cauchy-Schwarz, Hölder’s inequality, and the monotonicity of norms. Here
—the traditional letter, not the same as distributions
above—means that the norm takes the sum
of absolute
-th powers and outputs the positive branch of
. As they implicitly do, let’s switch to
in the norm context.
For example, their expression uses the
norm of the distribution
. The choice
may seem strange, but for the uniform distribution
,
so it relates to long-known bounds involving samples for uniform distribution. The inequalities in their analysis, however, involve other values of the norm power
. For
, the following inequality for non-negative numbers
is equivalent to the norm being monotone decreasing in
:
If we raise both sides to the power we see what happens for
:
so with , that is
, we get the equivalent form
Thus the original form (1) for captures all the needed information, while the form for
is a kind of dual. This also shows the idea of substituting powers
for
.
When we have a second non-negative sequence we can give Otto Hölder’s generalization for
of Cauchy-Schwarz, which is the case
:
We can obtain dual forms here too, and it helps to consider substitutions of the form and similarly for
. Doing so, and dividing out the right-hand sides, we obtain the basic forms for
and Hölder with
:
What They Prove
Let’s just state their main auxiliary result. Specifically, they determine for all length- sequences
,
, and
of rational numbers whether the inequality
holds for all and all non-negative real
-vectors
and
. For example, taking
,
,
, and
gives us the Hölder inequality with
, that is, Cauchy-Schwartz.
Theorem 2 The inequality (3) holds if and only if the left-hand side can be expressed as a finite product of positive powers of instances of the left-hand sides of the basic
and Hölder inequalities.
Moreover, there is an algorithm that in time polynomial in
and the maximum bits in any
,
, or
will output the terms of such a product if the inequality is true, or a description of a counterexample
if it is false.
The proof uses variables standing for the logarithms of sums of the form and takes logs of the inequalities to make linear constraints in these variables with non-negative values. The key idea is that if a derived objective function has a nonzero optimum then its
argument (and a neighborhood of it) gives counterexamples, whereas if the optimum is
then the dual program comes into play to yield the desired product. The bound
on
is not explicitly represented, but it comes out in counterexamples.
The process is so nicely effective that they are able to represent it by a human-friendly game on grid-points in the plane that is somewhat reminiscent of (and easier than) the solitaire peg-jumping game—well worth a look. They say:
Our characterization is of a non-traditional nature in that it uses linear programming to compute a derivation that may otherwise have to be sought through trial and error, by hand. We do not believe such a characterization has appeared in the literature, and hope its computational nature will be useful to others, and facilitate analyses like the one here.
In particular, it helped them with inequalities involving that “hairy” number which they used to prove their main results about sampling.
Why Isn’t it Exponential?
The short answer to why their algorithm runs in polynomial time is that it involves linear programming. There isn’t a tight analogy between their game moves and basic pivot moves of the simplex algorithm, however, because there are cases where the latter takes exponential time. It is unclear to us whether they are using a subcase of linear programming that stops short of being P-complete, or whether this will lead to new P-complete problems involving inequalities. They do have an even more general form with ()-many
-sequences and summands
for
to
for any
, in place of the elements
and
for
.
Most notable is that the minimum dimension of a counterexample can be doubly exponential in the bit-size of the arguments, and yet the algorithm can still describe it. This happens even for
as the note: Consider the inequalities
The middle factor is legal since . This is true for
but false for any
with
growing exponentially in
. Yet counterexamples have succinct descriptions—as they note, taking
makes it false with
There are -many trailing
s, and they make
large enough for the middle
term to overpower the rest. Put another way, the domain formed by their constraints is regular enough that extrema and specific points in their neighborhoods have short descriptions.
We close with a speculation about duality and dimension. The inequalities become equalities only when
. The Hölder inequalities become equalities only when
is proportional to
—that is, they are the same projective point. We wonder if their algorithm can point the way to a useful tradeoff between homogeneity and dimension, which might help tame cases where the dimension is large.
Open Problems
Can you find further applications for their result on inequalities? Can the class of inequalities that can be given an “automatic” treatment be extended further? For one instance we mention the Chebychev sum inequality
which depends critically on the monotone orderings and
.



typo, should be: c=(1/2,1/2,-1)
nicely written!