A Note on Distributions and Approximation
A helpful analogy for the quantum computing debate?
Roman Smolensky was a complexity theorist—unfortunately “was” is the correct word here. He passed away in 1995, way too young, at the age of 35. His STOC 1987 paper “Algebraic methods in the theory of lower bounds for Boolean circuit complexity” is a classic, and one can only imagine what he might have done had he lived longer.
Today I, Ken, wish to explain a well known result of Smolensky’s about low-level complexity theory. Then I draw an analogy we hope may help in understanding mathematical issues underlying our current debate over scalable quantum computers.
Dick raises a principle of mathematics that is not well known, or not as well known as it should be: when it is hard to work with sets, replace them by distributions. Even though distributions are a more general notion, they have some nicer properties. Properties of distributions over qubits and their errors are at the heart of the debate, so we feel noting some properties of distributions on ordinary bits or vectors over a finite field, and raising some questions, may be a helpful stepping stone.
Smolensky considered bounded (or sub-logarithmic) depth circuits of unbounded fan-in and
gates (with
gates at inputs), or equivalently circuits of
or
gates, together with unbounded fan-in gates telling whether the number of incoming 1’s is a multiple of a prime
. He proved that they cannot compute the mod-
function unless
is a power of
, and indeed cannot even approximate it in a sense whose dependence on distributions we wish to bring out. Let us first turn to Smolensky’s work, in the particular case of parity, i.e.,
and
is an odd prime.
Approximating NANDs
As Smolensky’s paper acknowledges, the following lemma on approximating by low-degree polynomials was also proved by Alexander Razborov for
, and by David Mix Barrington for all
:
Lemma 1 For any distribution
on
, and
, there exists a polynomial
of degree at most
over
such that
Proof: Consider polynomials of the form where
with coefficients .
If makes
false, then since
stands for true,
, and so
which stands for false.
Otherwise, some , and thus
has an additive term that is simply
by itself. Whatever the value of the rest of
, there is exactly a
probability that a uniformly random choice of coefficients including
will produce an
such that
. Raising that to the power
produces
, and so
which stands for true.
Furthermore, using -many sets of
-many coefficients
to define
in the manner of
, we define
Then when , all
so
which stands for false, while for all other
, the chance of making some
nonzero, so that the whole product is zeroed out and
, is
.
Thus for all , the probability over the
that the resulting
agrees with
on that
is
. Picture the
‘s as rows of a big table and the choices for all
as columns. Each row has density
of agreements. It follows that regardless of the distribution
on the rows, some column has density at least
of agreements when
is chosen according to
.
The reason for caring about arbitrary distributions on is that the
will be outputs from
gates higher in the circuit, and those values might be correlated. Lecture notes by Alexis Maciel and David Mix Barrington (which I am using in a course) in fact represent the
as functions
of a single variable
. Here
is ultimately interpreted as the inputs
to the circuit
, but it could be arbitrary.
Lower Bound for Parity
The point of the lemma is that if the circuit has small depth
, then composing the polynomials
obtained for each
gate
yields a polynomial
of degree at most
that agrees with
on all but an
proportion of inputs
, where
is the number of gates in
. When
, there is some slack to choose
non-constant but growing slowly enough with
(for some fixed
used as below) to make:
where is the size of the circuit. This allows
to be as big as
, and this becomes the circuit size lower bound obtained by Smolensky’s argument.
Note that regardless of the distribution on the inputs, there always exists a setting of the -many coefficients
over all the gates
that achieves the bound
on the error probability over that distribution of inputs. In particular, this works for the uniform distribution over the inputs
, but does not require it. Thus
can be approximated by some polynomials
of low degree
.
This sets the stage for Smolensky’s famous “trick” which we give in the case of parity. Over the domain , parity is just the product
of all the variables, and
. Thus parity is a polynomial
that is “complete” in the following sense defined in his paper:
For every function
there are polynomials
both of degree at most
such that
The genius of this is that if agrees with
on some subset
and has low degree
, then we can bound the size of
. Namely, every function
when restricted further to
can be written as a polynomial of degree at most
. There are
such functions, but only
ways to assign the coefficients of terms of degree
for all
. Taking logs base
we get
To understand the right-hand inequality intuitively, note that the standard deviation of coinflips is
, so having
be only an
deviation above
makes the overall density of
only vanishingly above one-half. Thus
and
can agree on barely over half the inputs in
, making
—that is, parity—inapproximable by any small-depth, sub-exponential size circuits
. A tour-de-force.
Note that the agreement set is implicitly talking about correlation with the uniform distribution on
. We would be interested in treatments that consider non-independent distributions on
, and note the presentation of Smolensky’s theorem by Emanuele Viola in this paper in that regard.
The Analogy
The analogy is that the proof technique—for the Lemma—does not require the distribution over the inputs to be uniform. Indeed they can be joint functions of some “hidden variable”
, and be in that sense “entangled.” More concretely, they can be arbitrarily strongly correlated. The proof technique copes with this lack of entropy in
by creating its own entropy in the ability to fit the coefficients
when taking a low-degree polynomial that succeeds with high probability. Although the theorem is usually presented assuming uniform distribution over the inputs
, this is not actually needed for the Lemma.
The analogy with our quantum debate is made by asking, do fault-tolerance schemes work more like the Lemma, or do they really need independence? At the risk of straining what we are trying to say, we can draw out the analogy even further:
“Model of Noise”: A class of distributions
on the inputs to the circuit
, or on any
gate of
.
“Correlated/Entangled Errors”:
is the image distribution under mappings
,
,
,
of a distribution
on a set
, where
may have lower entropy/fewer degrees of freedom. Note that gates lower down in the circuit may have a greater degree of correlation resulting from the computation by gates above them.
“Reasonable Noise Model”: The distribution
is not too complex, such as being
-samplable, or is natural in some other sense. For instance,
could be induced from some distribution
on co-ordinates
by some operation on the
co-ordinates, such as projection or “tracing out.”
“Feasible States”: The coefficients
that achieve the desired approximations are easy to compute, for instance
-uniform.
The main lack in this analogy is that the complexity-class side has no notion of “noise” per se or “errors” or “correction.” But we offer it nevertheless because we see some similarity in the mathematical issues. We also would like to ask whether the feasible-construction issues have been addressed as-such by complexity theorists.
We have a further reason for thinking along lines of distributions, a.k.a. measures, in a related context. A survey paper by Akshay Venkatesh of Stanford about work on the Littlewood conjecture has some interesting examples and declares straight out:
Measures are often easier to work with than sets.
In particular, on pages 12–13 he gives an example of a statement about sets that is “wishful thinking” for sets, but which he can “salvage” in terms of distributions. We intend to elaborate on this point soon.
Open Problems
What is the complexity of constructing the polynomials that approximate a
gate, or a circuit of
gates, in terms of the complexity of the distribution
on their inputs? Under what conditions are the resulting polynomials computable jointly in uniform polynomial time? When are they succinct?
Why does Smolensky’s argument fail to give a super-polynomial size lower bound for constant-depth circuits with gates that compute parity? Can we give a better, deeper answer than, “because
is not a field”?
[fixed formula for g-prime]



Roman Smolensky was a postdoc at the Hebrew University of Jerusalem at the early 90s and it was always a pleasure to talk with him. We often met and chatted at the Belgium house (our faculty club). The Razborov-Smolensky method for lower bounds for bounded depth circuits with MOD gates is indeed very beautiful and important. Indeed, it is a major mystery what to do for the non prime power case. (There are also some important mysteries in the prime case as well.)
In the context of Mobius randomness that was mentioned in several recent posts, it is still a major open problem to show that the Mobius function is nearly orthogonal to functions that can be computed by bounded depth Boolean circuits with mod 2 gates. By Razborov-Smolensky THM, we need to show that the Mobius function is asymptotically orthogonal to low degree (degree polylog) polynomial over GF(2). The case of degree one polynomials is precisely the case of Walsh functions which is crucial for “polylog frequency” in Green’s proof for AC0-Mobius randomness, and was settled for general Walsh functions by Bourgain. But even degree-two polynomials over GF(2) represent a very difficult open problem which includes the (yet, unknown) Mobius randomness of several famous sequences arising in dynamics.
Regarding QFT, what is required for quantum fault tolerance is much weaker than independence.On the other hand, noise models that fail QFT can also be very simple. In the more detailed proposed models the noise, in some sense, “mimics” the computation itself. At some early times I was interested how the noise that interests me can arise from very simple circuits, but I did not persue this direction.
Hello Professor Ken,
With respect to your introductory paragraph, I would like to ask a question.
I would like you to illustrate the principle when it is hard to work with sets, replace them by distributions by a few (possibly simple) examples. If possible, can I add in a request for you to include more than one example so that I can appreciate this principle better.
Thanks a lot for your time and please allow me to apologize for inconvenience if any.
Thanks for the interest—I’m limited only by finite-time. We do intend to make a later post on this, as stated in regard to the cited survey by Akshay Venkatesh at the end.
An example that’s maybe too simple to be on-target is that I originally conceived the chess anti-computer-cheating application (which you can find here and in papers here) as being one of distance between distributions. Three parties are represented as distributions to compare: the computer chess program, the player, and the model’s idealization of the player. The player is formally denoted by the set of played moves, but it was convenient to treat this as the ensemble of distributions giving 1 on the played move and 0 on the others, over all turns in the player’s games. (Later on I switched to a simpler frequentist model.)
My actual motive on the non-quantum side was the opposite: to ask how the Barrington/Razborov/Smolensky theorems might usefully be generalized to distributions over the inputs, with the technical obstacle that you may no longer have the simple binomial counting estimate that worked with the domain A of agreement treated as a set.