Skip to content

A Proof Of The Halting Theorem

September 19, 2016


Toward teaching computability and complexity simultaneously

ackermann
Large Numbers in Computing source

Wilhelm Ackermann was a mathematician best known for work in constructive aspects of logic. The Ackermann function is named after him. It is used both in complexity theory and in data structure theory. That is a pretty neat combination.

I would like today to talk about a proof of the famous Halting Problem.

This term at Georgia Tech I am teaching CS4510, which is the introduction to complexity theory. We usually study general Turing machines and then use the famous Cantor Diagonal method to show that the Halting Problem is not computable. My students over the years have always had trouble with this proof. We have discussed this method multiple times: see here and here and here and in motion pictures here.

This leads always to the question, what really is a proof? The formal answer is that it is a derivation of a theorem statement in a sound and appropriate system of logic. But as reflected in our last two posts, such a proof might not help human understanding. The original meaning of “proof” in Latin was the same as “probe”—to test and explore. I mean “proof of the Halting Problem” in this sense. We think the best proofs are those that show a relationship between concepts that one might not have thought to juxtapose.

Diagonal and Impossible

The question is how best to convince students that there is no way to compute a halting function. We can define Turing machines {M_x} in a particular way—or define other kinds of machines. Then we get the particular definition

\displaystyle  H(x,y) = 1 \text{ if the machine } M_x \text{ halts on input } y;\;\; H(x,y) = 0 \text{ otherwise.}

How can we prove that {H(x,y)} is not computable? We want to convey not only that this particular {H(x,y)} is uncomputable, but also that no function like it is computable.

Trying the diagonal method means first defining the set

\displaystyle  D = \{x: M_x \text{ does not accept } x\}.

We need to have already defined what “accept” means. OK, we show that there is no machine {M_d} whose set of accepted strings equals {D}. Then what? We can say that the complementary language {K = \{x: M_x \text{ \emph{does} accept } x\}} is not decidable, but we still need another step to conclude that {H} is uncomputable. And when you trace back the reason, you have to fall back on the diagonal contradiction—which feels disconnected and ad hoc to the particular way {D} and {H} are defined.

Ken in his classes goes the {D} route first, but Sipser’s and several other common textbooks try to hit {H} directly. The targeted reason is one that anyone can grab:

It is impossible for a function {f} to give the value {f(n) = f(n) + 1}—or any greater value.

Implementations of this, however, resort to double loops to define {f(n)}. Or like Sipser’s they embed the “{D}” idea in the proof anyway, which strikes us as making it harder than doing separate steps as above. We want the cleanest way.

The Proof

Here is the plan. As usual we need to say that {M_{x}(y)} represents a computation. If the computation halts then it returns a result. We allow the machine {M_{x}} to return an integer, not just accept or reject. If the machine does not halt then we can let this value be undefined; our point will be that by “short-circuit” reasoning the question of an undefined value won’t even enter.

Now let {H(x,y)} be defined as the halting function above.

Theorem 1 The function {H(x,y)} is not computable.

Proof: Define the function {f(y)} as follows:

\displaystyle  f(y) = 1 + \sum_{x=0}^{y} H(x,y) \cdot M_{x}(y).

Suppose that {H} is computable. Then so is {f(y)}. This is easy to see: just do the summation and when computing {H(x,y) \cdot M_{x}(y)} compute the {H} part first. If it is {0} then it adds nothing to the summation, so it “short-circuits” {M_{x}(y)} and we move on to the next {y}. If it is {1} then we compute {M_{x}(y)} and add that to the summation. Let {A_y = \sum_{x=0}^{y - 1} H(x,y) \cdot M_{x}(y)} stand for the summation before the last {x = y} term; then {A_y \geq 0}.

Now if the theorem is false, then there must be some {n} such that the machine {M_n} computes {f(y)}. But then

\displaystyle  f(n) = 1 + \sum_{x=0}^n H(x,n)M_{x}(n) = 1 + A_n + M_{n}(n) \ge 1 + M_{n}(n) = 1 + f(n).

This is impossible and so the theorem follows. \Box

Extending Simplicity

What Ken and I are really after is relating this to hierarchies in complexity classes. When the {M_{x}} are machines of a complexity class {\mathsf{C}} then the functions {H} and {f} are computable. It follows that {f} is not computed by any {M_n} and so does not belong to {\mathsf{C}}. What we want is to find similar functions {f'} that are natural.

Ackermann’s famous function {A(n)} does this when {\mathsf{C}} is the class of primitive recursive functions. There are various ways to define the machines {M_{x}}, for instance by programs with counted loops only. The {f(n)} that tumbles out is not primitive recursive—indeed it out-grows all primitive recursive functions. Showing that {A(n)} does likewise takes a little more formal work.

In complexity theory we have various time and space hierarchy theorems, say where {\mathsf{C}} is {\mathsf{DTIME}[O(n^2)]}. For any time constructible {t(n) = \omega(n^2 \log n)}, we can separate {\mathsf{C}} from {\mathsf{C' = DTIME}[t(n)]} by a “slowed” diagonalization. The {f'} obtained this way, however, needs knowledge of {t(n)} and its constructibility to define it. By further “padding” and “translation” steps, one can laboriously make it work for {t(n) = \omega(n^2 (\log n)^{\epsilon})}, for any fixed {\epsilon > 0}, and a similar theorem for deterministic space needs no log factors at all. This is all technical in texts and lectures.

Suppose we’re happy with {\mathsf{C' = DTIME}[n^3]}, that is, with a non-“tight” hierarchy. Can we simply find a natural {f'} that works for {\mathsf{C'}}? Or suppose {\mathsf{C}} is a combined time and space class, say machines that run in {O(n^2)} time and {O(n)} space simultaneously. Can we possibly get a natural {\mathsf{C'}} that is different from what we get by considering time or space separately?

We’d like the “non-tight” proofs to be simple enough to combine with the above proof for halting. This leads into another change we’d like to see. Most textbooks define computability several chapters ahead of complexity, so the latter feels like a completely different topic. Why should this be so? It is easy to define the length and space usage of a computation in the same breath. Even when finite automata are included in the syllabus, why not present them as special cases of Turing machines and say they run in linear time, indeed time {n+1}?

Open Problems

Is the above Halting Problem proof clearer than the usual ones? Or is it harder to follow?

What suggestions would you make for updating and tightening theory courses? Note some discussion in the comments to two other recent posts.

[some word fixes]

9 Comments leave one →
  1. Peter Gerdes permalink
    September 20, 2016 2:04 am

    This is *way* less clear. I feel you hide the self-reference in the trick of summing up over all computations in a way that doesn’t easily generalize to other similar proofs.

    The intuition one wants to convey with the halting problem is that: If the halting problem was computable we could define a computation that asks if that computation halts and does the opposite. I feel if you don’t get this across you might as well just assert it’s not computable since this is the essential trick used to prove any similar statement.

    Yes, that proof goes through the fixed point theorem but I actually find that the fixed point theorem (or at least the part needed here) is super easy for modern students to grasp. OF COURSE a computation can scan through memory and read off it’s own source.

    So take the program that reads it’s own `source’ off and then on input y runs H(‘source’, y)… then explain that this is all the fixed point theorem does in the formal version.

    • September 20, 2016 2:23 am

      I agree, this feels like a tricky calculation that mysteriously works. I think that one thing you can rely on is that today students know (at least to some extent) how to code, how a simple computer program looks like. Instead of Turing machines, you can present the whole proof with C++!

      typo: night should be might.

      • rjlipton permalink*
        September 20, 2016 9:10 am

        Thanks for comments from both.

        The method used here does work in other cases by the way. Perhaps we should have spelled that out. Or will do that in the future. Separating classes by growth is a common tool. It is used for example with primitive recursive functions and in a proof of Paris-Harrington’s result.

        Thanks

  2. September 20, 2016 12:04 pm

    “Most textbooks define computability several chapters ahead of complexity, so the latter feels like a completely different topic. Why should this be so? It is easy to define the length and space usage of a computation in the same breath.”

    Shameless plug: Stephan Mertens and I agree with you 🙂

    Why not start (at least to give the students intuition) with a “software-level” proof: suppose that there is a subroutine Halt(P,x) that tells us, in finite time, whether or not a program P halts when given input on x. Then we could use it in the following program,

    Catch22(P)
    if Halts(P,P) then loop forever
    else halt

    and then ask what would happen if we ran Catch22(Catch22).

    When I teach this stuff, I start by giving this proof. I then try to impress on the students that ideas like subroutines, universal programs that run other programs, and so on were hard-won — that people like Turing, Church, and Gödel had to do all this “from scratch”, using things like Gödel numbers instead the modern notion of source code. Their achievements are incredible, but we can honor them by using the high-level picture that they made possible.

    We then talk about e.g. the time hierarchy theorem by noting that we _can_ solve the Halting problem for P, if we have an upper bound on P’s running time — if f(n)=o(g(n)), then we can solve the Halting Problem for programs that run in time f(n), using g(n) time to simulate them. But the very same argument shows that g(n) _needs_ to grow faster than f(n), and this shows that TIME(f(n)) is a proper subset of TIME(g(n)). To put it differently, any universal simulation of programs by other programs must have some overhead* — which is a lovely high-level conclusion to be able to draw.

    *although not necessarily the log factor we get from Turing machines. For instance, in the RAM model the overhead can be a multiplicative constant.

  3. kevembuangga permalink
    September 20, 2016 12:08 pm

    It’s of no use to try to “better explain” Gödel incompleteness theorem, the Smoryński proof is among the simplest yet it is not “well received” (to say the least) to dismiss the Profound Metaphysical Problem of incompleteness (consciousness, AI, humans over machines, God, etc, etc…)
    See my comment at Peter Smith’s blog Logic Matters. and the replies…

  4. September 21, 2016 12:16 pm

    1. Is there no better picture of Ackermann anywhere? He lived until 1962 so there should be a more flattering one. (Sorry, bit of a hobby horse of mine. I posted on the math stackexchange without any results.)

    2. Unlike others here, I teach undergrads solely and in this course feel that one of the main needs is to convey what the results are “really about”. It is great to bring in complexity, because that’s where it is at, right? But I don’t know that hiding the diagonalization, or at least making it less overt, helps with the “really.” (For what its worth, I do what Chris Moore does.)

    I teach the Recursion Theorem as what happens when diagonalization fails and so at the moment I like having diagonalization be overt, although of course I could be convinced that the tradeoff is worth it.

    • September 21, 2016 12:44 pm

      Hi Jim. I totally agree with showing Cantor’s diagonalization (as well as Russell’s paradox, Berry’s, etc.) Then we can point out that if we have rows and columns corresponding to programs, calling Catch22(Catch22) goes along the diagonal, and the internal “if” of Catch22 provides the flip.

      I always throw in some descriptions of snakes eating their own tails, Möbius strips, and so on…

      Another diagonalization, which deserves to be taught more often (including at the undergrad level) is this:

      Tweak(P)
      ..return P(P)+1 [or if you want a Boolean, not(P(P))]

      with the understanding that (just as would happen if we tried to run this in real life) Tweak(P) never halts if P(P) never halts. If we run Tweak(Tweak) and it halts, it returns a value such that

      P(P) = P(P)+1

      The escape from this contradiction is that some programs never halt on some inputs (and indeed, Tweak(Tweak) is one of these). That is, some partial recursive functions are really partial: there’s no model of computation which 1) is universal in the sense of being able to run any program, and 2) which always halts.

  5. Aula permalink
    September 23, 2016 9:24 am

    Why is the partial sum A_y nonnegative? As far as I can see, nothing forbids M_x from returning negative integers.

Trackbacks

  1. A Proof Of The Halting Theorem | Vikash Kodati' Blog

Leave a Reply

Discover more from Gödel's Lost Letter and P=NP

Subscribe now to keep reading and get access to the full archive.

Continue reading