Logic and Star-Free, Part Deux
A visual proof with no abstract-algebra overhead
|
| Composite crop of src1, src2 |
Dominique Perrin and Jean-Éric Pin are French mathematicians who have done significant work in automata theory. Their 1986 paper “First-Order Logic and Star-Free Sets” gave a new proof that first-order logic plus the relation characterizes star-free regular sets.
Today we present their proof in a new visual way, using “stacked words” rather than their “marked words.” We also sidestep algebra by appealing to the familiar theorem that every regular language has a unique minimal deterministic finite automaton (DFA).
The first post in this series defined the classes for star-free regular and
for first-order logic where variables
range over the indices
of a string
of length
. For any character
in the alphabet
there is the predicate
saying that the character in position
of
equals
. The first part proved that for any language
there is a sentence
using only predicates
and the relation
besides the usual quantifiers and Boolean operations of first-order logic. such that
defined
. That is, it proved
.
A key trick was to focus not on but on a formula
expressing that the part of
from position
(inclusive) to
(exclusive) belongs to
. To prove the converse direction,
, we focus not on middles of strings but on prefixes and suffixes. The previous post proved two lemmas showing that if
is star-free then for any string
, so are its prefix and suffix languages:
We will piece together star-free expressions for a multitude of and
sets that come from analyzing minimum DFAs
at each induction step, until we have built a big star-free expression
for
. This sounds complex—and is complex—but all the steps are elementary. At the end we’ll try to see how big
gets. Again the rest of this post up to end notes is written by Daniel Winton.
Stacked Words and FO Formulas
We implement the “marked words” idea of Perrin and Pin by using extra dimensions rather than marks. Consider any formula in FO[
] where
are the free variables. Our "stacked words" have
extra rows underneath the top level holding the string
. Each row
has the same length
as
and contains a single
and
s. The column
in which the
lies gives the value of the variable
. Thus the alphabet of our stacked words is
. Here is an example:
Each column is really a single character , but we picture its entries as characters in
or
. For any row
, define
to be the disjunction of
over all
that have a
in entry
, and define
similarly. Then the following sentence expresses the “format condition” that row
has exactly one
:
This readily transforms into a star-free expression for the language of stacked words with exactly one in row
:
Here is the union of all stacked characters that have a
in entry
, and
is similarly defined. This is where the examples in the previous post about finite unions inside stars come in handy. The upshot is that
is a expression that enforces the “format condition” over all rows. We will use this at the crux of the proof; whether
is really needed is a question at the end. If we take the format for granted, then the main advantage of our stacked words will be visualizing how to decompose automata according to when the
in a critical row is read.
From FO To SF
The proof manipulates formulas having the
free variables shown. The corresponding language
is the set of stacked words that make
true, where for each
,
is the position of the lone
in row
, which gives the value of
.
Theorem 1 For all formulas
over
,
.
Without loss of generality we may suppose is in prenex form, meaning that it has all quantifiers out front. If there are
variables of which
are free, this means
where each is
or
and
has no quantifiers. If we picture the variables
as once having had quantifiers
, then the proof—after dealing with
—restores them one at a time, beginning with
. That is the induction. We prove the base case
here and the induction case in the coming sections.
Proof: Quantifier free formulas in are Boolean combinations of the atomic predicates
and
. Here
is only for the characters
, not the stacked characters
. But inside
(when we have
) we have to deal with the whole stack to represent
. Putting
, the expression can be viewed schematically as:
As illustrated above, using the initial lemma in the part-1 post, this is equivalent to a star-free regular expression. The other atomic predicate, , is true when we have a stacked word
of the form:
It is not important to have ; these are just the labels of the variables. Their values
must obey
, meaning that the
in row
comes in an earlier column than the
in row
. The characters of
in the top row are immaterial. We can capture this condition over all strings
over
by the expression:
Again by the lemma in Part 1, this yields a star-free expression. To complete the basis, we note that if and
have star-free expressions
and
corresponding to
, then
is represented by
and
by
, both of which are clearly star-free.
Strategy For the Induction Case
Assume that all formulas in with
quantifiers have star-free translations. We will show that any formula in
with
quantifiers also has a star-free translation. Let
where
is a formula in
that has
total variables,
free variables and
quantifiers. We are allowed to assume the first quantifier in
is existential as
and we observed above that the star-free expressible formulas in
are closed under negation.
The language of is given by:
we can add a
row to each of the
characters of
using
minus one
s and one
such that the resulting string
belongs to
We must show that is star-free. As
has
quantifiers, then by assumption
is star-free regular.
Note that has one more row than
because it has one more free variable. Without loss of generality we may suppose it is the bottom row.
The Tricky Point
Since we have an existential quantifier we might think we can simply delete the last row of . But here is a counterexample to show why we must be careful not to “lose information”:
Let
and
Notice that equals
with its final row removed. We have
and
Then is the language of strings with at least three zeroes in a row and
is the language of strings with at least two zeroes in a row.
We cannot add a second row consisting of zeroes and one 1 to each word in
to force
to be in
. For a trivial example, consider the word
The fault can be interpreted as not being recoverable uniquely from
owing to a loss of information. We could also pin the fault in this case on the central
, noting that
is generally not equivalent to
.
Either way, this means we must be careful with how we express (the final row of) . The idea is to give
in segments such that the final row of each segment is all-zero or a single
, so that the act of removing the final row could be inverted. This is the crux of the proof. To handle it we employ the unique DFA for the regular expression already obtained by induction for
.
Main Part of the Proof
By the Myhill-Nerode Theorem, there must exist a unique minimal automaton over the alphabet of our stacked words defining the language
of
.
The state set of
can be partitioned into the set
of states before any
is read in the final row, the set
of states reached after reading a
in the final row, plus a dead state
belonging to neither
nor
. The start state belongs to
and all accepting states belong to
. Since any stacked word and therefore accepting computation has exactly one
in any row, all arcs within
and
must have a only
s in their final row, while the letters connecting states in
and
must have a
in their final row. The minimality of
enforces these properties. Here is a sketch, omitting
:
Let be the set of edges that cross from
to
. With
we can label their origins by states
(not necessarily all distinct), and similarly label their characters by
and the destination states by
. Then
collects all the characters used by
that have a
in row
(except those into or at the dead state
). We can identify
with the set of instructions
.
Finally let denote the set of strings that take
from state
to state
, and let
. Then we can define
by the expression
What remains is to find expressions for
and
for each
. The latter is easy: Pick any string
in
. Then
The closure of under left quotients supplies expressions
for
.
The case of , however, is harder. What we would like to do is choose some (any) string
that goes from
to an accepting state and claim that
. The problem is that
might be accepted from some state
that has an incoming arc on the same character
from some other state
. Then
, which is disjoint from
, is included also in
. There may, however, be strings
and
such that
is not accepted by
. Then the string
would be wrongly included if we substituted
for
(or for
) in (1).
There is also a second issue when crossing edges on the same character come in from distinct states
to the same state
. To fix all this, we need to use sets of strings that distinguish
from all the other states
. This requires the most particular use of the minimality of
: If
, then there is either (or both):
-
a string
in
, or
-
a string
in
.
The strings in question need not begin with a character in —they need not cross right away. Let
stand for a choice of strings of the former kind,
the latter kind, covering all states
. Then
We could not use just the complement of
because that would allow strings that violate the “one
” condition in the rows. Note that
. Whether one can do the induction for
and
in tandem without invoking
is a riddle we pose at the end.
Either way, the closure of under right quotients yields a star-free expression
for
. Then we just have to plug our expressions
and
into (1) to get a star-free expression
for
. Now we are ready for the crude final step:
Form
by knocking out the last entry in every character occurring in
. Then
represents
.
To prove this final statement, first suppose is a stacked word in
. Then
satisfies
, so there is a value
so that
satisfies
with
set equal to the value
. This means that the stacked word
formed from
and a last row with
in position
belongs to
and hence matches
. Since
is obtained by knocking out the last row of
,
matches
.
The converse is the part where the tricky point we noted at the outset could trip us up. Suppose matches
. From the form of (1) as a union of terms
, we can trace how
is matched to identify a place in
that matches a “middle character”
obtained from a
in a crossing edge. Putting a
in a new row underneath this place and
s elsewhere hence makes a word
that matches the unique regular expression
obtained by appending a
component to every “middle character”
and a
to all other characters in
. Then
is equivalent to
, so
satisfies
. Since the last row that was added to
amounts to supplying a value
for the variable
, it follows that
satisfies
with
, hence satisfies
. This yields
and so the entire induction goes through.
Open Problems
First, as an end note, the end of the proof wound up different from what we envisioned until polishing this post. It originally applied the distinct-states argument to states on the far side of the crossing, but we had to backtrack on having previously thought that the “second issue” did not matter. Two other questions we pose for our readers:
-
Do the definitions of
and
need to extend over all states
, not just the other states
on the border?
-
Is
needed in (2), or in the proof overall?
A larger question concerns how the size of the translated expressions increases as we add more quantifiers. We discuss the blowup in terms of quantifier depth and length
. Let
be a first-order formula with no quantifiers and let
denote
with
quantifiers applied to it. Also let
be a star-free translation of
and the length of
, denoted by len
, equal
.
To obtain from
we represent
by a union over crossing edges between
and
and the final states of
of the concatenation of prefix and suffix languages and a crossing character between them. We have that
gives a decent approximation for—worst-case—length of the prefix languages, suffix languages and the product of crossing edges and final states. Using these approximations we have
. Iterating gives
,
, and in general,
. This says that our blowup could be doubly exponential. Is there a tighter estimate?
Our final question is: how well do our “stacked words” help to visualize the proof? Are they helpful for framing proofs of equivalences between language classes and logics that are higher up?









Shouldn’t a proof of Theorem 1 provide an Algorithm that, given an FO[<]-formula, outputs a corresponding r.e. expression?
Thanks for noting that desire. Our proof does indeed yield a recursive procedure to compute the SF expression. It needs to include the computation of expressions for left and right quotients from the lemmas in the Part 1 post. Then you carry out each step of the induction beginning with the quantifier-free matrix of the formula, and computing the minimal DFAs at each step of re-introducing a quantifier. Under “Open Problems” we note that the size blowup from this algorithm is, however, severe. We would love to know if there is a (proof that yields a) more-efficient algorithm.