Phil 455: Strings Part 1

Numerals and Numbers

We call the linguistic symbols "0", "1" and so on digits. When one or more (but fintely many) of these are strung together, we call that a numeral. We could also allow a numeral to include an initial sign, and/or to include a single period, and perhaps to include a bar over a trailing portion that begins sometime after the period. But I’ll ignore these possibilities here. Each numeral has a finite length. If it’s a “base 10” numeral, then the digits it’s allowed to be made of range from "0" to "9". If it’s a “base 2” or binary, numeral, then only the digits "0" and "1" are allowed. Other bases are also possible. We’ll work with base 3, or ternary, numerals in one upcoming homework. Some bases have more than ten digits, and after "9" the convention is to start using letters from the Latin alphabet, so "a" or "A" are digits representing the number ten, "b" or "B" are digits representing the number eleven, and so on.

Understood in this way, numerals are linguistic symbols. Moreover, on the assumptions that there are only finitely many digits, and that each numeral has a finite length (but unbounded — there is no maximum length), there are only countably many numerals. We can see this by enumerating the numerals (recall Homework Problem 18), like this: First include all the length 1 numerals in alphabetical order; next include all the length 2 numerals in alphabetical order; and so on.

Fancier enumerations are also possible. For example, we could first include all the length 1 numerals that use only the digit "0" (there is just one of these): call these numerals whose max digit is "0". Next we include all the length 2 numerals that use only "0". Here there’s an awkwardness that usually we only want to allow a single numeral that begins with the digit "0", namely the one with length 1. But I’ll ignore that for now. Then all the length 1 numerals that use "1" and that may also (but need not, and if they’re length 1 don’t have room to) use "0": call these numerals whose max digit is "1". Then all the length 3 numerals that use only "0"; then all the length 2 numerals whose max digit is "1"; then all the length 1 numerals whose max digit is "2". Then all the length 4 numerals that use only "0"; then all the length 3 numerals whose max digit is "1"; then all the length 2 numerals whose max digit is "2"; then all the length 1 numerals whose max digit is "3". And so on. If you think about this fancier enumeration, you’ll see that the pattern continues working even if there are countably infinite many digits. (We keep the assumption that each numeral has a finite, but unbounded, length.) In that case too, there are only countably many numerals.

Usually theorists want to distinguish numerals, understood as linguistic symbols, from the numbers we use those symbols to talk about. Some theorists controversially choose instead to identify numbers with numerals — this is a position in the philosophy of mathematics called formalism. Of course, since we only have countably many numerals, we can’t identify them with the real numbers, of which there are uncountably many. But formalists will instead identify the reals with certain kinds of mathematical constructions using (possibly infinite) sets and/or sequences of symbols. Formalism used to be more popular, but nowadays most theorists want to distinguish numerals and numbers, even when the numbers they’re talking about are just the countably many ℕs. We’re not going to get into the debates about this. Even among theorists who do think that fundamentally, there’s a difference between the numeral or symbol "2" and the number two, there is of course a bijection between (a certain choice of) numerals and the ℕs (we sketched two in the preceding paragraphs), and where there’s a bijection, there may be an isomorphism (this will depend on what additional structure one assumes for the numerals and the numbers). And for many mathematical purposes isomorphic structures can be regarded as equivalent.

We’re going to talk as though numerals like "2" and "20" are different from the numbers they’re standardly understood to designate. But often we’ll be saying things which will be true for both (perhaps after translating some numeral-talk into number-talk, or vice versa).

Sequences and Strings

We assume we have some non-empty set of what we’ll call atoms. These can be anything you want. As an example, let our set of atoms be the numbers {0, 1, 10}. A sequence or list is an ordered collection of zero or more of those atoms. Repeats are allowed, and make a difference to the result. We count the length 0 collection, which contains no atoms, as also being a sequence. Each of these is a sequence, and is different from each of the others:

[]
[0]
[0, 0]
[1, 0, 0]
[0, 1, 0]
[0, 10]

The length 0 or “empty” sequence is written here as []. Sometimes it’s instead called nil (not to be confused with the empty set ∅, which is pronounced “the null set”).

Instead of [0, 1, 0], some authors write ⟨0, 1, 0⟩ or (0, 1, 0). I’m using those notations instead for n-tuples (ordered pairs, or triples, etc), which I distinguish from sequences. Some authors don’t distinguish these.

Often theorists talk about strings instead of sequences. Mathematically, there’s no difference. This is just a different perspective or way of talking about sequences.

Strings in this mathematical sense are a more general/abstract kind of structure than you may be accustomed to thinking of as a “string.” For example, the sequences we talked about a moment ago had numbers as atoms, which we’re saying are not the same as any linguistic symbols standardly used to designate them. Similarly, we can take the people in our class — the people themselves, not their names — as our atoms, and then talk about sequences or strings built from those atoms. Or planets; or sets; or other sequences.

When we go in for the “string” way of talking about sequences, though, usually we’ll be thinking of the atoms as being linguistic symbols — like digits, or characters from the Latin or Greek or some other alphabet, or words from the English dictionary, or sentences in Spanish, or so on.

If our atoms are Latin letters like "t", "h", and "e", then one sequence built from them is:

["t", "h", "e"]

We might also allow punctuation symbols, like spaces, among our atoms, and then we could have sequences like:

["t", "e", "e", space, "t", "h", "e"]

It’s conventional to write such sequences like this:

"tee the"

Using this notation, the empty sequence or string [] (also called nil) could be written:

""

The empty string is sometimes instead written as Λ or as ɛ or empty. (Again, not to be confused with the null set ∅. And note the difference between the symbol ɛ and the ∈ symbol that refers to the relation of being a member of a set. Both are lowercase Greek epsilons, but they use different typography.)

Here are two things that may be confusing. First, instead of single characters, the atoms might instead include English words like "tee" and "the". With atoms like that, one possible sequence is:

["tee", "the"]

And that could conventionally be written like this (here we use spaces to indicate the separation between words):

"tee the"

The atoms we’re working with might also include the word "teethe" (that is, what babies do), and the length 1 string or sequence built from that atom would be different than the length 2 sequence described a moment ago. (This is why we add the space to make it clear we’re talking about the sequence ["tee", "the"] instead of the sequence ["teethe"].)

So if you see notation like "tee the" or "teethe", you can’t tell just from that whether the alphabet is {"t", "h", "e", space} or instead {"tee", "the", "teethe"}, with spaces just serving to mark the difference between examples like ["tee", "the"] and ["teethe"].

The second thing that may be confusing is the technical vocabulary mathematicians use to talk about sets of atoms and the sequences or strings one can make out of them.

They call the set of atoms an alphabet or vocabulary. Whatever it’s made of. Even if it’s made of words like "the" instead of letters like "t"; even if it’s made of numbers like the number two, rather than numerals like "2"; even if it’s made of people like us instead of anything linguistic like our names. Whatever the set of atoms is, that set is called an alphabet.
They call the atoms themselves letters or symbols. Even if they’re numbers or English words or people. In the mathematical jargon for talking about sequences or strings, anything playing the role of an atom gets called a letter.
Whatever the letters are, they call a sequence built from them a string or word. Even if the alphabet is people instead of names. Even if the alphabet is words from the English dictionary like "tee" and "the". They’d call those atoms letters, and sequences like ["tee", "the"] (also written as "tee the") they’d call a string or word. I will avoid this way of talking about “words,” but I’m telling you about it because you may see it in other sources. We’re still going to call the sequences “strings,” even if the “letters” we’re working with are people or numbers.

The alphabet (set of “letters” or atoms) is always assumed to be non-empty. Generally, it’s assumed to also be finite, but sometimes it’s allowed to be countably infinite, and sometimes even allowed to have a higher cardinality. (There’s no principled obstacle to taking the real numbers to be your alphabet, for example.) But when it’s not explicitly stated, assume that the alphabet is a (non-empty) finite set.

Generally, we’re going to work with sequences or strings understood so that each string has a finite length. In some contexts, theorists also work with strings allowed to be infinitely long. But unless we explicitly allow that, assume each string has a finite length.

Generalizing our remarks above about numerals (strings whose alphabet consists of a set of digits), the set of strings with finite (but unbounded) length is countably infinite. (As we observed before, this is still the cardinality even when the alphabet is allowed to be countably infinite.)

Some notation you will see is that when the alphabet set is Σ or Α, then the set of all finite strings built from that alphabet is referred to as Σ^* or Α^*. The ^* superscript here means “a sequence of length 0 or more.” If you want to exclude the empty string, you’d instead talk about Σ⁺ or Α⁺, with the ⁺ superscript meaning “a sequence of length 1 or more.”

These sets of all the strings you can make from a given alphabet are called languages. Any subset of them is also called a language. That is, in this sense, a language over an alphabet is any set of the strings buildable from that alphabet. You don’t have to include the empty string, but you may. You don’t have to include any strings at all. (And the language containing no strings is different from the language containing the empty string and no others.) When we want to talk about languages which leave out some of the strings you can make from the relevant alphabet, I’ll tend to talk about it as a restricted language. (This isn’t standard vocabulary, I’m just introducing it to help you keep track.)

Later we’ll be talking about systematic ways to specify or describe restricted languages. For the time being, we’ll mostly be talking about unrestricted languages, and if we ever need or want to talk about a restricted language, I’ll describe it using English. For example, back in Homework Problem 17, we talked about the set of strings (of length ≥ 0) made from the letters "a" and "b" that never have an "a" occurring after a "b". This is a restricted language, whose alphabet is the set {"a", "b"}. "baa" is a string built from that alphabet, but it’s not a member of the restricted language described. English is another restricted language. It includes the string "the mouse eats cheese", but does not include strings like "cheese the eats". The logical languages we look at will also be restricted. The language of first-order predicate logic will include strings like (Fa ∨ ∃y Rxy), but exclude strings like ∨ ∃y.

Another notion we’ll need is concatenation, which I’ll write as ⁀. You can use the caret symbol ^ above the 6 on a US Keyboard if you find that easier. In different programming languages it’s symbolized in a variety of ways. Of the top of my head, I’ve seen + and ++ and .. and @. In logic/math texts, the usual notation is ⁀ but you may also see "tee" ⁀ "the" expressed as "tee" ⋅ "the" or just "tee" "the" (simply juxtaposing the strings).

Concatenation joins any two sequences or strings together into a (possibly) longer one. For example:

["tee", "the"] ⁀ ["tee"] = ["tee", "the", "tee"]
"tee the" ⁀ "tee" = "tee the tee"

and:

["t"] ⁀ ["h", e"] = ["t", "h", "e"]
"t" ⁀ "he" = "the"

and:

"the" ⁀ ɛ = "the" = ɛ ⁀ "the"

As we’ve already noted in previous classes, the concatenation operation is associative, but not commutative and not idempotent.

For any string like "th", if we write it with a numerical superscript, like "th"³, that means the string which is the result of concatenating three instances of "th":

"th"³ = "th" ⁀ "th" ⁀ "th" = "ththth"

For "th"¹, that’s just "th". For "th"⁰, that’s understood to be the empty string.

We’ll sometimes talk about prefixes of a string. For the string "tee", its prefixes include "t" and "te". As we do with subsets, we also count the empty string as a prefix, and we also count the whole string as a prefix of itself. So all four of these strings are prefixes of "tee":

""
"t"
"te"
"tee"

If α is a prefix of the string "tee" — generally I will use Greek variables to designate or stand for strings — then there is some string β such that α ⁀ β is identical to "tee". The notion of a suffix is understood similarly.

When Σ is any alphabet, and Σ^* is understood as we explained to be the unrestricted language of all strings built from that alphabet, then as you’ll argue in Homework Problem 45, the structure ⟨Σ^*, ⁀, ɛ⟩ is a monoid. (It’s called the free monoid over the alphabet Σ.)

As we discussed in a previous class, the length function on strings is a homomorphism between that structure and the monoid ⟨ℕ, +, 0⟩. The length of a string counts how many atoms it’s made of, with repeats counting again for each occurrence. Thus with the alphabet {"t", "h", "e", space}, the string "tee the" has a length of seven. If the alphabet is instead {"tee", "the", "teethe"}, then the string "tee the" is understood as ["tee", "the"] and has a length of two.

With restricted languages, they (may but) need not be closed under ⁀. In the example we discussed from Homework Problem 17, "b" and "aa" are strings in the language, but their concatenation "b" ⁀ "aa" (that is, "baa") is not.

Note that the arguments/operands of ⁀ are always themselves strings, though they may be strings of length 1 (or 0). In some contexts, it’s important to distinguish strings of length 1 from the “letters” they’re made of. For example, if our “letters” are numbers, we may not want to identify the number 10 with the length 1 sequence or string [10]. In other contexts, theorists don’t make any distinction between letters and length 1 strings. The present observation is that ⁀ only applies to arguments that you are counting as strings.

When we want to refer specifically to length 1 strings made from some alphabet, without taking a stand on whether these are the same as the letters or not, I will call the length 1 strings units. (This is terminology I just made up myself.) Recall: letters or atoms are the members of the alphabet, and units are the strings of length 1 that are made from a single atom. Sometimes letters/atoms are identified with units, sometimes they are distinguished. I will sidestep those issues by mostly talking about “units over an alphabet” from here on, instead of letters/atoms.

Defining Strings Formally

There are two strategies for defining strings formally.

One strategy uses the notion of “the first n members of ℕ,” for n ≥ 0. This means {x < n | x ∈ ℕ}, or in other words:
- when n is 3, this is the set {0, 1, 2}
- when n is 2, this is the set {0, 1}
- when n is 1, this is the set {0}
- when n is 0, this is the set {}, that is ∅
These sets may also be referred to as “initial segments of ℕ,” or as {0...n-1}. Keep in mind that the latter notation has to be understood so as to mean {0} when n = 1, and to mean {} when n = 0.

Given that notion, we can then define a sequence or string over an alphabet Σ as a total function from an initial segment of ℕ into Σ. If Σ is {"t", "h", "e"}, and n = 3, one such function will map 0 ↦ "t", 1 ↦ "e", and 2 ↦ "e". This function will be the sequence ["t", "e", "e"], also written as "tee".

Sometimes authors define strings as functions from the domain {1...n} instead of {0...n-1}. Some programming languages follow the one convention, others follow the other. Many people develop strong preferences for the convention they’re most familiar with.
Another strategy for definining strings proceeds inductively. Here is how we define the notion of a “string over alphabet Σ” using this strategy. We help ourselves to primitive notions of “the empty string,” the concatenation operation ⁀, and to the notion of a “unit over Σ.”
1. The empty string ɛ is a string over Σ.
2. If α is a unit over Σ, and σ is any string over Σ, then α ⁀ σ is a string over Σ.
3. Nothing else is a string over Σ (except what’s implied by i and ii).
This can look problematically circular, because clause ii appeals to the notion of “a string over Σ,” which is what we’re supposed to be defining. (It also helps itself to the notion of a “unit over Σ,” but we said that we’re taking that to be a primitive notion — not itself defined in terms of the notion of a string of a given length.)

But in fact this definition is fine, because you can think of the clause as saying: If there are some σs that the definition already counts as “strings over Σ,” then here are some more things that also have to count as “strings over Σ.”

This is akin to how you learned to define the factorial function in math classes. Except here we are defining not a function but a kind of structured object.

Why do we need clause iii? Suppose we were instead trying to define the notion of “an even number,” and we did it like this:
1. 0 is an even number.
2. If k is an even number, then k + 2 is an even number.
And suppose we stopped there. Thet definition would tell us that the even numbers include at least {0, 2, 4, 6, ...}. They do not tell us that 7 is an even number. But do they tell us that 7 is not an even number? They do not. Clauses iv and v are consistent with its also being the case that 5 is an even number, and so (by clause ii) also 7 and 9 and so on. This is not what we want, so we should add an extra clause that says:
1. Nothing else is an even number (except what’s implied by iv and v).
Another way to think of these inductive definitions is like this. Some of the clauses like i and iv are base clauses. (In these examples, there’s just one base clause, but sometimes there can be more.) They give us a set of positive cases. Some of the clauses tell us how to extend that set, by taking its closure under some operation. In the first example, the operation is prefixing some unit. In the second example, the operation is adding two. (In these examples, we’re just taking the closure under one operation, but sometimes there can be more.) The “nothing else” clause tells us we want the least set that’s closed in that way: the one that that’s a subset of every other superset of the base set that’s closed under the relevant operation(s). The set we’ve specified in this way are the “strings over Σ,” or the “even numbers,” or whatever notion we’re defining.

You’ll see some texts defining sequences/strings with the one strategy, and others with the other.

An advantage of the first strategy is that it extends naturally to infinite sequences: these are just functions whose domain is ℕ instead of initial segments of ℕ.

An advantage of the second strategy is that it extends naturally to more complex objects, like trees, that don’t have a flat linear structure.