Phil 455: Other Encodings to/from Strings and Numbers

Sometimes one has a problem where it’s most natural to think of the input (or the output, or both) as being strings of some language, but then one is working with a formal model where the inputs (or the output, or both) are expected to be numbers (elements of ℕ). Or it might go the other way: it’s natural to think of the input and/or output as numbers, but you’re working with a formal model where they’re expected to be strings.

A related issue is when the formal model expects to receive just one number/string as input, and return just one as output, but it’s most natural to think of the input and/or output as being a collection of numbers/strings. (Here we’re just thinking about when you want a collection of outputs given all at once; we’ll talk later about algorithms that “enumerate” a sequence, or deliver it a bit at a time.)

Translating between sequences of strings and a single string

We already considered this in Homework 5 Problems 67–69.

Translating between a single string and a number

Generally we are assuming our strings are built from a finite alphabet (set of letters).

If the alphabet has only a single letter, then the strings built from it correspond straightforwardly to numbers: ɛ pairs with 0, "a" pairs with 1, "aa" pairs with 2, "aaa" pairs with 3, and so on. As we said, an abbreviation theorists often use is to write "a"ⁿ to mean n concatenated copies of "a". So "a" would be "a"¹, "aa" would be "a"², and so on. Then we can state the corrspondence to numbers very concisely: "a"ⁿ pairs with n.

What is the alphabet has more than one letter?

In that case, one strategy would be to order the strings in a “quasi-alphabetic” way, and then pair each string with its position in that order. I say “quasi-alphabetic” because we have to choose the order here carefully. If we used what’s called the “lexicographic ordering” from Homework 3 Problem 38, as we saw that ordering has infinite chains embedded within it. So it would pair ɛ with 0, and then "a" with 1, and "aa" with 2, "aaa" with 3, and so on. It would never get around to pairing the strings "aab", "ab", "b", and so on, with any numbers. So that ordering won’t work. A different ordering that does work is the one I’ve sometimes used in class, where first we go through all the length 0 strings (ɛ), then all the length 1 strings in alphabetic order ("a" then "b") then all the length 2 strings in alphabetic order ("aa" then "ab" then "ba" then "bb") and so on. This is called a shortlex ordering. So long as we’ve decided which order the letters come in, and there are only finitely many of them, this determines a unique pairing between every string and every number. We could use it to go from strings to numbers, or to go from numbers to strings.

What if we have countably infinitely many letters? Usually we won’t. But in case we do, we could use the kind of ordering I discused for numerals at the start of the notes for Strings Part 1.

Assuming we have a finite number m of letters, a different strategy uses an idea Alex suggested in an earlier class. This pairs off each letter with a number between 0 and m - 1, and then lets the string consitute a base m numeral. For example, if we have three letters "a" and "b" and "c" then "bbac" might be interpreted as a ternary numeral expression of 1⋅twenty-seven + 1⋅nine + 0⋅three + 2⋅one, which is thirty-eight. The only difficulty here is that whatever letter you pair with 0, if it comes at the start of the string it wouldn’t make a difference to the result. Thus in the encoding just suggested, "abbac" would also represent thirty-eight, even though it’s a different string. Alex’s good suggestion was to understand all the numbers as if they had an implicit 1 at the start, so that "bbac" becomes implicit 1⋅eighty-one + 1⋅twenty-seven + 1⋅nine + 0⋅three + 2⋅one, which is one hundred and nineteen. And "abbac" becomes implicit 1⋅two hundred and forty-three + 0⋅eighty-one + 1⋅twenty-seven + 1⋅nine + 0⋅three + 2⋅one, which is two hundred and eighty-one. We can subtract two from all of these results, so that the simple string "a" gets mapped to one instead of three (implicit 1⋅three + 0⋅one - two). And we can treat the empty string specially, always mapping to zero.

A different solution would be to instead interpret "a" and "b" and "c" as non-zero digits in a base four numeral; and not have any letter paired with the digit 0. The downside of that is that then some numbers won’t have any strings mapped to them. For some purposes, this may not matter.

Your email client may use a variation of these strategies called Base64 to encode attachments. In that scheme, each of the upper- and lower-case Latin letters is treated as a digit, and so are 0..9, and so are two punctuation characters (usually + and /). Giving 64 digits altogether. An email attachment, interpreted as a sequence of numbers, is translated into a base-64 string so that it can be safely conveyed through the email networks, which often reject data that isn’t normal textual characters.

Translating between pairs of numbers and a single number

There are multiple ways to do this. One idea is based on a strategy for a “linear walk of an infinite 2D table” that we considered when discussing cardinality:

This would pair (0,0) with position 0, (0,1) with position 1, (1,0) with position 2, (0,2) with position 3, (1,1) with position 4, and so on.

For reference, here are the functions for going from the pair to their single-number encoding by this strategy, and then back again:

methodAFromPair(x,y) =def (x + y)⋅(x + y + 1)/2 + x

methodAToPair(n) =def let m = (sqrt(1 + 8⋅n) - 1)/2 in
                      let x = n - m⋅(m + 1)/2 in
                      let y = m - x in
                      (x,y)

In the second function, sqrt(z) is a function that returns the greatest element of ℕ whose square is ≤ z: thus sqrt(7) is 2. Also, the divisions by 2 throw away any remainder; this matters for the first line of methodAToPair.

A different strategy is to pair (0,0) with position 0, then skip one and pair (0,1) with position 2, skip one again and pair (0,2) with position 4, and so on. Then we pair (1,0) with the first position left unpaired with (0,anything), namely position 1. We skip one unpaired position (position 3) and pair (1,1) with the next, being position 5, and so on. Then we pair (2,0) with the first position left unpaired with (0,anything) or (1,anything), namely position 3. We skip one unpaired position and pair (2,1) with the next, and so on.

For reference, here are the functions for going from the pair to their single-number encoding by this strategy, and then back again:

methodBFromPair(x,y) =def 2^x⋅(2⋅y + 1) - 1

methodBToPair(n) =def let x = cto(n) in let y = (n + 1)/2^{x + 1} in (x,y)

In the second function, cto(n) is a function that returns how many consecutive 1s there are at the right-side of the binary representation of n. (The name is an acronym for “count trailing ones.”) Since nineteen in binary is 10011, cto(19) = 2. As before, the division in methodBToPair throws away any remainder.

Translating between triples of numbers and a single number

One expedient strategy for encoding (x, y, z) is to treat it like (x, (y, z)), and then use one of the methods discussed earlier for encoding pairs (twice). Or you could treat it like ((x, y), z) and do the same. Or you could come up with a custom strategy for dealing with triples instead of pairs.

Translating between finite sequences of numbers and a single number

What if you have a finite sequence of numbers that you want to translate to a single number, but you don’t know in advance how long the sequence will be? Let’s say you have the length n sequence [x₀, x₁, ... x_n-1], assuming n ≥ 1. Then you might treat that like the pair (n, xx), where xx is the encoding for pairs, or triples, or … whatever kind of length n sequence you have. If you have a length 1 sequence [x₀] — then xx will just be that single number x₀. Then the pair (n, xx) can itself be encoded using a strategy for encoding pairs.

If you wanted to also allow for empty (length-0) sequences, you’d have to modify this somewhat. (What should you use for xx in that case?)

A different strategy would be to let xx be 2^x₀⋅3^x₁⋅...⋅p^x_n-1, where p is the nth prime. We’d still then need to encode (n, xx), in order to tell the difference between, for example, the sequence [1,2] and the sequence [1,2,0].

Most of these strategies would leave us with some numbers that don’t have any sequences paired with them. Again, for some purposes this might not matter; for others it will.

A different strategy would be to encode your sequence of numbers as a single string (see next entry), and then translate that back into a single number.

Translating between finite sequences of numbers and a single string

If you have a sequence of numbers x_i, you can encode each of them as a string, and then use the method mentioned above (from Homework 5 Problems 67–69) for encoding a sequence of strings as a single string.

A different strategy for pairing (a restricted selection of) strings with pairs of numbers is suggested in Homework 2 Problem 17, where (n, m) is encoded as "a"ⁿ ⁀ "b"^m. This can be generalized to longer sequences of numbers, so long as you have enough letters to work with. (Alternatively, you could alternate between just two letters, but in that case you’d need to make sure that none of the numbers being encoded is 0.)

Of course, these various translations become tedious, but you don’t have to do them by hand. Just specify algorithms that are capable of doing them for you!