Finding all numbers produced by short, digitless 'dc' programs

Manual exploration

Stream of consciousness updates: More bugs in my search program, found and excised. More, shorter numbers. Maybe all 10-character programs.

A few years ago, possibly in 2014, I saw this code golf problem, which invites you write a short program that prints "2014"—the caveat being, the program may not contain any digits in its source code.

When New Years 2018 rolled around, I was thinking about this problem. I remembered that a fairly short submission for the 2014 puzzle had been for "dc", an arbitrary-precision RPN calculator available on most UNIX systems, including Linux and BSD. Documentation of dc

In dc, to add 1 and 2 and print the result, you write the program "1 2 + p". The numbers 1 and 2 are placed on a "stack", and when "+" is encountered, the most recent numbers are taken from the stack, their sum is computed, and that value is put back on the stack. When "p" is encountered, the top value is taken and printed. So, the output of "1 2 + p" is "3".

A pecularity of dc (well, it's really *all* peculiarity!) is that it has a configurable input and output base. When dc encounters "I", it puts the number representing the current input base on the stack. Normally, of course, this is the value 10. So if you want to write 10 in a dc program without using digits, you can just write "I". Thus, one solution to the 2018 problem becomes: Using only the number 10, but any of the usual mathematical operators, how do you form the number 2018? Here's a short version I came up with, my first "solution" to the 2018 problem in dc: III**I+II/-II/d+*p

Clear as mud. And 18 characters is pretty short, right? Let's see if we can do better...

The next peculiarity of dc is that it is designed to accept input numbers in a variety of bases from 2 to 16. If you're familiar with base 16, hexadecimal, you know that "A" represents the value 10, and "F" represents the value 15. And, similar to how decimal 91 represents 9*10+1, hexadecimal (base-16) AF represents 10*16+15, or decimal 175. Below, I represent the base by showing it in parens after the number, like AF(16).

You can change the "input base" by feeding any number from 2 to 16 to "i". This led to my second solution: "II*d+viABEOO*-O-p". First, by an unlikely sequence the value 14 is formed (v is sqrt, and rounds down). Then the base-14 value ABE is entered, which happens to be 2128 decimal. 110 is formed from "O"s (tens) and subtracted, giving 2018, and that value is printed.

Then I remembered something else: you can print a partial line with "n". This program prints 20 followed by 18, but without a space or newline, meeting the requirements set out to print "2018" using just 12 characters: "II+dnII/d+-p". But what a drag to have to subtract 2, when you could just change the *output* base to 12 and print the same value again. This is the shortest program I found "by hand" at 8 bytes: "Id+dnCop"

At this point I actually took the trouble to find the "2014" puzzle, and was pretty surprised to see a 6-character solution: "DiBBCp". The poster states: "D pushes 13 on the stack, even tho the input radix is 10 initially. i changes input radix (to 13) and BBC is 2014 base 13. p prints." Wow, who knew dc worked like that? Where does the peculiarity end? Unfortunately, that particular program form only works for the years of BBA(13) through BBF(13), AKA 2012-2017. One year too late!

Automating it

I didn't think I was going to have any epiphanies about clever dc programs, so I wondered if it was possible to search all the "likely" programs of 7 characters or fewer to find out if any printed "2018". Based on my own efforts, it seemed that these 18 characters are most useful: "+-/*ABCDEFIOdinopr". The program will always end in "p", so if the other 6 characters can be any of those 18, that's just 34 million alternatives to consider. I knew I should, for instance, eliminate programs with "+" in the first position, since that'll print "dc: stack empty" instead of a nice juicy number, but that's OK—surely most of the programs are valid, or at any rate it's just as fast to eliminate them when they turn out not to print numbers, rather than skip running them in the first place.

My first effort had to execute the "dc" program once for each input. Running 34 million programs actually takes quite a bit longer than generating the 34 million strings, so I didn't let it run to completion. However, it did find a 6-character solution for 2018: "CiDBEp"! Funny story, dc accepts all the letters A-F no matter the input radix. After setting base 12, it computes the value 13*(12^2) + 11*12 + 14, which turns out to be 2018. (canonically, 1202(12) is 2018(10)). Incidentally, for 2014, my program found CiDBAp which is clearly better than the stackexchange answer since it would appear earlier in a dictionary of all dc programs.

My exhaustive search program actually was programmed to print out the solutions it found for all values 1..9999. Imagine my sadness when there were gaps—not all those numbers were reachable by length-6 digitless dc programs. So a challenge remained: could I find programs for all of 1..9999 of the shortest length possible.

I believe I've done so. There were several more steps along the way:

At this point, I was able to run to completion on 7-character programs. But some gaps remained. Then the final idea: Remember when I wanted to exclude all those programs that start with "+"? Let's do that! In fact, let's record the prefixes of all valid programs. For instance, upon discovering that "Op" prints 10, record that a valid program starts with "O". But on discovering that "+p" is an error, ensure that you never start a program with "+" again. And so on, for programs of any length. (in general, if you have the valid prefixes of length N, try adding each character in the subset to the end of it, followed by "n". If that works, then it's a valid prefix, and also you should check the number to see if it's one you haven't encountered before. It turns out that all the valid length-7 prefixes fit in around 4GB of RAM.

Results

(Some of the results here refer to the version which used a flawed valid-prefix search, others have been updated for my new not yet posted version)

This let me run enough length-9 programs to capture all the numbers 0..9999. The first number is reported by my dc implementation, the middle by gnu dc, and the program is shown in the third column. The programs all end in "n" (print without newline), change them to "p" if you want the newline.

The smallest number for which my shortest program has 4 characters is 0: AI~n

The smallest number for which my shortest program has 5 characters is 6: Avd+n

The smallest number for which my shortest program has 6 characters is 7: AA E/n 19: Ad+Bon -- 7 is Adv-n but was not found earlier due to the subtraction bug.

The smallest number for which my shortest program has 7 characters is 59: AD C~nn61: Ad+E~nn -- 59 is AviFEn but was not found earlier due to the excluded prefix bug.

The smallest number for which my shortest program has 8 characters is 398: AFd*C*vn

The smallest number for which my shortest program has 9 characters is 1864: AAd+FDE+n2789: AAEFEI~/n -- 1864 is BF F*B-n but was not found earlier due to the subtraction bug.

(I think the largest number from a 4 character program is probably OF^n = 1000000000000000 but my program doesn't report it)

The characters in order of frequency of use are: nABCDFEdI|o* v+-/i^r. "O" (fetch output radix) is never used, probably because "A" and "I" come earlier in the list, providing two alternate ways to get the value 10 most of the time.I'll update the frequency information soon.

Build with: g++ -std=c++11 -O3 minidc.cc -lgmp -o minidc

Needs libboost and libgmp.

Run in "find all values" mode with: ./minidc

Run dc expressions from the commandline: ./minidc 'IF^n'

.. on my machine, the "find all values" mode eventually ran for 335 minutes and checked about 3 billion dc expressions. The naive program would have checked 22^8 = 54 billion expressions. So the savings by tracking valid prefixes was about 18x. It found just 259965 of the values below 1,000,000. The first missing value is 17519.I'll update this information soon.

Preliminary: The new program searched up to length 9 in about 490 minutes CPU time for about 7 billion expressions checked, finding 344930, all of which are checked by real dc. (actually, I think it's still on programs that start with "O", but those will only duplicate results of programs that begin with "I"...) Because the new approach does not retain all prefixes in RAM, it will start on length 10 programs next, but that will probably take a week to run to completion.

Sequences my program doesn't discover: I discounted many characters from the set of valid ones in 'dc' as less likely to contribute to shorter sequences. Of course, it turns out they do. For instance, "Azn" is a shorter program for 1 (I find "Ad/n" or somesuch instead) and can improve least 15 other values below one million, but I don't search for "z". At least 11 sequences are shortened by use of "a", such as "CCEdnan" for 13346, instead of the "AFd*BB+n" that my program finds. Implementing "z" would not be difficult (it just pushes the depth of the stack onto the stack), but implementing "a" would be: it means that stack variables would now need to be EITHER a string type (albeit just one character long) or a number type. Of course, growing the set of characters tested increases the space of programs exponentially. Discounting the ability of the program to prune the search space relatively effectively, going from e.g., 22 to 23 characters means approximately 1.5 times as many programs to search; going from 23 to 24 is a similar increase.

At some point, you will inevitably reach numbers that are best printed by "real programs" which contain e.g., loops, as opposed to code that runs just once from start to finish. For instance, 11235813​21345589​14423337​76109871​59725844​18167651​09461771​12865746​36875025​12139319​64183178​11514229​83204013​46269217​83093524​57857028​87922746​51493035​22415781​73908816​96324598​61023341​55165580​14126791​42964334​94437701​408733 is the concatenation of the first 44 Fibonacci numbers. That is printed by the relatively short digitless program '.zd[dnrdk+KdZA>x]dsxx' (adapted from a program on esolangs.org) but probably the same value can't be printed by any shorter program that doesn't use the 'x' character that (in this case) implements a clever tail recursive algorithm. Aside from the impossible search space of 21-character programs making 214-digit numbers, you also get into the puzzle of deciding terminating from nonterminating programs. These considerations put 'x' much further on the "not to be implemented" side of the continuum than 'a'.

More bugs:

My dc implementation evaluates '_2 11 1000000 |' ((-2) ** 11 mod 1000000) as 997952, but bc gives -2048. This affects at least 12 values, the earliest being the above (but spelled as 'IC-BIC^v|n') This is actually a good thing, as if gnu dc implemented modular exponentiation as I did, there would probably be a lot of uses for "_", which I don't implement.

A bug in dc?

Oddly, in gnu dc, "0 _2 100 | f" results in no error and prints the stack as "100 -2 0", i.e., evaluation of "|" is silently stopped and the original operands remain. Weird! (I reported this to the authors of dc)

Entry first conceived on 10 February 2018, 2:58 UTC, last modified on 17 February 2018, 17:31 UTC
Website Copyright © 2004-2017 Jeff Epler