I did something related in the past: https://github.com/RussellSprouts/6502-enumerator. It uses C++ templates to share an emulator implementation between z3-powered symbolic execution and actual execution. It was meant to find equivalence between random instruction sequences for peephole optimization, rather than optimizing a specific input sequence.
Shadow instructions are very interesting and cursed. I've seen them used in very limited circumstances for NOP-slides for timing code: https://www.pagetable.com/?p=669. It would be fun to see it applied in otherwise normal code. My enumerator wouldn't support this -- it didn't execute actual 6502 instructions from bytes -- it had its own internal representation for `the first arbitrary absolute pointer` or `the second arbitrary immediate constant`. These would either be initialized with random concrete values or z3 variables.
Your project was very much something I looked into when designing this! Fun to see you commenting.
But yes, different goals. I did look into using z3, but quickly found out that it's pretty slow compared to just checking if a test case passes when ran through the candidate program.
Interesting project and well written. That only made me miss some links to prior art more though.
iirc, there was a superoptimizer (I belief the term was coined and motivated in that article) in the early nineties for M68k. https://dl.acm.org/doi/pdf/10.1145/36206.36194 might have been that.
Interesting and fun read - we are well into the terrain of what was completely impossible to do back then. Now I can't wait to see a faster AppleSoft ROM ;-)
If you assume that A * 10 isn't going to overflow, so that ASL A moves 0 into the carry flag (so no need for CLC), then instead of using the undocumented RRA opcode, you can just do:
sta $00
asl a
asl a
adc $00
asl a
This is also 7 bytes, but is faster since adc $00 is 3 cycles, vs rra $00 being 5 cycles.
The A = max(A, X) example is certainly interesting, but not very useful since it loops through the code twice (very slow) and assumes that $8a is available. The much faster obvious version only adds one byte:
Sure. Note that I picked those examples to demonstrate the two fairly quirky classes of things the optimizer tends to find. If the programmer has different requirements they can specify that, and it'll spit out the examples you gave (or something equivalent).
I like the idea of exhaustive search, which the simplicity of the 6502 seems ideally suited for, but the search speed seems a bit limiting. I wonder if there's not potential for more generation restriction (e.g. code can only use a specific N bytes of zero page) and heavy search pruning to speed it up? If it could generate optimal 20-30 op sequences in semi-reasonable time that'd make it very useful.
Demo coding is indeed the primary usecase for this, and the reason for why I started tinkering on it in the first place. That, and people who make homebrewed NES / C64 video games should find it fairly useful for optimizing tight loops and such.
"Although NESFab performance is good, writing assembly by hand can obviously surpass it"
These project obviously have different goals. DeiMOS is for people already writing assembler that want optimal assembler.
NESFab is a compiler - apparently a very good one (hard to do for 6502), but nonetheless not directly competing with people hand writing assembler and looking to save every byte and/or cycle.
I did something related in the past: https://github.com/RussellSprouts/6502-enumerator. It uses C++ templates to share an emulator implementation between z3-powered symbolic execution and actual execution. It was meant to find equivalence between random instruction sequences for peephole optimization, rather than optimizing a specific input sequence.
Shadow instructions are very interesting and cursed. I've seen them used in very limited circumstances for NOP-slides for timing code: https://www.pagetable.com/?p=669. It would be fun to see it applied in otherwise normal code. My enumerator wouldn't support this -- it didn't execute actual 6502 instructions from bytes -- it had its own internal representation for `the first arbitrary absolute pointer` or `the second arbitrary immediate constant`. These would either be initialized with random concrete values or z3 variables.
But yes, different goals. I did look into using z3, but quickly found out that it's pretty slow compared to just checking if a test case passes when ran through the candidate program.
iirc, there was a superoptimizer (I belief the term was coined and motivated in that article) in the early nineties for M68k. https://dl.acm.org/doi/pdf/10.1145/36206.36194 might have been that.
sta $00
asl a
asl a
adc $00
asl a
This is also 7 bytes, but is faster since adc $00 is 3 cycles, vs rra $00 being 5 cycles.
The A = max(A, X) example is certainly interesting, but not very useful since it loops through the code twice (very slow) and assumes that $8a is available. The much faster obvious version only adds one byte:
stx $00
cmp $00
bcs done
txa
done:
20-30 ops is probably impossible, unfortunately. The combinatorial explosion is just too enormous.
https://en.wikipedia.org/wiki/E-graph
https://www.cs.cornell.edu/courses/cs6120/2025sp/blog/supero...
https://github.com/philzook58/awesome-egraphs
I imagine lots of demo coders glancing back and forth between that writeup and their own carefully hand-tuned assembly.
Demo coding is indeed the primary usecase for this, and the reason for why I started tinkering on it in the first place. That, and people who make homebrewed NES / C64 video games should find it fairly useful for optimizing tight loops and such.
These project obviously have different goals. DeiMOS is for people already writing assembler that want optimal assembler.
NESFab is a compiler - apparently a very good one (hard to do for 6502), but nonetheless not directly competing with people hand writing assembler and looking to save every byte and/or cycle.