Anyone trying to do this... the first thing you do is avoid lex/yacc/bison/antlr. You do not need all this ceremony. A recursive descent parser that uses Pratt parsing will work for a vast majority of cases.
The lexer/parser is never the bottleneck. In fact, you can write those two by hand over a single weekend for a largish language. With LLMs, it takes 15 minutes if you have an unambiguous spec.
The biggest time sink, and the reason you will fail for sure, is the inability to restrict the scope of the project. You start with a limited feature set and produce the entire compiler/vm toolchain. Then you get greedy and fiddle with the type system, adding features that you have never used and probably never will. And now you have to change every single phase from start to end.
Jonathan Blow wrote his own game enginee and for that he wrote his own programming language.
He went through straight recursive descendant parser and said same thing.
I think compiler courses teach from yacc, bison etc that's where this whole thing came from but in practice people discovered that hand written recursive descendant parsers are all you need.
> I think compiler courses teach from yacc, bison etc that's where this whole thing came from
Very true. I have a shelf full of books on compiler development and optimization. I have read them selectively, a chapter here, a chapter there. But that shelf is useless for a vast majority of people.
You might find it useful if you are developing a production-level compiler/vm (I cannot make this statement with a straight face while Python rules the world). But a simple and sensible architecture that uses recursive-descent parsing takes you a long way.
Most hobbyist compilers (and even some production ones) are written as a heavy front-end compiling down to C or LLVM. Very few people actually write their own backend.
I learned to do this about 2 years ago (pre LLM). I have been developing software for ~30 years and somehow doing something like this was a major mental obstacle, mostly created by the perception of "the dragon book", as in this topic being full of mystical unobtainable incantations, so I never even dared venture into this space. Silly, I know. However, after diving into this and learning to write a recursive descent parser for a DSL I wanted to write, it felt like I'd acquired a superpower. Totally understand that there is many more layers to all of this, layer that can get very complex, but just learning that first bit...
I wish people would start with Nystrom's https://www.craftinginterpreters.com/ and avoid the dragon etc unless they really, really need it. Almost everything I have learnt about compiler/vm development, I have done so by reading random blogs and articles on various aspects and small tutorials on writing parsers and vms.
Even stuff like Crenshaw's Let's Build a Compiler was more useful to me than all these books that do lexical analysis using regular expressions. I have written lexers and parsers hundreds of times for all kinds of DSLs and config languages and not once have I used regular expressions to scan the text.
I wrote a few of these due to an interest in compilers and hardware.
The easiest syntax to copy if you’re looking for a high level language is Smalltalk.
But most of the time, I wouldn’t even use that. Simple imperative languages that look like BASIC works pretty well in most domains. If you simplify the syntax a little, it’s very easy to understand the compiler and use it for say when you want users to input code into existing systems.
I have written compilers for two families over the years: C and ML. My current preference is Python. I am currently working on a statically typed language that is inspired by Python (minus objects and OOP) that runs on a register VM.
Syntax is a minor issue but something that people are very opinionated about. You could technically build multiple front ends that share the typechecking, CFG validation, optimization, register allocation and byte code emission phases. But it is too much work for what is presently a personal project.
I wrote my own interpreted language about 25+ years ago to write online surveys. It made it easy to create complex surveys with many branches. I think I wrote it in Objective-C.
The team implementing the survey system wound up using the same language to implement the runtime portion, something I never expected or designed in.
I don't recall anything about what it looked like now. I do remember it was a lot of fun to write.
I had a similar surprise about how approachable PL is, but from going from 'the bottom up' instead from a normal language.
I wrote a compiler toolchain and debugger that takes a Turing machine description plus input string and emits an encoded tape runnable by a Universal Turing Machine [0]. I had some prior PL experience, but never did an end-to-end compiler pipeline, at least not this low level.
It started as a joke/experiment, but I couldn't believe how fast it pulled me into designing:
- a small low-level ASM for building the UTM
- an ABI for symbol widths and encoding grammar
- an interpreter used as the behavioral oracle
- raw TM transitions for each ASM instruction, generated by having an LLM iterate on candidate emissions and checked against the interpreter oracle
- a CFG-style IR to fix the LLM mess once direct ASM -> TM emission became too hard to keep sane (LLM did a decent job actually, I don't think I would have done a much better job without the IR either)
- a gdb-style debugger for raw transitions, ASM routines, and blocks
- a trace visualizer
- a bootstrapping experiment where an L1 UTM/input pair was itself run through an L2 UTM
- optimisation experiments
And every step came quite naturally and was easy to tie in with everything else. Each one was just the next local repair needed to make the previous layer tractable.
Yes, it's true that someone can put together a simple language like in a university course. The difficulties, as mentioned at the bottom of the post, are things like metaprogramming features or optimizing compilers.
The tail ends of a language implementation (parsing and code generation) are a fixed cost; the "middle end" can grow unbounded as more production-quality items are added.
I was very deep into .NET until recently but somehow I didn't know this existed. Looks like C# with extra Linq-to-SQL syntax; I guess it's a DSL made with Roslyn for ERP jobs? I wonder how they picked the name.
Like most things in programming, handling the easy stuff is easy, but it’s all the edge cases that kill you. I’m writing an IDE in flutter right now, and all of the defensive programming I have to do to handle the unhappy path, is where 50% of my code goes.
I've been having a lot of fun building my own programming language [1]. Getting to the point where you can write programs in your own language was surprisingly easy.
The language, Sapphire, is Ruby inspired, so the most interesting part is digging into the internals of the latter when I'm trying to figure out how something should work.
this project is pretty interesting, although i'm wondering how they're planning to address the "easy sandboxing" design goal in a compiled language with raw pointer arithmetic and clib interop... in that regard i think lua would have been a lot easier to sandbox, despite the author's concerns.
(also, they might want to look into lua userdata, since that would address their concern about the overhead of converting between native and lua data structures. the language is designed to be embedded in C programs after all)
At least in my opinion, it might depend what kind of game. For example, there can be: card game, certain kind of puzzle games, Pokemon game, etc. There might also be consideration of such things like what portability you want, what sandboxing you will want, etc.
(There are game engines that have their own programming language for those kind of game (and some of them are the ones I had made up too).)
It is absolutely not the case that all problems worth solving are solved already. Programming language development isn't necessarily about being a genius but rather a willingness to put in a monumental amount of work. Writing a language that compiles is easy enough. Getting a language off the ground to an actually useful place is tedious, simply in terms of the sheer amount of work to be done. Specification, implementation, documentation, diagnostics, optimization, configuration, tooling support, and creating a standard library (especially a cross-platform one) are things that will mire you in many hundreds of hours of work.
Making you own language is easy. Creating the library that will actually solve problems without forcing the developers to reinvent the wheel is the crux. There is a reason why C++ / Java / JavaScript etc are established, it's the already proven libraries around those languages that allows them to be so successful.
I have only read the first end of the article but I can't help but think that a project like libriscv[0] would've/could've worked for their game project too because fun fact but the creator of librsicv, the legendary fwsgonzo is also making a game. I highly recommend for people to check out their discord server.
But my main point is that libriscv is one of the fastest libriscv emulators and then something like C/C++/lua could've been used with sandboxing purposes for the purposes of the game then.
Am I missing something? Although, making a programming language is one kind of its own projects and that's really cool as well :-D
but I would also love to hear the author's opinion on libriscv as it feels like it ticks of all the boxes from my understanding
The lexer/parser is never the bottleneck. In fact, you can write those two by hand over a single weekend for a largish language. With LLMs, it takes 15 minutes if you have an unambiguous spec.
The biggest time sink, and the reason you will fail for sure, is the inability to restrict the scope of the project. You start with a limited feature set and produce the entire compiler/vm toolchain. Then you get greedy and fiddle with the type system, adding features that you have never used and probably never will. And now you have to change every single phase from start to end.
I mostly give up at this stage.
He went through straight recursive descendant parser and said same thing.
I think compiler courses teach from yacc, bison etc that's where this whole thing came from but in practice people discovered that hand written recursive descendant parsers are all you need.
Very true. I have a shelf full of books on compiler development and optimization. I have read them selectively, a chapter here, a chapter there. But that shelf is useless for a vast majority of people.
You might find it useful if you are developing a production-level compiler/vm (I cannot make this statement with a straight face while Python rules the world). But a simple and sensible architecture that uses recursive-descent parsing takes you a long way.
Most hobbyist compilers (and even some production ones) are written as a heavy front-end compiling down to C or LLVM. Very few people actually write their own backend.
Even stuff like Crenshaw's Let's Build a Compiler was more useful to me than all these books that do lexical analysis using regular expressions. I have written lexers and parsers hundreds of times for all kinds of DSLs and config languages and not once have I used regular expressions to scan the text.
The easiest syntax to copy if you’re looking for a high level language is Smalltalk.
But most of the time, I wouldn’t even use that. Simple imperative languages that look like BASIC works pretty well in most domains. If you simplify the syntax a little, it’s very easy to understand the compiler and use it for say when you want users to input code into existing systems.
Syntax is a minor issue but something that people are very opinionated about. You could technically build multiple front ends that share the typechecking, CFG validation, optimization, register allocation and byte code emission phases. But it is too much work for what is presently a personal project.
The team implementing the survey system wound up using the same language to implement the runtime portion, something I never expected or designed in.
I don't recall anything about what it looked like now. I do remember it was a lot of fun to write.
I wrote a compiler toolchain and debugger that takes a Turing machine description plus input string and emits an encoded tape runnable by a Universal Turing Machine [0]. I had some prior PL experience, but never did an end-to-end compiler pipeline, at least not this low level.
It started as a joke/experiment, but I couldn't believe how fast it pulled me into designing:
- a small low-level ASM for building the UTM
- an ABI for symbol widths and encoding grammar
- an interpreter used as the behavioral oracle
- raw TM transitions for each ASM instruction, generated by having an LLM iterate on candidate emissions and checked against the interpreter oracle
- a CFG-style IR to fix the LLM mess once direct ASM -> TM emission became too hard to keep sane (LLM did a decent job actually, I don't think I would have done a much better job without the IR either)
- a gdb-style debugger for raw transitions, ASM routines, and blocks
- a trace visualizer
- a bootstrapping experiment where an L1 UTM/input pair was itself run through an L2 UTM
- optimisation experiments
And every step came quite naturally and was easy to tie in with everything else. Each one was just the next local repair needed to make the previous layer tractable.
[0] Repo: https://github.com/ouatu-ro/mtm
The tail ends of a language implementation (parsing and code generation) are a fixed cost; the "middle end" can grow unbounded as more production-quality items are added.
My language: https://www.empirical-soft.com
Only thing that goes for C++ is that it has acceptable (not straightforward) C interop.
I don't like C# and X++ because the language surface is huge but if you use a limited subset than needles to say, very useful and handy languages too.
I was very deep into .NET until recently but somehow I didn't know this existed. Looks like C# with extra Linq-to-SQL syntax; I guess it's a DSL made with Roslyn for ERP jobs? I wonder how they picked the name.
The language, Sapphire, is Ruby inspired, so the most interesting part is digging into the internals of the latter when I'm trying to figure out how something should work.
[1] https://github.com/sapphire-project/sapphire
Coming up with new ideas is hard. Especially since you have to test them in the real world.
(also, they might want to look into lua userdata, since that would address their concern about the overhead of converting between native and lua data structures. the language is designed to be embedded in C programs after all)
I'm kind of curious and want to try it for fun as long as i get some free time ^^
Maybe AI is good enough now to help me with that..
The last time I tried, Claude couldn't even help me build a syntax highlighter for a hypothetical language.
(There are game engines that have their own programming language for those kind of game (and some of them are the ones I had made up too).)
Roughly 100%.
But my main point is that libriscv is one of the fastest libriscv emulators and then something like C/C++/lua could've been used with sandboxing purposes for the purposes of the game then.
Am I missing something? Although, making a programming language is one kind of its own projects and that's really cool as well :-D
but I would also love to hear the author's opinion on libriscv as it feels like it ticks of all the boxes from my understanding
[0]: https://github.com/libriscv/libriscv