My experiences of designing and implementing an interpreter from scratch with Python!
0I've been thinking about how cool it would be to implement a programming language from scratch for about a year! After a long time, I decided to leave all other projects at least for a while (even a startup company)...
First of all, maybe I should talk about the name of the language! This is the timeline of the chosen names in order:
My initial idea was to create a language heavily inspired by Bash; a language with understandable and simple codes (I'm still not sure if Bash is a simple language?!) and can be used in the terminal for many tasks (such as manipulating strings and files, automating processes, etc.)...
And this was the chosen grammar for the Ammepsand language:
In the middle of the night, I said to myself, maybe I should make the syntax more similar to the C family... But after about 15 minutes, I saw that we were far away from Bash! After that, I decided to change the name of the language to Amme along with the syntax.
This was the calculator program I wrote in the fictional Amme language:
It didn't take long for me to decide that we should have some rare features in the language we were going to implement:
All parts cannot be explained in this post, I will only mention the challenges and solutions that came to my mind during that period of time.
For many people, the first solution that comes to mind for tokenization and even parsing is to use ANTLR, Bison, and other similar tools... I am not saying that it is a bad idea, but we are supposed to understand the process of creating a programming language from scratch; not just the implementation of several node visitors!
Instead of reading character by character and basic labeling, I went to use regular expressions... I really liked to implement the project as much as possible so that others can easily manipulate it; and this was the main reason for using regular expressions in this phase!
A little further, you will understand what GroupedTokens
is for!? And that I know that for a data class that all fields have a specific parameter, it can be defined in the decorator itself (e.g., @dataclass(frozen=True, kw_only=True)
)...
Our scanner logic:
In the constructor, we create a pattern based on the groups defined by the developer(s) to break the code into smaller parts. The tokenize
method is probably not that difficult to understand; but I must mention that it is much better to use generators instead of returning a list at the end of the function...
Here we define our own language tokens beautifully:
This part is very, very long, so I will only take small parts of it...
This is the base parser we use with a very simple logic:
We have over 65 nodes in Farr, but I'll only include 30% of them here to give you an idea of how they're structured and inherited:
A not so interesting solution that can be used to create the nodes needed to build a syntax tree is to reduce them to one; with fields such as node type, and its children... There are definitely different ideas for different phases; for example, another simple solution that can be used to provide nodes is to create general groups for them, such as literals, loops, etc.
Let's go to parse three literals (_parse_integer
, _parse_string
and _parse_identifier
), a term (_parse_call
) and a statement (_parse_function
):
As a programmer and maintainer of this language, I have to admit that Farr could have been smarter in parsing! I mean more in the part of combining terms (_process_expression
)...
Oh, we have to go to the back-end part of our project! This part is both interesting and scary! Very interesting and very scary!
Preparing the environment for managing variables, functions, structs and methods; and the not so good base of the interpreter:
This section can be very difficult to understand without getting into the code, especially for chained handle expressions (_process_chain_target
) and calls (_populate_params
)...
A half of the main code of the Farr language interpreter is as follows:
I can say that my biggest problem in implementing the interpreter was my inability to understand the logic of matching arguments with parameters! I was involved in the implementation of the interpreter for about a week...
Some objects that the language provides as its native:
Of course, it is not necessary in this section to not divide the objects into two groups of expression and statement (it wasn't even necessary in the parser phase, but it was done for simplicity)...
If I want to design and implement a programming language again (seriously, not just to gain experience); with this project, I realized that I should spend more time on the design phase so that I don't get into ideological conflicts on the way...
See the project on GitHub, test it and also give feedback (if you like)!