Minerva Architecture
Posted December 27th, 2021
In my previous articles I introduced my programming language and demonstrated how I added a feature to it. In this article I want to describe its architecture.
I'll describe what happens at every stage, and how I'd like to change it to make it the language I want.
The implementation of the Minerva compiler is heavily inspired by Bob Nystrom, author of Crafting Interpreters. I'll be critiquing some of the decisions inspired by his book, but don't take that as a criticism of his book. The decisions in the book were very pragmatic. If he did otherwise his book would not be as effective as it was. He often commented on how certain decisions are not ideal in a real compiler, and how different languages make design decisions other than the ones he chose in the book.
Lexical Analysis
The program first sees a string, a list of characters. If you provide the following code:
var message = "Hello, World";
Then the compiler just sees:
['v', 'a', 'r', ' ', 'm', 'e', 's', 's', 'a', 'g', 'e', ' ', '=', ' ', '"', 'H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd', '"', ';']
It is the job of the scanner to chunk together the relevent characters into "words".
The output of this stage is a list of "tokens".
This is how a token is defined in Minerva.
class Token(val type: TokenType, val lexeme: String, val literal: Any?, val line: Int)
The token type is an enum representing all the types of token available. Here is the first line:
enum class TokenType {
LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE, LEFT_SUB, RIGHT_SUB,
I have two problems with this representation.
The first problem is that I only emit tokens that are eventually executed. I don't emit whitepace tokens, comments, etc. The problem with this is that this is useful information for code editors and syntax highlighting.
The second problem is that I'm using enums and nullable fields rather than Kotlin sealed types (Kotlin's answer to union types). This makes the calling code more clunky, and obscures intent.
Eventually, I will have to fix these problems, but for now they are good enough.
The scanner works by iterating over the list of characters, and matching the character in the current position. Depending on the current character, it may move the current pointer forward until a rule stops applying, and it appends a new token to the list. Only basic error checking is done here, e.g, string termination.
Parsing
The next stage is parsing. This takes in the list of tokens, and outputs an abstract syntax tree. I'll show simplified versions of the current syntax tree:
sealed class Stmt {
class If(val condition: Expr, val thenBranch: Stmt, val elseBranch: Stmt?): Stmt()
class Print(val expression: Expr): Stmt()
class Expression(val expression: Expr) : Stmt()
class Function(val name: Token, val functionBody: Expr.Function): Stmt()
class Var(val name: Token, val initializer: Expr, var type: Type): Stmt()
class While(val condition: Expr, val body: Stmt): Stmt()
}
sealed class Expr(var type: Type) {
class Binary(val left: Expr, val operator: Token, val right: Expr) : Expr(Type.AnyType())
class Call(val callee: Expr, val arguments: List<Expr>, val typeArguments: List<Type>): Expr(Type.AnyType())
class Grouping(val expr: Expr): Expr(Type.AnyType())
class Function(
val parameters: List<Token>,
val typeParameters: List<Token>,
val body: Expr)
: Expr(Type.AnyType())
class If(val condition: Expr, val thenBranch: Expr, val elseBranch: Expr): Expr(Type.AnyType())
class Literal(val value: Any?) : Expr(Type.AnyType())
class Logical(val left: Expr, val operator: Token, val right: Expr) : Expr(Type.AnyType())
class Unary(val operator: Token, val right: Expr) : Expr(Type.AnyType())
class Variable(val name: Token): Expr(Type.AnyType())
}
The parser uses recursive descent to check if the token matches a grammar rule, and walks the precedence chain until it finds a match, and then uses the information it finds to return a syntax tree node - either a statement or an expression. All expressions have types, but only basic type processing is done here - literal types and type declarations are extracted where present, otherwise place holder types are provided until they can later be analysed.
I have two issues with the current implementation.
The first is that all error handling uses exceptions. This works for now, but code editors typically still need to work on code with errors. The way to get around this is to return errors as values, and try to parse the rest of the code, even if errors are detected. This will require a lot of changes to the compiler, so I am putting it off.
The second issue is that I am implicitly deciding the precedence rules, of the grammar, and run into ambiguities as a result. I think that if I switched to a different parsing technique, such as Pratt Parsing, I can avoid these issues and make the grammar more extensible. This might make it possible to extend the language within Minerva - e.g, allow users to create new control flow constructs. However, like the other problems, this will require a lot of work and I can do without it for now.
You can also see one of the issues from the previous stage, the Binary expression takes in a token as an operator, this means that in the code it treats every token type as though it could be an operator, even though only a subset are binary operators. This makes it easy to miss one out.
Static Analysis
I need to make sure that variables are used correctly, and that variables are bound by scope. I do this using a resolver that visits every node in the syntax tree, and it tracks the current scopes, variable names, and where the values are in the environment. This is especially important with closures. This also uses exceptions for errors, and has one more issue: it does not handle mutual recursion well because functions have to be declared before they are called. To handle this properly, I need to make sure I visit declarations before other statements, but I need to do this for every single scope.
Type Checking
The next stage of lexical analysis is type checking. The type checker makes sure that every node in the syntax tree has a type, and makes sure that all the types are being used correctly, by comparing the types it inferred to the types provided in the source code. This is quite similar to the interpreter, only that instead of returning values it returns types, that it then uses to modify the placeholder types in the syntax tree. Since this code is newer, made after migrating from Java to Kotlin, I don't use exceptions to handle errors. Instead, I append to a string array, which is not ideal but is a step up. That way it tries to get as many type errors as possible, and can theoretically be used for code highlighting. Like the resolver, type recursion is a problem. I will need to fix this if I want to write the compiler in Minerva.
sealed class Expr(var type: Type) {
class Array(val values: List<Expr>): Expr(Type.ArrayType(Type.AnyType()))
class Assign(val name: Token, val value: Expr) : Expr(Type.NullType())
class Binary(val left: Expr, val operator: Token, val right: Expr) : Expr(Type.AnyType())
}
In Minerva I'd like to do something like this:
type Array = class(const values: List<Expr>)
type Assign = class(const name: Token, const value: Expr)
type Binary(const left: Expr, const operator: Token, const right: Expr)
type Expr = Array | Assign | Binary
Interpreting
The final stage is executing the code. This is simply a switch statement over the syntax tree, that handles each node type appropraitely, returing values for expressions and executing statements.
The main thing I'd like to change here is adding more backends, and adding an optimisation stage.
That's pretty much it
There are more specific implementation details and design decisions, but this is a high level picture of how the compiler works. If there is anything else you'd like to know about how it works, you can ask in the comments.