The parser infrastructure is mainly in the narxia-syn
crate.
It exposes the following types:
- narxia_syn::token_source::TokenSource: The
TokenSource
trait, used as an input to theParser
. - narxia_syn::token_source::text_ts::TextTokenSource: The main Token Source implementation.
- narxia_syn::parser::parse_event_handler::TreeBuilder: The
TreeBuilder
trait, used by the parser to actually build the syntax trees. - narxia_syn::parser::Parser: The main type. It gets a
narxia_syn::token_source::TokenSource
and creates Parser events. In the end, this list of parser events can be consumed by a Tree builder.
Token source
The token source, aka the trait narxia_syn::token_source::TokenSource
, represents some source of tokens. To make parsing fast and convenient, it has a much bigger interface that it exposes to the parser, than most simple lexers provide.
In particular, it supports rollbacks in the case the parser decides it should take some other branch.
It also supports specific lexer modes depending on where we are in the source code. For example, if we are in a string literal, we can have "lsnglkan ${expression} njnglener"
, and so, string literals are handled by the Parser
, because we would want the parser to go into the expression
, but any characters other than $
and \
should be considered as parts of the string literal. More on parsing of string literals here.
Text token source
The text token source, provided in narxia_syn::token_source::text_ts::TextTokenSource
takes in a &str
and provides an implementation of the token source trait on this &str
text, parsing individual tokens.
Parser
The parser itself is a concrete type: the narxia_syn::parser::Parser
. It takes an input TokenSource and outputs events to a TreeBuilder.
Changing the parser typically involves adding some new parser functions, which can be done through the parse_fn_decl proc macro.
parse_fn_decl
parse_fn_decl
is a procedural macro defined in narxia_proc
, but it is intended to be used as part of the parsing infrastructure. In particular, it abstracts away some of the particularities of the parser, and allows for simple high-level definition of parsing functions, while still making sure that the calls they compile down to work like they are supposed to.
Here’s some syntax the parse_fn_decl
accepts:
The parse-instructions
syntax can be defined roughly by the following CFG:
In the parse-call
, the first argument is always the Parser
, so it is added automatically by the parse_fn_decl
macro.
Those cover the basics of the syntax. You can always take a look over the parser code that’s already there for some examples. (The narxia-syn::parser::{self,expr,fun,stmt}
modules).
Parser events
The parser writes individual parse events into an internal “buffer,” and after its done parsing everything from the input token source, it has a finish()
function that takes in a narxia_syn::parser::parse_event_handler::TreeBuilder, and it “flushes” the buffer by calling the functions of that TreeBuilder
. It also provides a finish_to_tree()
function that outputs a (narxia_syn::syntree::GreenTree, Vec<narxia_syn::parse_error::ParseError>)
, which as the types might imply, is the parsed tree and the corresponding list of ParseError
s that happened during parsing.
Tree builder
The narxia_syn::parser::parse_event_handler::TreeBuilder
trait has a few functions that get called by the parser once it’s time to build the concrete syntax tree.
The main implementations of this trait are the narxia_syn::parser::parse_event_handler::GreenTreeBuilder
and narxia_syn::parser::parse_event_handler::GreenTreeBuilderSD
, which build a narxia_syn::syntree::GreenTree
.
Parsing string literals
This is quite complex. String literals in Narxia can have subexpressions, like here:
let x = 1
println("$x some text $?{"hello " + x} some more text")
A string literal like this would confuse the lexer, so the lexer, once it sees a "
character, it consumes it and immediately returns, without attempting to go beyond that.
When the parser sees that, it changes the lexer mode into “string literal” mode. In this mode, the lexer parses any characters other than \
, "
and $
as normal characters. Those characters have special meaning inside a string:
\
means escape the next character."
marks the end of the string, unless it is escaped.$
is where the fun begins. When the parser sees a$
or$?
, that means it has to switch the lexer back into normal mode, and attempt to parse the next thing as either an identifier or a full block expression, enclosed by curly braces ({}
). In both cases, when we are back at the outer string literal, the parser then switches the lexer back into string literal mode, so it can continue.