The parser infrastructure is mainly in the narxia-syn crate.

It exposes the following types:

narxia_syn::token_source::TokenSource: The TokenSource trait, used as an input to the Parser.
narxia_syn::token_source::text_ts::TextTokenSource: The main Token Source implementation.
narxia_syn::parser::parse_event_handler::TreeBuilder: The TreeBuilder trait, used by the parser to actually build the syntax trees.
narxia_syn::parser::Parser: The main type. It gets a narxia_syn::token_source::TokenSource and creates Parser events. In the end, this list of parser events can be consumed by a Tree builder.

Token source

The token source, aka the trait narxia_syn::token_source::TokenSource, represents some source of tokens. To make parsing fast and convenient, it has a much bigger interface that it exposes to the parser, than most simple lexers provide.

In particular, it supports rollbacks in the case the parser decides it should take some other branch.

It also supports specific lexer modes depending on where we are in the source code. For example, if we are in a string literal, we can have "lsnglkan ${expression} njnglener", and so, string literals are handled by the Parser, because we would want the parser to go into the expression, but any characters other than $ and \ should be considered as parts of the string literal. More on parsing of string literals here.

Text token source

The text token source, provided in narxia_syn::token_source::text_ts::TextTokenSource takes in a &str and provides an implementation of the token source trait on this &str text, parsing individual tokens.

Parser

The parser itself is a concrete type: the narxia_syn::parser::Parser. It takes an input TokenSource and outputs events to a TreeBuilder.

Changing the parser typically involves adding some new parser functions, which can be done through the parse_fn_decl proc macro.

parse_fn_decl

parse_fn_decl is a procedural macro defined in narxia_proc, but it is intended to be used as part of the parsing infrastructure. In particular, it abstracts away some of the particularities of the parser, and allows for simple high-level definition of parsing functions, while still making sure that the calls they compile down to work like they are supposed to.

Here’s some syntax the parse_fn_decl accepts:

parse_fn_decl! {
	parse_fn_name: SKind ::= <parse-instructions>
}

The parse-instructions syntax can be defined roughly by the following CFG:

parse-instructions ::= parse-instruction*
parse-instruction ::= instr-call
			| instr-expect-token
			| instr-extra
 
instr-call ::= "$" (rust function call expression) // call the given function
instr-expect-token ::= "$!" "[" token "]" // expect, and consume the token in the token source
token ::= (any token, goes into the T![] macro)
name ::= (rust-ident)
instr-extra ::= "$/" instr-extra-extra
 
instr-extra-extra ::=
	"ws" ":" "wc" | // skip whitespace and comments, stop at newline
	"ws" ":" "wcn" | // skip whitespace, comments, and newlines
	"match" "{" match-arm* match-wildcard-arm? "}" | // match on the kind of token we have
	"state" ":" name | // save the state into a variable caled name
	"restore_state" ":" name | // restores the state from a variable called name
	"if" if-condition "{" parse-instructions "}" // sequencially attempt to match the tokens, executing the first branch that matches
	("$/" "else" "if" if-condition "{" parse-instructions "}")*
	("$/" "else" "{" parse-instructions "}")?
 
match-arm ::= match-selector* match-action // if any of the selectors matches, run the action, otherwise attempt to match the next arm
match-wildcard-arm ::= "_" "=>" {parse-instructions} // catch all. If not supplied, then an err_unexpected is generated
 
match-selector = "[" token "]"
match-action = "!" // consume the token, and do nothing else
		| "=>" { parse-instructions } // run the set of parse-instructions
 
if-condition ::=
	"at" "[" token "]" // parser is at token kind T![token]
	| "||" "[" (if-condition ",")* if-condition? "]" // any of the given conditions

In the parse-call, the first argument is always the Parser, so it is added automatically by the parse_fn_decl macro.

Those cover the basics of the syntax. You can always take a look over the parser code that’s already there for some examples. (The narxia-syn::parser::{self,expr,fun,stmt} modules).

Parser events

The parser writes individual parse events into an internal “buffer,” and after its done parsing everything from the input token source, it has a finish() function that takes in a narxia_syn::parser::parse_event_handler::TreeBuilder, and it “flushes” the buffer by calling the functions of that TreeBuilder. It also provides a finish_to_tree() function that outputs a (narxia_syn::syntree::GreenTree, Vec<narxia_syn::parse_error::ParseError>), which as the types might imply, is the parsed tree and the corresponding list of ParseErrors that happened during parsing.

Tree builder

The narxia_syn::parser::parse_event_handler::TreeBuilder trait has a few functions that get called by the parser once it’s time to build the concrete syntax tree.

The main implementations of this trait are the narxia_syn::parser::parse_event_handler::GreenTreeBuilder and narxia_syn::parser::parse_event_handler::GreenTreeBuilderSD, which build a narxia_syn::syntree::GreenTree.

Parsing string literals

This is quite complex. String literals in Narxia can have subexpressions, like here:

let x = 1
println("$x some text $?{"hello " + x} some more text")

A string literal like this would confuse the lexer, so the lexer, once it sees a " character, it consumes it and immediately returns, without attempting to go beyond that.

When the parser sees that, it changes the lexer mode into “string literal” mode. In this mode, the lexer parses any characters other than \, " and $ as normal characters. Those characters have special meaning inside a string:

\ means escape the next character.
" marks the end of the string, unless it is escaped.
$ is where the fun begins. When the parser sees a $ or $?, that means it has to switch the lexer back into normal mode, and attempt to parse the next thing as either an identifier or a full block expression, enclosed by curly braces ({}). In both cases, when we are back at the outer string literal, the parser then switches the lexer back into string literal mode, so it can continue.

dnbln.dev

Table of Contents

Parsing

Seedling