Lunar Programming Language

by David A. Moon
January 2017 - January 2018

Syntax

A program consists of a series of expressions that are executed in turn. Most of these expressions will be definition statements.

The expressions in a program can be divided among a sequence of source files which are units of compilation. Source files are only significant for scoping the sealed class modifier.

Any top-level expression can be preceded by the keyword module:, in which case the expression is parsed in the modules module and must evaluate to a module. The remaining expressions in the current source file are parsed in that module. The expression can be a defmodule or a name whose definition is a module.

If the keyword syntax: appears at the top level of a program, it is followed by a name. If the name is xyz, the remainder of the source file is parsed by calling parse_xyz rather than parse_expression. This allows switching to an alternative user-defined syntax without having to change the compiler. Any parsing function that returns an expression can be used.

Any top-level expression can be preceded by the keyword export:, in which case the expression must parse as a name, a definition, or a prog_expression containing definitions. Each name defined is exported from the current module, except in a prog_expression any names after the first that start with _ are not exported; this allows putting export: in front of a defclass to export the class, the constructor, and any public slot readers and writers, but not private slot readers and writers, and not the constructor if the constructor: option is used and the constructor name starts with an underscore to indicate privacy. If the expression is just a name, that name is exported from the current module.

Program syntax requires consistent indentation, so each top-level expression must be on a separate line and all top-level expressions must begin at the first column.

Tokens, Comments, and Whitespace

Lexically, a program comprises tokens, comments, and whitespace.

Whitespace is one or more space, tab, and/or newline characters.

A comment begins with ; and continues through the end of the line. By convention double and triple semicolons introduce "paragraph" and "section" comments. Lexically a comment is equivalent to a newline.

A token is a name, a keyword, a literal, or a newline. Yes, newline is both whitespace and an implied token.

A literal is an integer, floating-point, character, or string literal. The syntax of literals is the same as in C except that a floating-point literal must start with a digit. A numeric literal cannot have a sign. Thus a literal begins with a digit, ', or ". Note that $ in a string literal may have special significance, indicating string interpolation. This is the only case where a literal represents a computation rather than a constant datum. See String Interpolation .

A name is one of:

an alphanumeric name is any number of alphabetic, digit, _, ?, !, ¿, or ¡ characters not starting with a digit.
a symbolic name is any number of +, -, *, /, %, ^, ~, &, |, =, <, >, :, or . characters and/or Unicode characters not in the alphabetic, digit, bracket, or blank character classes.
a punctuation character is exactly one `, @, #, $, (, ), [, ], {, }, \, or , character or one Unicode bracket character.

In grammar patterns, name matches a name that is not punctuation and does not have a definition as an operator or macro visible in the current scope, while anyname just matches any name without those restrictions. Either name or anyname matches \ followed by a name or a string.

A keyword is an alphanumeric name immediately followed by a colon (:).

As a special case, ... is punctuation rather than a symbolic name.

A newline token is implied between any two non-newline tokens that are on different lines. There is only one newline token no matter how many comments or blank lines intervene between the two tokens. A newline token has an associated indentation, which is simply the amount of whitespace between the beginning of the line and the next token. The "indentation of a line" is the indentation preceding the first token on that line and is the same as the indentation of the newline token preceding the first token on that line.

Indentation is significant in the syntax and is used to indicate nesting, rather than using brackets or an "end" statement. If an expression or other construct extends to more than one line, we call lines after the first continuation lines. All continuation lines must be indented at least as much as the indentation of the first line. Most constructs require that all continuation lines have equal indentation (unless enclosed in a nested construct) and that that indentation is greater than the indentation of the first line.

A \ at the end of a line prevents inserting a newline token before the next line. Both the \ and the line break are ignored by the parser. The next line is considered part of the previous line, not a continuation line, and does not participate in the indentation rules.

Two adjacent numeric literals and/or alphanumeric names must be separated by whitespace. Two adjacent symbolic names must be separated by whitespace. A numeric literal or alphanumeric name preceding a keyword must be separated by whitespace. An alphanumeric name preceding a symbolic name that starts with : must be separated by whitespace.

By convention, most non-punctuation tokens except those beginning with . are separated by whitespace to increase readability but this is not required. Sometimes whitespace is placed before or after punctuation as well; whatever seems to increase readability.

Meta-Syntax

After lexical syntax, the syntax of Lunar is defined entirely by recursive descent parse methods written in the language.

All parsing is LL(1) with one small exception involving newlines and one large exception in the def statement; see Parsing the Def Statement for details.

Parsing is sensitive to known definitions in scope, which allows extensible operators and macros.

A parse method takes four parameters: the token-stream, the current indentation, the current scope, and a required? flag. If the parse method does not recognize its input, the result depends on the required? flag: the method returns false if required? is false or signals a parse error if required? is true. The required? flag is passed down in head-recursive calls with no alternatives, causing the error message to be generated at the most informative level. The required? flag is true after the first token in a construct since we are committed to that construct, being LL(1).

In the program syntax, there are two ways to reach a parse method. The first way is to parse a specific syntactic type. For example, the syntactic type named expression is the basis of programs, so the compiler would call parse_expression to get a piece of a program. By convention, the name of the function that contains a parse method is the name of the syntactic type preceded by "parse_".

The second way is macros. When the expression parse method sees a name that is defined as a macro, it invokes the parse method for that macro, also known as the macro expander. Macros are the basis of all idiosyncratic syntax in Lunar. In this way, the syntax of the language is defined within the language and can be changed by users. Each module could have its own syntax. If we don't want to use the standard expression parser at all in a given source file, the modifier keyword syntax: will direct the compiler to use a different parser.

Each module can replace all the statements and operators of Lunar, simply by defining names. More importantly, since the language is defined within itself, user-defined language extensions and embedded languages can do anything the base language can do. There are no magic constructs.

A prefix macro parse method takes five parameters: the token-stream, the current indentation, the current scope, the variable set of modifiers preceding the statement, and the hygienic context of the statement. The required? flag accepted by ordinary parse methods is always implicitly true for a macro parse method. The parse method parses as many tokens as it likes out of the token-stream and returns a result. If the result is an expression, it is the expansion of the macro. Otherwise the result must be a sequence of tokens, which are parsed before reading anything more from the token-stream.

An infix macro parse method takes six parameters: the left-hand-side expression (which precedes the operator and has already been parsed), the token-stream, the current indentation, the current scope, the hygienic context of the statement, and a method head flag which is false when parsing an expression and true when parsing a method head, which may have slightly different syntax on the right-hand side. The result of an infix macro parse method is the same as for a prefix macro when parsing an expression. When parsing a method head, the result must be a method_head or a call_expression.

For convenience, rather than writing a parse method directly in raw imperative form, you can write it in a more declarative, pattern-directed form using one of the macros defparser, defsyntax, defmacro, and defoperator. These Lunar macros translate the patterns into Lunar code to do the parsing. Lunar patterns are pretty powerful, although not powerful enough to write the parser for expressions, which must be written in imperative form mainly because of operator precedence.

To be useable in patterns, the name of a syntactic type must be alphanumeric and not contain any underscores.

defparser defines a parse method for a syntactic type, using a pattern to specify what is to be parsed and arbitrary Lunar code to specify what object to return. The code generated by the defparser macro takes care of reading from the token-stream and handling the required? argument. Within the body of a defparser, the names lexer, scope, indentation, and required? are visibly bound to the parameters.

defsyntax is just like defparser except that the parser will also accept a single token that is of the data type with the same name as the syntactic type. This represents an already-parsed instance of the syntactic type. The defsyntax macro generates code to take care of this special processing. The body of a defsyntax must always return an instance of the data type with the same name as the syntactic type, or false. This allows a compound expression type to be defined as a Lunar class, and anywhere in the syntax that the syntactic type appears, the parser will accept either tokens that conform to the grammar of that syntactic type, or an expression that has already been parsed. This is very useful in connection with macros and templates.

defmacro defines a prefix macro. When the name of a macro appears in an expression, idiosyncratic syntax follows. defmacro defines the name of the macro to be a bundle with operator nature, which contains a parse method similar to those defined by defparser, except the name required? does not have a visible binding. The names context and previous_context are visibly bound to a new unique context for hygienic macros and the context of the macro name in the call, respectively. The result of the parse method must be an expression or a sequence of tokens. Often this result is produced by a template, introduced by the ` macro. A template evaluates to the sequence of tokens described by the template.

defoperator can define an infix macro. An infix operator macro that parses no tokens on the right-hand side is effectively a suffix macro.

Previous page Table of Contents Next page