Lunar Programming Language

by David A. Moon
January 2017 - January 2018



Strings

A string is a constant sequence of characters. The length is the number of characters.

Strings implement the keyed sequence protocol. The keys are non-negative integers which increase monotonically but are not necessarily consecutive. The sequence and keyed sequence protocols use this key as the position. These string positions are not character indexes.

Strings implement the succession protocol, so a substring can be computed from the positions of the first and last characters.

The internal representation is UTF-8 in a multi-slot of 0..255 named utf8. A string position is actually a key of this keyed sequence of UTF-8 bytes. The primary constructor takes a sequence of bytes as its actual parameter. There are pseudo-constructor methods to construct a string from many object types. Note that unlike other sequence constructors, the string constructors take a single parameter, rather than taking each sequence member as a separate actual parameter.

To build up a string in a variable "string buffer", use stack[character] and when finished pass it to string to convert it to a string.

string could have been defined by:

sealed:
defclass string(utf8 sequence[0..255]) \
            constant_succession[character],
            reversible_sequence[character]
  utf8[utf8.length] = utf8

;; Implement sequence protocol

def iterate(s string)                      0          ; initial string position
def more?(s string, pos 0..max_length)     pos < s.utf8.length
def next(s string, pos 0..max_length-1)    utf8_to_character(s.utf8, pos)
def iterate(s string, pos 0..max_length-1) pos + utf8_character_length(s.utf8, pos)
def (s string).length                      utf8_length(s.utf8)

;; Implement the reversible_sequence protocol, same position
def reverse_iterate(s string)                      reverse_iterate(s, s.utf8.length)
def reverse_more?(s string, pos -1..max_length-1)  pos >= 0
def reverse_iterate(s string, pos 0..max_length-1)
  block exit: return
    for next_pos = pos - 1 then next_pos - 1
      if next_pos < 0 or s.utf8[next_pos] < 0x80 then return(next_pos)

;; Implement keyed_sequence protocol

def keyed_iterate(s string)                      0
def next_key(s string, pos 0..max_length-1)      pos
def next_member(s string, pos 0..max_length-1)   next(s, pos)
def keyed_iterate(s string, pos 0..max_length-1) iterate(s, pos)
def (s string)[key 0..max_length-1]              utf8_to_character(s.utf8, key)

def (s string)[key, named: default]
  if key in 0..max_length and key < s.end_position and s.utf8[key] < 0x80
    utf8_to_character(s.utf8, key)
  else
    default

;; Implement succession protocol

def (s string).end_position            s.utf8.length
def (s string)[r range[integer]]       subsuccession(s, r)

;; Pseudo-constructors

def string(x string)    x               ; already a string
def string(x name)      x.spelling
def string(x character) string(character_to_utf8(x))
def string(x false)     "false"
def string(x true)      "true"
def string(x float)     ; decimal floating-point representation
def string(x integer, named: base = 10 2..36) ; representation in base 'base'

;; Convert a sequence of characters to a string
def string(x sequence[character])
  def buffer = stack[0..255]()
  for c in x
    append!(buffer, character_to_utf8(c))
  string(buffer)

;; Every object can be converted to a string
require string(x everything) => string

;; Default method could show class name and values of selected slots
def string(x)
  def buffer = stack[character](class(x).name.spelling...)
  push!(buffer, '(')
  def slotcount := 0
  for slotname => slot in class(x).slots while slotcount < 5
    def value = _internal_slot_value(x, slot)
    if value in number | boolean | character | string | name
      if slotcount > 0 then append!(buffer, ", ")
      append!(buffer, string(slotname))
      append!(buffer, ": ")
      append!(buffer, string(value))
      slotcount := slotcount + 1
  push!(buffer, ')')
  string(buffer)

String-specific functions

The in operator with a string as the right-hand operand accepts as the left-hand operand either a character, which has the usual in sequence meaning, or a string, which tests whether the left-hand string is a substring of the right-hand string.

The position function is similar.

TBD

;; String equality
def (s1 string) = (s2 string)
  s1.length = s2.length and
    for c1 in s1, c2 in s2 using always
      always c1 = c2

;; String comparison
def (s1 string) < (s2 string)
  block exit: return
    for c1 in s1, c2 in s2
      if c1 < c2 then return(true)
      if c1 > c2 then return(false)
    ;; common prefix is equal, so shorter string is less
    return(c1.length < c2.length)

;; TBD > etc.


downcase, upcase, alpha_char? digit_char?

UTF-8 Utilities

The following functions are used in the implementation of strings:

;; The number of characters represented by a sequence of UTF-8 bytes
def utf8_length(utf8 sequence[0..255])
  for byte in sequence using count
    count byte < 0x80

;; The number of UTF-8 bytes that constitute the next character
def utf8_character_length(utf8 keyed_sequence[0..255], pos)
  utf8_character_length(utf8[pos])

;; The number of UTF-8 bytes that constitute the character starting with this byte
def utf8_character_length(byte 0..255)
  if byte < 0x80 then 1
  else if byte < 0xE0 then 2
  else if byte < 0xF0 then 3
  else 4

;; The number of UTF-8 bytes that constitute this character
def utf8_character_length(c character)
  def code = c.code
  if code < 128 then 1
  else if code < 2048 then 2
  else if code < 65536 then 3
  else 4

;; A sequence of UTF-8 bytes that encode just one character
def character_to_utf8(c character)
  character_utf8_sequence(character)

constant:
defclass character_utf8_sequence(char character) sequence[0..255]

def (s character_utf8_sequence).length      utf8_character_length(s.char)
def iterate(s character_utf8_sequence)      0
def iterate(s character_utf8_sequence, pos) pos + 1
def more?(s character_utf8_sequence, pos)   pos < utf8_character_length(s.char)
def next(s character_utf8_sequence, pos)
  def code = s.char.code
  if code < 128 then code
  else if code < 2048
    if pos = 0 then 0xC0 + code / 64
    else 0x80 + code & 0x3F
  else if code < 65536
    if pos = 0 then 0xE0 + code / 4096
    else if pos = 1 then 0x80 + (code / 64) & 0x3F
    else 0x80 + code & 0x3F
  else
    if pos = 0 then 0xF0 + code / 262144
    else if pos = 1 then 0x80 + (code / 4096) & 0x3F
    else if pos = 2 then 0x80 + (code / 64) & 0x3F
    else 0x80 + code & 0x3F

;; The next character from a sequence of UTF-8 bytes
def utf8_to_character(utf8 keyed_sequence[0..255], pos)
  def byte = utf8[pos]
  character(if byte < 0x80 then byte
            else if byte < 0xE0 then (byte - 0xC0) * 64 + utf8[pos + 1] - 0x80
            else if byte < 0xF0 then (byte - 0xE0) * 4096 + utf8[pos + 1] * 64 +
                                     utf8[pos + 2] - 0x2080
            else (byte - 0xF0) * 262144 + utf8[pos + 1] * 4096 +
                 utf8[pos + 2] * 64 + utf8[pos + 3] - 0x82080)

;; Make a string from its UTF-8 representation in zero or more bytes
;; This is defined as the constructor of the string class
;; def string(utf8 sequence[0..255])

String Interpolation

The $ character in a string literal indicates an interpolation directive when it is not denatured by an immediately preceding \, it is not the last character in the string, and the immediately following character is alphabetic or one of _, ?, !, ¿, ¡, (, [, or {. Otherwise $ just represents itself.

String interpolation is an expression whose result is a string. Characters in the string literal that are not part of an interpolation directive are carried directly into the result.

The interpolation directives for string interpolation are as follows:

$name converts the value of name to a string and inserts it into the result. A character or string inserts itself. A number inserts its decimal value preceded by a minus sign if negative. A name inserts its spelling. A boolean inserts "true" or "false". A sequence inserts its members separated by commas. Anything else inserts the result of string applied to it.

$(expression) evaluates expression and inserts the result in that same way.

$(expression, parameters) evaluates expression and inserts the result with formatting controlled by the actual parameters. See Interpolation Parameters

$[expression] and $[expression, parameters] are the same as with parentheses, except if the result of expression is false nothing is inserted.

${substring1 & substring2} is an iterating interpolation. If substring1 contains at least one interpolation directive whose value is a sequence, this iterates as many times as the longest such sequence. On each iteration, it inserts substring1 but for each interpolation directive whose value is a sequence, it uses the next member of the sequence, or nothing if the sequence is exhausted. On each iteration but the last, it inserts substring2 after substring1, with the same treatment of any sequences being interpolated.

Otherwise this just processes substring1 in the ordinary way and ignores substring2.

The & substring2 part is optional and can be omitted. If this part is present, spaces that directly precede and/or follow the & will be ignored.

${ cannot be nested inside ${...}

See Templates for a similar feature for program code.

Interpolation Parameters

The formal parameter list that accepts the actual parameters in an interpolation directive is:

named: base = 10          2..36,
       separator = ", "   sequence[character]

base is the radix for conversion of integers to strings.

separator is the string that separates members of a list.

TODO: Add more parameters such as min and max width and alignment

TODO: Examples


Previous page   Table of Contents   Next page



Creative Commons License
Lunar by David A. Moon is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Please inform me if you find this useful, or use any of the ideas embedded in it.
Comments and criticisms to dave underscore moon atsign alum dot mit dot edu.