Programming Language for Old Timers

by David A. Moon
February 2006 .. September 2008

Comments and criticisms to dave underscore moon atsign alum dot mit dot edu.

Previous page   Table of Contents   Next page

Regular Expressions

If you wanted to provide a Regular Expressions library for PLOT, you could do it without making any changes to the lexical syntax (no s/foo/bar/g) and without any special magic global variables (no $*, $1, $2). Here is how:

First you have to distinguish the two use cases for Regular Expressions: matching and replacement.

For matching, define an =~ operator that takes two strings as arguments. It returns true if the right-hand string understood as a regular expression matches the left-hand string, otherwise false. This is a complex string comparison.

When the regular expression grouping construct (parentheses) is used, this extends to returning the substring matched by the group as the value when the match is successful, instead of returning true. When there is more than one group in the regular expression, return a list of substrings. If there is no match, return false. This is a complex string selection.

To return sequences of all matching substrings, rather than just returning the first match found, define an =~* operator.

For replacement, define a :=~ operator which takes a variable (or other L-value) on the left-hand side and one or more replacement rules on the right-hand side. It matches the replacement rules against the value of the variable, which must be a string, and sets the variable to a new string, the result of performing the replacements. Each replacement rule consists of an expression that evaluates to a string understood as a regular expression, an arrow, and a right-hand side that replaces the substring matched by the regular expression. The syntax would look like this:

defoperator :=~
  infix-macro: { ^ ?regexp { \=> | <=> } ?rhs }+ => ...

The right-hand side must be an expression that evaluates to either a string or a function. If a string, the matching substring is replaced with that string. If a function, it is called with one argument for each group in the regular expression; the value of the argument is the substring matched by that group. The function must return a string or false. If it returns false, no replacement occurs.

If the arrow in a replacement rule is =>, then if the regular expression matches the :=~ construct is finished after doing the replacement. If the arrow is <=>, do the replacement directed by the right-hand side and repeat until no further matches occur; this is the equivalent of perl's global matching.

For case-independent matching, use =~~ or =~~* or :=~~.

For multi-line mode, where instead of ^ and $ only matching the start and end of the input string, ^ will also match after any newline within the input string and $ will also match before any newline within the input string, I suggest using "-^" at the beginning of the regular expression and "^-" at the end of the regular expression. They can be used together, so "-^-" matches a blank line within the input string. This extension works because ^ has no meaning except as the first character in a regular expression and inside brackets.

Here are some examples taken from perl documentation and written using the above constructs:

; $string =~ m/sought_text/;
string =~ "sought_text"

; $string =~ m/\s*rem/i;   #true if the first printable text is rem or REM
string =~~ "\s*rem"

; if($string =~ m/^(Clinton|Bush|Reagan)/i)
;   {print "$string\n"};
if string =~~ "^(Clinton|Bush|Reagan)"
  print #"?string\n"

;Print every line with a valid phone number.
; if($string =~ m/[\)\s\-]\d{3}-\d{4}[\s\.\,\?]/)
;   {print "Phone line: $string\n"};
if string =~ "[\)\s\-]\d{3}-\d{4}[\s\.\,\?]"
  print #"Phone line: ?string\n"

;Similar but just print the phone number portion of the input
if def phone = string =~ "[\)\s\-](\d{3}-\d{4})[\s\.\,\?]"
  print #"Phone number: ?phone\n"

;$string =~ s/Bill Clinton/Al Gore/;
string :=~ "Bill Clinton" => "Al Gore"

; my($directory, $filename) = $text =~ m/(.*\/)(.*)$/;
; print "D=$directory, F=$filename\n";
def [directory, filename] = text =~ "(.*/)(.*)$"
print #"D=?directory, F=?filename\n"

;    if (/Time: (..):(..):(..)/) {
;        $hours = $1;
;        $minutes = $2;
;        $seconds = $3;
;    }
def [hours, minutes, seconds] = string =~ "Time: (..):(..):(..)"

; s/^([^ ]*) *([^ ]*)/$2 $1/;     # swap first two words
string :=~ "^([^ ]*) *([^ ]*)" => fun(w1, w2) w2 + " " + w1

It may be useful to optimize cases where the regular expression string is a literal or a constant expression and translate it into PLOT code to perform the requested operation directly. An easy way to do this is to define =~ as an operator macro. The macro can expand into an invocation of a generic regular expression interpreter or into code that performs the requested operation directly.

Previous page   Table of Contents   Next page