[logo] Computing Systems
CS134b, Winter 2005

Programming languages and compilers

Home
Policy
Syllabus
Assignments
Using Osaka
Pearls
Text
People
FAQ
Mailing Lists
Previous Years
Links
Style Guide
Resources OCaml Syntax Syntax Metalanguage


Introduction

This document describes the metalanguage used to define the OCaml syntax.  First we will give a grammar for the metalanguage;  thus the metalanguage will be used to describe itself.  Following that is a prose description and a discussion of common usage and how various aspects are understood to be interpreted.


Metalanguage Syntax

    Grammar =
	Rules


    Rules =
	Rules  NL+  Rule
	Rule

    Rule =
	Symbol  Separator                              NL  Alternative*
	Symbol  Separator  {  Character_set_lexeme  }  NL
	Symbol  Separator  Special_character_lexeme    NL


    Alternative* =
	Alternative+
	Empty

    Alternative+ =
	Alternative+  Alternative
	Alternative

    Alternative =
	Symbol+  NL


    Symbol+ =
	Symbol+  Symbol
	Symbol

    Symbol =
	Lexeme


    Separator =
	Lexeme


    Character_set_lexeme =
	Lexeme

    Special_character_lexeme =
	Lexeme


    Empty =


    Lexical_units  include:
	Lexeme
	{
	}
	NL
	White_space(ignored)

    Lexeme =
	Printable_character+

    Printable_character+ =
	Printable_character+  Printable_character
	Printable_character

    Printable_character =
	Letter
	Digit
	Other_printable

    Letter =
	Uppercase_letter
	Lowercase_letter

    Uppercase_letter  any_of:   { ABCDEFGHIJKLMNOPQRSTUVWXYZ }

    Lowercase_letter  any_of:   { abcdefghijklmnopqrstuvwxyz }

    Digit             any_of:   { 0123456789 }

    Other_printable   any_of:   { ,./;'[]-=\`<>?:"{}!@#$%^&*()_+|~ }


    NL+ =
	NL+  NL
	NL

    NL =
	CR  LF
	LF  CR
	CR
	LF


    White_space(ignored) =
	NULL
	Blank
	Tab

    NULL   \=  (ASCII:0)

    Blank  \=  (ASCII:32)

    Tab    \=  (ASCII:9)

    CR     \=  (ASCII:13)

    LF     \=  (ASCII:10)



Description and Discussion

A Grammar is a sequence of Rule's, with one or more blank lines (NL) between the Rule's.  Most Rule's are written with one or more alternative expansions for the nonterminal symbol (the first symbol in the first line of the Rule), with each Alternative being written by itself on a single line.
        The symbols in the metalanguage, both terminal symbols and nonterminal symbols, are simply contiguous sequences of non-blank characters.  The only characters that have any special meaning in the metalanguage are white space characters like Blank or Tab and "newlines" (which are various combinations of CR and LF).  As the grammar here is written, brace characters ({ and }) are also characters that have special meaning in the metalanguage, but notice that this isn't necessary.  We could give the definitions

    { =
	Lexeme

    } =
	Lexeme

and then any sequence of non-blank characters could be used for the type of Rule that uses { and }.
        Reading a single Alternative, there is no way to know which symbols are terminal symbols and which symbols are non-terminal symbols.  The rule is that symbols that appear on the left hand side (which is to say, are the first symbol on the first line of a Rule) are non-terminal symbols, and any other symbol is a terminal symbol that stands for itself.  In addition, most grammars expressed in the metalanguage will follow a convention to indicate which symbols are non-terminal symbols; here that convention is that non-terminal symbols start with a capital letter.
        The Separator used to delimit a left hand side non-terminal symbol and the various right hand sides is usually written here with '=', but the metalanguage grammar actually allows any symbol to be used.  This freedom is used to advantage within the metalanguage grammar to indicate set membership with 'any_of:' as the delimiting Separator, and '\=' to indicate various special characters.  These separators could just as well have been written as '::=' or '*:xyzzy:*';  both would fit the metalanguage grammar.  Thus allowing the separators to be written differently allows a certain amount of flexibility in writing grammars (which often are written with '::=' rather than '=') and a small degree of descriptive ability within the metalanguage.


Types of Rule's

There are three or four kinds of Rule's that may be expressed.

Most Rule's are written with one or more Alternative lines following the first Rule that gives the left hand side non-terminal, and are the usual kind of context free grammar rules that are expressed.  The special case of no Alternatives (as exemplified here by Empty) means a non-terminal that may be generated from "nothing".  Some texts on formal language refer to this case as "epsilon", sometimes written out and sometimes shown by a Greek letter.
        A few Rule's are used to express any single character out of set of characters, all characters in the set being printable.  These Rule's are written in the single line format with { and } surrounding the set of single character alternatives.
        Finally, various special characters are written in a single line style that has exactly three Lexemes in the one line of the Rule.  The grammar here uses '\=' to indicate a kind of special definition (although, as before, that is just a convention intended to help readability).  The interpretation of the defining (third) Lexeme is not defined.  Clearly, the metalanguage grammar given here is intended to be suggestive of a numeric value in a particular character set (ASCII), but any single Lexeme indicator may be used, with an interpretation to be supplied externally.


Parsing and Lexing

The first non-terminal of the grammar is always taken to be the goal symbol of the language being defined.  As a matter of good form the metalanguage mandates that the goal non-terminal not be referenced in any Alternative in the grammar.
        In addition, there is one other unreferenced non-terminal.  In the grammar here that non-terminal is Lexical_units.  This non-terminal identifes non-terminals that are returned as a single token by the lexer.  The usual rules apply for lexing, namely, the largest next lexical unit is assembled as the next token, and a specific terminal takes precedence over a more general class to which it may belong.  In a grammar for C, for example, we might have

    Statment ::=
	while  (  Expression  )  Statement

The 'while' terminal symbol is identified as itself and not as an Identifier, even though it matches the grammatical definition of Identifier.  If it is desired (which is isn't in most cases) that keywords also are allowed to be Identifier's, that may be accomplished by something along these lines:

    Lexical_units ...::=
	Identifier

    Identifier ::=
	while
	Non-keywordIdentifier

    Non-keywordIdentifier ::=
	Non-keywordIdentifier  Letter|Digit|_
	Letter
	_

    Letter|Digit|_ ::=
	Letter
	Digit
	_

    Letter ::=
	UppercaseLetter
	LowercaseLetter

    UppercaseLetter ::=  { ABCDEFGHIJKLMNOPQRSTUVWXYZ }

    LowercaseLetter ::=  { abcdefghijklmnopqrstuvwxyz }

    Digit ::=  { 0123456789 }

The inclusion of the 'while' terminal symbol as an alternative for Identifier would allow it to be generated as an Identifier even though both 'while' and Identifier are lexical units.  Incidentally, notices the use of '...::=' as a separator to emphasize that the set of alternatives given are not exhaustive (which is the usual rule).
        The Lexical_units non-terminal also specifies some other Alternative expansions that do not appear in the Alternatives reachable from the goal symbol.  These Alternatives are used to identify characters that serve to separate tokens but otherwise are ignored - for example, white space or comments.


Webmaster | Contact Us | Generated on %%DATE%%

Copyright (c) 2005 Caltech CS134 Course Administration.
Computer Science Dept., California Institute of Technology