akrantiain manual

2. The syntactic structure of a sentence sequence

2-0. Structure

A sentence sequence is made up of sentences. A sentence either defines an identifier, defines a conversion rule or specify a setting.
Akrantiain works by applying the conversion rules (in the order it is written in the .snoj file) to the input.
Thus, while the order of conversion rules' definition does matter, the order and positions of identifier definitions or setting specifiers never affect akrantiain's behavior.
A semicolon or a newline terminates a sentence; thus, a single sentence cannot span multiple lines.
In other words, a semicolon that terminates a sentence can be omitted when it is at the end of the line or is followed by a single-line comment. It can also be terminated before }.

2-1. Comments

Anything written after a # till the end of the line is ignored, except for the # inside a string literal or a slash literal.
Multiple-line comments do not exist.

2-2. Defining identifiers

Identifers are defined using the following syntax:

identifier = 1 or more string literal | 1 or more string literal | 1 or more string literal | 1 or more string literal;

For example,

sample = "foo" | "bar" "baz" | "foobar";

defines the identifier sample as a pattern that matches either "foo", "bar" "baz", or "foobar".
An identifier is an alphabet followed by 0 or more of {either an alphabet, a digit, or an _}. An identifier cannot start with a digit or a _.

An error occurs when you define a single identifier multiple times.

A special identifier named PUNCTUATION exists. For example, when you define PUNCTUATION = "." | "," | "!" | "?";, akrantiain recognizes these characters as punctuations.
This is necessary to, for example, let akrantiain know that character n inside the input This is a pen. terminates a word.
Currently, using a string literal whose length is longer than 1 (ex. "foo") inside the definition of PUNCTUATION may result in unintended behavior.
This is a bug, and is intended to be fixed in the future.

2-3. Defining conversion rules

Let SELECT be ^, a single identifier, a single string literal or ( 1 or more string literal | 1 or more string literal | 1 or more string literal ).
Let CONDITION be ! followed by a SELECT.
Let PHONEME be either $ or a slash literal.
conversion rules are defined using the following syntax:

0 or 1 CONDITION 1 or more SELECT 0 or 1 CONDITION -> 1 or more PHONEME;

There are two restrictions:

The number of {SELECTs (excluding ^) that make up 1 or more SELECT} must be equal to the number of {PHONEMEs that make up 1 or more PHONEME}; if not, an error occurs.
The 1 or more PHONEME part cannot consist fully of $s. Violating this also results in an error.

For example,

"u" "g" !vowel -> /u/ /u/;

defines a rule which tells akrantiain to do the following:
 "When a "u" is followed by a "g", which in turn is NOT followed by something that matches the pattern vowel,
  convert the "u" into /u/ and the "g" into /u/."

Whitespace characters are not allowed inside ->; you cannot split a -> into a - and a >.

A ^ is a pattern than matches a word boundary. A $ says: "This rule does not convert the corresponding left-hand side; let other rules written below handle it."
The following is an example which uses a ^ and a $:

"n" ^ ("m"|"p") -> /m/ $

This defines a rule that says:
 "When "n" is followed by a word boundary, which in turn is followed by something that matches pattern ("m"|"p"),
  replace "n" into /m/; what is matched to ("m"|"p") should be left to other rules that follows this rule."

An error occurs when an identifier used inside the definition of a conversion rule is not defined.

It is forbidden to directly or indirectly (through identifiers) use a string literal containing punctuations (i.e. characters used in the definition of PUNCTUATION) or spaces inside conversion rule definitions.
An error occurs when you try to do so.

2-4. Specifying settings

A setting specifier has the following structure.

@ identifier ;

This sentence tells akrantiain to behave differently from what it normally does.
It is legal to specify the same setting multiple times.

Setting specifier itself is case insensitive; @ uSe_NFd ; behaves exactly the same as @ USE_NFD ;.

The following is the list of identifiers that currently have settings accompanied to them.

Identifiers	Meaning
`CASE_SENSITIVE`	Make string literals case-sensitive
`FALL_THROUGH` `FALLTHROUGH` `FALL_THRU` `FALLTHRU`	When parts of the input isn't consumed by the conversion rules, make them fall through instead of throwing an error
`USE_NFD`	Allow, for example, an accented letter to be analyzed as the combination of an unaccented letter and a combining diacritics

2-5. String literal

A string literal is a literal that describes a string to be converted from. A string literal begins with a " and ends with a ", and it does not contain a newline within itself.
In order to express the character " inside a string literal, you must write \".
In order to express the character \ inside a string literal, you must write \\.
A character in Unicode's Basic Multilingual Plane can also be represented inside a string literal by writing \u followed by the character's code point in four hexadecimal digits. (0~9, A~F, a~f)

By default, string literals are case-insensitive.
You need to specify the setting CASE_SENSITIVE to make string literals case-sensitive.

2-6. Slash literal

A slash literal is a literal that describes a string to be converted to. A slash literal begins with a / and ends with a /, and it does not contain a newline within itself.
In order to express the character / inside a slash literal, you must write \/.
In order to express the character \ inside a slash literal, you must write \\.
A character in Unicode's Basic Multilingual Plane can also be represented inside a slash literal by writing \u followed by the character's code point in four hexadecimal digits. (0~9, A~F, a~f)