Skip navigation

For people who don’t care about multilingual support in their grammars, a token type like this might be sufficient:
IDENTIFIER = <<[A-Za-z][A-Za-z0-9_]*>>

I wanted to expand this to support more than just the basic latin alphabet, hoping it would be as easy as reformulating this expression into character classes, now that Grammatica 1.5 apparently supports unicode regular expressions. First I tried something like:

IDENTIFIER = <<[[:alpha:]][[:alnum:]]*>>

However it appears that Grammatica entirely ignores this type of structure, instead treating it like a character set composed of :s and the letters a, l, p, h, a, etc. So I found out about the \p{Class} formulation for property classes…

IDENTIFIER = <<[\p{L&}][\p{L&}]*>>

The \p formulations like \p{L&} for these weren’t working until I consulted Java’s own set of property classes, which list Alpha and Alnum, so it turned into:

IDENTIFIER = <<[\p{Alpha}][\p{Alnum}]*>>

This made it past the grammar build, targetting .NET for the tokenizer code. But when the compiler runs, from all appearances I gather that it is using .NET’s regular expression library, which does not accept {Alpha} and {Alnum} but prefers the Unicode block names instead like {L&}.

At a prime moment of head scratching I came up with a solution (though it’s not as pretty as the formulations above:

IDENTIFIER = <<[\p{Ll}\p{Lu}\p{Lt}][\p{Ll}\p{Lu}\p{Lt}\p{Nd}]*>>

This makes it through grammar compilation and parsing, matching the correct input. It matches any alphabetic letter (upper, lower, title case) as the first character, and any alphanumeric character for the rest.

The inability to use the Java style classes looks like a bug/oversight in the C# port of Grammatica, unless I miss a more elegant way to pull this off?

In any case it works, so that’s what I’m using. Grammatica has a very long release cycle and they *just* released version 1.5. It’s unlikely there will be a new version for quite awhile. So for anyone who needs Unicode-level identifier tokens, your welcome!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: