Skip to content

A Side Project – Words

January 19, 2014

words-smThis is the first of (hopefully) several posts about a scripting language I have been working on in my spare time. I’m going to keep these posts relatively short*It makes it more likely that I’ll write them. and with luck, they’ll be an interesting documentation of my progress.

An Introduction to Words

Words is the exciting new scripting language I am working on*The name has plenty of mileage, you’ll see!. I have been toying with the idea of writing one for a while – mainly because it’s something I’ve never done before and they’re just so useful – especially compared against hard coding everything. That said, there are some facets of the more commonly used scripting languages that I don’t like (perhaps because I am not a rather set in my ways C++ programmer), and if I’m going to go to the effort of making a new language, I’m going to make it appeal to me. With that said, here are the design goals:

  • Words must be a compiled language – there should be no room for typos or errors where types are misused.
  • The language syntax should be of the brace (C-style) family, rather than whitespace-delimited (like Python) or “begin”/”end”/whatever like Lua.
  • The design should be similar to other managed languages, no pointers or direct memory allocation allowed (or, perhaps better, it shouldn’t be needed).
  • It must support vector types and vector maths for games by default.
  • The runtime should be easy to integrate into C++

The first point is a pet peeve of mine  – and this is mostly due to my own carelessness, but I find it really hard to write a moderately sized (e.g.,) Python or Lua (things which I use semi-often) script without making a typo somewhere and not noticing it until it inevitably blows up*Usually just before a long running operation can output results. due to me making a retarded error like misspelling “print” as “pint” or something. The second point is purely personal preference and I’m not going to say that there is a right and a wrong way to do that.*Except, obviously, whitespace delimiting is a horrible idea because people just won’t stick with tabs, the Lord’s indentation of choice. The third point is really about usage – if you’re going to care about this sort of stuff, use C++, not a scripting language. Scripting languages should be friendly enough that it’s easy to get something working, the compromise of such a deal being some additional level of distance from “the metal.”

That said, let’s dive straight in and see what I’ve been up to!

A Foray into Parsing

I’ll come out and say it – I copped out. I never got a formal computer science education, so I missed out on a lot of the stuff that universities love teaching about programming (and languages) – I was far too preoccupied with trying to make convincing explosion effects with some seriously limited knowledge of OpenGL; and one of the things that I never got around to was formal grammars, or exactly how languages are parsed. I started this project trying to read up on the fundamentals of things like Chomsky’s Hierarchy and a discussion of regular vs everything else grammars from a variety of sources, but unfortunately I rapidly lost interested and abandoned ship. I had learnt enough from my dusty tomes on grammars to (I thought, at least), write a fair chunk of my language in BNF, which I could then plug into a parser generator that someone with much greater patience than I had prepared earlier.

A few things happened – one, it turns out that I am terrible at writing formal languages without much ambiguity. I know it’s my fault and not my language because: a) it’s clearly unambiguous and b) if I jiggle all the rules around it works. Second, it seems that most parser generators were written in the late Jurassic era, when “C” was a new, exciting (and possibly hipster), language – since my laptop isn’t an ancient UNIX machine made of vacuum tubes and I don’t have a tape drive attached, getting the damn things to install/run/generate anything remotely sane was an exercise in frustration. Also it boggles my mind that people seemingly never imagine that you don’t have make or autoconf installed – it’d be less galling if they were remotely decent tools.

Anyway, all this heartache was compounded by the fact that I didn’t really want to write my compiler in C++ – C# is my go-to language for anything that isn’t realtime, and it seemed to me that a compiler would need to deal with lots of, well, fiddly bits. By using C# I wouldn’t have to memory manage anything, everything could point to everything else, and there was a real chance I’d be able to autocomplete the whole thing using Intellisense’s suggestions. That settled it, I had to find something for C#.

As chance would have it, some Google searches (“C# parser for people who can’t write unambiguous rules”) pointed me to (like all good things in life) a Stackoverflow post which mentioned the “Hime” parser, which turned out to be exactly what I was looking for. Hime – which is hosted on Codeplex (a fact which allows the more cynical of us to divine a lot about it) – is a fast, efficient, native C# parser which is incredibly easy to set up. I thoroughly recommend it to anyone who is semi-interested in playing around with parsing, in fact. The grammars are specified in a simple but quite expressive form of BNF with alternation and repetition constructs, and it even comes with a bunch of sample languages to learn from, including CSharp4Ansi_C and ECMAScript.

Example Hime Grammar for parsing simple numeric expressions (MathExp.gram)

cf grammar MathExp
{
	options
	{
		Axiom = "exp";
		Separator = "SEPARATOR";
	}
	terminals {
		INTEGER	-> [1-9] [0-9]* | '0' ;
		REAL		-> INTEGER? '.' INTEGER  (('e' | 'E') ('+' | '-')? INTEGER)?
					|  INTEGER ('e' | 'E') ('+' | '-')? INTEGER ;
		NUMBER	-> INTEGER | REAL ;

		WHITE_SPACE	-> 0x0020 | 0x0009 | 0x000B | 0x000C ;
		SEPARATOR	-> WHITE_SPACE+;
	}
	rules {
		exp_atom-> NUMBER {OnNumber}
			| '('exp ')' ;

		exp_op0	-> exp_atom
			|  exp_op0 '*' exp_atom {OnMult}
			|  exp_op0 '/' exp_atom {OnDiv};

		exp_op1	-> exp_op0
			|  exp_op1 '+' exp_op0 {OnPlus}
			|  exp_op1 '-' exp_op0 {OnMinus};

		exp	-> exp_op1 ;
	}
}

It’s not perfect (Hime’s C# implementation parses about 40% of my random .cs files with a few small edits), but it’s a great place to start. Although, looking at what I just pasted up there, I have no idea what the labels in curly braces are – nor can I see any reference to them in the generated AST when I use that grammar. Hime’s internal parsing algorithm is the over-enthusiastically named RNGLALR(1) (“wranglaaaallllr”, is how I’ve been saying it to people), which is apparently some kind of optimized form of a generalized LR parser with lookahead. The upshot of which is that I can bash out some seriously fast and loose grammar rules and the parser just sort of munches away at it without telling me to go away and read a book first.

Also, keen-eyed readers will be able to see that the operator precedence is encoded in the expression tree, not using some kind of mad hax way, which means that, when you put binary, boolean, unary, cast and relational operators into your expression tree you end up with some amazingly deep AST chains, making something as innocuous as this:

// Multiply is literally the best function ever to test with
float Multiply(int a, int b)
{
	return a * b;
}

Expand into this eye-watering monster with the Words grammar as it currently stands:

Himebest

Next time, I’ll talk about my adventure into semantic analysis and converting the generated AST into something useful!

From → Programming, Words

One Comment

Trackbacks & Pingbacks

  1. Words Core Types | Eat/Play/Hate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: