Skip to content

Words Core Types

January 20, 2014

words-sm

Words with Expressions

Welcome to part two of my adventures in the development of Words – this is another retrospective look at what I’ve done so far over the last week or so. I’m pretty sure I’ve subconsciously decided that blogging is ~basically~ documentation, so I can continue my development guilt-free! Today, we’re going to look at some of the simple stuff to do with built-in types, and the kind of simple operators that most people implement when they write languages (before they give up and use a proper one instead). Surprisingly, this was a lot more involved that I had expected, mainly because I wanted to come up with some kind of consistent rules (I don’t want to make another PHP-level catastrophe).

Firstly, I limited the scope – the grammar I intended to support in this initial phase was:

// Variable declarations (exp_local -> type exp_simple_name)
int testInt;

// Variable initializers (exp_assign -> exp_lvalue '=' exp_rvalue)
float initializedValue = 10.0f;

// Simple expressions (+, -, *, /, %) as well as parenthesized expressions
int example = ((10 + 1) * 35) % 6;

// Casting operators
int truncated = (int)(7.0f / 5.0f);

At this point, the individual Words file (*.wds, known henceforth as a page*But of course!) is basically just a series of these statements, and frankly I’m getting a little ahead of myself because those assignments actually didn’t do anything and I was really just looking at the generated code to see if it made any sense, BUT I DIGRESS: because first, we need to talk about types.

Simple Types

I started this project thinking that I would support a few types – integers (obviously), floats (but of course), vector types (this was one of my design goals after all), strings (because it’s a scripting language) and then everything else would be dealt with via my “object” type, which, under the hood would be a heap reference, so it would deal with all the difficult things like classes etc. Words would also support boxing and unboxing of simple types into objects for reasons of it being hugely useful. I was also thinking about the type of bytecode that would be generated, and decided that I didn’t really want to mess around with SSA or things like that, instead opting for generating a stack-based bytecode similar to CIL – after all, if it was good enough for C#, it would be good enough for anybody (an argument I would later keep bringing out).

At this point, these were just totally arbitrary decisions (“stack based, sure!”) that didn’t yet have consequences. But I was still struggling with my simple types, mainly because I felt that bytes and long (64-bit) integers were both important – because I didn’t want Words to have issues reading from data sources. While an int is usually sufficient for logic, it’s completely plausible that you could be dealing with >2GiB of data in a single file, and having a language with no native long integer type would be an exercise in hilarity to try to read it. Equally, not having an ability to represent bytes, the building block of file storage, is another big shortcoming. I had already decided that I’d support boolean types as secret integers – the language would treat them as a proper type but under the hood it’d be 32 bit integers all the way down, so I decided to add bytes as another “secretly integer” type.

This, inevitably, was a slippery slope – “bytes” are, in my opinion at least, unsigned (with a value from 0-255), but I wanted to just support signed integers. Since I was already polluting my type system with a narrow integer of a different sign interpretation, I thought “what the heck” and added shorts, ushorts, sbytes and uints. All these types would be represented in the stack machine as an integer slot. Longs were a whole other problem – not only were they important, but they were also not something I could fit in my beautiful 4 byte stack slot. This was already an issue of some kind (since my “pointer” type would have to work on x64), but it hammered home the fact that I couldn’t just have “ints” and “everything else”. I added longs, unsigned longs, and doubles*Not long doubles though (that is a stupid type), because I felt that floats were probably feeling all left out.

Name (specifier) Type Stack Format
Int8 (sbyte, s8) 8-bit signed integer (-128 to 127) Integral 4 byte
UInt8 (byte, u8) 8-bit unsigned integer (0 to 255)
Int16 (short, s16) 16-bit signed integer (-32768 to 32767)
UInt16 (ushort, u16) 16-bit unsigned integer (0 to 65535)
Int32 (int, s32) 32-bit signed integer (-2147483648 to 2147483647)
UInt32 (uint, u32) 32-bit unsigned integer (0 to 4294967295)
Int64 (long, s64) 64-bit signed integer (-9223372036854775808 to 9223372036854775807) Integral 8 byte
UInt64 (ulong, u64) 64-bit unsigned integer (0 to 18446744073709551615, i.e., lots)
Float32 (float, f32) 32-bit floating point value (0 to ±1038 with about 6 s.f. accuracy) Floating-point 4 byte
Float64 (double, f64) 64-bit floating point value (0 to ±10308 with about 15 s.f. accuracy) Floating-point 8 byte
bool true or false boolean value Integral 4 byte (0 defined as false)
string A variable-length string of characters String Reference (4 or 8 bytes)
object A reference to an object on the heap Heap Reference (4 or 8 bytes)

I didn’t get around to dealing with vector types because frankly that’s enough to do in one go and they come with their own set of fun issues (alignment being chief of them) and need their own solution, which I will discuss when I actually make them. I got a little carried away with name variants (to specify a 32-bit signed integer you could do “int32”, “int”, “Int32” or “s32”), but since they each became terminals in the lexer, I was basically filling my language with surprising keywords. I stripped out the capital letter variants (they are used internally only), and the lowercase versions of them – so each type (except for bool, string and object) has two declaring forms, a short, explicit one (“s32”, “f32”, “u16”) or a familiar, friendly one (“int”, “float”, “ushort”).

You may have also noticed that I have basically remade the integral types in C++*(with the addition of strings, lols) because, ultimately, this has to be a ~pretty fast~ language. Games can afford to spend time mucking around with scripting languages because of the huge benefits they give, but there’s no reason for you to actively make them maliciously eat your cycles. Hence, the types are grounded in what the machines can actually do relatively efficiently – something some languages *cough*Lua*cough* seem to eschew).

Literals and Initializers

It turns out that if you want to do some expression parsing, you need a way of feeding data into the system (no way!), and that way is by providing a way of parsing literals – e.g., “10”, “-663.04”, or “false”. Words has to support four classes of literals – boolean literals (true, false and maybe*(just kidding)), integer literals, decimal/real literals and string literals.

Floating point literals are first – they look like this:

tkRealLiteral	-> '-'? ([0-9]+)? '.' ([0-9]+)  (('e' | 'E') ('+' | '-')? ([0-9]+))? ('f'|'d')?
		|  '-'? [0-9]+ ('e' | 'E') ('+' | '-')? ([0-9]+) ('f'|'d')?
		|  '-'? [0-9]+ '.'? ('f'|'d') ;

// This supports the obvious kinds:
// 10.05f, 1.05e1, -8e10, 2.d

Internally, I parse these into a double-precision floating point number for later use, keeping track of whether it is single or double precision. Traditionally, the suffix ‘f’ is for single precision values, but since doubles are essentially the Devil’s own format, I default to assuming you are trying specify a single-precision value. I find it deeply irritating that certain C-like languages (looking at you, GLSL) don’t accept ‘f’ suffixes on floats, so I support them because it’s a habit I just can’t break. There’s one exotic form of specifying floats, as a hexadecimal literal – an abomination introduced by C99*A dialect used solely by people who are so bitter about modern languages that they decided to make up a bunch of bizarre changes to solidify their clique, which I have absolutely no desire to support.

For integer literals, I went for a fairly straightforward setup. Integer literals are specified in three ways, with two optional suffixes. I support normal ones (0-9)+, hexadecimal ones (0x…), and binary ones (because that may actually be legit useful) that can be specified by a leading “0b” followed by a series of ones and zeros (similar to hexadecimal specifiers).

tkIntegerDigits -> [0-9]+;
tkIntegerValue -> '-'? tkIntegerDigits;

tkIntegerLiteral -> tkIntegerValue ('U' | 'L' | 'UL')?;
tkHexadecimalLiteral -> '0' 'x' [a-fA-F0-9]+ ('U' | 'L' | 'UL')?;
tkBinaryLiteral -> '0' 'b' ('0' | '1')+ ('U' | 'L' | 'UL')?;

// These rules support normal things like 10, -5, 0 etc
// As well as hexadecimal values like 0xFACE
// And binary values like 0b10001, which makes you feel like you're talking to the machine directly
// Also no octal values, because I am making this language for people who are not time travelers from the past

I went for a default of specifying a signed integer, although hexadecimal and binary constants are all parsed as unsigned (note that the negative sign is only accepted for “normal” ints, which was some arbitrary rule I came up with). By specifying a suffix, you control the literal type – ‘U’ makes it unsigned, ‘L’ makes it a long integer, and ‘UL’ an unsigned long. All integers are stored in the AST as longs/ulongs to simplify the compiler code.

Bools are easy – boolean values can be true or false, no exceptions. There will be rules for converting non-boolean values to bools, but when it comes to specifying literals, you don’t get any choice in the matter. Strings are somewhat more involved, as it’s a little frustrating to specify strings without escape characters in C++ at least. Until C++11, if you wanted to write a string with newlines or backslashes (e.g., you’re assembling a string to be printed – and escaped – elsewhere), you’d have to go crazy with the backslash party. C++11 introduced raw strings, a truly horrific language “feature” which you can just tell was the result of some late nights in the standards committee where that one guy kept coming up with every more insane corner cases that they had to deal with. I’ve gone for a simple system, a bit like C#. There are two types of string specifiers – normal ones, which are encased in quote marks (“”), and have the following values escaped:

\n -> newline
\t -> tab
\xH(HHH) -> encodes a character with a value specified from 1 to 4 hexadecimal digits
\0 -> null - this is NOT an octal escape, just a shorthand!
\uHHHH -> encodes a unicode value specified with 4 hexadecimal digits
\UHHHHHHHH -> encodes a unicode value specified with 8 hexadecimal digits
\" -> escapes a quote character
\\ -> escapes a backslash character

Writing string escaping stuff is fairly trivial so I won’t go into it, suffice to say that doing the CPPGM for a bit made me a dab hand at that sort of thing. I also got rid of all the rest of the stupid escapes like form feed and “alert” (seriously what) because I don’t think people really think of using string literals to control the PC speaker anymore. Besides, you can access all that ~weird stuff~ with the hexadecimal specifiers. For strings that are full of backslashes (or multiple lines), I went for a bastardized version of C# – string literals can be specified with an @ prefix and suffix. So you can have one that has multiple lines or windows paths without wearing out your backspace key:

@"C:\An\extended\"string"\literal"@

If you want to encode a quote followed by an at symbol in the string, you’ll have to escape the at symbol (e.g, @”Annoying “\@ string”@) – yes it’s a bit ew, but it’s a pretty unlikely situation. Adjacent string literals (e.g., “hi” ” there”) are automatically glued together in C++, but I’m not actually convinced that Words should support that, so I’m not adding it.

Casting and Conversions

It turns out that it’s actually relatively involved to come up with some sane rules for casting/conversions. Let’s take literals to start with – we support the following “flavours” of literals (once they’ve been parsed):

  • Boolean literals (true/false)
  • String literals (normal or extended)
  • Integer literals
  • Unsigned integer literals (‘U’ suffix)
  • Long integer literals (‘L’ suffix)
  • Unsigned long integer literals (‘UL’ suffix)
  • Single precision floating point literals
  • Double precision floating point literals (‘d’ suffix)

Conspicuously absent are narrow integer types, such as bytes/shorts etc. These have to be natural to specify. There are also other issues – what if I specify a number greater than the size of an int (e.g., 0xFFFFFFFFF)? What should happen? Strings, bools are fine, and floats are easy – you have to cast a double to a float, but putting a float into a double is fine. Ideally, any operation that isn’t going to destroy your data should be doable without thinking. I’ve come up with the following rules:

  • When assigning a literal with no suffix to a narrow integer type, check if the value fits and generate a compile time error if it does not.
  • When assigning a literal with the ‘U’ suffix to a narrow type, if the destination is also unsigned, check if the value fits – generate an error if it doesn’t meet those requirements
  • Integer literals with the long suffix (signed or unsigned)
    • If the suffix is ‘L’ the destination must be a long or a compile-time error will be generated
    • If the suffix is ‘UL’, the destination must be a ulong or a compile-time error will be generated
    • In the event that the value is explicitly cast, the literal must be truncated at the casting location. (1)
  • Any integer value can be implicitly converted to a floating point value. Any float value can be implicitly converted to a double.
  • If the integer literal does not have a long suffix but the value is not representable as a 32 bit integer (signed or unsigned), a compile time error will be generated (2)
  • If the integer has a ‘U’ suffix, but the literal starts with a minus sign, a warning should be generated and the type converted to signed
  • If the storage type is unsigned but the literal is negative, an error should be generated

Complicated, hey! For my two notes, 1) is because you could get some seriously whacked out stuff happening otherwise (remember all integer literals are stored in the AST as 64-bit values):

// This should be a value of 0x100000000, not 0x200000000
ulong x = (uint)0x1FFFFFFFFUL + 1;

Point 2) is because I don’t think it should be quite ~so~ magic, I think making your literal extended width should be a conscious choice. Anyway, I distilled those rules into a delicious table:

Storage Initializer Format
Bool Int UInt Long ULong Float Double
bool Native Implicit Implicit Implicit Implicit Error Error
s8 Error Size Check Error Error Error Error Error
u8 Error Size Check Size Check Error Error Error Error
s16 Error Size Check Error Error Error Error Error
u16 Error Size Check Size Check Error Error Error Error
s32 Error Native Error Error Error Error Error
u32 Error Sign Check Native Error Error Error Error
s64 Error Upcast Upcast Native Error Error Error
u64 Error Sign Check Upcast Error Native Error Error
f32 Error Upcast Upcast Upcast Upcast Native Error
f64 Error Upcast Upcast Upcast Upcast Upcast Native

All those “implicit” values up in the boolean section of the table are in there because I want to allow the following shorthand:

if (myInteger)
	DoStuff();

Wow. This has been a long post about nothing except literals, which just goes to show how much of a serious business it is to create a language. I haven’t even got to talking about simple numeric operators, so maybe I’ll leave that for another post.

Edit: Since making sure the actual compiler did what is expected, I have removed the conversion from floating point types to booleans, because it’s pretty dangerous anyway and unlikely to be what you want!

From → Programming, Words

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: