The Humble Spin Lock

October 18, 2013

Today, we’ll be looking at one of my favourite implements in the toolbox of programming – the spin lock. These have a bit of a bad reputation, as the term often refers to less cultured variants of the structure, but I’m here to tell you when, why and how to use them.

Multithreaded codebases will inevitably suffer from contention at some point – and traditionally, we turn to kernel systems to ensure that we serialize all the threads that are fighting over the resource. The main system used is that of a mutex or mutual-exclusion structure. These take the form of a “token” or object (generally called ‘the’ mutex for convenience) which can only be held by one thread at a time. When a second thread asks to hold the mutex, the operating system suspends it and will not reschedule the thread for execution until the mutex has been released. Mutexes, by their connection to the OS scheduling system, are owned and maintained by the underlying operating system, and can be used to build up many other useful structures.

The Devil in the Details

The biggest issue with a mutex has to do with their “weight” – mutex ownership is acquired, released and transferred via methods that communicate with the operating system kernel. At the very least, this precludes any form of inlining (guaranteeing a function call), and in most systems, it will introduce a very slow operation called a mode switch. You see, your program will generally be executing in a virtualized environment (on x86, the CPU has several “rings” of protection/virtualization, and most client programs execute in so-called ring 3) – your view of memory is not what another program will see, your view of the CPU is always from the perspective of client code (no restoring of register blocks), and an interrupt handler is something that someone else deals with – this is necessary so multiple applications can exist side-by-side without issues. The kernel, on the other hand, executes in a mode where it knows, for example, that the memory is laid out just so, and it handles hardware interrupts, paging/unpaging data from RAM and scheduiling processes. The kernel is running with an accurate view of the underlying hardware in the system – at a “higher” level (in x86 Windows, it is in ring 0 – rings 1 and 2 exist as well but aren’t used in Windows). This is why when an application crashes, you get a dialogue box, but when a driver crashes, you get a blue-screen.*

So, back to locks. Calling a kernel method involves a transition called a mode switch, and these transitions are slow. In the order of many thousands of clock cycles. Mode switches are more of an issue than you might expect – it was part of the reason that DirectX 9 was so slow (compared to OpenGL at the time with the same hardware/data/methods), and they are such an issue with locks that Windows ships with something called a critical section, which is just a multi-level lock which, when there is no contention, doesn’t require a mode switch to acquire. This is where spin locks come into their own – they exist entirely in user code, and don’t communicate with the kernel at all. This means that there is no mode switch, there is a high chance of inlining, and they don’t consume any special kernel resources to maintain.

They’re not a magic bullet, however – and this is where their reputation comes from – because spin locks manage to be fast and horribly inefficient. You see, when a resource is busy and another thread comes along to acquire a mutex, the thread gets suspended – the OS stops executing it and will not try running it again until the mutex is transferred. A spin lock, however, just keeps on trying (“spinning”) until it acquires the lock. In the era of portable devices, carefully tuned thermals and improved battery life, this sort of busy work is not a great idea. However, if the contention is likely to be over quickly, or the contention is reasonably unlikely, spin locks can be a very good structure to use – they also, because of their lightweight nature, allow you to have much more fine-grained locks over your code.

* With virtualization, ring-0 has actually become in itself another protection level (since it’s not representative of the real hardware), so the “hypervisor” that runs the VM is actually executing at an even higher level, called “ring -1”

Stop Trying to Sell it and Show Me the Code!

The simplest spin lock looks like this:

bool g_Busy;

void ProtectedFunction()
{
	while (g_Busy)
	{
		// do nothing (this is the spinning bit)
	}
	g_Busy = true;

	// Do some stuff here

	g_Busy = false;
}

This spin lock identifies the general concept of the system, but is one of those amazing examples (often used in computer science) that actually won’t work in practise and is full of horrible, horrible issues. Nevertheless, the concept is there – we have a variable that encodes whether the lock is in-use or available, and we keep checking it until it opens up. Then we mark the lock as used, do our work, and mark it as available before returning. We will start by putting the state into a class, so a single “lock” object can be used wherever it is applicable. While we’re at it, we should also create an RAII scoped-lock structure, just because it is a useful construct for most cases.

class SpinLock
{
protected:
	bool m_Busy;

public:
	Spinlock() : m_Busy(false) { }

	void Lock()
	{
		while (m_Busy)
			;
		m_Busy = true;
	}

	bool TryLock()
	{
		if (!m_Busy)
		{
			m_Busy = true;
			return true;
		}
		return false;
	}

	void Unlock()
	{
		m_Busy = false;
	}
};

class ScopedLock
{
protected:
	SpinLock& m_Lock;

public:
	ScopedLock(SpinLock& lock) : m_Lock(lock)
	{
		m_Lock.Lock();
	}
	~ScopedLock()
	{
		m_Lock.Unlock();
	}
};

Now that we have wrapped the broken lock logic into a class, we can go about fixing it up. First, we have to fix the enormous giant bug in this code. If two threads both try to acquire the lock at the same time, they will both test if the lock is busy (“no”), and happily continue to the next part of the locking code, marking it as “in-use” (twice), and finally returning from the Lock method. The protected area is now all kinds of not protected, as both threads are executing concurrently.

Enter the CAS

Luckily, CPU manufacturers foresaw this sort of issue, and developed instructions to solve this problem. On x86/x64, this method is called Compare-and-Swap, or, in Windows-specific parlance, InterlockedCompareExchange. We can, in one step, check a value and set a value if the value is what we expect it to be – this fixes the problem because we can guarantee that the executing code really is the code that is marking the lock as being “in use.” In our example above, one of the threads will discover that it is not able to do a compare and swap, and will return to spinning. First, we need to change the status variable to an integer (because that is the format that the CPU instructions prefer), and then change the rest of the class to be like this:

class SpinLock
{
protected:
	enum
	{
		Available,
		Busy
	};
	int m_Status;

public:
	Spinlock() : m_Status(Available) { }

	void Lock()
	{
		// ICE returns the "old" value, so we check against that
		while (InterlockedCompareExchange(
			(LONG*)&m_Status, Busy, Available)
			!= Available)
		;
	}

	bool TryLock()
	{
		if (InterlockedCompareExchange((LONG*)&m_Status, Busy, Available) != Available)
			return false;
		return true;
	}

	void Unlock()
	{
		m_Status = Available;
	}
};

ICE returns the original value – so we know that if it was what we expect (‘Available’/0), we have changed it to ‘Busy’ in one step and have gained ownership of the lock. ICE, unfortunately, is platform – and architecture – specific, so what works above on Windows isn’t going to work on Linux or ARM. I’m going to keep on going with Windows code for these examples, but for reference, this is an ARM version of ICE:

// ARM doesn't return the value in the stored pointer, only whether it worked.
bool ARMCAS(int* dest, int exchange, int compare)
{
	// ARM uses something called "load-exclusive" and "store-exclusive"
	// These fix something called the ABA problem (which we don't care about at the moment)
	// But may cause false 'negatives' when trying to do a CAS
	// Also note how horrible GCC asm is
	register int result;
	asm volatile (
		"ldrex		r1, [%1] "
		"cmp		r1,  %2 "
		"strexeq	%0,  %3, [%1] "
		: "=&r" (result)
		: "r"(dest), "r"(compare),"r"(exchange)
		: "r1"
		);
	return result == 0;
}

// GCC - use __sync_val_compare_and_swap, note the reversed last arguments!!!
int __sync_val_compare_and_swap(int* ptr, int compare, int exchange);

// Bonus, for Windows x64, use this instead (leading underscore) which is an intrinsic and faster
LONG _InterlockedCompareExchange(LONG* value, LONG exchange, LONG compare);

Re-entrancy, Barriers and Busy Waits

We have a (mostly – ARM is still trouble) working spin lock now, but an issue which is likely to be a big concern is the lack of re-entrancy. If a thread owns the lock, and then attempts to lock it again, it will sit there, spinning in vain until you terminate the process. This happens fairly regularly, and most OSes provide a ‘re-entrant mutex’ (such as the critical section in Windows). We should be able to modify our lock to support re-entrancy by keeping a “lock count” around, although, instead of just a value for whether the lock is busy or not, we will need a defined owner, so we can identify if we are re-entering a lock already held by the thread. Luckily, again, the platform authors have considered this, and we can usually identify threads through an integer ID – in Windows, we can get this using the method GetCurrentThreadId, and in pthread-based systems (iOS/Android/Linux), we can use the method pthread_self.

The code will need to be altered slightly – we will need to test if we own the lock, if so, we will increment a value that identifies the recursion count. If we do not, we will use the standard locking method. When unlocking, we will decrease the recursion count – and if it goes to zero, we need to relinquish the lock entirely.

class SpinLock
{
protected:
	int m_Owner;
	int m_Recursion;

public:
	Spinlock() : m_Owner(-1), m_Recursion(0) { }

	void Lock()
	{
		int id = GetCurrentThreadId();
		if (m_Owner == id)
		{
			m_Recursion++;
			return;
		}
		while (InterlockedCompareExchange((LONG*)&m_Owner, id, -1) != -1)
			;
		m_Recursion = 1;
	}

	bool TryLock()
	{
		int id = GetCurrentThreadId();
		if (m_Owner == id)
		{
			m_Recursion++;
			return true;
		}
		if (InterlockedCompareExchange((LONG*)&m_Owner, id, -1) != -1)
			return false;
		m_Recursion = 1;
		return true;
	}

	void Unlock()
	{
		if (--m_Recursion == 0)
		{
			MemoryBarrier();
			m_Owner = -1;
		}
	}
};

Astute readers will see that I have snuck in a little bonus in the code snippet above – a call to the (nonexistent) function MemoryBarrier. This is here because we are potentially in a little bit of hot water when we’re dealing with real CPUs – while we are safe when dealing only with the owner variable (it’s all being set via CAS), the recursion variable has no special protection, and on some (many!) CPUs, it could be written after the other thread has taken the lock, which would completely break the re-entrancy count (specifically, imagine calling “Unlock” if the count is 0, which is what may occur). As per usual, this is implemented differently in multiple platforms:

// GCC general
#define MemoryBarrier() void_mm_mfence()

// Windows
#define MemoryBarrier() _ReadWriteBarrier()

// ARM
#define MemoryBarrier() asm volatile ("dmb" ::: "memory")

But wait! There’s one more thing we can do – we can make these spin locks a little less offensive to our delicate sensibilities by telling the CPU that it is in a busy-loop, thus (hopefully) reducing the power draw. Simply change the spin part (the empty while body) to include the following:

// GCC general
#define YieldCPU() __asm volatile ("pause" ::: "memory")

// Windows
#define YieldCPU() YieldProcessor()

// ARM
#define YieldCPU() asm volatile ("yield" ::: "memory")

Spinning All Day Long

Spin locks are fantastic and you should not be afraid to use them when you have a lightweight function that needs protection. Hopefully you have got a good basis to write your own well behaved lock now, and have also learned just how hideously ugly GCC style assembler code is. Obviously a spin lock isn’t necessarily the best lock for all situations, but you may be surprised how much mileage you can get out of multithreaded code with some judicious sprinkling of this structure.

Finally, I leave you with a SUPER ULTRA FAST way of getting the thread ID in Windows, which should make your spinlock class even more awesome.

#if defined(WIN32) && defined(_M_IX86)
	#define GetThreadIDFast __readfsdword(0x24)
#elif if defined(WIN32) && defined(_M_X64)
	#define GetThreadIDFast *((DWORD*)(((BYTE*)__readgsqword(0x30)) + 0x48));
#else
	#define GetThreadIDFast GetCurrentThreadId
#endif

From → CPlusPlus, Programming

The Humble Spin Lock

The Devil in the Details

* With virtualization, ring-0 has actually become in itself another protection level (since it’s not representative of the real hardware), so the “hypervisor” that runs the VM is actually executing at an even higher level, called “ring -1”

Stop Trying to Sell it and Show Me the Code!

Enter the CAS

Re-entrancy, Barriers and Busy Waits

Spinning All Day Long

Leave a comment Cancel reply

Recent Posts

About Me

Tweeterings

Categories

Archives

Site

Email Subscription

The Humble Spin Lock

The Devil in the Details

* With virtualization, ring-0 has actually become in itself another protection level (since it’s not representative of the real hardware), so the “hypervisor” that runs the VM is actually executing at an even higher level, called “ring -1”

Stop Trying to Sell it and Show Me the Code!

Enter the CAS

Re-entrancy, Barriers and Busy Waits

Spinning All Day Long

Share this:

Related

Leave a comment Cancel reply

Recent Posts

About Me

Tweeterings

Categories

Archives

Site

Email Subscription