Programmers, virtual functions and interviews

January 1, 2011

Google image search can be quite unexpected sometimes.

Programmers!

Who needs them anyway? Well, it turns out that we in the games industry can sometimes find a use for them. One of my duties (perhaps surprisingly, especially to those who have met me) is to interview prospective games programmers and cast off any aspersions they have to the job.

In my experience, applicants tend to come in one of the following flavours:

computer science students (making up about 90%) can’t program their way out of a singly-linked cardboard box. They have just graduated university (or sometimes, inexplicably, have been working as a programmer for several years). They have once heard of the term “algorithm”, and almost all of them have come out of a Computer Science course with obscene marks (furthering my belief that CS courses in uni are taught predominantly by lunatics);
software engineers, people for whom boost and the STL are the second coming of Jesus, people who seriously believe that someone has a use for a hash_multidequeue. They are mercifully rare and generally don’t last long because the games industry doesn’t fit their ideas of a well run system;
finally, we have the dungeon masters, people who dissolve caramel sweets into orange juice – while lecturing their senior programmers about the length of the integer ALU on the Motorola 68060, and how we should really be “thinking more about pipelining”. While the first group of applicants invariably becomes a group that doesn’t understand how much better other programmers are, this group becomes the set which doesn’t understand how much worse they personally are. Obviously I am part of this group.

Occasionally, you get a hire-able person from of any of the three above, or, even more rarely, you find someone actually worth their salt (yes, we still pay people with salt in the games industry, times are hard). However, I am not here to talk about any of them – that was one of my tangents. I am here to talk about the people who read Gamedev or Gamasutra, and believe everything in it – to the point where they repeat tired old maxims that have been perpetuated by the generation of programmers that refuses to use C++ because “all that OO stuff makes it slow”.

So, we lurch inexorably towards the point of this post. Virtual functions. I have told all of my team to get over it and use them where it makes sense (which is often quite a common occurrence), but I keep hearing people in interviews telling me that virtual functions are slow. So, like the scientist I am, I decided to find out once and for all what the deal truly is.

Let’s get down to business: why are virtual functions slow? Apart from the obvious “they aren’t”, theoretically they are slow because instead of the compiler emitting something like this:

mov ecx, object_address
call some_address

The compiler has to generate code to find out what function to call:

mov ecx, object_address
mov eax, [ecx]
mov eax, [eax + vtable_index]
call eax

The instructions themselves are not particularly expensive, the “slow” part (at least, according to received knowledge) is the potential cache miss caused by locating the real function address from the vtable (stored in the first few bytes of the object). Now, back in the days of the 486, this could be a problem – the extra size of the vtable, the slowness of memory access and the small CPU cache added up to a measurable performance impact. But what about now? My phone has about 100x the power of the first PC I ever owned (and I’m not including the ancient BBC Micro I inherited) – surely things have changed.

So, I constructed a test, I made a base class, a derived class, and two functions – one virtual, one not. The functions do some trivial maths on an argument and access a member variable (we have to do some actual work or the optimizer will simply remove the whole thing). The functions all follow this pattern:

int Test(int i)
{
    return m_Test * (i - 1);
}

I am using MSVC 2010 in stock release mode (minus link-time code generation, you’ll see why eventually). The functions are being timed by the high performance counter (RDTSC with multicore fixes via QueryPerformanceCounter) and the operation is being run 10,000,000 times. Here are the results:

Virtual function: 0.063s
Normal function: 0.018s

Holy balls! The virtual function is 3x slower! Because it’s impossible for me to be wrong, I looked at the generated assembly code to find out what was really going on.

// Virtual function
mov	edx, DWORD PTR [edi]
mov	eax, DWORD PTR [edx]
push	esi
mov	ecx, edi
call	eax

// Normal function
add	DWORD PTR _value$[ebp], eax
add	DWORD PTR $T68661[ebp], edx
add	DWORD PTR $T68662[ebp], esi
add	DWORD PTR $T68663[ebp], edi
add	edx, ecx
add	eax, ecx
add	esi, ecx
add	edi, ecx

While the first bit of code looks is a function call, the second one has been inlined to buggery. Inlining is an optimization method which basically copies and pastes the function body into the caller’s assembly, avoiding the (apparently costly) assembly call operation and the function prologue and epilogue (which usually consists of a few stack push and pop instructions).

I modified the code to add a control function call (the Win32 API call GetTickCount), and a non-virtual call to a function which was defined in another module (rather than just the header). By putting code in another module, we’re denying the compiler the opportunity to inline the code (at least, when link-time code generation is turned off). I’ve left the virtual function in the header, because it is clearly not being inlined. These were the results:

Method Type	Time (seconds)
Control	0.075
Virtual Function	0.063
Normal Function	0.018
Non-inlined Normal Function	0.047

Suddenly, the gap has changed to a mere 20%. Still significant, but a lot better than 70%. But… there’s more. In this case, the function is being called via a member variable of the testing function’s class. The virtual method call is, as far as I am concerned, not strictly required – the type of the object is concrete at the calling site, but the optimizer doesn’t appear to realize this. However, things change if we call the method through a local variable rather than a member variable. The optimizer, recognizing that the type is known, inlines the function – somehow making it even faster than the normal inlined function at 0.005s (3x faster than the normal inlined function). The asm is totally different (no idea why, optimizers are unpredictable), which explains the discrepancy.

As is often the problem with benchmarks, they don’t represent real world situation. Recognizing this, I’ve changed the function payload (and the control) to look more like a real function rather than a getter, making them like this:

int Test(int i)
{
    return m_Test * (i - 1) * GetTickCount();
}

So, how does this fare?

Method Type	Time (seconds)
Control	0.088
Virtual Function	0.095
Normal Function	0.077
Non-inlined Normal Function	0.095
Concrete Virtual (inlined)	0.077

What a surprise. Suddenly the whole virtual/nonvirtual thing gets blown out of the water. It turns out, unsurprisingly, that when you have a function that actually does work (or, for example, calls another function), the virtual call overhead just melts away into the background. I admit, virtual calls cost slightly more than nonvirtual calls, especially if the function itself is very cheap. But more importantly, inlining is really awesome and the compiler should do it all the time; and unfortunately, the optimizer isn’t that great at recognizing when a type is concrete.

There are a few extra considerations about inlining. First, LTCG isn’t a silver bullet. My non-inlined function didn’t get inlined when I turned on link-time code generation, until I moved the definition into the same module as the call site. Second, library calls defined in headers and .lib files don’t get inlined properly – infact, the only function that seemed to be inlined when I moved the definitions to a static library was the concrete virtual one (which makes no sense, especially since the library didn’t even export the other functions).

But it doesn’t end there! What about C#? It’s always been my belief that C# is what happens when someone who actually has a vague idea of what they’re doing decides to write a programming language. Remembering one of the oddities of C# (that all method calls are compiled down to the MSIL instruction callvirt) I repeated the tests, expecting it to do pretty well. It turns out that the JIT optimizer, while formidable, is even less effective at recognizing concrete types – all virtual functions were called through a vcall, even for locally defined objects. Interestingly, if the function is not marked virtual (or it is a struct member), the JIT compiler will inline it. And it will do an amazing job, too – it can inline any method from any class – even one in a different DLL!

Time for a verdict: virtual calls are marginally slower than nonvirtual calls, but any real performance difference probably comes from inlining rather than the virtual call overhead. Conclusion: get over it and stop mentioning it in interviews. Alright, I’ll put in a caveat – x86 (and x64) are both very forgiving architectures, and vcalls cost more on more rubbish platforms – but, they’re necessary constructs for well designed code, and you can take them from my cold, dead hands.

From → CPlusPlus, Programming

4 Comments

LongTimeFan permalink

Hello. You didn’t actually measure the cache hit, which is what all the cool kids are talking about these days.

My test: create N classes, an instance of each class, call virtual method on each instance in tight loop (so, sort of simulating an update method called on an array of game objects). With non-virtual calls, each method call is ~ 1ns. Virtual calls with 12 classes, virtual calls are between 15 and 30ns.

There should be two methods in the vtable (destructor and my test method), which is only 8 bytes … hardly enough to fill the cache with 12 vtables I would’ve thought, so they must have terrible locality (or alignment requirements).

(MacBook Pro gcc -O3). Inlining in my case made little difference, either because my test is different, or gcc, or different hardware …

Reply
- eatplayhate permalink
  
  Hah, sorry man but all that proves is that GCC is a piece of junk. I just tried that exact test (12 classes, 128 instances, 100,000 calls to the whole array) and a cheap payload (subtract->multiply by member), all the objects allocated to heap via operator new().
  
  No difference. Infact, when I ran it multiple times to try to coax a difference out of it, the vcall was faster (because the optimizer didn’t hoist out a constant register in the first loop, which happened to be the control). Secondly, I think that’s a bogus test – there’s almost no way to design a program like that without having a huge number of homogeneous object arrays (ew), or stupid classes that do everything. In that case, even if the virtual call was significantly slower, I don’t see any decent alternatives.
  
  Finally, you’ll be pleased to know that the 12 vtables were stored in contiguous memory. Which makes sense, because they are loaded from the program’s data section in the exe. The real question, as is so often the case at work, is “what utter lunatics wrote gcc?”
  
  Reply
- eatplayhate permalink
  
  Actually, cocks, I made a mistake with the test – the vcall was marginally slower (same kind of ratio as before 0.07s vs 0.06s). That should teach me to always look at the assembly =)
  
  Also, LTCG worked when I did the test – I had to turn it off or it inlined a method across two modules (I really wish I knew what made it work in this particular instance…), with inlining, the nonvirtual call took 0.02s. Typical.
  
  Edit: Also the vtable size is 12 bytes as it includes the RTTI info.
  
  Edit 2: If you make the functions themselves longer, it seems that you hit a point where the cache miss is on the function level rather than the vtable level. The performance plummets. Moral of the story, cache misses upset things, vtables don’t upset the cache that much, inlining is great, and optimizers are erratic.
  
  Edit 3: The linker can tell when functions are the same even when they have different symbols!
  
  Reply
mortoray permalink

It’s important to remember that if you need virtual functions you can’t compare to a non-virtual route. You would need to compare to calling via a function pointer. That is, you only pay for the virtual overhead if you are actually using virtual, which you may not need.

Your last point is more important though, that the cost virtually disappears when compared to the actual workload of the function. I do a lot of low-latency programming and still use virtual functions. Again though, I only use virtual in places where it is needed and thus a function pointer approach would still cost the same.

For those interested, check out my article about how virtual functions work: http://mortoray.wordpress.com/2011/08/09/how_polymorphism_works_part1/

Reply

Programmers, virtual functions and interviews

Programmers!

Leave a reply to eatplayhate Cancel reply

Recent Posts

About Me

Tweeterings

Categories

Archives

Site

Email Subscription

Programmers, virtual functions and interviews

Programmers!

Share this:

Related

Leave a reply to eatplayhate Cancel reply

Recent Posts

About Me

Tweeterings

Categories

Archives

Site

Email Subscription