Lowlevel Optimizations

I'm currently doing memory optimizations in Drakensang, and together with the new ideas from the asynchronous rendering code in N3 I'm going to do a few low-level optimizations in the memory subsystem over the next time in Nebula3. Here's what I'm planning to do:

Add a thread-safe flag to the Heap class, currently a heap is always thread-safe, but there will be quite a few cases now where it makes sense to not have the additional thread-safety-overhead in the allocation routines.
Add some useful higher-level allocators:

FixedSizeAllocator: This optimizes the allocation of many small same-size objects from the heap, it will pre-allocate big pages, and manage the memory within the pages itself. The main-advantage comes from the fact that all blocks in the page are guaranteed to be the same size.
BucketAllocator: This is a general allocator which holds a number of buckets for e.g. 16, 32, 48, ...256 byte blocks (the buckets are just normal FixedSizeAllocators). Small allocations can be satisified from the buckets, larger allocation go directly through the heap as usual.

Overwrite the new-operator in all RefCounted-derived classes to use the a BucketAllocator (however, I'll actually do some profiling whether this is actually faster then Windows' Low Fragmentation Heaps). This stuff will be behind the scenes in the DeclareClass()/ImplementClass()-macros, so there are no changes necessary to the class source code.

The biggest new feature (which depends on all of the above) is that I want to split RefCounted classes into thread-safe and thread-local classes. The idea is that a thread-local class promises that creation, manipulation and destruction of its objects happens from the same thread. A thread-local class can do a few important things with less overhead:

thread-local classes would create their instances from a thread-local BucketAllocator which doesn't have to be thread-safe
thread-local classes could use normal increment and decrement operations for their refcounting instead of Interlocked::Increment() and Interlocked::Decrement(). Since every Ptr<> assignment changes the refcount of an object this may add up quite a bit.

By far most classes in Nebula3 can be thread-local, only message objects which are used for thread-communication, and some low-level classes need to be thread-safe. I'm planning to enforce thread-locallity in Debug mode by storing a pointer to the local thread-name in each instance, and checking in AddRef, Release and other RefCounted methods whether the current thread-context is the same (that's a simple pointer-comparision, and it happens only in Debug mode).

The general strategy is to get away as far as possible from the general Alloc()/Free() calls, and to make memory management more "context sensitive" and also more static (in the sense that the memory layout doesn't change wildly over the runtime of the game). It's extremely important for a game application to set aside fixed amounts of memory right from the beginning of the project (and to let the game fail hard if the limits are violated), otherwise everything will grow without control, and towards the end of the project much time must be invested to cut everything back to reasonable limits.