XNAMath
With all the focus on the console platforms I didn’t notice one very cool addition to the March DirectX SDK: XNAMath. This is basically the traditional Xbox360 vector math lib, ported to the PC with SSE2 and inlining support. The N3 math classes are now running from the same code base on top of XNAMath for the PC and Xbox360 platforms. Maik has spent a few days to analyze the generated code and after some tweaking the improvements for our simple math benchmarks are absolutely dramatic, up to 4x faster on the PC side!
We had to change our memory allocation routines on the PC to always return 16-byte aligned memory, without this, XNAMath isn’t really useful since the aligned load/store functions can’t be used on vectors residing in heap buffers. Really strange that there isn’t a way to do this through the Win32 heap functions directly (or is there?).
Other then that I’m currently deep into “jobifiying” the render thread, in order to free the PS3-PPU from the mundane number-crunching tasks. Properly jobified code will also “automatically” run about 2x faster on a 2-core PC, and about 3..4x faster on the Xbox360, since even single jobs will be split and processing will be distributed to worker threads. The actual speedup may even be higher, since the data must be re-organized into small independent chunks (“slices”) of about 16..32 kByte each in order to make the best use of the SPU local memory, and this improved spatial locality is also extremely beneficial for CPU caches on the other platforms (I think I’m starting to sound like a record, but I can’t stress enough how good this data-reorganization will be for N3 on ALL platforms :)