A Radeon Fix and More

The Nebula3/emscripten demos (http://n3emscripten.appspot.com) had a serious performance problem on Macs with Radeon GPUs in the instancing demos. Problem was that my pseudo-instancing code used an additional vertex-buffer with 1-dimensional ubyte vertex components as fake InstanceIds. This worked fine on nVidia and Intel GPU, but triggered a horrible slow-path in the OSX Radeon driver. After replacing this with ubyte4 components everything worked fine on Radeons, but I wasn't happy that the InstanceId buffer would now be 4 times as large, with 3/4 of the the size dead weight. Then today in the train from Hamburg back to Berlin the embarrassingly obvious solution occured to me to stash the InstanceId in the unused w-component of the vertex normals. These are in packed ubyte4 format, with the last byte unused. And with this simple fix I could get rid of the second vertex buffer completely and actually throw away most of the pseudo-instancing code. Win-Win!

And now on to the actual issue: I didn't really pay attention to the code path which is used if the GL vertex array object extenion isn't available, and I was shocked when I discovered that the dsomapviewer demo performs 7000 GL calls per frame (not draw-calls, but all types of GL calls), and then I was astonished that Javascript+WebGL crunches through those 7k calls without a problem even on my puny laptop. But something had to be done about that of course.

OpenGL / WebGL without extensions is very verbose even compared to Direct3D9. To prepare the geometry for rendering, you need to bind an vertex buffer (or several), bind an index buffer, and for each vertex component call glEnableVertexAttribArray() and glVertexAttribPointer(), aaaand each unused vertex attribute must be disabled with glDisableVertexAttribArray(). Depending on the max number of vertex attributes supported in the engine, this can add up to dozens of calls just to switch geometry. And whenever a different vertex buffer is bound, at least the glVertexAttribPointer() functions must be called again and if the vertex specification has changed, vertex attribute arrays must be enabled or disabled accordingly.

With the vertex array object extension all of this can be combined into a single call.

This particular part of defining the vertex layout is by far the least elegant area of the OpenGL spec, and even the vertex array object stuff could be nicer. To me it doesn't make a lot of sense to include the buffer binding in the vertex attribute state, keeping the buffer separate from the vertex layout would make more sense IMHO. But enough with the ranting.

Other high-frequency calls are the glUniformXXX() functions to update shader variables, and the whole process of assigning textures to shaders. Un-extended WebGL doesn't provide functions to bundle these static shader updates into some sort of buffers.

These types of high-frequency calls is exactly what we don't want in Javascript and WebGL. In a native OpenGL app, these calls are usually extremely cheap, so it doesn't matter that much. But when calling a WebGL function from emscripten, there's quite a lot of overhead (at least compared to a native GL app). First, emscripten maintains some lookup tables to associate numeric GL ids with Javascript objects. Then the WebGL JS functions are called, in Chrome, these calls are serialized into a command buffer which is transferred to another process, in this GPU process the commands are unpacked, validated, and the actual GL function is called. But it doesn't end there. On Windows, the ANGLE wrapper translates the OpenGL calls to Direct3D9 calls. So what's an extremely cheap GL call in a native app, comes with some serious overhead in a WebGL app. Considering all this it is really mind-blowing that WebGL is still so fast!

All this means though, that it really makes a lot of sense to filter redundant GL calls, especially in a WebGL application, and every GL extension which helps to reduce the number of API calls is many times more valuable under WebGL!

So my mission in the train from Berlin to Hamburg and back today was to filter out those redundant GL calls.

First I wanted to know what calls are actually the problem. The OSX OpenGL Profiler tool can help with this. It records a trace of all OpenGL calls, can create a quick stat of the most-called functions, and the sequence of calls with their arguments reveals which calls suffer most from redundancy.

Which are in the dsomapviewer demo: glEnableVertexArray(), glDisableVertexArray(), glBindBuffer() and glUseProgram().

Apart from filtering those lowlevel calls I also implemented a separate high-level filter which skips complete mesh assignment operations (that whole call sequence of buffer bindings and vertex attribute specification I talked about before).

All in all the results where encouraging: per-frame GL calls dropped from 7k down to 4k. In comparison: when using the vertex array object extension the number of GL calls goes down to about 3k.

This could be improved even more by reducing the number of vertex buffers, and bundling the vertex data of many graphics objects into one or few big vertex buffers, since then much fewer buffer binds and vertex attribute specification calls would be needed (at least if they occur in the right sequence). But for this I would either need the glDrawElementsBaseVertex() function, which is not available in WebGL, or I would need to fix-up lots of indices whenever vertex data is created or destroyed (but this would limit the size of one compound vertex buffer to 64k vertices, and limit the efficiency of the bundling, hmm...).

Anyway, to wrap this up, Chrome already exposes the OES_vertex_array_object extension, and an ANGLE_instanced_arrays extension seems to be on the way. Both should help a lot to reduce GL calls already. Then the only remaining problem is texture assignment and uniform updates in scenes with many different materials.

But I think before working on reducing GL calls even more I'll try to do something about then stuttering when new graphics assets are streamed in.

Over & Out,
-Floh.