A Tour of Oryol's Metal Renderer

The Metal rendering backend has about 2.8k lines of code (with about 20% comments) in ~26 files. All source files are written in Objective-C++, and most headers are included both by Objective-C++ and C++ code, with some macros to hide Objective-C++ types from the C++ side.

NOTE: all class names starting with MTL or MTK are OSX/iOS classes, all names starting with mtl or Gfx:: are Oryol classes or methods, all mentioned Oryol classes (those starting with ‘mtl’) are private to the Gfx module, the methods starting with Gfx:: are static methods of the public Gfx facade

Window Management and Main Loop

MetalKit’s MTKView is used on both platforms (iOS and OSX) to present the rendered frame and to drive the entire Oryol application loop in its draw-callback. On iOS the differences between the GL and Metal backend are minimal, basically an MTKView is used instead of a GLKView.

On OSX the difference to the GL version is bigger, since GLFW is not used for window management. In general, the OSX application wrapper code is much closer to the iOS version when Metal is used.

The main differences to the GL version on OSX are:

no dependency on GLFW
the whole Oryol application frame is executed inside the -drawInMTKView callback (same as on iOS)
the OSX application immediately creates a hidden application window, a Metal device, MTKView and MTLViewDelegate at startup, which are only reconfigured when the Gfx module is initialized
a lot of boring input-handling code which previously was provided by GLFW

Gfx Module Setup

At Gfx module setup time, the (still hidden) application window is resized and the MTKView’s contentScaleFactor is configured to either render in full Retina resolution or upscaled from half-res (which is the default). Finally the application window is made visible.

Gfx Resources

Oryol Gfx resources are minimal wrapper objects around one or more Metal objects, they are just ‘dumb’ data objects without any functionality, living in object pools.

Meshes

mtlMesh objects own one or more MTLBuffer objects, used as vertex- and index-buffers. For dynamic meshes there are multiple (usually 2) vertex- and index-buffers which are rotated through so that CPU writes and GPU reads don’t collide.

There is no fine-grained locking or partial-updating for dynamic resources, each dynamic resource can only be updated once per frame, and must be completely overwritten with new data.

Mesh creation happens in the mtlMeshFactory class, only the methods [MTLDevice newBufferWithLength:options:] or [MTLDevice newBufferWithBytes:length:options] are called.

NOTE: it probably makes a lot of sense to only allocate a single buffer per Mesh object, since Metal doesn’t care either way, and all buffers of a Mesh object will be destroyed at the same time anyway

Textures

mtlTexture objects own one or more MTLTexture objects and one MTLSamplerState object. Immutable textures have always one MTLTexture object, dynamic textures (where the entire content is replaced each frame by the CPU) have multiple MTLTexture objects to rotate through, the same principle as dynamic meshes.

If the mtlTexture object is used as render-target it can optionally own an additional MTLTexture object used as depth-buffer.

Texture creation happens in the mtlTextureFactory class and calls the following Metal methods:

[[MTLTextureDescriptor alloc] init]
[MTLDevice newTextureWithDescriptor]
[MTLTexture replaceRegion] (only if texture is initialized with data)
[[MTLSamplerDescriptor alloc] init]
[MTLDevice newSamplerStateWithDescriptor]

NOTE: Sampler state objects are currently not re-used even if they contain the same state. This, together with a state cache in the mtlRenderer class would be a very useful optimization.

Shaders

mtlShader objects own one MTLLibrary object and multiple MTLFunction objects pointing to vertex- and fragment-shader functions.

There is one MTLLibrary per Oryol shader object which can contain multiple vertex- and fragment-shader entry points.

Shaders are created in the mtlShaderFactory class from precompiled byte code embedded into the executable through static C arrays generated by the Oryol shader code generator. The shader code generator also writes C structs for ‘shader uniform blocks’ which have the right member alignment so that they can be transferred to Metal with a simple memcpy.

The following Metal methods are called during shader creation:

[MTLLibrary newLibraryWithData]
[MTLLibrary newFunctionWithName]

DrawStates

mtlDrawState objects own one MTLRenderPipelineState and one MTLDepthStencilState Metal object.

DrawStates are created in the mtlDrawStateFactory class with the Metal methods:

[[MTLDepthStencilDescriptor alloc] init]
[[MTLStencilDescriptor alloc] init] (only if StencilEnabled)
[MTLDevice newDepthStencilStateWithDescriptor]
[MTLVertexDescriptor vertexDescriptor]
[MTLRenderPipelineDescriptor alloc] init]
[MTLDevice newRenderPipelineStateWithDescriptor]

One minor inconvience is that the pipeline state object needs to know about some render target attributes (color and depth/stencil format and MSAA sample count), which means a specific DrawState object is only compatible with specific render targets. This limitation is present in all new 3D-APIs though.

Resource Destruction

Oryol uses ‘unretained’ command buffers to eliminate refcounting overhead inside Metal. Resources that need to be released are added to a ‘deferred release queue’ and are actually destroyed a few frames later when it is guaranteed that the GPU no longer uses those resources.

This generally happens for all Metal objects created by the Oryol resource factory classes.

The Render Loop

Everything rendering-related happens in the internal mtlRenderer class. At setup-time, a frame-synchronization semaphore and a single MTLCommandQueue is created, as well as (usually) 2 big global ‘uniform buffers’ (these are standard MTLBuffers used to store a frame’s worth of shader uniform updates).

There is a compile-time constant GfxConfig::MtlMaxInflightFrames which defines how many frames of work the CPU can queue up before it needs to wait for the GPU. All dynamic resources, as well as the global uniform buffer have as many copies to rotate through as the number of GfxConfig::MtlMaxInflightFrames (default is 2, which means double-buffering).

The size of the global uniform buffer must be defined upfront and cannot grow. Slots in the uniform buffer have a minimal size of 256 bytes (this is an alignment requirement), thus the size of the global uniform buffer must be at least (max number of Gfx::ApplyUniformBlock() calls per frame) * 256

Gfx::ApplyRenderTarget()

Oryol doesn’t have Metal’s concept of render passes but still maps quite well to it. The Gfx facade method Gfx::ApplyRenderTarget() and the final Gfx::CommitFrame() serve as pass boundaries.

Gfx::ApplyRenderTarget() finishes any previous pass (in the same frame) and starts a new pass, Gfx::CommitFrame() finishes the last pass of the frame.

This is what happens during ApplyRenderTarget():

the very first call in the current frame creates a new MTLCommandBuffer via [MTLCommandQueue commandBufferWithUnretainedReferences] (this is the magic method which gives us a command buffer that doesn’t incur refcounting overhead), the command buffer will be valid for the entire frame until Gfx::CommitFrame()
if a previous pass in the same frame exists, it is finished with [MTLRenderCommandEncoder endEncoding]
a new MTLRenderPassDescriptor is requested either from the MTKView (if rendering to the default render target), or via [MTLRenderPassDescriptor renderPassDescriptor] (if rendering to an offscreen render target)
the color, depth and stencil attachments are set on the pass descriptor (if rendering to offscreen render target)
the ‘LoadAction’ is defined, if the render targets should be cleared, this will be MTLLoadActionClear, otherwise MTLLoadActionDontCare
a new MTLRenderCommandEncoder is created with [MTLCommandBuffer renderCommandEncoderWithDescriptor]
the global uniform buffer is bound to 4 reserved vertex- and fragment-shader bind-slots (this binding only needs to happens once on the command encoder, so we do it right here)

Metal’s render-pass concept required to move the formerly separate Gfx::Clear() method into the Gfx::ApplyRenderTarget() method. No big deal, in fact it makes the Gfx interface simpler.

NOTE: there are actually 2 public methods to apply the next render target: Gfx::ApplyRenderTarget() for offscreen-rendertargets, and Gfx::ApplyDefaultRenderTarget() to render to the default render target which is made visible at the end of a frame. Internally, both call the same mtlRenderer::applyRenderTarget() method.

Gfx::ApplyDrawState()

The Gfx::ApplyDrawState() method sets the entire static render state and all textures required for the following draw calls. It is not possible to change the texture assignment between draw calls without also setting a new draw-state, this is an Oryol convention which was not dictated by Metal, but rather by the more recent D3D12 backend.

The following Metal methods are called:

[MTLRenderCommandEncoder setBlendColor]
[MTLRenderCommandEncoder setCullMode]
[MTLRenderCommandEncoder setStencilReferenceValue]
[MTLRenderCommandEncoder setRenderPipelineState]
[MTLRenderCommandEncoder setDepthStencilState]
[MTLRenderCommandEncoder setVertexBuffer] (this might be called multiple times to bind several buffers with vertex and index data to the vertex shader stage)

Followed by methods to assign textures and samplers to the vertex- and fragment-shader stages:

[MTLRenderCommandEncoder setVertexTexture]
[MTLRenderCommandEncoder setVertexSamplerState]
[MTLRenderCommandEncoder setFragmentTexture]
[MTLRenderCommandEncoder setFragmentSamplerState]

Gfx::ApplyUniformBlock()

This copies a small chunk of memory with shader uniform data into the next free location of the global uniform buffer, records the new location in the command buffer, and advances the uniform buffer offset with an alignment of 256 bytes:

a memcpy() to copy the uniform data
either [MTLRenderCommandEncoder setVertexBufferOffset:atIndex:] or [MTLRenderCommandEncoder setFragmentBufferOffset:atIndex:] to bind the buffer location to a shader stage bind slot

Gfx::Draw() and Gfx::DrawInstanced()

Metal has no separation between instanced- and non-instanced rendering, the draw method always accepts an instance count.

Depending on whether this is an indexed or non-indexed draw, one of the following Metal methods is called:

[MTLRenderCommandEncoder drawIndexedPrimitives]
[MTLRenderCommandEncoder drawPrimitives]

No additional operations happen here, all static state has been applied in Gfx::ApplyDrawState() and shader uniforms have been updated in Gfx::ApplyUniformBlock().

Gfx::UpdateVertices() and Gfx::UpdateIndices()

These are straight memory copies into the next free MTLBuffer of a dynamic mesh resource object (‘free’ means: not currently accessed by the GPU). Only one complete update is allowed per mesh resource and frame. On OSX, the method [MTLBuffer didModifyRange] is called.

Gfx::UpdateTexture()

Same basic rules as the vertex/index update, but instead of a straight memcpy, the [MTLTexture replaceRegion] is called once per mipmap surface that should be updated.

Gfx::CommitFrame()

This is the most interesting method because it contains the only place in the entire frame where the GPU and CPU need to sync up.

The first thing that happens though (only on OSX) is a call to [MTLBuffer didModifyRange] on the global uniform buffer over the range written by shader uniform updates.

Next, the render pass is finished via [MTLRenderCommandEncoder endEncoding] and the command buffer is marked as ready for presenting with [MTLCommandBuffer presentDrawable].

Then before the MTLCommandBuffer is committed, a completion handler block (basically an Objective-C lambda function) is added to signal the frame-sync semaphore ‘after the device has completed the execution of the command buffer’. After that the frame’s command buffer is committed for execution with [MTLCommandBuffer commit].

Finally the CPU waits on the frame-sync semaphore until the frame before (not the one we just committed) has finished (or an even older frame depending on the GfxConfig::MtlMaxInflightFrame constant.

Conclusion

And that’s it mostly. There are a few free-standing state-update functions that are not worth mentioning and which map directly to Metal functions (namely Gfx::ApplyViewPort() and Gfx::ApplyScissorRect()).

In general, I find the Metal API very clean and a joy to work with, it really makes life easier, which is an essential quality of any API. Unlike most other 3D APIs I never had a ‘WTF were they thinking?’ moment. I almost forgot to mention the API Validation layer which really provides helpful feedback. A good validation message is worth a thousand documentation pages ;)

The only flaw of Metal is that it isn’t a plain C API, but that’s of course IMHO. Mixing Objective-C and C++ is possible but ugly, and the hidden lifetime-management of Objective-C objects is a constant source of pain (Automatic Reference Counting, @autoreleasepools and all that jazz). Also: just how expensive is _objc_msgSend() really?!

Oryol doesn’t make use of the more advanced Metal features like the whole compute stuff or full-blown C++ shaders. One reason for this is that the Oryol Gfx module needs to find a balance between WebGL and OpenGL ES2 on the low-end, while still mapping somewhat well to the modern 3D APIs, and if just for the sake of simpler and saner code. From a pure performance perspective, it will be hard to beat D3D11 or the better desktop OpenGL drivers for an API wrapper that still needs to support GLES2, but at least on iOS, using Metal is a no-brainer. It is better in every single aspect than GLES2 or GLES3.