Multithreading in emscripten with HTML5 WebWorkers
Multithreading in emscripten is different from what us C/C++ coders are used to. There is no concept of threads with shared memory state in Javascript, so emscripten can't simply offer a pthreads wrapper like NaCl does. Instead it uses HTML5 WebWorkers and a highlevel message-passing API to spread work across several CPU cores.
You basically pass a memory buffer over to the worker thread as input data, the worker thread does its processing and passes a memory buffer with the result data back to the main thread.
The downsides are (1) you can't simply port your existing multi-threaded code over to emscripten, (2) it is (currently) somewhat expensive to pass data around since it involves copying, and (3) you cannot express all multithreading patterns in emscripten. The upside is though, that it's really hard to shoot yourself in the foot, since there's no shared state, and all the multithreading primitives you love to hate (like mutexes, semaphores, cond-vars, atomic-ops) simply don't exist.
Let's have a quick look at emscripten's worker API, only 4 API-functions and 2 user-provided functions are necessary:
worker_handle emscripten_create_worker(const char* url);
This create a new worker object, it takes the URL of a separate emscripten-generated Javascript file.
The worker file must export at least one C-function (name doesn't matter, but the function name must be explicitely exported using emscripten's new "-s EXPORTED_FUNCTIONS" switch so that it isn't removed by dead-code elimination. The worker function prototype looks like this:
void dowork(char* data, int size);
The arguments define the location and size of the input data.
The function to invoke the worker is:
void emscripten_call_worker(worker_handle worker, const char *funcname, char *data, int size, void (*callback)(char *, int, void*), void *arg);
This takes the worker handle returned by emscripten_create_worker(), the name of the worker function (in our case "dowork"), a pointer to and size of the input data, a completion callback function pointer, and finally a custom argument which is passed through to the completion callback to associate the completion call with the invocation call.
At some point after emscripten_call_worker() is called, the dowork-function will be called in the worker thread with a data pointer and size. Since the worker has its own address space, the actual pointer value will be different from the pointer value in the emscripten_call_worker call of course.
The worker function now uses this input data to compute a result, and (optionally) hands this result back to the main thread using this function:
void emscripten_worker_respond(char* data, int size);
The return-data will be copied inside the function, so if the worker function had allocated a result buffer it remains the owner of that buffer and is responsible to release it.
Finally, once the worker has finished, the completion callback will be called on the main thread with the result data, and the custom arg given in the emscripten_call_worker() call:
void completion_callback(char* data, int size, void* arg);
The callee does not gain ownership of the data buffer, thus it must read / copy the received data but not write to, or free the buffer.
Finally there's a function to destroy a worker:
void emscripten_destroy_worker(worker_handle worker);
As with threads, creating and destroying workers is not cheap, so you should create a couple of workers at the start of the application and keep them around, instead of creating and destroying workers repeatedly. It's also wise to batch as much work as possible per worker invocation to offset the call-overhead as much as possible (don't call a worker many times per frames, ideally only once), but this is all pretty much common sense.
The worker Javascript file must be created as a separate compilation unit, it's a bit like on the PS3 where the SPU code also must be compiled into small, complete "SPU executables". To keep the code size small I decided to keep the runtime environment in the worker scripts as slim as possible, there's no complete Nebula3 environment, only a minimal C runtime environment. But this is not a limitation of emscripten, only a decision on my part. Most of the time the workers will contain simple math code which loops over arrays of data instead of high-level object-oriented code. To avoid downloading redundant code it might also make sense to put several worker functions into a single JS file.
The updated Nebula3/emscripten demos at http://n3emscripten.appspot.com now decompress the downloaded asset files in up to 4 WebWorker threads in parallel to the main thread, this speeds up asset loading tremendously and avoids the excessive frame hickups which happened before. This is important, since real-world Nebula3 apps stream asset data on demand while the render loop is running. The whole stuff took me about half a day, but unfortunately I stumbled across a Chrome bug which required a small workaround (see here: http://code.google.com/p/chromium/issues/detail?id=169705).
It's not completely perfect yet. There's data copying happening on the main thread, and there's also some expensive stuff going on when creating the WebGL resources (for instance vertex and index data is unrolled for the instanced rendering hack). The ultimate goal is to move as much resource creation work off the main thread in order to guarantee smooth rendering while resources are created.
There are also browser improvements in sight which will make WebWorkers more efficient in the future, mainly to avoid extra data copies by transferring ownership of the passed data over to the web worker, basically a move instead of a copy.
And that's it for today :)
You basically pass a memory buffer over to the worker thread as input data, the worker thread does its processing and passes a memory buffer with the result data back to the main thread.
The downsides are (1) you can't simply port your existing multi-threaded code over to emscripten, (2) it is (currently) somewhat expensive to pass data around since it involves copying, and (3) you cannot express all multithreading patterns in emscripten. The upside is though, that it's really hard to shoot yourself in the foot, since there's no shared state, and all the multithreading primitives you love to hate (like mutexes, semaphores, cond-vars, atomic-ops) simply don't exist.
Let's have a quick look at emscripten's worker API, only 4 API-functions and 2 user-provided functions are necessary:
worker_handle emscripten_create_worker(const char* url);
This create a new worker object, it takes the URL of a separate emscripten-generated Javascript file.
The worker file must export at least one C-function (name doesn't matter, but the function name must be explicitely exported using emscripten's new "-s EXPORTED_FUNCTIONS" switch so that it isn't removed by dead-code elimination. The worker function prototype looks like this:
void dowork(char* data, int size);
The arguments define the location and size of the input data.
The function to invoke the worker is:
void emscripten_call_worker(worker_handle worker, const char *funcname, char *data, int size, void (*callback)(char *, int, void*), void *arg);
This takes the worker handle returned by emscripten_create_worker(), the name of the worker function (in our case "dowork"), a pointer to and size of the input data, a completion callback function pointer, and finally a custom argument which is passed through to the completion callback to associate the completion call with the invocation call.
At some point after emscripten_call_worker() is called, the dowork-function will be called in the worker thread with a data pointer and size. Since the worker has its own address space, the actual pointer value will be different from the pointer value in the emscripten_call_worker call of course.
The worker function now uses this input data to compute a result, and (optionally) hands this result back to the main thread using this function:
void emscripten_worker_respond(char* data, int size);
The return-data will be copied inside the function, so if the worker function had allocated a result buffer it remains the owner of that buffer and is responsible to release it.
Finally, once the worker has finished, the completion callback will be called on the main thread with the result data, and the custom arg given in the emscripten_call_worker() call:
void completion_callback(char* data, int size, void* arg);
The callee does not gain ownership of the data buffer, thus it must read / copy the received data but not write to, or free the buffer.
Finally there's a function to destroy a worker:
void emscripten_destroy_worker(worker_handle worker);
As with threads, creating and destroying workers is not cheap, so you should create a couple of workers at the start of the application and keep them around, instead of creating and destroying workers repeatedly. It's also wise to batch as much work as possible per worker invocation to offset the call-overhead as much as possible (don't call a worker many times per frames, ideally only once), but this is all pretty much common sense.
The worker Javascript file must be created as a separate compilation unit, it's a bit like on the PS3 where the SPU code also must be compiled into small, complete "SPU executables". To keep the code size small I decided to keep the runtime environment in the worker scripts as slim as possible, there's no complete Nebula3 environment, only a minimal C runtime environment. But this is not a limitation of emscripten, only a decision on my part. Most of the time the workers will contain simple math code which loops over arrays of data instead of high-level object-oriented code. To avoid downloading redundant code it might also make sense to put several worker functions into a single JS file.
The updated Nebula3/emscripten demos at http://n3emscripten.appspot.com now decompress the downloaded asset files in up to 4 WebWorker threads in parallel to the main thread, this speeds up asset loading tremendously and avoids the excessive frame hickups which happened before. This is important, since real-world Nebula3 apps stream asset data on demand while the render loop is running. The whole stuff took me about half a day, but unfortunately I stumbled across a Chrome bug which required a small workaround (see here: http://code.google.com/p/chromium/issues/detail?id=169705).
It's not completely perfect yet. There's data copying happening on the main thread, and there's also some expensive stuff going on when creating the WebGL resources (for instance vertex and index data is unrolled for the instanced rendering hack). The ultimate goal is to move as much resource creation work off the main thread in order to guarantee smooth rendering while resources are created.
There are also browser improvements in sight which will make WebWorkers more efficient in the future, mainly to avoid extra data copies by transferring ownership of the passed data over to the web worker, basically a move instead of a copy.
And that's it for today :)