Monday, January 23, 2012

Emscripten Standard Library Support, Now With More C++

I just landed much more comprehensive support for the C++ standard library in Emscripten, which now allows you to compile pretty much any C++ code using the standard C++ library to JavaScript. So I figured it was a good time to write up an overview of how Emscripten handles standard libraries.

As background, one of the initial design decisions in Emscripten was to focus on generating good code, even when that has some potential downsides elsewhere. Good code means both fast code and small code, both of which are particularly important on the web: While fast code is important everywhere, JavaScript is not yet as fast as native code, so to counter that we need to really focus on generating efficient code, and regarding code size, you might not care much about linking to a 5MB shared library on your desktop, but downloading and parsing a 5MB script is something significant.

For that reason, it didn't seem like a good idea to build C and C++ standard libraries, ship them with your code, and link them at runtime: The standard libraries are quite large. Furthermore, Emscripten doesn't have a single ABI, it has several code generation modes, as one example there are two typed array modes and one mode without typed arrays, and code compiled with one is not interface-compatible with another. Not having a stable ABI lets us generate more specialized and efficient code, but it is another reason for not shipping separate linkable standard libraries.

Instead, when compiling code with Emscripten we build the standard library along with your project's code. Everything is then shipped as a single file. This gives the advantages mentioned before: Smaller code size since we know which parts of the standard library you actually need, and faster code since we can specialize both the standard library and your own code, not just your own.

However, there are two disadvantages of this approach. The first is that combining the standard libraries with your own code means they form a single "unit". When normally you build your code and then link to the LGPL-licensed GNU standard libraries, it's clear that your code does not need to comply with the LGPL (you just need to comply with the LGPL regarding the LGPL'd library itself). But if you build your project together with the standard library, intertwining them in an optimized way, it is less clear how the LGPL applies here. I actually don't think there is a problem - it seems equivalent to the former "normal" case to me, despite the differences between them - however, I am not a lawyer, and also it is better to avoid any possible confusion and concern. In addition, even if the LGPL still applies just to the library, you would be shipping the library yourself, meaning you need to comply with the LGPL for it (which I don't think is a problem myself, but it is a concern for other people). For those reasons, Emscripten doesn't include any LGPL code. That ruled out using the GNU standard C and C++ libraries, which would otherwise be the first choice because of their familiarity and compatibility with existing code.

Given that decision, I looked at the other options and decided to use the Newlib C library. There are then two options: Use just the Newlib headers, or use the existing Newlib implementation code, porting it to the new platform. I decided to use just the headers, because (1) Newlib is already not 100% compatible with the GNU C library, so there would anyhow be inconsistencies and missing parts we would need to work around, and easier to do so in our own new code, (2) By implementing the C standard library in JavaScript, we can optimize it using the existing capabilities of the web platform (for example, we use JavaScript's sort inside qsort), and (3) Porting Newlib to the web would mean writing new C code inside Newlib, and interfacing that with JavaScript code that hooks into the web platform APIs themselves, which is a little less convenient than writing just JavaScript, and finally (4) Porting a C standard library means working with the internals of that library, as opposed to implementing the familiar C standard library interface which is higher-level. So, Emscripten has an implementation of the C standard library written in JavaScript, primarily using the Newlib headers. (The only exception is malloc and free, which we compile from dlmalloc, because writing an effective malloc implementation is not trivial. However, we should implement malloc and free in JavaScript eventually since we could optimize it quite a bit.)

For C++, again we couldn't use the GNU C++ standard library. Instead we started out with the libc++ headers. Just using the headers was enough to get a lot of code to run, because a lot of the functionality is in the headers themselves. We did need to introduce some ugly hacks in the headers though, as well as implement some bits in JavaScript. This was enough for a lot of projects to work, but was still missing a lot of stuff, almost everything that wasn't implemented in a header.

That problem is what I worked on fixing last week: As of Saturday, we will build the libc++ sources, if they are needed by your project, and include them. To get this to be efficient, I also enabled LLVM's global dead code elimination, so that while we link in the entire C++ standard library, we immediately eliminate all the parts you don't actually need before proceeding to compile to JavaScript. This helps quite a lot with the size of the generated code (and also is nice for compilation times). Aside from that, the main challenge here was getting libc++ to build using the Newlib C standard library headers. The end result is that all of our hacks in the libc++ headers are now removed, we now build stock libc++ and include it if necessary, which means pretty much any C++ program that uses the C++ standard library should work. (With the obvious caveats of no multithreading with shared state and so forth, which are general limitations of compiling to JavaScript.)

I mentioned before that there were two downsides to building the standard libraries with your project, the first of which was licensing, which led us to avoid LGPL code. The other disadvantage is build time: You compile the standard library into JavaScript when you build your project, as opposed to just building your own project and linking it with prebuilt standard libraries. While I believe this is definitely worth the advantages of the approach (faster and smaller generated code), it is a concern. We get around a lot of the problem by (1) having the C standard library implemented in JavaScript, so there is no compilation time for it, (2) using LLVM's dead code elimination to quickly get rid of parts of the C++ standard library (and dlmalloc, as mentioned before that we also build, if it is used) early in the compilation process, (3) only linking in the C++ standard library (and dlmalloc) if they are actually used, and (4) caching the bitcode result of compiling the C++ standard library (and dlmalloc), so that we only compile it once to bitcode. With these in place, while emcc is still slower than gcc, it isn't very significant, except for the very first time you compile libc++ from source into bitcode (which as mentioned before, is done once and then cached).