# Benchmarking Ruby, Python, JavaScript, Lua, Java, C++ and Assembly language

I admit it, I am a performance fetishist, maybe that’s the reason why I always looked down on the web a bit, and why 3D realtime visual simulation is still my favourite field of development activity, but I admit, cpu cycles are cheap and still getting cheaper, and human brainpower is expensive, and will be more expensive in the future, even with millions of indian and chinese programmers, because they will soon demand to have a life with good food, good doctors, a car, holidays and air conditioning. The cpus will just be happy with air conditioning. So substituting brain power with cpu power is very desireably, at least for those who pay the bills.

This is where rapid development, middleware, frameworks and all kind of web technology enter the scene, and digging deeper, behind all these systems there are programming languages of all sorts that promise exactly this: Through late binding and dynamic typing you can glue things together, and build even a decent amount of business logic with these languages much faster than in C++. Java is lurking somewhere inbetween, not beeing a dynamic “scripting” language, it has a mighty isolation layer in form of the VM and provides introspection and dynamic linkage.

While I do not have yet a simple way to quantify how much faster you can program or solve problems with a particular language, there is a quite easy way to determine the costs in terms of cpu cycles.

To make things even easier, I am not interested in any real world scenarios because there are way too much of them, and too many ways to program them, so my idea was to get some idea about the absolute upper limits about what can be achieved with a particular programming language. For this purpose, I devised three simple tests:

1. LOOP: A counting loop that yields the upper limit of simple instructions I can execeute:``` n = 1000000; i = 0; while (i < n) { n+=1; } ```
2. CALL: A counting loop with a simple function call with two arguments, which yields the upper limit of function calls I can make:
``` n = 1000000; i = 0; while (i < n) { i =add(i,1); } ```
3. MAT4 A loop that calls a function that carries out a full 4×4 matrix multiplication in this language:
``` n = 100000; i = 0; a = [...]; b = [...]; while (i < n) { c =multiplyMatrix(a, b); } ```

I also tried to optimize the code for the particular language and used the best performing version, e.g. in ruby it is about 20% faster to use the n.times statement than the while statement, so i used it; however, I am quite sure I did not always find the fastest way of doing things, but I used what I was able to come up with in decent time. So here the results; the numbers indicate the amout of loop cycles or functions call per second were performed on a Dell M70 with a Pentium M 2.13 Ghz Processor. The OS is WindowsXP SP2, Ruby 1.8.1-12, Python 2.4.1, Lua 5.0.2, Spidermonkey 1998 Vintage, Java SDK 1.4.2.08, C++ VS.NET2003. Spike-A and B are simple experimental expression tree based virtual machines written in C++ by me, trying to find the performance limits for an interpreted dynamic language. I ran every benchmark several times for some seconds, and took the best result, and rounded it after the second most significant digit.

Ruby JavaScript Python Lua Spike-A Spike-B Java -Xint. Java JITC C++DBG C++OPT x86-Asm
LOOP 2.7Mio. 3.1 Mio. 4.5 Mio. 8.7 Mio. 30 Mio. 112 Mio. 87 Mio. 260 Mio. 260 Mio. 1000 Mio. 2000 Mio.
CALL 1.4Mio. 1.8 Mio. 2.1 Mio. 4.3 Mio. 6.5 Mio. 19 Mio. 22 Mio. 52o Mio. 10 Mio. 340 Mio. n.n.
MAT4 12000 20000 52000 23000 n.a. 2.5 Mio. 160000 2.3 Mio 4.2 Mio 14 Mio. n.n.

Here some remarks about the languages and the results:

• Ruby: I was told it is slow, and indeed it came out to be slowest in every benchmark, but it is not an order of magnitude compared to other popular scripting languanges. Making the benchmark was quite straightforward, everything worked as expected.
• Javascript: I am familiar with it, using it for a while now. The Spidermonkey engine is an very old, matured piece of interpreter and very reliable, but not very beautiful. The performance is not exceptional and lies inbetween Ruby and Python.
• Python: Strangely, Python had the most tiny syntactic pitfalls for me as a non-python user, but this is probably a matter of habituation. On the other hand, the documentation for Python is not very good; it took some time to find out what functions to use for timing. Interestingly, the performance sticks out when it comes to array access and numerical expression evalution, even beating Lua, but still two orders of magnitude from C++.
• Lua: I was told it should be fast, and indeed it is faster than Python in loops, but not on the matrix test. Lua also does not have a Timer in the standard library with a better than 1 sec. resolution, so I had to run long benchmarks (>20 sec.) here to get some more precision.
• Spike-A:When I was unsatisfied with Spidermonkey performance last year, I tried myself a shot at building an interpreter in C++ for a dynamic, Javascript-like language. To make a long story short, i used all tricks from the books (and a few new), and after a lot of benchmarking, optimizing and selecting the fastest techniques, I reached a point with not much headroom left for further optimization. I found that interpreting a P-Code is slower than executing an expression tree, and I think that this is inevitable with current processor architectures. At the heart of every P-Code interpreter is a huge “switch” block that processes a token stream, dispatching dependent on some token values. An expression tree traverser just needs to perform function calls on pointers that were created when building the tree, so no lookup is required. The Spike-A Benchmarks were executed on an expression tree that was generated at runtime by a C++ programm, and as you see, it is 3-10 times faster than anything else out there. I am quite sure I could run Ruby, Python or JavaScript on this expression tree with the same performance as in the Spike-A benchmark, but I am still not sure it is worth the effort, if you compare the performance to Java JIT compilation.
• Spike-B: This is the same expression tree engine, but with a less dynamic approach; it benefits from a closer coupling to C++ and a large number of library functions and “precompiled” idioms; it would have to be a specially tailored language with a more static type system and and a fat syntax, but the performance figures represent what I think is close to the theorethical limit of an interpreter written in C++ without using Just-In-Time compilation or Assembler-Level optimization. To make it clearer: The MAT4 test here is so fast because Matrix and Vector Types are native types of the interpreter. It is possible to makes bindings to all kinds of native types for all the other scripting languages, but e.g. a Python-Wrapper for a C++ 4×4 Matrix multiplier also just yields about a few hundred thousand multplies per second; with SWIG, it will be possibly even slower than the native multiply.
• Java -Xint: The Java P-Code Interpreter runs 5-10 times faster than Lua; actually it is surprisingly fast; I think a lot of engineering effort went into it, and it probably also marks the limit what can be achieved with an interpreter. It is funny that the numbers are close to my Spike-B, which seems to confirm that you cant get much faster with interpreting.
• Java: The Java Just-In-Time compiler does a decent job and runs generally as fast as unoptimized C++, sometimes even faster. However, you still pay a price for the isolation layer.
• C++ DBG: Unoptimized Debug compiled C++ can be slower than you think, but the bad CALL test results seem to be anomaly, probably caused by some excessive checks generated by the VC compiler; I dont think they will be present with gcc, but I need to check that.
• C++ OPT: The C++ optimizer required me to modify the benchmark because 1) the optimzer throws away code if the result of a computation is not used and 2) the optimizer is able to replace a of number of additions in a loop by one multiplication, so I had to add an j^=i (xor) statement to both the loop body in LOOP as well as to the function body in CALL; otherwise the execution time was independent from the loop count n. After making sure the loops were not optimized away, C++ turns out to be 4-6 times faster than the Java JITC, except for CALL where the Java Compiler is faster.
• x86asm:I learned some interesting lessons when I tried to find out how fast the processor can be, so I took a shot at trying how fast I can loop in x86 assembly language. The interesting thing was that the naive approach with a loop of just three instructions (dec, test, jne) is not as fast as optimized C++; to get a two billion per sec. loop I had to unroll the loop keep the pipeline filled and use proper type of jump to play nice with the branch prediction. The interesting thing here is that I can really execute an average of almost three instructions per clock cycle in a simple loop, actually allowing a loop that runs with clock speed.

Ok, thats what I have gathered so far. It is not exactly scientifc, but quite interesting; I would not have guessed this outcome, and I would still like to fill some gaps in the table. I would also like to try C#, but all this took me already several days, and at one point I have to come to an end, it is already unusually long for just a blog entry. If someone is interested, I will also make a tar archive of all the benchmark programs available. There is a great number of details I learned, and all this testing has fueled my interest in rolling my own scripting language, but now I will go on holidays first.