Google benchmark

Our new results

These few pages describe a new step (our first results are still available here) in our participation to the programming exercise proposed by R.Hundt in his paper Loop Recognition in C++/Java/Go/Scala, where he presented the implementation of a well specified benchmark in four different languages : go, Java, Scala and C++….

Introduction

When, among R.Hundt’s conclusions, we could read this one :

We find that in regards to performance, C++ wins out by a large margin. However, it also required the most extensive tuning efforts, many of which were done at a level of sophistication that would not be available to the average programmer.

we decided to enter Cawen for this friendly derby…

If you are interested in reproducing our results, having a look at the C++ sources, the Cawen sources and the C99/C++ sources they have generated, please download our archive.

The only prerequisite : Judy arrays library

Versions overview

The four C++ versions
Robert Hundt’s original version (rhundt_1)

You can download it using :

svn checkout http://multi-language-bench.googlecode.com /svn/trunk/ multi-language-bench-read-only
Robert Hundt’s optimized version (rhundt_2)

R.Hundt did not opensource the C++ « pro » version.

But, he gave some hints to improve significantly the V0 version. In this list of changes we picked up the easiest to implement, namely :

  • Additional gain for getting rid of BasicBlockMap and storing dfs_numbers on the BasicBlocks themselves.
  • Converted BasicBlockSet from set<> to vector<>.
  • Changed node_pool from a list<> to a vector<>. Moved definition out of the loop to recycle it.
  • Changed LoopSet from set<> to vector<>.
  • Changed LoopList from list<> to vector<>.
  • Changed EdgeList from list to vector.
Russel Cox’s version (rcox)

You will find it here

Anthony C. Hay’s version (anthay)

is avalaible in this article  : Loop counting: Google’s multi-language-bench

The 8 Cawen versions

For each C++ versions we coded two Cawen versions :

the first one is a literal translation (line by line/ container by container) of the corresponding C++ version, the second version is an optimization, for speed only, of the first one.

The input

Robert Hundt has chosen to test his code on a particular graph and to run his search function 50 times.

Russel Cox ran the same experiment (with some differences noticed by A.Hay, we changed Cox’s code to get closer to R.Hundt’s version).

Anthony Hay tested his code on R.Hundt’s graph but he also wanted to know how the different versions would behave on a randomly generated input. He built a sequence of random graphs with at most 10000 nodes and another set of graphs with at most 50000 nodes.

We tested the 4 C++ versions and the 8 Cawen versions with this three different kinds of inputs and processings whe named bench, random_10000 and random_50000.

Plateforms

We ran the benchmarks on 4 platforms we named :

  • freebsd 1 : gcc 4.2.2 / FreeBSD 8.3 / Intel Celeron /32b/ 2,66 GHz / 1G RAM
  • freebsd 2 : gcc 4.2.1 / FreeBSD 8.2 / Intel Core i5-2300/64b/2.80GHz /16G RAM
  • cygwin : gcc 4.5.3 / Cygwin6.0/ Intel Pentium Dual-Core T4200 /64b/2GHz /4G RAM
  • mac osx : gcc-4.0.1 / Mac OS X 10.5.8/ Intel Core 2 Duo /32b/ 2,4GHz /2G RAM
  • cygwin2 : gcc 3.4.4 / version Cygwin DLL is 1.7.16-1 / Intel Core 2 Duo CPU P8600 /32b/2.40GHz /3.45G RAM (results collected by François Bluteau, thanks Fanch!)

The criteria and their measurements

Run-time

We calculate the average of (sys + user) from the time command over seven runs.

Memory footprint

To be coherent with R.Hundt’s paper, we used one top command every second all through the run-time and calculate the average.

This method is too imprecise, particulary when execution times are very short. Used twice on the same process, it can give quite different results.

We tried to solve this problem :

  • using massif tool of valgrind (with –pages-as-heap=yes , see http://valgrind.org/docs/manual/ms-manual.html ) where valgrind was available (freebsd 1 & freebsd 2)
  • for the other plateforms, we ran each binary 7 times and took a top measure every second. For the “bench” test, where the memory naturaly reaches a plateau the average was calculated over the coherent measures only. We took RES figures of cygwin top command and RSIZE of mac osx top command.

This is surely the less reliable measure we record here. Differences of a few percent are not significant.

Code Size

We have been surprised not to find, foreach language under evaluation, some tool that could count only “useful” lines. That is to say, a tool that would know enough about a given langage to make the difference between a comment line, a white line and a statement.

We might not have searched enough.

We use :

wc -l;

To stay fair with C++, we copied comments and even indentation mistakes…

Compilation time

Our preprocessor is installed only on platform freebsd 1. On the other platforms we compiled the generated C99 code with gcc.

Results

All results are available in an array.

Code sizes and compilation times
Code sizes [number of lines] - Bench

Code sizes [number of lines] – Bench

Code sizes [number of lines] - Random

Code sizes [number of lines] – Random

Compilation times [seconds] - Bench

Compilation times [seconds] – Bench

Compilation times [seconds] - Random

Compilation times [seconds] – Random

Cawen seems a little less verbose that C++ (we will come back to this point later) but Cawen compilation times are huge even on a rather tough server.

« We are working hard on it » (-30% last month !;-)

Binary sizes
Binary size [bytes] - FBSD 1 - Bench

Binary size [bytes] – FBSD 1 – Bench

Binary size [bytes] - FBSD 1 - Random

Binary size [bytes] – FBSD 1 – Random

Binary size [bytes] - FBSD 2 - Bench

Binary size [bytes] – FBSD 2 – Bench

Binary size [bytes] - FBSD 2 - Random

Binary size [bytes] – FBSD 2 – Random

Binary size [bytes] - OS X - Bench

Binary size [bytes] – OS X – Bench

Binary size [bytes] - OS X - Random

Binary size [bytes] – OS X – Random

Binary size [bytes] - Cygwin - Bench

Binary size [bytes] – Cygwin – Bench

Binary size [bytes] - Cygwin - Random

Binary size [bytes] – Cygwin – Random

C++ binaries are smaller than Cawen binaries for Rhundt 1 & 2. It is the opposite for the anthay and rcox versions.

The C99 source files we produced are accepted by g++ but all the results we present here have been obtained by binaries compiled with gcc…

One thing we can note for the moment is that, on freebsd 2, Cawen binaries obtained by a g++ compilation are smaller than those obtained by a gcc compilation (we’ll investigate) and even, most of the time, than the original C++ binaries.

We will see later what can be the impact of a g++ compilation on runtime performance.

Performance analysis

The first tests we ran on freebsd 2 confirmed what A.C.Hay has noticed before : the finish order varies with the input type…

Memory [Mbytes] / execution time [sec.] - Bench

Memory [Mbytes] / execution time [sec.] – Bench

Memory [Mbytes] / execution time [sec.] - Random 10000

Memory [Mbytes] / execution time [sec.] – Random 10000

Memory [Mbytes] / execution time [sec.] - Random 50000

Memory [Mbytes] / execution time [sec.] – Random 50000

Rhunt_1, Rhunt_2, rcox, anthay….We let you designate the winning version…

We are trying here to evaluate Cawen and the following diagrams may be more relevant. We gathered data from all tests on every machine and calculate the differences between Cawen original version and C++.

The original Cawen version

Execution time variation x [%] / Memory variation y [%] cwn compared to cpp

Execution time variation x [%] / Memory variation y [%] cwn compared to cpp

 

In 73% of the tests (44 of the 60 tests : 5 machines * 4 algos * 3 input types) the original Cawenversion is faster and consume less or the same memory than C++.

In 96% of the tests (58 over 60), Cawen consumes less or the same memory than C++.

Results for Robert Hundt’s fixed input

In 100% of the original benchmark tests, Cawen version is faster and consumes the same or less memory than C++. Observed time reductions are from 2% to 77% and memory reduction from 3% to 32% .That is to say that R.Hundt’s race has got a new winner ! Champagne pour tous !

Results for random input

The speed and memory gain is particulary impressive on R.Hundt’s 1 & 2 with random input tests. For rhundt 2/random_50000, Cawen uses 89,73% less time and 93,02% less memory than C++.

We certainly have to thank Doug Baskins and his Judy arrays for these great results.

By contrast, Cawen gets into trouble with random input on R. Cox and A.C. Hay versions.

Performance tuning
Unrolling loops

With random inputs on Cox’s and Hay’s versions, Cawen is beaten. Gprof shows clearly that Cawen‘s “find value in an array” function is to blame.

0.05 15.57 23216774/23216774 FindLoops_LoopFinder [4]
 [5] 78.3 0.05 15.57 23216774 govel_find_all1 [5]
 15.57 0.00 23216774/23216774 govel_find_in_container_contiguous1

78,3 % of the execution time is spent in this function which is nothing but a loop over all the elements of an array. One way to accelerate this kind of loop is to unroll it.

changing :

if (!@contains(yprime,@pAt(w,&non_back_preds)))

to :

if (!@contains{unrollCount=16}(yprime,@pAt(w,&non_back_preds)))

we get a 44% percent time reduction and become faster than C++.

One thing is for sure, Cawen vectors need a little reengineering…

allocators

Most of the C++ versions use repetitive constructing/destroying of the same objects. In this kind of situation, allocators can be of great help.

We used two kinds of allocators.

For R.Hundt 1 & 2, we redirected all the memory allocations less or equal to 64 bytes to the same allocator. The Cawen for that is :

#typeDef(aggPool_t, aggHmgPool, M4, maxSize = 64, chunkCount=5000);
aggPool_t usedAlloc;
#paramDef(
   memory::reserve, 
   memory::release,
   memory::resize,
   govel::del,
   govel::new, 
   govel::newDecl  
         $pAllocator &usedAlloc);

For Hay and Cox versions, we used type specific allocators to handle our “new” instructions.

#typeDef(hmgPool1_t, hmgPool, Flag, exactSize, M4, chunkByteCount = sizeof(Loop), chunkCount = 10000); 
hmgPool1_t loopAlloc; 
// (...) 
@newDecl(Loop,l,&loopAlloc);
Changing vectors density

Govel (our tiny STL) allows to change the way the memory allocated for a vector must grow. The density template parameter indicates what must be the minimum density  (number of used elements/number allocated) after a memory extension.

#typeDef(Edge_Vector,heapArray,M4,valType=Edge,density=30);
// (...)
#typeDef(LoopBlock_Vector,heapArray,M4,valType=LoopBlock,density=100);

Default density is 50%.

The optimized Cawen version

This graph shows the difference between Cawen original version and Cawen optimised version, that is to say the results of our tuning efforts.

Execution time variation x [%] / Memory variation y [%] cwn optimized compared to cwn

Execution time variation x [%] / Memory variation y [%] cwn optimized compared to cwn

 

The only difference between anthay and anthay_opt is the unrolling of the search loop and  ,as we can check on the plot, there is no impact on memory consumption. The use of allocators increased memory footprint on rhundt 1 & 2, but left it unchanged on rcox.

The comparison with the cpp versions becomes :

Execution time variation x [%] / Memory variation y [%] cwn optimized compared to cpp

Execution time variation x [%] / Memory variation y [%] cwn optimized compared to cpp

 

In 91% of the tests the optimized Cawen versions are faster and use the same or less memory than the corresponding C++ versions.

In 100% of the tests the optimized versions are faster than the corresponding C++ versions.

Speed gain are from 4% to 98.7%.

What the benchmark tells about coding Cawen

Code sizes

According to our different implementations of the benchmark, Cawen seems to be a litlle less verbose than C++ (by 5 to 9%). It is mainly because, in this exercise, :

  • we used « foreach » (available in C0x11)
@foreach(elements,v,back_preds_w) { if (v:=>val != w) @append(find_root_union_find(v:=>val,&preorder_set),&p); else @at(w,&type) = self; };
  • the writing functions of our containers accept any type of data as « from » argument
// Copy node_pool to worklist. @here( HavlakLoopFinder_NodeList,worklist); @append(&node_pool,&worklist);
  • #include directives are only necessary in the main source file.

To give a better idea of Cawen‘s expressiveness, we can add that govel, our baby STL, that includes a stack array, a heap array, a double-linked list, the wrapping of Judy1 and JudyL, their associated iterators, differents foreach (runtime & preprocessing) and a copying/cleaning policy is 10081 Cawen lines long.

Automatic cleanup

In Cawen one can express things like “each time you enter a function code do this” (“this” being whatever precompilation processing) or “after/before each typedef declaration do this” etc….or “each time a variable is leaving a scope, do this”.

We used this last possibility to code automatic cleanup at scope ends. In our source files, it ‘s activated by :

#defCodePointOn goodbyeVar;

This freeing policy, which is 330 Cawen lines long, belongs to govel (our container collection) and is not part of the language itself.

Conclusion

The debates about languages, their respective advantages and disadvantages, very often turn to religious wars.

We would like to thank R.Hundt for reminding us that their main purpose is to express simply and clearly some possibly very complex processings and then to execute these processings efficiently.

So that there must be some objective criteria to estimate their utility in solving a given problem.

Following this very general principle, we hope we contributed to prove that there is still a lot of room for new solutions.

We are very conscious that Cawen remains very perfectible and that we have a lot of work ahead of us…Although we don’t know how far little Cawen will go, will we agree to say that his first steps seem promising ?

Comments

  1. Bonjour,
    je me permets de vous répondre suite à un message laissé sur notre site internet.
    je vous revoie vers l’association Green Code Lab pour débattre davantage sur les sujets d’écoconception sur lesquels nous travaillons.
    Notre vocation n’est pas débattre sur le langage utilisé même si nous avons nos préférences (et même si le petit cawen semble très performant en effet !) . second point, nous nous intéressons à l’ensemble du cycle de vie du logiciel et pas seulement à la phase d’usage. Lors de cette phase, la consommation est une des parties à laquelle nous nous intéressons via l’analyse statique de code et via la mesure énergétique générée par le delta entre 2 versions d’applications intégrant des corrections du référentiel d’écoconception que nous avons mis au point. Nous nous intéressons à la performance car elle est « souvent » corrélée avec une moindre consommation des ressources au final (mais pas toujours !).
    Je pense que vos travaux et nos travaux vont dans le même sens. Je vous invite à nous rejoindre lors de la grand messe national de l’écoconception à Nantes le 18 octobre à l’école des Mines de Nantes. Il doit rester quelques places !
    Un échange avec le GCL pourrait également être vertueux dans les prochains mois (peut-être travailler sur l’aspect consommation de Cawen ?).
    A bientôt
    Thierry

    • Gwenaël Chailleu :

      Bonjour Thierry !

      Nous avons certainement beaucoup à apprendre sur les différents aspects de l’éco-conception.

      « L’ensemble du cycle de vie du logiciel », « analyse statique de code », la corrélation « partielle » des performances avec la consommation finale de ressources, sont des points qui nous intriguent et donc nous intéressent particulièrement.

      Dans l’immédiat, nous serions évidemment très désireux de pouvoir mesurer la consommation des différentes versions dont nous avons mis le source en ligne…

      N’hésitez pas à nous contacter sur nos mails individuels et à bientôt sur Green Code Lab !

      Thomas & Gwen

Leave a Reply to T. LEBOUCQ Cancel reply

*


8 × seven =

-Your email address will not be published.-