Self Benchmarking

I was asked on twitter about the current speed of the Self implementation. The request was for a method send benchmark so I wrote a simple one and compared against Pharo, a Smalltalk implementation.

The implementation and technology behind Self is quite old in comparison to modern compiler implementations but at the time it was state of the art. I hoped it would hold up reasonably well. The test I wrote in Self was:

(|
  doSomething = ( ^self ).
  test = ( |n <- 0|
           [ n < 100000000 ] whileTrue: [ doSomething. n: n + 1 ]
         )
|)

Running this in the Self shell shows:

"Self 1" _AddSlots: ...code snippet from above...
shell
"Self 2" [ test ] time.
2587

2.5 seconds seems a bit slow to me but I tested in Pharo to confirm and to see how it compares. The Pharo code looks almost exactly like the Self code:

doSomething = ^self
test = |count|
       count := 0.
       [ count < 100000000 ] whileTrue: [
         count := count + 1.
         doSomething.
       ].

[ MyObject new test ] timeToRun
  => 0:00:00:00.239

That's 239ms vs 2,587ms, a factor of over 10x. Further investigation revealed that calling 'time' in Self seems to cause the code to run slower. If I call the 'test' method first, and then call 'time' then it's much faster:

"Self 2" [ test ] time.
2587
"Self 3" [ test ] time.
2579
"Self 4" test.
nil
"Self 5" [ test ] time.
650
"Self 6" [ test ] time.
628

At 650ms it is about 2.7x slower than Pharo, an improvement over 10x. More investigation is needed to see if there is room for other improvements.

The Self implementation has some primitives that can be changed to show debugging information from the JIT. All primitives can be listed with:

primitives primitiveList do: [ | :e | e printLine ].

Looking through this shows some interesting ones prefixed with _Print that can be set to output debug data. One is _PrintCompiledCode. Setting this to true allows viewing the generated assembler code on the Self console.

"Self 16" _PrintCompiledCode: true.
false
"Self 17" 40 + 2.
...
  // loadOop2
movl $0xa0 (40), #-16(%ebp)
  // loadOop2
movl $0x8 (2), #-20(%ebp)
  // loadArg
movl #-20(%ebp), %ebx
movl %ebx, #4(%esp)
  // selfCall
movl #-16(%ebp), %ebx
movl %ebx, (%esp)
nop
nop
nop
call 0x8186597 <SendMessage_stub> (bp)
  // begin SendDesc
jmp L7f
  .data 3
jmp L9f
  .data 0
  .data 0
  .data 0x4578341 ('+')
  .data 4
L7: 
L8: 
  // end SendDesc
movl %eax, #-16(%ebp)
  // epilogue
movl #-16(%ebp), %eax
  // restore_frame_and_return
leave
ret

Others, like _PrintInlining display debug information related to inlining code.

"Self 18" _PrintInlining: true
fales
"Self 19" test.
*inlining size, cost 0/size 0 (0x8b7e864)
*PIC-type-predicting - (1 maps)
*type-casing -
*inlining - (smallInt.self:153), cost 1/size 0 (0x8b7ee38)*
 *inlining asSmallInteger (number.self:108), cost 1/size 0 (0x8b7fa94)*
  *inlining raiseError, cost 0/size 0 (0x8b80530)*
  *inlining asSmallIntegerIfFail: (smallInt.self:302), cost 0/size 0 (0x8b808fc)*
 *inlining TSubCC:
 *cannot inline value:With:, cost = 10 (rejected)
 *marking value:With: send ReceiverStatic
 *sending value:With:
*sending -
*inlining size:, cost 0/size 0 (0x8b8434c)
*inlining rep, cost 0/size 0 (0x8b846a8)
*PIC-type-predicting removeFirstLink (1 maps)
*type-casing removeFirstLink
*inlining removeFirstLink (list.self:300), cost 2/size 0 (0x8b84b48)*
 *inlining next, cost 0/size 0 (0x8b85628)
 *PIC-type-predicting remove (1 maps)
 *type-casing remove
 *cannot inline remove, cost = 9 (rejected)
 *sending remove
*sending removeFirstLink
*PIC-type-predicting value (1 maps)
*type-casing value
*inlining value, cost 0/size 0 (0x8b86570)*
*sending value
*inlining asSmallInteger (number.self:108), cost 1/size 0 (0x8b7e5b0)*
 *inlining raiseError, cost 0/size 0 (0x8b7f074)*
 *inlining asSmallIntegerIfFail: (smallInt.self:302), cost 0/size 0 (0x8b7f440)*
*inlining TSubCC:
*cannot inline value:With:, cost = 10 (rejected)
*marking value:With: send ReceiverStatic
*sending value:With:
nil

For more involved benchmarks there is some code shipped with the Self source. It can be loaded with:

"Self 28" bootstrap read: 'allTests' From: 'tests'.
reading ./tests/allTests.self...
reading ./tests/tests.self...
reading ./tests/programmingTests.self...
reading ./tests/debugTests.self...
reading ./tests/lowLevelTests.self...
reading ./tests/numberTests.self...
reading ./tests/deltablue.self...
reading ./tests/sicTests.self...
reading ./tests/branchTests.self...
reading ./tests/nicTests.self...
reading ./tests/testSuite.self...
reading ./tests/languageTests.self...
reading ./tests/cons.self...
reading ./tests/benchmarks.self...
reading ./tests/richards.self...
reading ./tests/parser.self...
reading ./tests/parseNodes.self...
modules allTests

There are methods on the bootstrap object for running the tests and printing results. For example:

"Self 32" benchmarks measurePerformance

                 compile    mean       C    mean/C       %
recur:                 5       0
sumTo:                 2       7
sumFromTo:             2       7
fastSumTo:             2       6
nestedLoop:            2      10
...

There is also measurePerformance2 and measurePerformance3 methods. The code comments for the measure2 and measure3 methods explain the differences.

Self 2 was well known for generating very fast code that compared favourably with C. This implementation of this was described in Craig Chamber's thesis. Compilation was slow however so in Self 3 and 4 two new compilers were created. These were 'nic' and 'sic'. I believe this is covered in Urs Holzle's thesis The 'nic' compiler is the 'Non Inlining Compiler' and is simpler to implement. It's the compiler you write to get Self bootstrapped and running on new platforms fairly quickly. There is no inlining and no type feedback so performance is slower as shown by the benchmarking when changing the compiler used, as described below. The 'sic', or 'Single Inlining Compiler', generates better code through more optimisations. While neither is as fast as the Self 2 compiler it is faster to compile code and makes for a better interactive system. You can read more about this in the Merlintec Self FAQ.

There is a defaultCompiler slot in the benchmark object that can be set to nic or sic to compare the different JIT compilers that Self implements. Comparing the 'nic' compiler vs the 'sic' compiler shows a speedup of about 6x in the 'richards' benchmark when using 'sic'.

There's probably a fair bit of low hanging fruit to improve run times. I don't think the x86 backend has had as much work on it as the Sparc or PPC backends. The downside is much of the compiler code is written in C++ so for people interested in 'Self the language' it's not as fun to hack on. Klein was an attempt to write a Self VM in Self and includes a compiler and assembler which might make a more interesting project for those that want to use Self itself to implement compiler code.