An image based programming language like Factor stores executable code in an image file. When Factor starts up it loads the image and starts executing the code stored in it.
I wanted to explore some ideas with images and dynamic code generation so wrote a simple example on how to generate assembly, store it in an image and later load and run it. The same principle is used to generate machine code at runtime and execute it.
As a start I wanted to generate the equivalent of a simple function in the x86 architecture that returns an integer value. The equivalent of:
int func(void) { return 42; }
To see the assembly code this generates I used gcc to compile it and saved the temporary files:
gcc -c -save-temps test.c
The relevant part of the output was:
_func:
pushl %ebp
movl %esp, %ebp
movl $42, %eax
popl %ebp
ret
Using objdump I looked at the generated machine code with:
objdump -D test.o
Giving:
00000000 <_func>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: b8 2a 00 00 00 mov $0x2a,%eax
8: 5d pop %ebp
9: c3 ret
For the first simple test I just wrote some code that output the relevant bytes into a file called 'image.x86'. Now I needed a program to load and execute the code.
Allocating a memory area with 'malloc', loading the image into it, and then casting the memory pointer to a function pointer and calling it seems the obvious step but it won't work.
Modern CPUs distinguish between code and data. Code cannot be executed from within a memory area which is not marked as executable. We need to allocate a memory area specifically marked as being executable. The way to do this is different between operating systems. For this example I'm running on Windows and VirtualAlloc is the needed call:
void* allocate_code_buffer(int size) {
return VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
}
void free_code_buffer(void* addr) {
VirtualFree(addr, 0, MEM_RELEASE);
}
void flush_icache() {
FlushInstructionCache(GetCurrentProcess(), 0, 0);
}
Also needed is VirtualFree to free the memory area when we no longer need it. Modern CPUs also cache some memory areas. If the memory area where the generated code is stored is cached then our new code in that area won't be seen by the CPU unless we flush the cache. That's what 'FlushInstructionCache' does. This isn't actually needed on x86 machines but on ARM it is. Windows on x86 provides the function anyway and it's good practice to call it so the code is portable.
The following pseudo-ish code loads the image, and calls it:
void* load_image(FILE* file) {
void* image = 0;
image = allocate_code_buffer(1024);
fread(image, 1024, 1, file);
flush_icache();
return image;
}
typedef int (*entry_func)(void);
void run_image(void* image) {
entry_func func = (entry_func)image;
printf("Result is: %d\n", func());
}
FILE* file = fopen("image.x86", "rb");
void* image = load_image(file);
fclose(file);
run_image(image);
free_image(image);
Once the small C loader is written it can be used with any generated image files.
Real code is a bit more complicated than that simple function. It has branches and jumps. To handle this I made the image file format more structured. At the beginning is the size of the generated machine code, the machine code data itself and then some relocation records.
The runtime loaded in C loads the image data first, then walks through the relocation records patching up any references in the code to the address where the image was loaded. This needs to be done if there are any absolute jumps or references in the image to other parts of the image as the image could be loaded at a different address from where it was first generated.
The relocation records exist as a number of 'labels'. Each label contains the offset in the image file for that label, and a number of addresses that reference the label. Those references are updated to be the actual address of the label. Something like this:
void relocate_image(FILE* file, void* image) {
unsigned long num_labels = 0;
int i = 0;
fread(&num_labels, sizeof(num_labels), 1, file);
for(i=0;i<num_labels;++i) {
unsigned long offset = 0;
unsigned long num_references = 0;
int j = 0;
fread(&offset, sizeof(offset), 1, file);
fread(&num_references, sizeof(num_references), 1, file);
for(j=0;j<num_references;++j) {
unsigned long ref = 0;
unsigned char* bimage = (unsigned char*)image;
fread(&ref, sizeof(ref), 1, file);
*(bimage+ref)=bimage+offset;
}
}
}
I didn't fully implement jumps in this simple example but to prove the concept I do relocations for 'near relative' jumps. These could actually be computed before writing the image but I did it as part of the image relocation as I'd need to add that for absolute jumps later anyway.
I wrote some Javascript code to generate the machine code. It's not a full X86 assembler in Javascript, but just enough to generate the 3-4 instructions I use as a proof of concept. The Javascript to generate the machine code looks like this:
var image = new Image(LITTLEENDIAN);
var assembler = new X86Assembler(image);
assembler.move(42, EAX)
.move(5, EBX)
.jmpNear("end")
.adc(EBX,EAX)
.label("end")
.ret();
image.save("image.x86");
This snippet generates code to load EAX with 42, EBX with 5, and jumps to the label 'end' which results in returning from the function. If you remove the jump, or comment it out, it adds the two numbers together before returning. The resulting image file can be run with the C loaded and prints out the integer value returned by this snippet (which is the result stored in the EAX register).
This assembler is very simple. It basically stores the generated bytes for each instruction in an array in the Image object. When saved the Image object writes the generated bytes and the relocation data if there is any.
The javascript code is in image.js, x86assembler.js, armassembler.js and generate.js. It can be run with the Rhino Javascript interpreter. The full distribution is in image1.tar.gz and includes the Rhino Javascript interpreter JAR file. A darcs repository with the code if you want to play with it and send patches is at:
darcs get http://www.bluishcoder.co.nz/repos/image1
A simple test run is:
java -jar js.jar generate.js
gcc -o run_image run.c
run_image image.x86
I also did an ARM version which can be tested on a Windows CE machine. I tested it on the Microsoft Windows Mobile 5 Emulator with run.c compiled with 'cegcc'. In that case I ran the test.arm image which is simple code generated with:
var image = new Image(LITTLEENDIAN);
var assembler = new ARMAssembler(image);
assembler.moveRsRd(arm.SP, arm.IP)
.stmfd(arm.SP, [1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0])
.sub(arm.IP, arm.FP, 4)
.moveImmediate(42, arm.R3)
.moveRsRd(arm.R3, arm.R0)
.ldmfd(arm.SP, [1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0]);
image.save("image.arm");
This code returns '42' in the R0 register. It's exactly the code generated with the test function described before on x86, but for ARM, and I worked it out using the same compile/objdump method but using cegcc for the compiler.
It's an interesting exercise to generate this sort of thing from scratch to get an idea of how systems like Factor do it. An interesting approach might be to use Factor to generate small images for use on memory constrained devices with a tiny C runtime loader.
Although I haven't tried it this might work on Symbian devices as well. Symbian devices are ARM based but their applications are really DLLs that are loaded by the main phone application. By writing a DLL C loader that reads an executes the image file, or different images for each exposed virtual function from the DLL, you could bootstrap a simple Symbian program.
The code I wrote here is very much 'hack away until it worked' code. The assemblers in particular are ugly but it should be just enough to give you an idea of how it works if you want to go off and do something similar.