The libtheora api gives YCbCr data as the result of a decoded frame. The current method of displaying data through Firefox requires it to be converted to RGBA. The conversion of YCbCr to RGBA turns out to be a bottleneck.
In current Firefox builds we use the conversion routines provided by liboggplay. The Ogg backend is being modified to reduce the third party library usage (Started by me and continued now by Chris Pearce) and the liboggplay usage will go away. I looked at some of the colour space conversion routines available to get an indication of their relative performance.
I tested the following colour space conversion routines:
- Using integer math and lookup tables.
- A basic C version using floating point math.
- liboggplay. BSD license.
- FrameWave (also known as the AMD Performance Library). Apache License 2.0.
- [Intel Integrated Performance Primitives] (http://software.intel.com/en-us/intel-ipp/). Commercial license.
- Moonlight's colour space conversion routines. I tested the C and MMX implementation from Moonlight. LGPL2 license.
- The conversion routines from Chromium. BSD license.
- libswscale. LGPL license.
For testing I modified plogg to decode a theora file and time the colour space conversion portion of the process. I have a series of command line switches to pick the implementation of the colour space conversion. I used a movie trailer Theora file I had handy (480x260) and these were the results:
|Intel IPP (optimized)
|Intel IPP (default)
|Floating Point C
The 'Total Time' is the time spent in the conversion code for every frame in the file (3,000 frames). 'Frame Time' is the average time per frame. 'Relative' is the time taken relative to the liboggplay implementation. Since this is what we are currently using this makes it easy to see what sort of improvement we could get by switching. The testing was done on a 1.83 GHz Core Duo laptop running Linux.
From the results it seems that FrameWave and Chromium are the fastest and very close in performance. The license for the Chromium colour space conversion code is probably a bit better fit for our usage though and it's smaller and easier to integrate if we were to choose to use it. The 'default' Intel IPP version uses the non-processor specific implementation whereas the 'optimized' uses routines optimized for the particular processor on the machine I used for testing.
I'm interested in any comments on the libraries listed and recommendations for other libraries that I could try. I'll put the source for the test program on github shortly and update this post with the link. Results from different machines would be useful.