My first scientific contribution has just been published. What an amazing moment. (“Massively Parallel Multiclass Object Recognition”, together with F. Deger, H. Dammertz, J. Bouecke and H. P. A. Lensch, in Proc. of VMV 2010)
A citing paper (Orchard et al., 2013) appears in IEEE Transactions on Neural Networks and Learning Systems, mentioning us as their predecessor and as a performance reference. They bring our approach to the next level of integration (to FPGAs).
How is this possible? Usually, to speed up an algorithm, one simply invents a more sophisticated algorithm with reduced complexity. But say, if you are already using the best possible or the best known algorithm, how can you be quicker? Say, you already used the best suiting processor instructions. Can we run it on faster hardware? As computers get quicker, usually doubling their speed every 18 months, one can simply wait 10 years to buy a then shiny new PC to achieve a factor of 100. But I guess waiting for so long, you probably don’t need the results anymore.
But how is it then possible on your PC to speed up calculations that much? You already felt it coming: The calculation units on your graphics processing unit (GPU).
The GPU has vast amounts of processing units, about 500 of them. They have a very simple instruction set, and this is their advantage. Few cruft of the past, and no need for general computation: these units dedicate all processing power to one task, your algorithm.
And this is where your brain gets involved again: how can you redesign your algorithm to fit the GPU? A GPU works best on data parallel instructions, where all processors units do the same calculation at the same time. So you need to distribute all data in a way it can be munched through synchronized. Imagine a bar mower: flowers, grass and weed are treated equally when a row is cut by this mower. But this mower only works efficient, if it cuts with the full length of the bar. So you need to take care to find the best covering path through your field of grass. Only such an algorithm keeps all processing units fully utilized.
What we do now with this power? Find a hard problem, find a suiting algorithm, sew it again to run on the GPU, and compare it to important other existing solutions.
This is what we did. We asked our faculty neighbors for Computer Vision for interesting problems. This is how we found work about multiclass object recognition, which allows to classify image content into known categories. As a base for our parallelized implementation, we used a visual-cortex-like model of Jim Mutch and David G. Lowe. Such biologically motivated models compute in feed-forward layers which naturally lend themselves for parallel computation. Additionally, the graphical nature of an image recognition system suits the architecture of GPUs quite well.
Sufficed to say, the implementation went smoothly and moreover, it opened up new possibilities: the usage of object recognition even on low end mobile computers. For this, you can watch the short demonstration video:
In comparison with other implementations of the selected object recognition model, our algorithm was faster, while still achieving the same quality of results. That means our work was not only working well, but even compared great to other published results.
This is why we spent the following weeks in the lab to further optimize our implementation, measure results, write our first drafts, correct it, correct it even more, send it in – and finally were accepted at the VMV2010. Hard work, great results, and only a little bit of luck of having the right people at the right place. I thank you all for these amazing moments.
Helmut Sedding; Ferdinand Deger; Holger Dammertz; Jan Bouecke; Hendrik P. A. Lensch:
Massively Parallel Multiclass Object Recognition
Proceedings of the 15th Vision, Modeling and Visualization Workshop 2010, pp. 251-257
[paper] [poster] [source code] [BibTeX] [video] [doi:10.2312/PE/VMV/VMV10/251-257]