Humans use both their eyes and ears to understand the world around them. In this project with Honda Research Institute Japan, I wrote a program to perform audio-visual musical instrument classification. The HEARBO robot, equipped with a thermo camera and microphone, was trained to perform multimodal fusion using a Gaussian Mixture Model (my explanation on SlideShare).
It could differentiate 12 different instruments, including the very similar sounding shakuhachi, ocarina, recorder, classical flute and Japanese traditional flute. It was developed in C++ using HARK, OpenCV and ROS. Here's the full description.