For the project Anna Rohrbach and her husband Marcus established a kitchen and equipped it with video cameras. Later the filmed cooking scenes should be described. REHACARE.com asked how the software exactly works and whether it could belong to every home soon.
Mrs. Rohrbach, you work on a computer program which should automatically recognize pictures. How could it help visually impaired people in their daily routines?
Anna Rohrbach: The goal of our work is to develop software which is able to automatically recognize what happens in a scene and communicate it to a person using natural language. Many people could benefit from such a technology, in particular blind and visually impaired. One application that we consider in our work is generating descriptions for movies. This would let visually impaired people follow the storyline and understand what happens in a movie along with their non-visually impaired peers. Another direction that we have studied is targeting human daily routines, such as cooking. The computer assistant which understands the process of preparing a meal and is able to recognize the involved objects can be useful for a visually impaired person. While so far we focused on generating descriptions for cooking videos, the system could be extended to help the person locate certain objects or answer questions that require visual inspection, for example: “What is in this jar?”. There is currently a high interest in the research community to integrate Computer Vision, Machine Learning, and Natural Language Processing techniques. With the rapid progress in the field, it is likely that soon mobile devices will be providing support for the visually impaired people during shopping and other daily routines.
How does it work?
Rohrbach: The technologies that we rely on in our work are Machine Learning methods. The general idea is to supply a software with training data, such as video and associated sentence description, and teach it to describe new videos. During the training process the software learns which motion and pose are discriminative for specific human actions. It also needs to recognize the manipulated objects, for example: knife and orange. Finally, we provide the predictions of our visual recognizers into the next part of the pipeline which produces sentences. We have experimented with two different sentence generation approaches. First is a Machine Translation approach, similar to translating from one language to another. In our case we translate the predicted labels (for example: cut, knife and orange) into a natural language sentence: “The person cuts an orange”. Our second approach makes use of an artificial recurrent neural network, which directly generates a sentence from the visual features.