For the project Anna Rohrbach and her husband Marcus established a kitchen and equipped it with video cameras. Later the filmed cooking scenes should be described. REHACARE.com asked how the software exactly works and whether it could belong to every home soon.
Mrs. Rohrbach, you work on a computer program which should automatically recognize pictures. How could it help visually impaired people in their daily routines?
Anna Rohrbach: The goal of our work is to develop software which is able to automatically recognize what happens in a scene and communicate it to a person using natural language. Many people could benefit from such a technology, in particular blind and visually impaired. One application that we consider in our work is generating descriptions for movies. This would let visually impaired people follow the storyline and understand what happens in a movie along with their non-visually impaired peers. Another direction that we have studied is targeting human daily routines, such as cooking. The computer assistant which understands the process of preparing a meal and is able to recognize the involved objects can be useful for a visually impaired person. While so far we focused on generating descriptions for cooking videos, the system could be extended to help the person locate certain objects or answer questions that require visual inspection, for example: “What is in this jar?”. There is currently a high interest in the research community to integrate Computer Vision, Machine Learning, and Natural Language Processing techniques. With the rapid progress in the field, it is likely that soon mobile devices will be providing support for the visually impaired people during shopping and other daily routines.
How does it work?
Rohrbach: The technologies that we rely on in our work are Machine Learning methods. The general idea is to supply a software with training data, such as video and associated sentence description, and teach it to describe new videos. During the training process the software learns which motion and pose are discriminative for specific human actions. It also needs to recognize the manipulated objects, for example: knife and orange. Finally, we provide the predictions of our visual recognizers into the next part of the pipeline which produces sentences. We have experimented with two different sentence generation approaches. First is a Machine Translation approach, similar to translating from one language to another. In our case we translate the predicted labels (for example: cut, knife and orange) into a natural language sentence: “The person cuts an orange”. Our second approach makes use of an artificial recurrent neural network, which directly generates a sentence from the visual features.
What do you suggest when will the program have arrived in every home?
Rohrbach: Our movie description project is still at rather early stage of development. The end goal system has to solve various visual recognition challenges as well as understand the movie plot, etc. It will probably take five to ten years to fully automate this process. As for the kitchen computer assistant, I imagine that first such systems could appear in about five years, while within ten years we will see some impressive new technologies coming to our homes.
What does inclusion mean to you?
Rohrbach: For me inclusion is recognition that everyone has equal rights to be part of the society, while support should be provided for those who require it. I believe that technologies specifically developed for the visually impaired people will eliminate some of the daily life challenges they have to face, help them be more independent and realize their full potential.