Speech recognition by humanoid robot in real environment

Japan's National Institute of Advanced Industrial Science and Technology (AIST), an independent administrative institution, has developed a speech recognition function in real environment using an array of microphones, successfully extending the sensing capability of humanoid robot under the Humanoid Robotic Project HRP-2 "Prométhée".

The microphone array consists of eight omnidirectional microphones mounted around the robot's head (Fig. 1 left). The sound source is located on the basis of difference in times for arrival to individual microphones, and at the same time, a camera mounted at the robot's head detects, tracks and locates a person giving the vocal instruction. Stable speech recognition is obtained by combining information derived from the microphone array and the camera and by isolating and eliminating noises. Hardware to eliminate noises in real time has been developed and built into a robot, making it possible for a human operator to give robot vocal instructions, and to control IT appliances through a robot, even in a field where multiple noise sources such as TV exist.

The present study has been carried out as a part of AIST Project "Development of Humanoid Robot Type Intelligence Booster Platform" (fiscal years 2003-05).

Fig.1 (Left)
Fig.1 (Right)

Fig. 1. (Left) A head of a humanoid robot equipped with a microphone array. Arrows show positions of mounted microphones. (Right) A multi-channel signal processing hardware built in a robot.


It is expected, therefore, that natural communications may be realized in the living environment between a human operator and a humanoid robot through the auditory function of robot.

The present study has been carried out as a part of AIST Project "Development of Humanoid Robot Type Intelligence Booster Platform" (fiscal years 2003-05).

Since the announcement of Humanoid Robot P2 by Honda Motor Co., Ltd. in 1996, R&D works on the humanoid robot have been increased energetically not only in Japan but also over the world. In the technological strategic map for robotics drafted by the Ministry of Economy, Trade and Industry (METI), it is planned to ensure practical use of robots supporting human labor in the living environment by 2025, such as supporting household works, self-reliance support, assistance and nursing care for aged persons.

While previous R&D efforts for humanoid robot technology have been focused on robot locomotion aiming at safe and stable walks and behaviors, as well as robot vision, little have been done in full-scale technological development of hearing function of robot, which plays an important role in establishing natural communications between humans and robots.

In the living environment, where practical use of next generation robots is expected, direct human-robot interaction through voice channel is growing to one of key perceptive functions of robot.

In the living environment where the next generation robots are expected to work in the near future, a lot of sound sources exist such as TV. Under such a circumstance, the natural communication between the human being and the robot through voice channel just like human-to-human interaction is one of essential functions for robots to work in the living environment. The present study has made it possible to install a voice interface on a humanoid robot operable in the environment involving a lot of sound sources. In this work, the humanoid robot "HRP-2 Prométhée" has been used.

The voice interface developed in this study consists of the following components:

  • A microphone array system consisting of 8 omnidirectional microphones embedded around the head of the HRP-2.
  • Software to identify the position of a human being out of an image taken by a wide field camera mounted on the head of the HRP-2.
  • Software to determine the position of sound source on the basis of difference in arrival times of voice signals to each of microphones in the array, to detect utterance segment and to isolate sound sources through the combination with the visual information of human position supplied from the camera, separating and eliminating noises other than human voices.
  • Small-sized hardware for multi-channel signal processing to execute these software features in real time (Fig. 1 Right).
  • Feeding the human voice with noises eliminated by the speech interface into the speech recognition software "Julian" makes it possible for a humanoid robot to stably recognize voice instruction in the environment where TV and other noise sources exist without requiring head set on the part of human operator, establishing robot hearing function.

Moreover, a set of software has been developed to make robots operate through the perceived vocal instruction and manipulate TV and other IT appliances through the network, verifying in this way the usefulness of the voice interface.


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of News Medical.
Post a new comment
You might also like...
Blockchain meets planetary health: exploring web3's potential in environmental challenges