Disclosure of Invention
An object of the present disclosure is to provide a robot, and a voice recognition apparatus and method thereof, capable of accurately performing voice recognition in various scenes.
According to a first embodiment of the present disclosure, there is provided a voice recognition apparatus applied to a robot, including: a distributed microphone array comprising a first microphone array located on a front face of the robot and a second microphone array located on a back face of the robot for acquiring a first speech signal and a second speech signal, respectively; and the voice processor is used for fusing the first voice signal and the second voice signal to perform voice recognition.
Optionally, the first microphone array and the second microphone array are each one of: linear microphone arrays, annular microphone arrays, and spherical microphone arrays.
Optionally, the first microphone array is located on a chest of the robot and the second microphone array is located on a back of the robot.
Optionally, the speech processor comprises: the sound source direction determining unit is used for determining a first sound source direction based on the first voice signal and determining a second sound source direction based on the second voice signal; a beamforming unit configured to perform beamforming on the first voice signal in which the first sound source direction is determined, and perform beamforming on the second voice signal in which the second sound source direction is determined; the signal-to-noise ratio calculation unit is used for respectively calculating the signal-to-noise ratio of the first voice signal and the signal-to-noise ratio of the second voice signal after beam forming; the noise reduction processing unit is used for using the voice signal with the excellent signal-to-noise ratio as a noise reference signal and performing noise reduction processing on the voice signal with the poor signal-to-noise ratio by using the noise reference signal; and a voice recognition unit for performing voice recognition based on the voice signal after the noise reduction processing.
Optionally, the beamforming unit is configured to: calculating a first spatial delay of the first voice signal by using a first area array corresponding to the first microphone array, and calculating a second spatial delay of the second voice signal by using a second area array corresponding to the second microphone array; and calculating the weight of the direction vector of the first voice signal according to the first spatial delay and updating the corresponding blocking matrix, and calculating the weight of the direction vector of the second voice signal according to the second spatial delay and updating the corresponding blocking matrix.
Optionally, the speech processor further includes a final sound source direction determining unit configured to determine a sound source direction of the speech signal with a high signal-to-noise ratio as the final sound source direction.
Optionally, the speech processor further comprises an echo cancellation unit for selecting an array of microphones further away from the loudspeaker to perform echo cancellation before performing beamforming.
According to a second embodiment of the present disclosure, there is provided a robot including the voice recognition apparatus according to the first embodiment of the present disclosure.
According to a third embodiment of the present disclosure, there is provided a voice recognition method applied to a robot, including: acquiring a first voice signal by a first microphone array located on a front side of the robot and acquiring a second voice signal by a second microphone array located on a back side of the robot; and fusing the first voice signal and the second voice signal for voice recognition.
Optionally, the fusing the first voice signal and the second voice signal for voice recognition includes: determining a first sound source direction based on the first voice signal, and determining a second sound source direction based on the second voice signal; performing beamforming on the first voice signal in which the first sound source direction is determined, and performing beamforming on the second voice signal in which the second sound source direction is determined; respectively calculating the signal-to-noise ratio of the first voice signal and the signal-to-noise ratio of the second voice signal after beam forming; using the voice signal with excellent signal-to-noise ratio as a noise reference signal, and performing noise reduction processing on the voice signal with poor signal-to-noise ratio by using the noise reference signal; and performing voice recognition based on the voice signal after the noise reduction processing.
Optionally, the performing beamforming on the first voice signal with the first sound source direction determined and performing beamforming on the second voice signal with the second sound source direction determined includes: calculating a first spatial delay of the first voice signal by using a first annular area array corresponding to the first microphone array, and calculating a second spatial delay of the second voice signal by using a second annular area array corresponding to the second microphone array; and calculating the weight of the direction vector of the first voice signal according to the first spatial delay and updating the corresponding blocking matrix, and calculating the weight of the direction vector of the second voice signal according to the second spatial delay and updating the corresponding blocking matrix.
Optionally, the method further comprises: and determining the sound source direction of the voice signal with the excellent signal-to-noise ratio as the final sound source direction.
Optionally, the method further comprises: an array of microphones that are a little further away from the loudspeaker are selected to perform echo cancellation before beamforming is performed.
By adopting the technical scheme, the voice recognition device and the voice recognition method according to the embodiment of the disclosure utilize the distributed microphone arrays on the front and the back of the robot to pick up the voice and fuse the first voice signal and the second voice signal to perform voice recognition, so that 360-degree positioning and pickup can be performed in strong noise (such as in the environments of exhibition, business hall and the like) and the scenes of robot motion, voice recognition can be accurately performed, and the robustness of voice interaction is enhanced.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 shows a schematic block diagram of a speech recognition apparatus applied to a robot according to an embodiment of the present disclosure. As shown in fig. 1, the speech recognition apparatus includes: a distributed microphone array 1, the distributed microphone array 1 comprising a first microphone array 11 located on the front face of the robot and a second microphone array 12 located on the back face of the robot for respectively acquiring a first voice signal and a second voice signal; and the voice processor 2 is used for fusing the first voice signal and the second voice signal to perform voice recognition.
The first microphone array 11 may be arranged at least at one location of the robot's chest, front of the legs etc., preferably on the chest. The second microphone array 12 may be arranged on at least one location of the back, the back of the brain, the back of the legs, etc. of the robot, preferably on the back.
The first microphone array 11 and the second microphone array 12 may each be one of: linear microphone arrays, annular microphone arrays, and spherical microphone arrays. For example, the first microphone array 11 and the second microphone array 12 may both be implemented by an annular microphone array, or the first microphone array 11 may be implemented by a linear microphone array and the second microphone array 12 may be implemented by an annular microphone array, and so on. In addition, the linear microphone array may be an array of n rows and m arrays, where n and m are both positive integers greater than 2, in order to achieve 360 degree speech recognition; the annular microphone array may be a j-microphone array, where j is a positive integer greater than 4, such as a 4-microphone annular microphone array, a 5-microphone annular microphone array, an 8-microphone annular microphone array, and so forth.
With the help of first microphone array 11, realized the three-dimensional location pickup in the front space of robot, with the help of second microphone array 12, realized the three-dimensional location pickup in the back space of robot, through the combination of low pressure microphone array 11 and second microphone array 12 then can realize the space location pickup in the whole robot all sides, no dead angle, can realize more focused beam forming, promote the noise reduction effect. Moreover, through the arrangement of the distributed microphone array, the problem that the depth of the microphone aperture is inconsistent due to the fact that the body of the robot is not smooth and undulates and the problem that the microphone cannot be deployed to effectively receive the voice in any direction due to the posture of a robot product can be solved.
Fig. 2 shows a schematic diagram of a first microphone array 11 and a second microphone array 12, which are located on the chest and back, respectively, of a robot and which are each an 8-microphone loop microphone array. The double arrow in fig. 2 indicates that an 8-microphone loop microphone array indicated by reference numeral 12 is located on the back of the robot. Then, the first voice signal acquired by the first microphone array 11 is an 8-channel voice signal, and the second voice signal acquired by the second microphone array 12 is also an 8-channel voice signal.
By adopting the technical scheme, because the voice recognition device according to the embodiment of the disclosure comprises the distributed microphone arrays on the front and back of the robot, and the voice processor 2 performs voice recognition by fusing the first voice signal and the second voice signal, 360-degree positioning and sound pickup can be performed in scenes of strong noise (such as in environments of exhibition, business hall and the like) and robot motion, voice recognition can be accurately performed, and robustness of voice interaction is enhanced.
In one embodiment, the speech processor 2 may include a sound source direction determining unit, a beam forming unit, a signal-to-noise ratio calculating unit, a noise reduction processing unit, and a speech recognition unit.
The sound source Direction determining unit is configured to determine a first sound source Direction based on the first speech signal and a second sound source Direction based on the second speech signal, and the sound source Direction may be determined using a Direction of Arrival (DOA) estimation algorithm, for example.
The beam forming unit is used for carrying out beam forming on a first voice signal which determines the first sound source direction and carrying out beam forming on a second voice signal which determines the second sound source direction.
The signal-to-noise ratio calculating unit is used for respectively calculating the signal-to-noise ratio of the first voice signal and the signal-to-noise ratio of the second voice signal after beam forming.
The noise reduction processing unit is configured to use a speech signal with a superior signal-to-noise ratio as a noise reference signal, and perform noise reduction processing on a speech signal with a poor signal-to-noise ratio by using the noise reference signal, for example, if the signal-to-noise ratio of the first speech signal is better than that of the second speech signal, the noise reduction processing unit may use the first speech signal as the noise reference signal, that is, the first speech signal is used as a noise spectrum input of post filtering in the post filtering processing process after beamforming, and then, based on, for example, wiener filtering or a statistical model or other methods, the stationary noise in the second speech signal is removed. In an actual application scene, due to the posture of the robot, in the interaction process, one surface of the array is necessarily opposite to the actual sound source, so that the microphone array facing the actual sound source can be used for sound pickup and noise reduction, and the microphone array opposite to the actual sound source is used as a reference signal. The noise reduction processing unit may be implemented using various suitable filters.
And the voice recognition unit is used for carrying out voice recognition based on the voice signal after the noise reduction processing. Still taking the above-mentioned example as an example, in the case of using the first speech signal as the noise spectrum input to eliminate stationary noise in the second speech signal, the speech recognition unit performs speech recognition based on the noise-reduced second speech signal.
In the prior art, only a single microphone array is used for sound pickup, so that only one sound source direction needs to be positioned, and a noise spectrum obtained by using a statistical model needs to be used as a noise reference signal during noise reduction processing. In the application, the distributed microphone arrays respectively pick up the voice signals from the front and the back of the robot, so that the voice signals picked up by the microphone arrays need to be respectively positioned in the direction of a sound source, and in the noise reduction process, the voice signals with excellent signal-to-noise ratio are used as noise reference signals, and the noise reference signals are used for carrying out noise reduction process on the voice signals with poor signal-to-noise ratio.
In one embodiment, the speech processor 2 further includes a final sound source direction determining unit configured to determine a sound source direction of the speech signal with the excellent signal-to-noise ratio as the final sound source direction. Therefore, the accuracy of target tracking in the moving process of the robot is improved.
In the prior art, a planar microphone array, a ring microphone array, etc. are all placed in a flat manner, so that a linear array or a ring array is adopted for calculation during the beamforming process. In the present disclosure, the microphone array is arranged on the robot body in a vertical manner. Fig. 3a and 3b show schematic diagrams of a ring-shaped microphone array placed flat and upright, respectively. The inventor finds that the conventional linear array and ring array calculation mode is not suitable any more, otherwise, the beam forming processing result is inaccurate. Therefore, it is necessary to improve the existing beam forming to perform the beam forming process on the voice signal picked up by the microphone array placed vertically. That is, the beamforming unit is configured to: calculating a first spatial delay of the first voice signal by using a first area array corresponding to the first microphone array 11, and calculating a second spatial delay of the second voice signal by using a second area array corresponding to the second microphone array 12, for example, when the first microphone array 11 and the second microphone array 12 are both annular microphone arrays, the first area array and the second area array are both annular area arrays; and calculating the weight of the direction vector of the first voice signal according to the first spatial delay and updating the corresponding blocking matrix, and calculating the weight of the direction vector of the second voice signal according to the second spatial delay and updating the corresponding blocking matrix. By adopting the technical scheme, the result of beam forming processing can be more accurate, and the accuracy of voice recognition is higher.
In one embodiment, the speech processor 2 further comprises an echo cancellation unit for selecting an array of microphones that are further away from the loudspeaker to perform echo cancellation before performing beamforming. In the places such as exhibitions, business halls, the sound that the loudspeaker broadcast can fill the whole place, consequently selects which microphone array to do its effect of echo cancellation unanimously basically, selects the microphone array that is far away from loudspeaker to carry out the echo cancellation in principle, because the vibrations or the nonlinear change influence that receive the loudspeaker cavity are minimum, and the advantage that beam forming can exert is better simultaneously.
According to still another embodiment of the present disclosure, there is provided a robot including a voice recognition apparatus according to an embodiment of the present disclosure.
Fig. 4 shows a flowchart of a voice recognition method applied to a robot according to an embodiment of the present disclosure. As shown in fig. 4, the method includes:
in step S41, acquiring a first voice signal by a first microphone array located on the front face of the robot, and acquiring a second voice signal by a second microphone array located on the back face of the robot;
in step S42, the first speech signal and the second speech signal are fused for speech recognition.
By adopting the technical scheme, the voice recognition method according to the embodiment of the disclosure utilizes the distributed microphone arrays on the front and back of the robot to pick up voice and fuses the first voice signal and the second voice signal to perform voice recognition, so that 360-degree positioning and pickup can be performed in strong noise (such as in the environments of exhibitions, business halls and the like) and the scenes of robot motion, voice recognition can be accurately performed, and the robustness of voice interaction is enhanced.
Fig. 5 shows a flow chart of how a first speech signal and a second speech signal are fused for speech recognition.
As shown in fig. 5, includes:
in step S42a, determining a first sound source direction based on the first voice signal and a second sound source direction based on the second voice signal;
in step S42b, beamforming is performed on the first voice signal in which the first sound source direction is determined, and beamforming is performed on the second voice signal in which the second sound source direction is determined;
in step S42c, calculating the signal-to-noise ratio of the beamformed first voice signal and the signal-to-noise ratio of the second voice signal respectively;
in step S42d, the speech signal with the excellent signal-to-noise ratio is used as a noise reference signal, and the noise reference signal is used to perform noise reduction processing on the speech signal with the poor signal-to-noise ratio; and
in step S42e, speech recognition is performed based on the speech signal after the noise reduction processing.
Alternatively, the performing beamforming on the first voice signal with the first sound source direction determined and the performing beamforming on the second voice signal with the second sound source direction determined in step S42b includes: calculating a first spatial delay of the first voice signal by using a first annular area array corresponding to the first microphone array, and calculating a second spatial delay of the second voice signal by using a second annular area array corresponding to the second microphone array; and calculating the weight of the direction vector of the first voice signal according to the first spatial delay and updating the corresponding blocking matrix, and calculating the weight of the direction vector of the second voice signal according to the second spatial delay and updating the corresponding blocking matrix.
Optionally, the method according to the embodiment of the present disclosure further includes: and determining the sound source direction of the voice signal with the excellent signal-to-noise ratio as the final sound source direction.
Optionally, the method according to the embodiment of the present disclosure further includes: an array of microphones that are a little further away from the loudspeaker are selected to perform echo cancellation before beamforming is performed.
Specific implementation manners of the steps involved in the speech recognition method according to the embodiment of the present disclosure have been described in detail in the apparatus according to the embodiment of the present disclosure, and are not described herein again.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, various possible combinations will not be separately described in this disclosure.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.