CN114596848A - Robot and voice recognition method and device for same - Google Patents

Robot and voice recognition method and device for same Download PDF

Info

Publication number
CN114596848A
CN114596848A CN202011420360.9A CN202011420360A CN114596848A CN 114596848 A CN114596848 A CN 114596848A CN 202011420360 A CN202011420360 A CN 202011420360A CN 114596848 A CN114596848 A CN 114596848A
Authority
CN
China
Prior art keywords
information
voice information
target object
robot
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011420360.9A
Other languages
Chinese (zh)
Inventor
庄伟基
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202011420360.9A priority Critical patent/CN114596848A/en
Publication of CN114596848A publication Critical patent/CN114596848A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/02Services making use of location information
    • H04W4/029Location-based management or tracking services

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Manipulator (AREA)

Abstract

The present disclosure relates to a voice recognition method for a robot, comprising the steps of: acquiring multi-channel voice information acquired by a robot; acquiring position information of a target object; enhancing the multi-channel voice information according to the position information of the target object to generate enhanced voice information; and performing voice recognition on the enhanced voice information to generate a voice recognition result. According to the embodiment of the disclosure, through the acquired position information of the target object, the beam enhancement can be performed on the multi-path voice information acquired by the robot, so that the voice signal is enhanced, the accuracy of voice recognition is improved, and the follow-up voice interaction process is facilitated.

Description

Robot and voice recognition method and device for same
Technical Field
The present disclosure relates to the field of robots, and in particular, to a method and an apparatus for speech recognition of a robot, and a storage medium.
Background
With the continuous development of robots, robotic pets are more and more popular. However, robotic pets, such as legged robots, are constantly moving during interaction with human speech. Unlike traditional fixed intelligent equipment (such as a smart sound box), the machine pet generates a lot of noises due to the continuous movement of the machine pet, such as the noises of a driving motor, the mechanical transmission noises of a joint part during movement, and the like, and the noises can generate great interference on voice recognition.
In addition, since the robot is in motion at the moment, it may be very far away from the target object, and at this time, due to the influence of self noise and environmental noise, the voice recognition of the target object may be inaccurate, so that the robot may not accurately respond to the instruction of the target object.
Disclosure of Invention
The present disclosure provides a voice recognition method and apparatus for a robot, and a robot, so as to solve at least the problem of inaccurate voice recognition due to noise influence in the related art. The technical scheme of the disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a speech recognition method for a robot, including the steps of: acquiring multi-channel voice information acquired by a robot; acquiring position information of a target object; enhancing the multi-channel voice information according to the position information of the target object to generate enhanced voice information; and performing voice recognition on the enhanced voice information to generate a voice recognition result.
In an embodiment of the disclosure, the enhancing the multiple channels of speech information according to the position information of the target object to generate enhanced speech information includes: selecting reference voice information and reference voice information from the multi-path voice information according to the position information of the target object; adjusting the reference voice information according to the reference voice information; and generating the enhanced voice information according to the reference voice information and the adjusted reference voice information.
In one embodiment of the present disclosure, the robot includes a plurality of microphones, wherein the selecting reference voice information and reference voice information from among the plurality of paths of voice information according to the position information of the target object includes: determining a microphone facing the position of the target object according to the position of the target object and the topological positions of the plurality of microphones; and using the voice information collected by the microphone facing the position of the target object as the reference voice information, and using other voice information in the multi-path voice information as the reference voice information.
In an embodiment of the disclosure, the adjusting the reference voice information according to the reference voice information includes: taking the microphone facing the position of the target object as a reference microphone and the other microphones among the plurality of microphones as reference microphones; generating time delay information of the reference microphone relative to the reference microphone according to topological position information of the plurality of microphones; and adjusting the reference voice information according to the time delay information of the reference microphone so as to align the reference voice information with the reference voice information.
In an embodiment of the disclosure, the generating the enhanced speech information according to the reference speech information and the adjusted reference speech information includes: adding the reference voice information and the aligned reference voice information to generate the enhanced voice information.
In an embodiment of the disclosure, the acquiring the position information of the target object includes: receiving a UWB (Ultra Wide Band) signal transmitted by the smart device of the target object through an array antenna to generate a plurality of UWB receiving signals; and generating position information of the target object from the plurality of UWB reception signals.
In one embodiment of the present disclosure, the array antenna includes a plurality of antennas, and the generating the position information of the target object from the plurality of UWB reception signals includes: acquiring flight times or arrival time differences of the plurality of UWB receiving signals received by the plurality of antennas; and generating a position message of the target object according to the positions of the plurality of antennas and the time of flight or the time difference of arrival of the plurality of UWB receiving signals.
According to a second aspect of embodiments of the present disclosure, there is provided a voice recognition apparatus for a robot, including: the acquisition module is used for acquiring the multi-channel voice information acquired by the robot; the position information acquisition module is used for acquiring the position information of the target object; the enhancement module is used for enhancing the multi-channel voice information according to the position information of the target object so as to generate enhanced voice information; and the recognition module is used for carrying out voice recognition on the enhanced voice information to generate a voice recognition result.
In one embodiment of the disclosure, the enhancement module includes: the selection submodule is used for selecting reference voice information and reference voice information from the multi-path voice information according to the position information of the target object; the adjusting submodule is used for adjusting the reference voice information according to the reference voice information; and the enhancement module is used for generating the enhanced voice information according to the reference voice information and the adjusted reference voice information.
In an embodiment of the disclosure, the acquisition module includes a plurality of microphones, and the selection sub-module determines a microphone facing the position of the target object according to the position of the target object and topological positions of the plurality of microphones, and uses voice information acquired by the microphone facing the position of the target object as the reference voice information, and uses other voice information in the plurality of paths of voice information as the reference voice information.
In an embodiment of the disclosure, the adjusting sub-module uses the microphone facing the position of the target object as a reference microphone, uses the other microphones of the plurality of microphones as reference microphones, generates time delay information of the reference microphone relative to the reference microphone according to topological position information of the plurality of microphones, and adjusts the reference voice information according to the time delay information of the reference microphone so as to align the reference voice information with the reference voice information.
In an embodiment of the present disclosure, the enhancement sub-module adds the reference speech information and the aligned reference speech information to generate the enhanced speech information.
In one embodiment of the present disclosure, the location information acquiring module includes: an array antenna for receiving a UWB signal transmitted by the smart device of the target object to generate a plurality of UWB reception signals; and a generation sub-module for generating position information of the target object from the plurality of UWB reception signals.
In one embodiment of the present disclosure, the array antenna includes a plurality of antennas, and the generation sub-module acquires flight times or arrival time differences of the plurality of UWB reception signals received by the plurality of antennas, and generates the position message of the target object based on the positions of the plurality of antennas and the flight times or arrival time differences of the plurality of UWB reception signals.
According to a third aspect of the embodiments of the present disclosure, there is also provided a robot including: a plurality of microphones; a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the speech recognition method for a robot as described above.
According to a fourth aspect of embodiments of the present disclosure, there is also provided a storage medium having instructions that, when executed by a processor of a voice recognition apparatus for a robot, enable the voice recognition apparatus for the robot to perform the voice recognition method for the robot as described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
according to the embodiment of the disclosure, through the acquired position information of the target object, the beam enhancement can be performed on the multi-path voice information acquired by the robot, so that the voice signal is enhanced, the accuracy of voice recognition is improved, and the follow-up voice interaction process is facilitated.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow diagram illustrating a method of speech recognition for a robot in accordance with an exemplary embodiment;
FIG. 2 is a schematic view of a robot work scenario;
FIG. 3 is a flow chart of a method for generating enhanced speech information according to an embodiment of the present disclosure;
FIG. 4 is a flowchart illustrating a method for adjusting reference voice information according to an embodiment of the present disclosure;
fig. 5 is a block diagram illustrating a voice recognition apparatus for a robot according to an exemplary embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In one embodiment of the present disclosure, a robot, particularly a legged robot, may be very far away from a target object and the position of the target object relative to the robot is not fixed, compared to other smart devices because the robot is often in motion. In addition, during the running process of the robot, especially when the robot runs outdoors, the environmental noise is very large, and the noise generated by the motion of the robot is also very large, so that the robot is very difficult to recognize the voice signal of the target object. In view of the above, the embodiment of the present disclosure provides a method for performing beam enhancement on a target object voice according to a target object position, so as to improve accuracy of voice recognition.
Fig. 1 is a flowchart illustrating a voice recognition method for a robot according to an exemplary embodiment, as shown in fig. 1, the method for the voice recognition method for the robot including the steps of:
and step 110, acquiring multi-channel voice information acquired by the robot.
In one embodiment of the present disclosure, the robot may be a legged robot, such as a quadruped robot or a biped robot, among others. In other embodiments of the present disclosure, multiple microphones may be disposed in the robot. For example, one microphone is provided at the front and rear of the robot, while two microphones are provided at both sides of the robot. Thus, no matter which direction of the robot the target object is in, good voice collection can be carried out. In one embodiment of the disclosure, the target object may be a user, for example, a user bound to the robot.
Further, in one embodiment of the present disclosure, the plurality of microphones are all omni-directional microphones. In an embodiment of the present disclosure, multiple paths of voice information are collected by multiple microphones on the robot. The installation positions of the plurality of microphones on the robot body are different, so that the topological positions of the plurality of microphones can be formed according to the installation positions of the plurality of microphones.
Step 130, position information of the target object is acquired.
In some human-computer interaction scenarios, due to privacy concerns, sometimes the robot cannot use the camera, and therefore cannot know the position of the target object. Further, as described above, since both the environmental noise of the robot and the noise generated by the robot are very large, the direction of the speaker cannot be accurately determined by the microphone array. Therefore, in an embodiment of the present disclosure, a scheme for accurately positioning a target object by using an UWB (Ultra Wide Band) chip is proposed. In one embodiment of the present disclosure, UWB communication is performed between the robot and the smart device of the target object, thereby determining the location of the smart device of the target object. In an embodiment of the present disclosure, the smart device of the target object may be a smart wearable device, or may be a mobile terminal, such as a mobile phone. In an embodiment of the disclosure, an array antenna may be disposed in a robot, and a UWB chip is disposed in an intelligent terminal of a target object, so that the robot has a stronger spatial perception capability, and a direction of the target object carrying a wearable device or a mobile terminal may be accurately determined.
In the embodiment of the present disclosure, since far-field speech wake-up and speech recognition rely on the microphone array to perform speech enhancement processing first, and the precondition that the multi-channel speech signal enhancement processing algorithm based on beam forming is effective is that accurate estimation of the direction-of-arrival (DOA) direction can effectively enhance the target speech direction only if the position of the speaker is accurately estimated, and the subsequent beam enhancement can attenuate noise from other directions. Therefore, in the embodiments of the present disclosure, the detection of the position of the target object may be achieved by the UWB chip and the array antenna, thereby achieving directional beam enhancement.
In one embodiment of the present disclosure, since the wearable device and the mobile terminal of the target object carry the UWB chip, the UWB signal can be transmitted through the UWB chip. The robot end can detect UWB signals emitted by the target node through the antenna array, and signals received by each antenna in the antenna array can be subjected to UWB signal positioning algorithm to complete target object direction positioning.
In one embodiment of the present disclosure, since the robot may be operated in an outdoor noisy environment, voice information of many people may be collected by the robot at the same time. Fig. 2 is a schematic diagram of a working scene of the robot. In this embodiment, the intelligent robot receives the voice information from the target object (carrying the wearable device or the mobile terminal) and also receives the voice information of other people in the environment, but because the robot can accurately position the target object, the beam enhancement can be performed on the voice information in this direction, thereby achieving the effect of improving the accuracy of voice recognition.
In the above embodiments, the UWB signal is used to locate the target object, however, in other embodiments of the disclosure, other ways may be used to locate the target object, and these are all within the scope of the disclosure. Of course, in the embodiments of the present disclosure, the target object can be accurately located by means of the UWB signal, so as to further improve the effect of beam enhancement.
In one embodiment of the present disclosure, an ultra-wideband UWB signal transmitted by a smart device of a target object may be received through an array antenna to generate a plurality of UWB reception signals, and position information of the target object may be generated from the plurality of UWB reception signals.
In one embodiment of the present disclosure, the array antenna includes a plurality of antennas, and first obtains a Time of Flight (TOF) or a Time Difference of arrival (Time Difference of array) of a plurality of UWB reception signals received by the plurality of antennas; and generating a position message of the target object according to the positions of the plurality of antennas and the time of flight or the time difference of arrival of the plurality of UWB receiving signals.
It should be noted that, in the above embodiments, although UWB is taken as an example to locate the target object, in other embodiments of the present disclosure, other location technologies may also be selected to locate the target object, for example, WIFI location technology, bluetooth location technology, RF (radio frequency) location technology, Zigbee location technology, and the like, all of which can implement location of the target object.
And 150, enhancing the multi-channel voice information according to the position information of the target object to generate enhanced voice information.
In one embodiment of the present disclosure, a beam enhancement mode may be used to enhance multiple channels of speech information according to the position of a target object, thereby generating enhanced speech information. In the embodiment of the disclosure, the beamforming may perform frequency domain filtering on multiple voice signals by using spatial information carried by a microphone array, and finally output one path of enhanced signal, where the output enhanced signal is expected to be able to retain the voice signal of the target object as much as possible, and simultaneously remove the noise signal as much as possible.
Step 170, performing speech recognition on the enhanced speech information to generate a speech recognition result.
In the embodiment of the disclosure, because the enhanced voice information enhances the voice signal therein and attenuates the noise signal, the signal-to-noise ratio is improved, and accordingly, the accuracy of voice recognition can be improved, thereby providing a good basis for subsequent voice interaction.
Fig. 3 is a flowchart of a method for generating enhanced speech information according to an embodiment of the present disclosure. In this embodiment, a method of how to perform speech enhancement based on location information of a target object is proposed, the method comprising the steps of:
step 310, selecting reference voice information and reference voice information from the multi-path voice information according to the position information of the target object.
In one embodiment of the present disclosure, since the positions of the plurality of microphones on the robot are fixed, after the position information of the target object is obtained, the relative position of the target object and the robot may be obtained, so that the microphone facing the position of the target object may be selected from among the plurality of microphones. In an embodiment of the present disclosure, topological positions of the plurality of microphones may be pre-stored in the robot, so that the robot may determine a microphone facing the position of the target object according to the position of the target object and the topological positions of the plurality of microphones after obtaining the position information of the target object. And the voice information collected by the microphone facing the position of the target object is used as reference voice information, and other voice information in the multi-path voice information is used as reference voice information. Since the microphone facing the target object is closest to the target object, the voice information of the target object can be known earliest, and therefore, in the embodiment of the present disclosure, the microphone is used as the reference voice information, and the voice information collected by the other microphones is used as the reference voice information. In one embodiment of the present disclosure, the reference voice message is received by the robot first, and the other voice messages are later than the reference voice message, and are therefore referred to as reference voice messages.
Step 330, adjusting the reference voice information according to the reference voice information.
In one embodiment of the present disclosure, there is a time difference of arrival between the reference speech information and the reference speech information as described above, that is, the reference speech information and the other multiple reference speech information are not aligned. In view of this, in an embodiment of the present disclosure, a method for adjusting reference speech information according to reference speech information is provided, and in the following embodiments, the method will be described in detail and will not be described herein again.
Step 350, generating enhanced voice information according to the reference voice information and the adjusted reference voice information.
In an embodiment of the present disclosure, since the reference voice information has been adjusted according to the reference voice information in step 330 so that the reference voice information and the reference voice information have been matched, the reference voice information and the reference voice information after the adjustment may be added to generate the enhanced voice information.
As shown in fig. 4, a flowchart of a method for adjusting reference speech information according to an embodiment of the present disclosure is shown, where the method includes the following steps:
step 410, the microphone facing the position of the target object is taken as a reference microphone, and the other microphones among the plurality of microphones are taken as reference microphones.
And step 430, generating time delay information of the reference microphone relative to the reference microphone according to the topological position information of the plurality of microphones.
In one embodiment of the present disclosure, since the positions of the plurality of microphones on the robot are fixed, the topological position information is also fixed, so that the time delay information between the plurality of microphones can be calculated. In this embodiment, since the microphone facing the position of the target object is used as the reference microphone, the time delay information of the other reference microphones with respect to the reference microphone can be obtained from the topology information.
Step 450, adjusting the reference voice information according to the delay information of the reference microphone, so that the reference voice information is aligned with the reference voice information.
In an embodiment of the present disclosure, the reference voice information may be adjusted according to the delay information of the reference microphone, for example, corresponding delay information is advanced, so that the reference voice information may be aligned with the reference voice information, and since the reference voice information is aligned with the reference voice information, a voice signal and a noise signal in the reference voice information and the reference voice information are also aligned, at this time, a plurality of reference voice information may be superimposed with the reference voice information, so as to enhance the voice signal therein.
Fig. 5 is a block diagram illustrating a voice recognition apparatus for a robot according to an exemplary embodiment. In an embodiment of the present disclosure, the voice recognition apparatus 500 for a robot includes a collecting module 510, a position information acquiring module 520, an enhancing module 530, and a recognition module 540. The acquisition module 510 is configured to acquire multiple paths of voice information acquired by the robot. The position information acquiring module 520 is used for acquiring the position information of the target object. The enhancing module 530 is configured to enhance the multiple paths of voice information according to the location information of the target object to generate enhanced voice information. The recognition module 540 is used for performing speech recognition on the enhanced speech information to generate a speech recognition result.
In one embodiment of the present disclosure, the robot may be a legged robot, such as a quadruped robot or a biped robot, among others. In other embodiments of the present disclosure, multiple microphones may be disposed in the robot. For example, one microphone is provided at the front and rear of the robot, while two microphones are provided at both sides of the robot. Thus, no matter which direction of the robot the target object is in, good voice collection can be carried out. Further, in one embodiment of the present disclosure, the plurality of microphones are all omni-directional microphones. In an embodiment of the present disclosure, multiple paths of voice information are collected by multiple microphones on the robot. The installation positions of the plurality of microphones on the robot body are different, so that the topological positions of the plurality of microphones can be formed according to the installation positions of the plurality of microphones.
In some human-computer interaction scenarios, due to privacy concerns, sometimes the robot cannot use the camera, and therefore cannot know the position of the target object. Further, as described above, since both the environmental noise of the robot and the noise generated by the robot are very large, the direction of the speaker cannot be accurately determined by the microphone array. Therefore, in an embodiment of the present disclosure, a scheme for accurately positioning a target object by using an UWB (Ultra Wide Band) chip is proposed. In one embodiment of the present disclosure, UWB communication is performed between the robot and the smart device of the target object, thereby determining the location of the smart device of the target object. In an embodiment of the present disclosure, the smart device of the target object may be a smart wearable device, or may be a mobile terminal, such as a mobile phone. In an embodiment of the disclosure, an array antenna may be disposed in a robot, and a UWB chip is disposed in an intelligent terminal of a target object, so that the robot has a stronger spatial perception capability, and a direction of the target object carrying a wearable device or a mobile terminal may be accurately determined.
In the embodiment of the present disclosure, since far-field speech wake-up and speech recognition rely on the microphone array to perform speech enhancement processing first, and the precondition that the multi-channel speech signal enhancement processing algorithm based on beam forming is effective is that accurate estimation of the direction-of-arrival (DOA) direction can effectively enhance the target speech direction only if the position of the speaker is accurately estimated, and the subsequent beam enhancement can attenuate noise from other directions. Therefore, in the embodiments of the present disclosure, the detection of the position of the target object may be achieved by the UWB chip and the array antenna, thereby achieving directional beam enhancement.
In one embodiment of the present disclosure, since the wearable device and the mobile terminal of the target object carry a UWB chip, a UWB signal can be transmitted through the UWB chip. The robot end can detect UWB signals emitted by the target node through the antenna array, and signals received by each antenna in the antenna array can be subjected to UWB signal positioning algorithm to complete target object direction positioning.
In one embodiment of the present disclosure, since the robot may be operated in an outdoor noisy environment, voice information of many people may be collected by the robot at the same time. Fig. 2 is a schematic diagram of a working scene of the robot. In this embodiment, the intelligent robot receives the voice information from the target object (carrying the wearable device or the mobile terminal) and also receives the voice information of other people in the environment, but because the robot can accurately position the target object, the beam enhancement can be performed on the voice information in this direction, thereby achieving the effect of improving the accuracy of voice recognition.
In the above embodiments, the positioning of the target object by using the UWB signal is proposed, however, in other embodiments of the present disclosure, other methods may be used for positioning, and these methods are all within the protection scope of the present disclosure. Of course, in the embodiments of the present disclosure, the target object can be accurately located by means of the UWB signal, so as to further improve the effect of beam enhancement.
In one embodiment of the present disclosure, an ultra-wideband UWB signal transmitted by a smart device of a target object may be received through an array antenna to generate a plurality of UWB reception signals, and position information of the target object may be generated from the plurality of UWB reception signals.
In one embodiment of the present disclosure, a beam enhancement mode may be used to enhance multiple channels of speech information according to the position of a target object, thereby generating enhanced speech information. In the embodiment of the disclosure, the beamforming may perform frequency domain filtering on multiple voice signals by using spatial information carried by a microphone array, and finally output one path of enhanced signal, where the output enhanced signal is expected to be able to retain the voice signal of the target object as much as possible, and simultaneously remove the noise signal as much as possible.
In the embodiment of the disclosure, because the enhanced voice information enhances the voice signal therein and attenuates the noise signal, the signal-to-noise ratio is improved, and accordingly, the accuracy of voice recognition can be improved, thereby providing a good basis for subsequent voice interaction.
In one embodiment of the present disclosure, the enhancement module 530 includes a selection submodule 531, an adjustment submodule 532, and an enhancer module 533. The selection submodule 531 is configured to select reference voice information and reference voice information from the multiple paths of voice information according to the position information of the target object. The adjusting sub-module 532 is configured to adjust the reference voice information according to the reference voice information. The enhancer module 533 is configured to generate enhanced speech information according to the reference speech information and the adjusted reference speech information.
In one embodiment of the present disclosure, the acquisition module 510 includes a plurality of microphones. The selection submodule 531 determines a microphone facing the position of the target object according to the position of the target object and the topological positions of the plurality of microphones, and uses the voice information collected by the microphone facing the position of the target object as reference voice information and uses other voice information among the plurality of paths of voice information as reference voice information.
In an embodiment of the disclosure, the adjusting sub-module 532 takes the microphone facing the position of the target object as a reference microphone and the other microphones of the plurality of microphones as reference microphones, generates time delay information of the reference microphone relative to the reference microphone according to topological position information of the plurality of microphones, and adjusts the reference voice information according to the time delay information of the reference microphone so as to align the reference voice information with the reference voice information.
In one embodiment of the present disclosure, the enhancement sub-module 533 adds the reference speech information and the aligned reference speech information to generate enhanced speech information.
In one embodiment of the present disclosure, the position information acquisition module 520 includes an array antenna 521 and a generation sub-module 522. The array antenna 521 is used for receiving a UWB signal transmitted by a smart device of a target object to generate a plurality of UWB receiving signals. The generating sub-module 522 is configured to generate position information of the target object according to the plurality of UWB receiving signals.
In one embodiment of the present disclosure, the array antenna 521 includes a plurality of antennas, and the generating sub-module 522 obtains Time of Flight (TOF) or Time Difference of arrival (arrival) of a plurality of UWB receiving signals received by the plurality of antennas and generates a position message of the target object according to positions of the plurality of antennas and the Time of Flight or Time Difference of arrival of the plurality of UWB receiving signals.
According to a third aspect of the embodiments of the present disclosure, there is also provided a robot including: a plurality of microphones; a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the speech recognition method for a robot as described above.
According to a fourth aspect of embodiments of the present disclosure, there is also provided a storage medium having instructions that, when executed by a processor of a voice recognition apparatus for a robot, enable the voice recognition apparatus for the robot to perform the voice recognition method for the robot as described above.
According to the embodiment of the disclosure, through the acquired position information of the target object, the beam enhancement can be performed on the multi-path voice information acquired by the robot, so that the voice signal is enhanced, the accuracy of voice recognition is improved, and the follow-up voice interaction process is facilitated.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (16)

1. A speech recognition method for a robot, comprising the steps of:
acquiring multi-channel voice information acquired by a robot;
acquiring position information of a target object;
enhancing the multi-channel voice information according to the position information of the target object to generate enhanced voice information; and
and performing voice recognition on the enhanced voice information to generate a voice recognition result.
2. The method of claim 1, wherein said enhancing the multi-path speech information based on the location information of the target object to generate enhanced speech information comprises:
selecting reference voice information and reference voice information from the multi-path voice information according to the position information of the target object;
adjusting the reference voice information according to the reference voice information; and
and generating the enhanced voice information according to the reference voice information and the adjusted reference voice information.
3. The method of claim 2, wherein the robot includes a plurality of microphones, and wherein the selecting reference voice information and reference voice information from among the plurality of paths of voice information according to the position information of the target object includes:
determining a microphone facing the position of the target object according to the position of the target object and the topological positions of the plurality of microphones;
and using the voice information collected by the microphone facing the position of the target object as the reference voice information, and using other voice information in the multi-path voice information as the reference voice information.
4. The method of claim 3, wherein said adjusting said reference speech information based on said reference speech information comprises:
taking the microphone facing the position of the target object as a reference microphone and the other microphones among the plurality of microphones as reference microphones;
generating time delay information of the reference microphone relative to the reference microphone according to topological position information of the plurality of microphones;
and adjusting the reference voice information according to the time delay information of the reference microphone so as to align the reference voice information with the reference voice information.
5. The method of claim 4, wherein said generating the enhanced speech information from the base speech information and the adjusted reference speech information comprises:
adding the reference voice information and the aligned reference voice information to generate the enhanced voice information.
6. The method of claim 1, wherein the obtaining location information of the target object comprises:
receiving ultra-wideband UWB signals sent by the intelligent device of the target object through an array antenna to generate a plurality of UWB receiving signals; and
and generating the position information of the target object according to the plurality of UWB receiving signals.
7. The method of claim 6, wherein said array antenna comprises a plurality of antennas, said generating position information of said target object from said plurality of UWB receive signals comprising:
acquiring flight times or arrival time differences of the plurality of UWB receiving signals received by the plurality of antennas;
and generating a position message of the target object according to the positions of the plurality of antennas and the time of flight or the time difference of arrival of the plurality of UWB receiving signals.
8. A speech recognition apparatus for a robot, comprising:
the acquisition module is used for acquiring the multi-channel voice information acquired by the robot;
the position information acquisition module is used for acquiring the position information of the target object;
the enhancement module is used for enhancing the multi-channel voice information according to the position information of the target object so as to generate enhanced voice information; and
and the recognition module is used for carrying out voice recognition on the enhanced voice information to generate a voice recognition result.
9. The apparatus of claim 8, wherein the boost module comprises:
the selection submodule is used for selecting reference voice information and reference voice information from the multi-path voice information according to the position information of the target object;
the adjusting submodule is used for adjusting the reference voice information according to the reference voice information; and
and the enhancement submodule is used for generating the enhanced voice information according to the reference voice information and the adjusted reference voice information.
10. The apparatus of claim 9, wherein the collecting module includes a plurality of microphones, wherein the selecting sub-module determines a microphone facing the position of the target object according to the position of the target object and topological positions of the plurality of microphones, and uses voice information collected by the microphone facing the position of the target object as the reference voice information, and uses other voice information among the plurality of paths of voice information as the reference voice information.
11. The apparatus of claim 10, wherein the adjusting sub-module takes the microphone facing the position of the target object as a reference microphone and the other microphones of the plurality of microphones as reference microphones, generates time delay information of the reference microphone relative to the reference microphone according to topological position information of the plurality of microphones, and adjusts the reference voice information according to the time delay information of the reference microphone to align the reference voice information with the reference voice information.
12. The apparatus of claim 11, wherein the enhancement sub-module adds the reference speech information after alignment to the base speech information to generate the enhanced speech information.
13. The apparatus of claim 8, wherein the location information acquisition module comprises:
an array antenna for receiving a UWB signal transmitted by the smart device of the target object to generate a plurality of UWB reception signals; and
a generating sub-module for generating position information of the target object from the plurality of UWB receiving signals.
14. The apparatus of claim 13, wherein said array antenna includes a plurality of antennas, and said generating sub-module acquires time-of-flight or time-of-arrival differences of said plurality of UWB received signals received by said plurality of antennas and generates a position message of said target object based on positions of said plurality of antennas and the time-of-flight or time-of-arrival differences of said plurality of UWB received signals.
15. A robot, comprising:
a plurality of microphones;
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the speech recognition method for a robot of any of claims 1 to 7.
16. A storage medium in which instructions, when executed by a processor of a voice recognition apparatus for a robot, enable the voice recognition apparatus for a robot to perform the voice recognition method for a robot according to any one of claims 1 to 7.
CN202011420360.9A 2020-12-07 2020-12-07 Robot and voice recognition method and device for same Pending CN114596848A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011420360.9A CN114596848A (en) 2020-12-07 2020-12-07 Robot and voice recognition method and device for same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011420360.9A CN114596848A (en) 2020-12-07 2020-12-07 Robot and voice recognition method and device for same

Publications (1)

Publication Number Publication Date
CN114596848A true CN114596848A (en) 2022-06-07

Family

ID=81802537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011420360.9A Pending CN114596848A (en) 2020-12-07 2020-12-07 Robot and voice recognition method and device for same

Country Status (1)

Country Link
CN (1) CN114596848A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024000834A1 (en) * 2022-06-30 2024-01-04 歌尔股份有限公司 Beam-forming function implementation method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324237A (en) * 2011-05-30 2012-01-18 深圳市华新微声学技术有限公司 Microphone array voice wave beam formation method, speech signal processing device and system
CN103544959A (en) * 2013-10-25 2014-01-29 华南理工大学 Verbal system and method based on voice enhancement of wireless locating microphone array
CN106346487A (en) * 2016-08-25 2017-01-25 威仔软件科技(苏州)有限公司 Interactive VR sand table show robot
CN106710601A (en) * 2016-11-23 2017-05-24 合肥华凌股份有限公司 Voice signal de-noising and pickup processing method and apparatus, and refrigerator
CN110010147A (en) * 2019-03-15 2019-07-12 厦门大学 A kind of method and system of Microphone Array Speech enhancing
CN110223686A (en) * 2019-05-31 2019-09-10 联想(北京)有限公司 Audio recognition method, speech recognition equipment and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102324237A (en) * 2011-05-30 2012-01-18 深圳市华新微声学技术有限公司 Microphone array voice wave beam formation method, speech signal processing device and system
CN103544959A (en) * 2013-10-25 2014-01-29 华南理工大学 Verbal system and method based on voice enhancement of wireless locating microphone array
CN106346487A (en) * 2016-08-25 2017-01-25 威仔软件科技(苏州)有限公司 Interactive VR sand table show robot
CN106710601A (en) * 2016-11-23 2017-05-24 合肥华凌股份有限公司 Voice signal de-noising and pickup processing method and apparatus, and refrigerator
CN110010147A (en) * 2019-03-15 2019-07-12 厦门大学 A kind of method and system of Microphone Array Speech enhancing
CN110223686A (en) * 2019-05-31 2019-09-10 联想(北京)有限公司 Audio recognition method, speech recognition equipment and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024000834A1 (en) * 2022-06-30 2024-01-04 歌尔股份有限公司 Beam-forming function implementation method and system

Similar Documents

Publication Publication Date Title
JP7158806B2 (en) Audio recognition methods, methods of locating target audio, their apparatus, and devices and computer programs
US9408011B2 (en) Automated user/sensor location recognition to customize audio performance in a distributed multi-sensor environment
Ishi et al. Evaluation of a MUSIC-based real-time sound localization of multiple sound sources in real noisy environments
US9432770B2 (en) Method and device for localizing sound sources placed within a sound environment comprising ambient noise
JP7004609B2 (en) Radar image processing methods, equipment and systems
Nakamura et al. Intelligent sound source localization for dynamic environments
JP6240995B2 (en) Mobile object, acoustic source map creation system, and acoustic source map creation method
EP2449798B2 (en) A system and method for estimating the direction of arrival of a sound
CN111025233A (en) Sound source direction positioning method and device, voice equipment and system
CN108962272A (en) Sound pick-up method and system
CN110495185B (en) Voice signal processing method and device
CN105679328A (en) Speech signal processing method, device and system
JP2004507767A (en) System and method for processing a signal emitted from a target signal source into a noisy environment
US10015588B1 (en) Beamforming optimization for receiving audio signals
CN112098942B (en) Positioning method of intelligent equipment and intelligent equipment
US9990939B2 (en) Methods and apparatus for broadened beamwidth beamforming and postfiltering
CN208956308U (en) Distinguishable sound bearing is to promote the audio signal reception device of reception
Ayllón et al. Indoor blind localization of smartphones by means of sensor data fusion
US11657829B2 (en) Adaptive noise cancelling for conferencing communication systems
CN114596848A (en) Robot and voice recognition method and device for same
CN115331692A (en) Noise reduction method, electronic device and storage medium
CN112859000B (en) Sound source positioning method and device
Brendel et al. Localization of multiple simultaneously active sources in acoustic sensor networks using ADP
CN109672465B (en) Method, equipment and system for adjusting antenna gain
Marquardt et al. Performance comparison of bilateral and binaural MVDR-based noise reduction algorithms in the presence of DOA estimation errors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination