CN111009239A

CN111009239A - Echo cancellation method, echo cancellation device and electronic equipment

Info

Publication number: CN111009239A
Application number: CN201911129309.XA
Authority: CN
Inventors: 刘华航; 刘坚强; 钱庄
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2019-11-18
Filing date: 2019-11-18
Publication date: 2020-04-14

Abstract

The present disclosure relates to an echo cancellation method, an echo cancellation device, an electronic apparatus, and a computer-readable storage medium. The echo cancellation method is applied to a cloud server and comprises the following steps: collecting a user voice instruction; generating corresponding response voice based on the voice instruction, and controlling corresponding equipment to play the response voice; and controlling the voice acquisition equipment to ignore the response voice so as to avoid the situation that the response voice is acquired by the voice acquisition equipment and is mistakenly taken as a user voice instruction. According to the method and the device, the generated and played response voice is ignored, misjudgment caused by re-recognition is avoided, so that misjudgment of the real meaning of the user is avoided, the calculation cost caused by voice recognition is reduced, and the reliability and the accuracy can be ensured in multiple rounds of voice interaction.

Description

Echo cancellation method, echo cancellation device and electronic equipment

Technical Field

The present disclosure relates to the field of intelligent voice control, and in particular, to an echo cancellation method, an echo cancellation apparatus, an electronic device, and a computer-readable storage medium.

Background

With the development of smart homes, a user can conveniently control the smart devices through interaction among various devices and a cloud server. The user can send out voice command to intelligent audio amplifier, and intelligent audio amplifier carries out voice prompt feedback through the server, or when multi-device linkage, the server also can carry out voice prompt feedback through other equipment, provides the corresponding information of user or satisfies user's demand through many rounds of interaction.

And when the intelligent sound box continuously receives the sound through the microphone, invalid voice income can be sent to the server, the server still can carry out voice recognition and judgment on the intelligent sound box, logic confusion and misjudgment are caused, the multi-round interaction accuracy rate is reduced, the transmission cost is also improved, and a large amount of calculation cost waste caused by voice recognition is avoided.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides an echo cancellation method, an echo cancellation device, an electronic device, and a computer-readable storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided an echo cancellation method applied to a cloud server, the method including: collecting a user voice instruction; generating corresponding response voice based on the voice instruction, and controlling corresponding equipment to play the response voice; and controlling the voice acquisition equipment to ignore the response voice so as to avoid the situation that the response voice is acquired by the voice acquisition equipment and is mistakenly taken as a user voice instruction.

In one embodiment, controlling the voice collecting device to ignore the response voice comprises: and controlling the voice acquisition equipment to close the microphone during the answering voice playing.

In one embodiment, controlling the voice capture device to turn off the microphone during the playback of the answering voice includes: determining the playing time length of the response voice; and controlling the voice acquisition equipment to close the microphone from the beginning of playing the response voice until the playing time length is up.

In one embodiment, controlling the voice collecting device to ignore the response voice comprises: and controlling the voice collecting equipment not to respond to the response voice in response to collecting the response voice.

In one embodiment, in response to collecting the response voice, controlling the voice collecting device not to respond to the response voice includes: caching the response voice; receiving audio collected by voice collection equipment; comparing the response voice with the voice frequency by voice print; responding to the fact that the similarity between the response voice and the voice print of the audio frequency reaches a preset threshold value, and not responding to the audio frequency; and responding to the fact that the similarity between the response voice and the voice print of the audio does not reach a preset threshold value, and performing voice recognition on the audio.

In an embodiment, the method further comprises: determining the time length from sending the response voice to receiving the audio; performing voice recognition on the audio in response to the elapsed time length exceeding a preset threshold; and in response to the elapsed time not exceeding a preset threshold, performing a voiceprint comparison step.

According to a second aspect of the embodiments of the present disclosure, there is provided an echo cancellation device applied to a cloud server, the device including: the acquisition unit is used for acquiring a user voice instruction; the response unit is used for generating corresponding response voice based on the voice instruction and controlling the corresponding equipment to play the response voice; and the processing unit is used for controlling the voice acquisition equipment to ignore the response voice so as to avoid that the response voice is acquired by the voice acquisition equipment and is mistakenly taken as a user voice instruction.

In an embodiment, the processing unit is to: and controlling the voice acquisition equipment to close the microphone during the answering voice playing.

In an embodiment, the processing unit is further configured to: determining the playing time length of the response voice; and controlling the voice acquisition equipment to close the microphone from the beginning of playing the response voice until the playing time length is up.

In an embodiment, the processing unit is to: and controlling the voice collecting equipment not to respond to the response voice in response to collecting the response voice.

In an embodiment, the processing unit further comprises: the buffer unit is used for buffering the response voice; the receiving unit is used for receiving the audio collected by the voice collecting equipment; the voiceprint comparison unit is used for carrying out voiceprint comparison on the response voice and the audio; the processing unit responds to the fact that the similarity between the response voice and the voice print of the audio frequency reaches a preset threshold value, and does not respond to the audio frequency; and responding to the fact that the similarity between the response voice and the voice print of the audio does not reach a preset threshold value, and performing voice recognition on the audio.

In an embodiment, the processing unit is further configured to: determining the time length from sending the response voice to receiving the audio; performing voice recognition on the audio in response to the elapsed time length exceeding a preset threshold; and in response to the elapsed time not exceeding a preset threshold, performing a voiceprint comparison step.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a memory to store instructions; and a processor for invoking the memory-stored instructions to perform the echo cancellation method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by a processor, perform the echo cancellation method of the first aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: by rejecting or discarding the generated and played response voice, false judgment caused by invalid voice income and re-recognition is avoided, so that the real meaning of the user is avoided being judged by mistake, the calculation cost caused by voice recognition is reduced, and the reliability and the accuracy can be ensured in multiple rounds of voice interaction.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating an echo cancellation method according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating another echo cancellation method in accordance with an exemplary embodiment;

FIG. 3 is a flow diagram illustrating another echo cancellation method in accordance with an exemplary embodiment;

FIG. 4 is a flow diagram illustrating a method of echo cancellation without responding to a responsive speech according to an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a method of echo cancellation without responding to a reply voice according to an exemplary embodiment;

FIG. 6 is a schematic block diagram illustrating an echo cancellation device in accordance with an exemplary embodiment;

FIG. 7 is a schematic block diagram illustrating another echo cancellation device in accordance with an exemplary embodiment;

FIG. 8 is a schematic block diagram illustrating an apparatus in accordance with an exemplary embodiment.

FIG. 9 is a schematic block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The present disclosure provides an echo cancellation method 10, which is applied to a cloud server, where the cloud server and other devices may be under an account and maintain communication connection. Referring to fig. 1, the echo cancellation method 10 includes steps S11-S15, described in detail below:

and step S11, collecting a user voice instruction.

The server can acquire the instruction of the user through the voice acquisition equipment so as to respond. The user voice instruction may be obtained in various ways, for example, the user may send an instruction to the server through a mobile terminal such as a mobile phone, and in some voice multi-round interaction scenarios, obtaining the user instruction may include: receiving user voice sent by voice acquisition equipment, wherein the user voice is acquired through a microphone of the voice acquisition equipment; and carrying out voice recognition on the voice of the user to obtain a user instruction.

And step S12, generating corresponding response voice based on the voice command, and controlling the corresponding equipment to play the response voice.

The server responds after receiving the user instruction, generates response voice for providing information or prompting operation for the user, and can send the response voice to corresponding equipment for playing so as to feed back the user voice instruction, wherein the equipment for playing can be equipment with a loudspeaker, such as an intelligent sound box, an intelligent television and the like. In some scenes, the response voice is sent to the intelligent sound box, and the response voice is played through the intelligent sound box, so that interaction with a user is realized; in other scenes, the smart sound box serves as an acquisition inlet of a user instruction, the smart television serves as another interactive device to display related information or display related content according to the requirements of the user, and the server can send the generated response voice to the smart television and play the response voice through a loudspeaker of the television.

And step S13, controlling the voice collecting equipment to ignore the response voice so as to avoid the response voice being collected by the voice collecting equipment and mistakenly used as the voice instruction of the user.

The response voice is used for feeding back to the user, so that the response voice is ignored in order to avoid wrong judgment caused by collection and recognition of the response voice again, and accuracy and reliability of a voice interaction process are guaranteed.

In one embodiment, as shown in fig. 2, step S13 may include: step S131, during the period of answering voice playing, controlling the voice collecting device to close the microphone.

The microphone of the voice acquisition equipment used as the entrance for acquiring the user instruction is closed, so that the reception is cut off, the played response voice is prevented from being received, transmitted and identified again, and the transmission cost and the voice identification calculation cost are avoided while the false identification is avoided.

In an embodiment, step S131 may include: determining the playing time length of the response voice; and controlling the voice acquisition equipment to close the microphone from the beginning of playing the response voice until the playing time length is up. According to the embodiment, the voice acquisition equipment is controlled according to the playing duration of the response voice, and the microphone is ensured to be opened in time, so that the voice instruction sent again by the user according to the response voice is ensured to be received, and the reliability of multi-round voice interaction is improved.

In an embodiment, as shown in fig. 3, step S13 may further include: and step S132, responding to the collected response voice, and controlling the voice collecting equipment not to respond to the response voice. The embodiment continuously collects voice, avoids omitting voice instructions of users, simultaneously does not answer the collected response voice, and avoids misjudging the response voice into wrong instructions, thereby ensuring the reliability of multi-round semantic interaction.

In one embodiment, as shown in fig. 4, step S132 may include: step S1321, caching the response voice; step S1322, receiving an audio frequency acquired by the voice acquisition device; step S1323, comparing the response voice with the voice frequency by voice print; responding to the fact that the similarity between the response voice and the voice print of the audio frequency reaches a preset threshold value, and not responding to the audio frequency; and responding to the fact that the similarity between the response voice and the voice print of the audio does not reach a preset threshold value, and performing voice recognition on the audio.

In this embodiment, the server caches the generated response voice, and after the prompt voice is played, the voice acquisition device acquires the audio and sends the audio to the server, and the server does not directly perform voice recognition, but performs voiceprint comparison on the received audio and the cached prompt voice, and the calculation amount of the voiceprint comparison is lower than that of the voice recognition. Judging whether the voiceprint and the voice recognition are the same sound according to whether the similarity of the voiceprint exceeds a preset threshold value, if so, directly discarding the voiceprint without performing voice recognition operation, thereby saving the calculated amount; and if the sound is judged not to be the same sound, voice recognition is carried out, and the condition that the user instruction is not missed is ensured.

In another embodiment, as shown in fig. 5, step S132 may further include: step S1324, determining the time length from sending the response voice to receiving the audio; performing voice recognition on the audio in response to the elapsed time length exceeding a preset threshold; and in response to the elapsed time not exceeding a preset threshold, performing a voiceprint comparison step. After the prompt voice sent by the server exceeds a certain time, the voice received after the certain time is exceeded basically cannot be the income of the prompt voice again, therefore, in the implementation, the voice print recognition is carried out through the preset threshold value of the time length and the audio frequency received within the preset threshold value of the time length, the misjudgment caused by the collected response voice is avoided, meanwhile, the voice print recognition is not carried out on the audio frequency received after the time length exceeds the threshold value, the semantic recognition is directly carried out, and the feedback of the user instruction is ensured.

In an embodiment, after the voiceprint comparison is performed or the duration of the voiceprint comparison exceeds the preset threshold, the cached prompt voice can be deleted, so that the storage cost is reduced, and the performance reduction caused by excessive storage is avoided.

Based on the same inventive concept, fig. 6 shows an echo cancellation device 100, as shown in fig. 8, where the echo cancellation device 100 is applied to a cloud server, and includes: the acquisition unit 110 is used for acquiring a user voice instruction; the response unit 120 is configured to generate a corresponding response voice based on the voice instruction, and control the corresponding device to play the response voice; and the processing unit 130 is configured to control the voice collecting device to ignore the response voice, so as to avoid that the response voice is collected by the voice collecting device and mistakenly used as the user voice instruction.

In one embodiment, the processing unit 130 is configured to: and controlling the voice acquisition equipment to close the microphone during the answering voice playing.

In an embodiment, the processing unit 130 is further configured to: determining the playing time length of the response voice; and controlling the voice acquisition equipment to close the microphone from the beginning of playing the response voice until the playing time length is up.

In one embodiment, the processing unit 130 is configured to: and controlling the voice collecting equipment not to respond to the response voice in response to collecting the response voice.

In one embodiment, as shown in fig. 7, the processing unit 130 further includes: a buffer unit 131 for buffering the response voice; a receiving unit 132, configured to receive audio collected by the voice collecting device; a voiceprint comparison unit 133 for comparing the response voice with the audio voice; the processing unit 130 responds that the similarity between the response voice and the voice print of the audio reaches a preset threshold value, and does not respond to the audio; and responding to the fact that the similarity between the response voice and the voice print of the audio does not reach a preset threshold value, and performing voice recognition on the audio.

In an embodiment, the processing unit 130 is further configured to: determining the time length from sending the response voice to receiving the audio; performing voice recognition on the audio in response to the elapsed time length exceeding a preset threshold; and in response to the elapsed time not exceeding a preset threshold, performing a voiceprint comparison step.

With respect to the echo canceling device 100 in the above embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 8 is a schematic block diagram illustrating an apparatus of any of the previous embodiments in accordance with an exemplary embodiment. For example, the apparatus 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 300 may include one or more of the following components: a processing component 302, a memory 304, a power component 306, a multimedia component 308, an audio component 310, an input/output (I/O) interface 312, a sensor component 314, and a communication component 316.

The processing component 302 generally controls overall operation of the device 300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 302 may include one or more processors 320 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 302 can include one or more modules that facilitate interaction between the processing component 302 and other components. For example, the processing component 302 may include a multimedia module to facilitate interaction between the multimedia component 308 and the processing component 302.

The memory 304 is configured to store various types of data to support operations at the apparatus 300. Examples of such data include instructions for any application or method operating on device 300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 304 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 306 provide power to the various components of device 300. The power components 306 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 300.

The multimedia component 308 includes a screen that provides an output interface between the device 300 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 308 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 300 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 310 is configured to output and/or input audio signals. For example, audio component 310 includes a Microphone (MIC) configured to receive external audio signals when apparatus 300 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 304 or transmitted via the communication component 316. In some embodiments, audio component 310 also includes a speaker for outputting audio signals.

The I/O interface 312 provides an interface between the processing component 302 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 314 includes one or more sensors for providing various aspects of status assessment for the device 300. For example, sensor assembly 314 may detect an open/closed state of device 300, the relative positioning of components, such as a display and keypad of device 300, the change in position of device 300 or a component of device 300, the presence or absence of user contact with device 300, the orientation or acceleration/deceleration of device 300, and the change in temperature of device 300. Sensor assembly 314 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 314 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 316 is configured to facilitate wired or wireless communication between the apparatus 300 and other devices. The device 300 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 316 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 316 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a computer-readable storage medium comprising instructions, such as the memory 304 comprising instructions, executable by the processor 320 of the apparatus 300 to perform the above-described method is also provided. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 9 is a block diagram illustrating an electronic device 400 according to an example embodiment. For example, the apparatus 400 may be provided as a server. Referring to fig. 9, apparatus 400 includes a processing component 422, which further includes one or more processors, and memory resources, represented by memory 432, for storing instructions, such as applications, that are executable by processing component 422. The application programs stored in memory 432 may include one or more modules that each correspond to a set of instructions. Further, the processing component 422 is configured to execute instructions to perform the above-described methods.

The apparatus 400 may also include a power component 426 configured to perform power management of the apparatus 300, a wired or wireless network interface 450 configured to connect the apparatus 400 to a network, and an input output (I/O) interface 458. The apparatus 400 may operate based on an operating system stored in the memory 432, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. An echo cancellation method applied to a cloud server, the method comprising:

collecting a user voice instruction;

generating corresponding response voice based on the voice instruction, and controlling corresponding equipment to play the response voice;

and controlling the voice acquisition equipment to ignore the response voice so as to avoid that the response voice is acquired by the voice acquisition equipment and is mistakenly taken as a user voice instruction.

2. The echo cancellation method of claim 1, wherein controlling a voice capture device to ignore the reply voice comprises:

and controlling the voice acquisition equipment to close a microphone during the response voice playing.

3. The echo cancellation method according to claim 2, wherein controlling the voice collecting device to turn off a microphone during the playing of the response voice comprises:

determining the playing time length of the response voice;

and controlling the voice acquisition equipment to close the microphone from the beginning of playing the response voice until the playing time length is passed.

4. The echo cancellation method according to claim 1, wherein said controlling the voice collecting device to ignore the response voice comprises:

and responding to the collected response voice, and controlling the voice collecting equipment not to respond to the response voice.

5. The echo cancellation method according to claim 4, wherein controlling the voice collecting device not to answer the answer voice in response to collecting the answer voice comprises:

caching the response voice;

receiving audio collected by the voice collecting equipment;

comparing the response voice with the audio by voice print;

responding to the fact that the similarity between the response voice and the voice print of the audio frequency reaches a preset threshold value, and not responding to the audio frequency;

and responding to the fact that the similarity between the response voice and the voice print of the audio frequency does not reach a preset threshold value, and performing voice recognition on the audio frequency.

6. The echo cancellation method of claim 5, wherein the method further comprises:

determining the time length from sending the response voice to receiving the audio;

performing voice recognition on the audio in response to the elapsed duration exceeding a preset threshold;

and executing the step of voiceprint comparison in response to the elapsed time not exceeding the preset threshold.

7. An echo cancellation device applied to a cloud server, the device comprising:

the acquisition unit is used for acquiring a user voice instruction;

the response unit is used for generating corresponding response voice based on the voice instruction and controlling the corresponding equipment to play the response voice;

and the processing unit is used for controlling the voice acquisition equipment to ignore the response voice so as to avoid that the response voice is acquired by the voice acquisition equipment and is mistakenly taken as a user voice instruction.

8. The echo cancellation device of claim 7, wherein the processing unit is configured to:

9. The echo cancellation device of claim 8, wherein the processing unit is further configured to:

determining the playing time length of the response voice;

10. The echo cancellation device of claim 7, wherein the processing unit is configured to:

11. The echo cancellation device of claim 10, wherein the processing unit further comprises:

the buffer unit is used for buffering the response voice;

the receiving unit is used for receiving the audio collected by the voice collecting equipment;

the voiceprint comparison unit is used for comparing the response voice with the voice frequency;

the processing unit responds to that the similarity between the response voice and the audio voiceprint reaches a preset threshold value, and does not respond to the audio; and responding to the fact that the similarity between the response voice and the voice print of the audio frequency does not reach a preset threshold value, and performing voice recognition on the audio frequency.

12. The echo cancellation device of claim 11, wherein the processing unit is further configured to:

13. An electronic device, comprising:

a memory to store instructions; and

a processor for invoking the memory-stored instructions to perform the echo cancellation method of any of claims 1-6.

14. A computer-readable storage medium storing instructions which, when executed by a processor, perform the echo cancellation method of any one of claims 1 to 6.