WO2022083502A1

WO2022083502A1 - Voice interaction method and related apparatus, and method for establishing correspondence

Info

Publication number: WO2022083502A1
Application number: PCT/CN2021/123913
Authority: WO
Inventors: 谢家晖
Original assignee: 广东美的白色家电技术创新中心有限公司; 美的集团股份有限公司
Priority date: 2020-10-22
Filing date: 2021-10-14
Publication date: 2022-04-28
Also published as: CN112581972A

Abstract

Provided by the present application are a voice interaction method and a related apparatus, and a method for establishing a correspondence. The voice interaction method comprises: calculating the echo cancellation suppression amount of an original recorded audio and a recorded audio of a voice device to obtain a live echo cancellation suppression amount, wherein the original recorded audio is an audio recorded by the voice device while an original audio is played back by itself, and the recorded audio is an audio obtained after the original recorded audio is processed by an echo cancellation algorithm; and performing voice interaction by using a voice interaction strategy corresponding to the live echo cancellation suppression amount. The present application may improve the effect of voice interaction.

Description

Voice interaction method, related device, and method for establishing corresponding relationship

【Technical field】

The present application relates to the technical field of voice interaction, and in particular, to a voice interaction method, a related device, and a method for establishing a corresponding relationship.

【Background technique】

With the maturity of voice recognition technology, the voice interaction function of voice devices is rapidly strengthening and improving. At present, voice devices can use half-duplex voice or full-duplex voice interaction to interact with people or other devices, but the interaction effect is still insufficient.

[Content of the invention]

The present application provides a voice interaction method, a related device, and a method for establishing a corresponding relationship, which can improve the interaction effect.

In order to achieve the above object, the present application provides a voice interaction method, the method includes:

Perform the echo cancellation suppression amount calculation on the original recorded audio and recorded audio of the voice device to obtain the live echo cancellation suppression amount, where the original recorded audio is the audio recorded by the voice device while playing the original audio itself, and the recorded audio is the original recorded audio The audio processed by the echo cancellation algorithm;

Voice interaction is carried out with the voice interaction strategy corresponding to the amount of live echo cancellation suppression.

Wherein, the corresponding relationship between the amount of live echo cancellation suppression and the voice interaction strategy is preset;

And the smaller the suppression amount of live echo cancellation, the lower the voice interaction performance index defined by the voice interaction strategy.

Among them, the corresponding relationship between the amount of live echo cancellation suppression and the voice interaction strategy includes:

In response to the live echo cancellation suppression amount being greater than the first preset value, the voice interaction is performed in a full-duplex mode;

In response to the live echo cancellation suppression amount being less than or equal to the first preset value, the voice interaction is performed in half-duplex mode.

The full-duplex mode includes a first full-duplex mode and a second full-duplex mode, and the first full-duplex mode has more command words than the second full-duplex mode; greater than the first preset value, perform voice interaction in full-duplex mode, including:

in response to the live echo cancellation suppression amount being greater than the second preset value, performing the voice interaction in the first full-duplex mode;

In response to the live echo cancellation suppression amount being greater than the first preset value and less than or equal to the second preset value, the voice interaction is performed in the second full-duplex mode.

Wherein, the activation time of the first full-duplex mode is longer than the activation time of the second full-duplex mode.

Wherein, obtaining the suppression amount of live echo cancellation includes:

Calculate the energy matrix of the original recorded audio and the recorded audio separately;

Convert the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio to obtain difference data;

Perform frame-by-frame processing on the difference data to determine the peak value of each frame in the difference data;

The mean value of the peak values of at least some frames in the difference data is calculated to obtain the suppression amount of live echo cancellation.

Among them, calculating the mean value of the peak value of at least some frames in the difference data, including:

Calculate the mean of the peaks of all frames in the difference data after the echo cancellation algorithm has converged.

Among them, the duration of the original audio is longer than 10s.

In order to achieve the above purpose, the present application provides a method for establishing a corresponding relationship, the method comprising:

Test the amount of live echo cancellation and suppression of voice equipment under different acoustic echo channels;

Define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels, and save the correspondence between the voice interaction strategy and the live echo cancellation suppression amount.

Among them, testing the live echo cancellation suppression amount of the voice device under different acoustic echo channels includes: testing the live echo cancellation suppression amount and interruption index of the voice device under different acoustic echo channels;

Define the voice interaction strategy matched by the live echo cancellation suppression amount under different acoustic echo channels, including: defining the voice interaction strategy matched by the live echo cancellation suppression amount based on the interruption index.

Among them, the voice interaction strategy matched by the amount of live echo cancellation suppression is defined with the interruption index as the measurement standard, including:

Taking the minimum live echo cancellation suppression amount for which the interruption index is qualified as the first preset value, or the highest live echo cancellation suppression amount for which the interruption index is unqualified as the first preset value;

Wherein, the first preset value is the highest standard value for the voice device to activate the half-duplex mode.

The full-duplex mode includes a first full-duplex mode and a second full-duplex mode. The first full-duplex mode has more command words than the second full-duplex mode, and is defined by the interruption index as a measure. Voice interaction strategies matched by the amount of live echo cancellation suppression, including:

Take the minimum live echo cancellation suppression amount with excellent interruption index as the second preset value,

Wherein, the second preset value is the highest standard value for the voice device to activate the second full-duplex mode,

The live echo cancellation suppression amount for which the difference between each interruption index and the respective qualified value exceeds the corresponding first threshold value is the live echo cancellation inhibition amount with excellent interruption index, and the first threshold value is greater than 0; or, each interruption The live echo cancellation suppression amount for which the ratios of the indicators and their respective qualified values all exceed the corresponding second threshold value is the live echo cancellation suppression amount with excellent interrupt indicators, and the second threshold value is greater than 1.

Among them, the interruption indicators include interruption precision rate and interruption recall rate;

The live echo cancellation suppression amount for which both the interruption precision rate and the interruption recall rate exceed their respective qualified values is the live echo cancellation suppression amount for which the interruption index is qualified;

The live echo cancellation suppression amount for which either the interruption precision rate or the interruption recall rate does not exceed the respective pass values is the live echo cancellation suppression amount for which the interruption index fails.

In order to achieve the above purpose, the application provides a kind of voice equipment, the voice equipment includes a processor, a playback device, a recording device and an echo cancellation circuit; the playback device, the recording device and the echo cancellation circuit are all coupled to the processor, and the memory stores a A computer program for the processor to execute instructions to implement the above voice interaction method.

To achieve the above object, the present application provides a computer-readable storage medium for storing instructions/program data, and the instructions/program data can be executed to implement the above voice interaction method.

The method of the present application is as follows: first, perform an echo cancellation suppression amount calculation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain a live echo cancellation suppression amount; then use the voice interaction strategy corresponding to the live echo cancellation suppression amount Voice interaction, so that the actual echo cancellation and suppression amount can be used as a measure to accurately select a voice interaction strategy that is consistent with the actual use environment, so as to achieve a better balance between recognition rate and interaction experience, and improve the interaction effect.

【Description of drawings】

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

1 is a schematic flowchart of a first implementation method of the voice interaction method of the present application;

2 is a schematic flowchart of a second implementation method of the voice interaction method of the present application;

3 is a schematic flowchart of a first implementation method of a method for establishing a corresponding relationship in the present application;

4 is a schematic flowchart of a second implementation method of a method for establishing a corresponding relationship in the present application;

5 is a schematic diagram of a comparison of sweep frequency spectrums in a high reverberation environment in a first embodiment of a method for establishing a corresponding relationship of the present application;

6 is a schematic structural diagram of an embodiment of a voice device of the present application;

FIG. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application.

【Detailed ways】

In order for those skilled in the art to better understand the technical solutions of the present application, the voice interaction method, related devices, and method for establishing a corresponding relationship provided by the present application are further described in detail below with reference to the accompanying drawings and specific embodiments.

The terms "first", "second" and "third" in this application are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first", "second", "third" may expressly or implicitly include at least one of that feature. In the description of the present application, "a plurality of" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.

Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments without conflict.

The voice interaction method of the present application is applied to a situation where a voice device performs voice interaction with a person or other devices. In this case, taking the field of home appliances as an example, at present, many home appliances such as refrigerators, microwave ovens, air conditioners and rice cookers have the function of voice interaction and can be used as voice devices. These voice devices generally need to record audio and perform wake word and command word recognition on the recorded audio to determine the voice command issued by the user or other devices, and then broadcast relevant content based on the voice command to achieve voice interaction. At present, voice devices can perform voice interaction in half-duplex mode or full-duplex mode. When the voice device adopts the half-duplex mode, it cannot broadcast when it accepts the user's voice command; also when it broadcasts, it cannot receive the user's command, so either party often has to wait enough to communicate with real people. There is a large deviation in the experience of voice interaction. When the voice device adopts full-duplex mode, it can broadcast and receive voice commands at the same time. However, due to the interference of the external environment, the voice device may not be able to accurately cancel the echo of the original recording, resulting in a self-excited response, which may cause misrecognition. happens frequently. In the actual use process, the voice device can only use the preset voice interaction strategy to interact with the user or other devices. good balance. With the voice interaction method of the present application, a suitable voice interaction strategy can be independently selected according to the actual use environment.

The voice interaction method of the present application can perform the echo cancellation suppression amount calculation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain the real echo cancellation suppression amount; and then use the voice interaction strategy corresponding to the live echo cancellation suppression amount to perform Voice interaction, so that the voice interaction strategy consistent with the actual use environment can be accurately selected based on the amount of live echo cancellation and suppression.

As shown in FIG. 1 , FIG. 1 is a schematic flowchart of the first embodiment of the voice interaction method of the present application. The voice interaction method of the present application may include the following steps.

S110: Perform an echo cancellation suppression amount calculation on the original recorded audio and the recorded audio of the voice device to obtain a live echo cancellation suppression amount.

The original recorded audio of the voice device and the recorded audio after echo cancellation processing can be calculated first to obtain the real echo cancellation suppression amount, so that the voice interaction strategy corresponding to the real echo cancellation suppression amount can be used for subsequent voice interaction.

The original recorded audio is the audio recorded by the voice device while playing the original audio by itself. In order to avoid that the original audio frequency is too short and the calculated real-time echo cancellation suppression is inaccurate, the duration of the original audio can be more than 10s. Preferably, the duration of the original audio is 20-30s. Wherein, the original audio can be music or a long-term broadcast, for example, it can be boot music. The recorded audio is the audio obtained by processing the original recorded audio through an echo cancellation algorithm.

It can be understood that any method can be used in the present application to calculate the suppression amount of live echo cancellation.

In one implementation, the energy matrices of the original recorded audio and the recorded audio after echo cancellation processing may be calculated separately; then the live echo cancellation suppression amount is determined based on the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio. In addition, the unit of the live echo cancellation suppression amount can be unified to better measure the live echo cancellation suppression amount. Exemplarily, the unit of the echo cancellation suppression amount can be unified into decibels. Specifically, in the process of calculating the live echo cancellation suppression amount, the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio can be converted into a decibel representation. , to obtain the difference data; then, the difference data is processed by means of mean operation, etc., to obtain the real-time echo cancellation suppression amount.

In addition, considering the quiet part of the audio, the energy difference itself is relatively small. If the difference value of each point is used to calculate the live echo cancellation suppression amount, the value of the quiet part will greatly reduce the live echo cancellation suppression amount. Accurately calculate the real-time echo cancellation suppression amount of the relatively stable original audio; the present application obtains the live echo cancellation suppression amount by processing the frame by frame, taking the peak value of each frame and calculating the mean value to obtain the live echo cancellation suppression amount, which can avoid a large amount of silence in the stable original audio. The value of the part pulls down the live echo cancellation suppression amount to improve the accuracy of the live echo cancellation suppression amount calculation. The length of the frame in the frame-by-frame processing may correspond to the original audio. Specifically, the frame length of each original audio may be preset according to the length and/or fluctuation of each original audio, so as to improve the suppression amount of live echo cancellation. the accuracy of the calculation. In addition, in order to avoid that the part of the recorded audio before the convergence of the echo cancellation algorithm will affect the live echo cancellation suppression amount of the voice device, the average value of the peak values of all frames after the echo cancellation algorithm converges in the difference data can be calculated to obtain the live echo. Cancel the suppression amount to ensure the accuracy of the calculation of the live echo cancellation suppression amount.

Exemplarily, the process of calculating the suppression amount of live echo cancellation is as follows:

(1) For the original recorded audio mic, convolve the fixed vector to obtain the energy matrix of the original recorded audio;

rawEchoEng=conv2(mic.^2,ones(wlen,1)/wlen,'same');

(2) For the recorded audio aec after echo cancellation processing, convolve the fixed vector to obtain the energy matrix of the recorded audio;

resEchoEng=conv2(aec.^2,ones(wlen,1)/wlen,'same');

(3) Convert the energy difference between the two into db representation;

ERL=-10*log10(resEchoEng./rawEchoEng+10^-9);

(4) Perform frame-by-frame processing and set an appropriate frame size;

Blocks=floor(signal_length/Lk)

(5) Calculate the maximum db difference point of each frame, that is, the peak value, frame by frame;

ERL_max=zeros(Blocks,1);

for kk=1:Blocks

ERL_max(kk)=max(ERL(((kk-1)*Lk+1):kk*Lk));

end

(6) Perform the mean value operation on the peak values of all frames after the algorithm converges (assuming that 30 frames of convergence time are required) to obtain the final live echo cancellation suppression amount of the audio segment;

ERLmean=mean(ERL_max(30:end)).

For another example, the root mean square of the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm can be calculated to obtain the actual echo cancellation suppression amount.

S120: Perform voice interaction by using the voice interaction strategy corresponding to the real-time echo cancellation suppression amount.

After calculating the real echo cancellation suppression amount, the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for voice interaction, so that the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for subsequent voice interaction, so that the live echo cancellation suppression amount can be: Metrics can select a voice interaction strategy that matches the actual usage environment.

It can be understood that there is a certain corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy, and the corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy may be pre-configured, so that the live echo cancellation suppression amount is calculated through step S110. After that, the voice interaction strategy corresponding to the actual echo cancellation suppression amount can be directly used for voice interaction. Among them, the smaller the amount of live echo cancellation suppression, the lower the voice interaction performance index defined by the voice interaction strategy. For example, as the amount of live echo cancellation suppression decreases, the voice interaction strategy changes from a full-duplex voice interaction strategy to a half-duplex voice interaction strategy. The worker voice interaction strategy, or the activation time and/or the number of command words in the full-duplex voice interaction strategy decreases. Of course, it is not excluded that the suppression amount of live echo cancellation becomes smaller, but the voice interaction performance index defined by the voice interaction strategy remains unchanged or increases.

In an implementation manner, the live echo cancellation suppression amount may be divided into at least two intervals in advance, and different intervals correspond to different voice interaction strategies. In this case, after the live echo cancellation suppression amount is calculated through step S110, First confirm which range the current live echo cancellation suppression amount of the voice device is in, and then select the voice interaction strategy corresponding to the range for voice interaction. Optionally, the live echo cancellation suppression amount can be divided into two sections, one section corresponds to the voice interaction strategy in the full-duplex mode, and the other section corresponds to the voice interaction strategy in the half-duplex mode.

In another implementation manner, different live echo cancellation suppression amounts correspond to different voice interaction strategies. Exemplarily, the smaller the suppression amount of live echo cancellation, the less activation time in the corresponding full-duplex voice interaction process; and/or, the smaller the amount of live echo cancellation suppression, the corresponding number of command words in the full-duplex voice interaction process. less.

In step S120, the activation time and/or the command word in the full-duplex voice interaction process corresponding to the live echo cancellation suppression amount may be confirmed first, and then the voice interaction is performed based on the determined activation time and/or the command word. Optionally, based on the correspondence table between the live echo cancellation suppression amount and the activation time, and/or the correspondence table between the live echo cancellation suppression amount and the command word, it can be confirmed that the full-duplex voice interaction process corresponding to the live echo cancellation suppression amount. Activation time and/or command word. In other embodiments, the activation time and/or the number of command words in the full-duplex voice interaction process corresponding to the suppression amount of live echo cancellation can be calculated based on the activation time calculation formula and/or the calculation formula of the number of command words; Find out the number of command words; then perform voice interaction based on the corresponding activation time and command words. Wherein, the command words in the command thesaurus may be sorted according to the correlation between the command words and the voice device, so that the command words with the highest correlation are confirmed from the command thesaurus.

In this embodiment, the echo cancellation suppression amount calculation is first performed on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain the real echo cancellation suppression amount; Voice interaction, in this way, based on the amount of live echo cancellation and suppression, the voice interaction strategy that is consistent with the actual use environment can be accurately selected, so as to achieve a better balance between the recognition rate and the interactive experience, so as to improve the effect of voice interaction.

Optionally, the voice interaction method of the present application may be executed after the voice device is powered on. Specifically, the voice device is powered on; the audio is recorded while the boot music is played to obtain the original recorded audio; echo cancellation is performed on the original recorded audio to obtain the recorded audio output by the echo cancellation algorithm; echo cancellation is performed on the original recorded audio and the recorded audio The suppression amount is calculated to obtain the real echo cancellation suppression amount; then the voice interaction strategy corresponding to the live echo cancellation suppression amount is used for voice interaction.

In other implementation manners, the voice interaction method of the present application may be executed when an instruction to modify the voice interaction strategy issued by the user is received.

As shown in FIG. 2 , FIG. 2 is a schematic flowchart of the second embodiment of the voice interaction method of the present application. The voice interaction method of the present application may include the following steps.

S210: Perform an echo cancellation suppression amount operation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm, to obtain a live echo cancellation suppression amount.

Please refer to step S110, which is not repeated here.

S220: Confirm the echo cancellation suppression amount interval to which the live echo cancellation suppression amount belongs.

After calculating the live echo cancellation suppression amount, the echo cancellation suppression amount interval to which the live echo cancellation suppression amount belongs can be determined, and the voice interaction strategy corresponding to the echo cancellation suppression amount interval can be selected for voice interaction, so as to select the appropriate voice interaction according to the actual use environment. Strategy.

Optionally, at least two echo cancellation suppression amount intervals may be preconfigured, and different echo cancellation suppression amount intervals correspond to different voice interaction strategies.

Illustratively, three echo cancellation suppression amount intervals may be configured. The three echo cancellation suppression amount intervals may be respectively: less than or equal to the first preset value; greater than the first preset value and less than or equal to the second preset value; greater than the second preset value. Wherein, the second preset value is greater than the first preset value. The first preset value may be pre-configured, which may be adjusted according to the actual situation, or may be adjusted according to the user's instruction. Optionally, the first preset value may be 27dB, 30dB, 32dB and the like. The second preset value may be pre-configured, which may be adjusted according to the actual situation, or may be adjusted according to the user's instruction. Optionally, the second preset value may be 34dB, 35dB, 38dB and the like.

In response to confirming that the amount of suppression of live echo cancellation is less than or equal to the first preset value, proceed to step S230; in response to confirming that the amount of suppression of live echo cancellation is greater than the first preset value and less than or equal to the second preset value, proceed to step S240 ; In response to confirming that the suppression amount of live echo cancellation is greater than the second preset value, enter step S250.

S230: Perform voice interaction in half-duplex mode.

After confirming that the suppression amount of live echo cancellation is less than or equal to the first preset value, the voice interaction can be performed in half-duplex mode, so as to avoid the use of full-duplex mode in the case of insufficient echo cancellation under the influence of the actual environment, resulting in excessive false recognition rate. High to ensure the normal operation of voice interaction.

When performing voice interaction in half-duplex mode, it is generally limited that the voice device is not allowed to record and play at the same time, so as to avoid self-excited reactions due to insufficient echo consistency.

S240: Perform voice interaction in the second full-duplex mode.

The full-duplex mode may include a first full-duplex mode and a second full-duplex mode, wherein the number of command words in the first full-duplex mode may be greater than the number of command words in the second full-duplex mode. The first full-duplex mode may include, in addition to command words strongly related to the voice device, command words that are weakly related to the voice device. The second full-duplex mode may only include command words that are strongly related to the voice device, of course not limited thereto. For example, the second full-duplex mode may also include some command words that are frequently used but weakly related to the voice device. It can be understood that, the command words of the second full-duplex mode can also be the command words of the first full-duplex mode.

In addition, the activation time of the first full-duplex mode may also be longer than the activation time of the second full-duplex mode. The voice device can recognize the command word in the recording content within the activation time after receiving the wake-up word.

After confirming that the suppression amount of live echo cancellation is greater than the first preset value and less than or equal to the second preset value, the voice interaction can be performed in the second full-duplex mode, so that the command words and commands that can be recognized in the full-duplex voice interaction can be used. /Or the activation time also needs to be set more rigorously to ensure that a more natural full-duplex dialogue experience can be provided in the category of some of the most frequently used command words.

S250: Perform voice interaction in the first full-duplex mode.

After confirming that the suppression amount of live echo cancellation is greater than the second preset value, the voice interaction can be performed in the first full-duplex mode. Since the amount of echo cancellation and suppression is sufficient, there is almost no self-excited misrecognition, which allows communication with users or other devices. Long-term voice interaction, and can recognize more command words, so that when the amount of echo cancellation and suppression is sufficient, the voice interaction of the voice device can be sufficiently close to the previous voice interaction between real people.

It can be understood that, in order to use the voice device, the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for voice interaction, and the live echo cancellation suppression amount can be established by debugging before the voice device leaves the factory or before using the above voice interaction method. Correspondence with voice interaction strategies. Specifically, as shown in FIG. 3 , the method for establishing a corresponding relationship in this embodiment may include the following steps.

S310: Test the real-time echo cancellation suppression amount of the voice device under different acoustic echo channels.

The voice device can be placed under different acoustic echo channels, and the live echo cancellation suppression amount of the voice device under different acoustic echo channels can be tested, so as to establish the corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy.

Among them, the different acoustic echo channels can be the high reverberation environment simulated in the laboratory, the resonance of the manufacturing equipment and the nonlinear distortion of the playback channel, etc.

It can be understood that during the test, the acoustic echo channel can be used as a variable, the original audio played and the calculation method of the live echo cancellation suppression amount and other test conditions are kept the same, and the live echo cancellation of the voice device under different acoustic echo channels can be obtained. inhibitory amount.

S320: Define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels, and save the correspondence between the voice interaction strategy and the live echo cancellation suppression amount.

After testing the live echo cancellation suppression amount of the voice device under different acoustic echo channels, you can define the voice interaction strategy that matches the live echo cancellation suppression amount under different acoustic echo channels, and then save the live echo cancellation suppression amount and voice interaction strategy so that the voice device will autonomously select a voice interaction strategy that matches the actual use environment when it is used subsequently.

In an implementation manner, the interruption index can be used as a measurement standard to establish a corresponding relationship between the amount of live echo cancellation suppression and the voice interaction strategy. Specifically, in step S310, the live echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels can be tested, so as to establish a corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy using the interruption index as a measurement standard .

In another implementation manner, the command word recognition rate may be used as a measure to establish a corresponding relationship between the amount of live echo cancellation suppression and the speech interaction strategy. Specifically, in step S310, the real-time echo cancellation suppression amount and the command word recognition rate of the speech device under different acoustic echo channels can be tested, so as to use the command word recognition rate as a measure to establish the live echo cancellation suppression amount and the voice interaction strategy. Correspondence.

In this implementation manner, first test the live echo cancellation suppression amount of the voice device under different acoustic echo channels, then define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels, and then save the live echo cancellation The corresponding relationship between the suppression amount and the voice interaction strategy, so that the voice device will autonomously select a voice interaction strategy that is consistent with the actual use environment when it is used later.

As shown in FIG. 4 , FIG. 4 is a schematic flowchart of the second embodiment of the method for establishing a corresponding relationship of the present application. The voice interaction method of the present application may include the following steps.

S410: Test the real-time echo cancellation suppression amount and interruption index of the voice device under different acoustic echo channels.

The real-time echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels can be tested, so that the corresponding relationship between the real-time echo cancellation inhibition amount and the voice interaction strategy can be established by using the interruption index as a measurement standard.

The interruption metrics may include interruption precision and/or interruption recall. Among them, interruption refers to the semantic interruption in full-duplex voice interaction, which means that during the broadcast process, the voice device receives an instruction that conforms to the recognition semantic range, and immediately stops the playback and responds to the update. Interruption recall rate refers to the ratio of the number of times that the voice device receives valid interruptions and the command word is correctly recognized to the number of times that the voice device should interrupt. It can be understood that the number of times the voice device should interrupt refers to the number of times the command word is actually input to the voice device. Interruption recall rate can represent the accurate recognition in the full-duplex recognition process. Interruption accuracy refers to the ratio of the number of times the voice device receives valid interruptions and the command word is correctly recognized to the total number of times the voice device receives valid interruptions. The total number of times of effective interruption by the voice device includes: the number of times that the voice device effectively interrupts and the command word is correctly recognized; and the number of times that the voice device effectively interrupts and the command word is incorrectly recognized. The interrupt accuracy rate can be used to calculate the ratio of correct interrupts and false interrupts due to echo self-excitation/external intervention in the full-duplex recognition process. The combination of interruption precision and interruption recall can express the overall correctness of full-duplex dialogue recognition, so that whether to use full-duplex mode and which type to use can be accurately determined by the measurement of interruption precision and interruption recall. Full duplex mode.

Interruption precision and interruption recall are calculated as follows:

Under the corresponding acoustic echo channel, the actual number of input command words is A;

Under the corresponding acoustic echo channel, the number of successful interruptions received by the voice device is B;

Under the corresponding acoustic echo channel, the number of times that the voice device receives valid interruptions and the command word is correctly recognized is C;

Interrupt accuracy = C/B;

Interrupt recall = C/A.

S420 : Define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels using the interruption index as a measurement standard, and save the correspondence between the voice interaction strategy and the live echo cancellation suppression amount.

After testing the live echo cancellation suppression amount and interruption index of the voice device under different acoustic echo channels, the corresponding relationship between the echo cancellation inhibition amount and the voice interaction strategy can be established by using the interruption index as a measurement standard.

Optionally, it is possible to first confirm whether the interruption index under each acoustic echo channel is qualified and excellent. Then, the minimum live echo cancellation suppression amount for which the interruption index is qualified is taken as the first preset value, or the highest live echo cancellation suppression amount for which the interruption index is unqualified is used as the first preset value, wherein the first preset value is the voice The highest standard value for the device to start the half-duplex mode; the lowest real-time echo cancellation suppression amount for which the interruption index far exceeds the qualified standard is taken as the second preset value. Wherein, the second preset value is the highest standard value for the voice device to activate the second full-duplex mode.

In addition, the minimum live echo cancellation suppression amount with excellent interruption index can also be used as the second preset value, wherein the second preset value is the highest standard value for the voice device to activate the second full-duplex mode.

The live echo cancellation suppression amount for which the difference between each interruption index and the respective qualified value exceeds the corresponding first threshold value is the live echo cancellation inhibition amount with excellent interruption index, wherein the first threshold value is greater than 0.

Exemplarily, when the interruption index includes interruption precision rate and interruption recall rate, it is assumed that the qualified value of interruption precision rate is 80%, the first threshold of interruption precision rate is 8%, and the qualified value of interruption recall rate is is 70%, and the first threshold for interrupted recall is 12%. When the suppression amount of live echo cancellation is 32dB, the interruption precision rate is 84%, and the interruption recall rate is 89%, because the difference between interruption precision rate and qualified value is 4%, which is less than the first threshold of interruption precision rate. Therefore, the 32dB live echo cancellation suppression amount is not an excellent live echo cancellation suppression amount for the interruption index. When the suppression amount of live echo cancellation is 35dB, the interruption precision rate is 90%, and the interruption recall rate is 93%, because the difference between interruption precision rate and qualified value is 10%, which is greater than the first threshold of interruption precision rate. And the difference between the interrupt recall rate and the qualified value is 23%, which is greater than the first threshold of the interrupt recall rate, so the 35dB live echo cancellation suppression amount is an excellent live echo cancellation suppression amount for the interrupt index.

The live echo cancellation suppression amount for which the ratio of each interruption index to the respective qualified value exceeds the corresponding second threshold is the live echo cancellation inhibition amount with excellent interruption index, and the second threshold value is greater than 1.

Exemplarily, when the interruption index includes interruption precision rate and interruption recall rate, it is assumed that the qualified value of interruption precision rate is 80%, the second threshold of interruption precision rate is 1.12, and the qualified value of interruption recall rate is 1.12. 80%, the second threshold for interrupt recall is 1.1. When the live echo cancellation suppression amount is 32dB, the interruption precision rate is 84%, and the interruption recall rate is 89%, because the ratio of interruption precision rate to qualified value is 1.05, which is less than the second threshold of interruption precision rate, interruption rate The ratio of recall rate to qualified value is 1.112, which is larger than the second threshold of interrupt recall rate, so the 32dB live echo cancellation suppression amount is not the excellent live echo cancellation suppression amount for the interrupt index. When the live echo cancellation suppression amount is 35dB, the interruption precision rate is 90%, and the interruption recall rate is 93%, because the ratio of interruption precision rate to qualified value is 1.125, which is greater than the second threshold of interruption precision rate, and the interruption rate is 1.125. The ratio of the interrupt recall rate to the qualified value is 1.162, which is greater than the second threshold of the interrupt recall rate, so the 35dB live echo cancellation suppression amount is an excellent live echo cancellation suppression amount for the interrupt index.

Understandably, the first threshold of each interruption indicator exceeds 0. The second threshold for each interruption metric exceeds 1. Wherein, the qualified value of each interruption index, the first threshold and the second threshold may be preset.

Among them, if the interruption index includes interruption precision rate and interruption recall rate, then the amount of live echo cancellation suppression for which the interruption precision rate and interruption recall rate both exceed their respective qualified values is the amount of live echo cancellation inhibition for which the interruption index is qualified ; The live echo cancellation suppression amount for which either the interruption precision rate and the interruption recall rate does not exceed the respective qualified value is the live echo cancellation inhibition amount for which the interruption index is unqualified.

In this implementation manner, the live echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels are tested first, and then the interruption index is used as the measurement standard to define the live echo cancellation suppression amount under different acoustic echo channels. The matching voice interaction strategy, so that a more appropriate corresponding relationship can be formulated with the interruption index as the measurement standard, so that the voice device will autonomously select a voice interaction strategy that is consistent with the actual use environment when it is used later.

In order to better illustrate the method for establishing the corresponding relationship in the present application, the following specific examples for establishing the corresponding relationship are provided for illustrative description:

The following will use experimental data to show the impact on the amount of live echo cancellation suppression due to severe environmental reverberation in the acoustic echo channel, thereby reducing the interruption precision rate and interruption recall rate, and propose a dynamic adjustment of the voice interaction strategy at this level. Strategy. The experimental environment is as follows:

(1) Set the interrupt precision rate standard of the voice device as 80%, and the interrupt recall rate standard as 80%.

(2) Choose a narrow and strong reflection small conference room (about 10 square meters) and a standard analog recording studio (about 25 square meters, with sound-absorbing materials to reduce echo reflection) as the contrast environment for the echo channel.

(3) The acoustic structure of the voice equipment is guaranteed to be normal, and the two acoustic echo channels are guaranteed to use the same voice equipment and live echo cancellation and suppression algorithm.

Experimental results:

(1) In the narrow and strong reflection small conference room, the frequency sweep test shows that the reverberation of the environment is serious, and the frequency points interfere seriously with each other, as shown in Figure 5 (the upper part is the acoustic echo channel of the narrow and strong reflection small conference room, and the lower part is the acoustic echo channel of the narrow and strong reflection small conference room. for the standard circuit echo reference channel):

At this time, the calculated real-time echo cancellation suppression amount is 28dB, the corresponding test interruption accuracy rate is 82.32%, and the interruption recall rate is 75.25%. At this time, the half-duplex recognition rate test can be passed normally. Under the suppression level of 28dB, the half-duplex mode should be selected for voice interaction.

(2) In a standard analog recording studio, the frequency sweep test shows that the reverberation performance of the environment is normal. The calculated live echo cancellation suppression amount is 36dB, the corresponding test interruption accuracy rate is 91.20%, and the interruption recall rate is 97.44. %, the test passed and the interruption indicator was significantly higher than the standard. At this time, under the real-time echo cancellation suppression amount of 36dB, an accurate response can be obtained during the full-duplex dialogue process, and the first full-duplex mode should be selected for voice interaction.

(3) The rest of the tests will not be repeated. The voice device finally defines the corresponding relationship between the specific interval of the live echo cancellation suppression amount and the voice interaction strategy as follows:

The first interval of the live echo cancellation suppression amount: the live echo cancellation suppression amount is ≥ 35dB, and the first full-duplex mode with a wide skill field and a long activation time is used for voice interaction.

The second interval of the live echo cancellation suppression amount: 30dB≤live echo cancellation suppression amount<35dB, using the limited skill field and the second full-duplex mode with short activation time for voice interaction.

The third interval of the live echo cancellation suppression amount: the live echo cancellation suppression amount is less than or equal to 30dB, and the voice interaction is performed in a half-full-duplex mode.

Please refer to FIG. 6 , which is a schematic structural diagram of an embodiment of a voice device of the present application. The voice device 10 includes a processor 12, a playback device 13, a recording device 14 and an echo cancellation circuit 15; the playback device 13, the recording device 14 and the echo cancellation circuit 15 are all coupled to the processor 12, and the processor 12 is used for executing instructions to The above voice interaction method is implemented.

The processor 12 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 12 may be an integrated circuit chip with signal processing capability. Processor 12 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components . A general purpose processor may be a microprocessor or the processor 12 may be any conventional processor or the like.

The speech device 10 may further include a memory 11 for storing instructions and data required for the operation of the processor 12 .

The processor 12 is configured to execute the instructions to implement the method provided by any of the above-mentioned embodiments of the voice interaction method of the present application and any non-conflicting combination.

Please refer to FIG. 7 , which is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application. The computer-readable storage medium 20 of the embodiment of the present application stores the instruction/program data 21, and when the instruction/program data 21 is executed, implements the method provided by any embodiment of the voice interaction method of the present application and any non-conflicting combination. Wherein, the instruction/program data 21 can be stored in the above-mentioned storage medium 20 in the form of a program file in the form of a software product, so that a computer device (may be a personal computer, a server, or a network device, etc.) or a processor (processor) Perform all or part of the steps of the methods of the various embodiments of the present application. The aforementioned storage medium 20 includes: a USB flash drive, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, etc. media, or terminal devices such as computers, servers, mobile phones, and tablets.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

The above are only the embodiments of the present application, and are not intended to limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied in other related technical fields, All are similarly included in the scope of patent protection of the present application.

Claims

A voice interaction method, characterized in that the method comprises:

Perform an echo cancellation suppression amount calculation on the original recorded audio and the recorded audio of the voice device to obtain a live echo cancellation suppression amount, wherein the original recorded audio is the audio obtained by the voice device while playing the original audio by itself, and the recorded audio The audio obtained by the echo cancellation algorithm for the original recorded audio;

The voice interaction is performed according to the voice interaction strategy corresponding to the live echo cancellation suppression amount.
The method of claim 1, wherein:

The corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy is preset;

And the smaller the suppression amount of the live echo cancellation, the lower the voice interaction performance index defined by the voice interaction strategy.
The method according to claim 2, wherein the corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy comprises:

In response to the live echo cancellation suppression amount being greater than a first preset value, performing voice interaction in a full-duplex mode;

in response to the live echo cancellation suppression amount being less than or equal to a first preset value, performing voice interaction in half-duplex mode;

Wherein, the first preset value is the highest standard value for enabling the half-duplex mode of the voice device.
The method according to claim 3, wherein the full-duplex mode includes a first full-duplex mode and a second full-duplex mode, and the first full-duplex mode has more command words than the second full-duplex mode. Command word in duplex mode, if the amount of live echo cancellation suppression is greater than a first preset value, performing voice interaction in full duplex mode includes:

In response to the live echo cancellation suppression amount being greater than a second preset value, performing voice interaction in a first full-duplex mode;

In response to the live echo cancellation suppression amount being greater than a first preset value and less than or equal to a second preset value, performing voice interaction in a second full-duplex mode;

Wherein, the second preset value is the highest standard value at which the voice device starts the second full-duplex mode.
The method of claim 4, wherein the activation time of the first full-duplex mode is longer than the activation time of the second full-duplex mode.
The method according to claim 1, wherein the obtaining a live echo cancellation suppression amount comprises:

Calculate the energy matrix of the original recorded audio and the recorded audio respectively;

Convert the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio to obtain difference data;

Perform frame-by-frame processing on the difference data, and determine the peak value of each frame in the difference data;

The mean value of the peak values of at least some frames in the difference data is calculated to obtain the live echo cancellation suppression amount.
The method according to claim 6, wherein the calculating the mean value of the peak values of at least some frames in the difference data comprises:

Calculate the mean value of the peak values of all frames in the difference data after the echo cancellation algorithm converges.
The method according to claim 1, wherein the duration of the original audio is longer than 10s.
A method for establishing a corresponding relationship, characterized in that the method comprises:

Test the amount of live echo cancellation and suppression of voice equipment under different acoustic echo channels;

A voice interaction strategy matched with the live echo cancellation suppression amount under the different acoustic echo channels is defined, and the correspondence between the voice interaction strategy and the live echo cancellation suppression amount is saved.
The method for establishing a corresponding relationship according to claim 9, wherein,

The live echo cancellation suppression amount of the test voice device under different acoustic echo channels includes: testing the live echo cancellation suppression amount and interruption index of the voice device under the different acoustic echo channels;

The defining the voice interaction strategy matched by the live echo cancellation suppression amount under the different acoustic echo channels includes: taking the interruption index as a measurement standard, defining the voice interaction strategy matched by the live echo cancellation suppression amount .
The method for establishing a corresponding relationship according to claim 10, wherein, using the interruption index as a measurement standard, defining a voice interaction strategy matched by the live echo cancellation suppression amount, comprising:

taking the minimum live echo cancellation suppression amount for which the interruption index is qualified as the first preset value, or the highest live echo cancellation suppression amount for which the interruption index is unqualified as the first preset value;

Wherein, the first preset value is the highest standard value for enabling the half-duplex mode of the voice device.
The method for establishing a correspondence relationship according to claim 11, wherein the full-duplex mode includes a first full-duplex mode and a second full-duplex mode, and the first full-duplex mode has more command words than The command word of the second full-duplex mode, the said interruption index is used as the measurement standard to define the voice interaction strategy matched by the said live echo cancellation suppression amount, including:

Taking the minimum live echo cancellation suppression amount with the excellent interruption index as the second preset value;

Wherein, the second preset value is the highest standard value at which the voice device starts the second full-duplex mode,

The live echo cancellation suppression amount for which the difference between each interruption index and the respective qualified value exceeds the corresponding first threshold value is the live echo cancellation inhibition amount with excellent interruption index, and the first threshold value is greater than 0; or, The live echo cancellation suppression amount for which the ratio of each interruption index to the respective qualified value exceeds the corresponding second threshold value is the live echo cancellation inhibition amount with excellent interruption index, and the second threshold value is greater than 1.
The method according to claim 11, wherein the interruption index comprises interruption precision rate and interruption recall rate;

The live echo cancellation suppression amount for which the interruption precision rate and the interruption recall rate both exceed their respective qualified values is the live echo cancellation inhibition amount for which the interruption index is qualified;

The live echo cancellation suppression amount for which either of the interruption precision rate and the interruption recall rate does not exceed the respective pass values is the live echo cancellation suppression amount for which the interruption index is unqualified.
A voice device, characterized in that the voice device comprises a processor, a playback device, a recording device and an echo cancellation circuit; the playback device, the recording device and the echo cancellation circuit are all coupled to the processor , the processor is configured to execute instructions to implement the voice interaction method according to any one of claims 1-8.
A computer-readable storage medium, characterized in that, the computer-readable storage medium is used for storing instructions/program data, and the instructions/program data can be executed to implement any one of claims 1-8. Voice interaction method.