WO2022083502A1 - Voice interaction method and related apparatus, and method for establishing correspondence - Google Patents

Voice interaction method and related apparatus, and method for establishing correspondence Download PDF

Info

Publication number
WO2022083502A1
WO2022083502A1 PCT/CN2021/123913 CN2021123913W WO2022083502A1 WO 2022083502 A1 WO2022083502 A1 WO 2022083502A1 CN 2021123913 W CN2021123913 W CN 2021123913W WO 2022083502 A1 WO2022083502 A1 WO 2022083502A1
Authority
WO
WIPO (PCT)
Prior art keywords
echo cancellation
suppression amount
live
voice interaction
interruption
Prior art date
Application number
PCT/CN2021/123913
Other languages
French (fr)
Chinese (zh)
Inventor
谢家晖
Original Assignee
广东美的白色家电技术创新中心有限公司
美的集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广东美的白色家电技术创新中心有限公司, 美的集团股份有限公司 filed Critical 广东美的白色家电技术创新中心有限公司
Publication of WO2022083502A1 publication Critical patent/WO2022083502A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Definitions

  • the present application relates to the technical field of voice interaction, and in particular, to a voice interaction method, a related device, and a method for establishing a corresponding relationship.
  • voice interaction function of voice devices is rapidly strengthening and improving.
  • voice devices can use half-duplex voice or full-duplex voice interaction to interact with people or other devices, but the interaction effect is still insufficient.
  • the present application provides a voice interaction method, a related device, and a method for establishing a corresponding relationship, which can improve the interaction effect.
  • the present application provides a voice interaction method, the method includes:
  • Voice interaction is carried out with the voice interaction strategy corresponding to the amount of live echo cancellation suppression.
  • the corresponding relationship between the amount of live echo cancellation suppression and the voice interaction strategy includes:
  • the voice interaction is performed in a full-duplex mode
  • the voice interaction is performed in half-duplex mode.
  • the full-duplex mode includes a first full-duplex mode and a second full-duplex mode, and the first full-duplex mode has more command words than the second full-duplex mode; greater than the first preset value, perform voice interaction in full-duplex mode, including:
  • the voice interaction is performed in the second full-duplex mode.
  • the activation time of the first full-duplex mode is longer than the activation time of the second full-duplex mode.
  • obtaining the suppression amount of live echo cancellation includes:
  • the mean value of the peak values of at least some frames in the difference data is calculated to obtain the suppression amount of live echo cancellation.
  • calculating the mean value of the peak value of at least some frames in the difference data including:
  • the duration of the original audio is longer than 10s.
  • the present application provides a method for establishing a corresponding relationship, the method comprising:
  • testing the live echo cancellation suppression amount of the voice device under different acoustic echo channels includes: testing the live echo cancellation suppression amount and interruption index of the voice device under different acoustic echo channels;
  • Define the voice interaction strategy matched by the live echo cancellation suppression amount under different acoustic echo channels including: defining the voice interaction strategy matched by the live echo cancellation suppression amount based on the interruption index.
  • the voice interaction strategy matched by the amount of live echo cancellation suppression is defined with the interruption index as the measurement standard, including:
  • the first preset value is the highest standard value for the voice device to activate the half-duplex mode.
  • the full-duplex mode includes a first full-duplex mode and a second full-duplex mode.
  • the first full-duplex mode has more command words than the second full-duplex mode, and is defined by the interruption index as a measure.
  • Voice interaction strategies matched by the amount of live echo cancellation suppression including:
  • the second preset value is the highest standard value for the voice device to activate the second full-duplex mode
  • the live echo cancellation suppression amount for which the difference between each interruption index and the respective qualified value exceeds the corresponding first threshold value is the live echo cancellation inhibition amount with excellent interruption index, and the first threshold value is greater than 0; or, each interruption
  • the live echo cancellation suppression amount for which the ratios of the indicators and their respective qualified values all exceed the corresponding second threshold value is the live echo cancellation suppression amount with excellent interrupt indicators, and the second threshold value is greater than 1.
  • the interruption indicators include interruption precision rate and interruption recall rate
  • the live echo cancellation suppression amount for which both the interruption precision rate and the interruption recall rate exceed their respective qualified values is the live echo cancellation suppression amount for which the interruption index is qualified;
  • the live echo cancellation suppression amount for which either the interruption precision rate or the interruption recall rate does not exceed the respective pass values is the live echo cancellation suppression amount for which the interruption index fails.
  • the application provides a kind of voice equipment
  • the voice equipment includes a processor, a playback device, a recording device and an echo cancellation circuit; the playback device, the recording device and the echo cancellation circuit are all coupled to the processor, and the memory stores a A computer program for the processor to execute instructions to implement the above voice interaction method.
  • the present application provides a computer-readable storage medium for storing instructions/program data, and the instructions/program data can be executed to implement the above voice interaction method.
  • the method of the present application is as follows: first, perform an echo cancellation suppression amount calculation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain a live echo cancellation suppression amount; then use the voice interaction strategy corresponding to the live echo cancellation suppression amount Voice interaction, so that the actual echo cancellation and suppression amount can be used as a measure to accurately select a voice interaction strategy that is consistent with the actual use environment, so as to achieve a better balance between recognition rate and interaction experience, and improve the interaction effect.
  • FIG. 1 is a schematic flowchart of a first implementation method of the voice interaction method of the present application
  • FIG. 2 is a schematic flowchart of a second implementation method of the voice interaction method of the present application.
  • FIG. 3 is a schematic flowchart of a first implementation method of a method for establishing a corresponding relationship in the present application
  • FIG. 4 is a schematic flowchart of a second implementation method of a method for establishing a corresponding relationship in the present application
  • FIG. 5 is a schematic diagram of a comparison of sweep frequency spectrums in a high reverberation environment in a first embodiment of a method for establishing a corresponding relationship of the present application;
  • FIG. 6 is a schematic structural diagram of an embodiment of a voice device of the present application.
  • FIG. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application.
  • first”, “second” and “third” in this application are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as “first”, “second”, “third” may expressly or implicitly include at least one of that feature.
  • "a plurality of” means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.
  • the voice interaction method of the present application is applied to a situation where a voice device performs voice interaction with a person or other devices.
  • a voice device performs voice interaction with a person or other devices.
  • many home appliances such as refrigerators, microwave ovens, air conditioners and rice cookers have the function of voice interaction and can be used as voice devices.
  • These voice devices generally need to record audio and perform wake word and command word recognition on the recorded audio to determine the voice command issued by the user or other devices, and then broadcast relevant content based on the voice command to achieve voice interaction.
  • voice devices can perform voice interaction in half-duplex mode or full-duplex mode.
  • the voice device When the voice device adopts the half-duplex mode, it cannot broadcast when it accepts the user's voice command; also when it broadcasts, it cannot receive the user's command, so either party often has to wait enough to communicate with real people. There is a large deviation in the experience of voice interaction.
  • the voice device adopts full-duplex mode it can broadcast and receive voice commands at the same time.
  • the voice device may not be able to accurately cancel the echo of the original recording, resulting in a self-excited response, which may cause misrecognition. happens frequently.
  • the voice device can only use the preset voice interaction strategy to interact with the user or other devices. good balance.
  • a suitable voice interaction strategy can be independently selected according to the actual use environment.
  • the voice interaction method of the present application can perform the echo cancellation suppression amount calculation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain the real echo cancellation suppression amount; and then use the voice interaction strategy corresponding to the live echo cancellation suppression amount to perform Voice interaction, so that the voice interaction strategy consistent with the actual use environment can be accurately selected based on the amount of live echo cancellation and suppression.
  • FIG. 1 is a schematic flowchart of the first embodiment of the voice interaction method of the present application.
  • the voice interaction method of the present application may include the following steps.
  • S110 Perform an echo cancellation suppression amount calculation on the original recorded audio and the recorded audio of the voice device to obtain a live echo cancellation suppression amount.
  • the original recorded audio of the voice device and the recorded audio after echo cancellation processing can be calculated first to obtain the real echo cancellation suppression amount, so that the voice interaction strategy corresponding to the real echo cancellation suppression amount can be used for subsequent voice interaction.
  • the original recorded audio is the audio recorded by the voice device while playing the original audio by itself.
  • the duration of the original audio can be more than 10s.
  • the duration of the original audio is 20-30s.
  • the original audio can be music or a long-term broadcast, for example, it can be boot music.
  • the recorded audio is the audio obtained by processing the original recorded audio through an echo cancellation algorithm.
  • the energy matrices of the original recorded audio and the recorded audio after echo cancellation processing may be calculated separately; then the live echo cancellation suppression amount is determined based on the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio.
  • the unit of the live echo cancellation suppression amount can be unified to better measure the live echo cancellation suppression amount.
  • the unit of the echo cancellation suppression amount can be unified into decibels. Specifically, in the process of calculating the live echo cancellation suppression amount, the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio can be converted into a decibel representation. , to obtain the difference data; then, the difference data is processed by means of mean operation, etc., to obtain the real-time echo cancellation suppression amount.
  • the energy difference itself is relatively small. If the difference value of each point is used to calculate the live echo cancellation suppression amount, the value of the quiet part will greatly reduce the live echo cancellation suppression amount.
  • the present application obtains the live echo cancellation suppression amount by processing the frame by frame, taking the peak value of each frame and calculating the mean value to obtain the live echo cancellation suppression amount, which can avoid a large amount of silence in the stable original audio.
  • the value of the part pulls down the live echo cancellation suppression amount to improve the accuracy of the live echo cancellation suppression amount calculation.
  • the length of the frame in the frame-by-frame processing may correspond to the original audio.
  • the frame length of each original audio may be preset according to the length and/or fluctuation of each original audio, so as to improve the suppression amount of live echo cancellation. the accuracy of the calculation.
  • the average value of the peak values of all frames after the echo cancellation algorithm converges in the difference data can be calculated to obtain the live echo. Cancel the suppression amount to ensure the accuracy of the calculation of the live echo cancellation suppression amount.
  • rawEchoEng conv2(mic. ⁇ 2,ones(wlen,1)/wlen,'same');
  • ERL -10*log10(resEchoEng./rawEchoEng+10 ⁇ -9);
  • ERL_max(kk) max(ERL(((kk-1)*Lk+1):kk*Lk));
  • ERLmean mean(ERL_max(30:end)).
  • the root mean square of the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm can be calculated to obtain the actual echo cancellation suppression amount.
  • S120 Perform voice interaction by using the voice interaction strategy corresponding to the real-time echo cancellation suppression amount.
  • the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for voice interaction, so that the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for subsequent voice interaction, so that the live echo cancellation suppression amount can be: Metrics can select a voice interaction strategy that matches the actual usage environment.
  • the live echo cancellation suppression amount is calculated through step S110.
  • the voice interaction strategy corresponding to the actual echo cancellation suppression amount can be directly used for voice interaction.
  • the voice interaction strategy changes from a full-duplex voice interaction strategy to a half-duplex voice interaction strategy.
  • the worker voice interaction strategy, or the activation time and/or the number of command words in the full-duplex voice interaction strategy decreases.
  • the suppression amount of live echo cancellation becomes smaller, but the voice interaction performance index defined by the voice interaction strategy remains unchanged or increases.
  • the live echo cancellation suppression amount may be divided into at least two intervals in advance, and different intervals correspond to different voice interaction strategies.
  • the live echo cancellation suppression amount is calculated through step S110, First confirm which range the current live echo cancellation suppression amount of the voice device is in, and then select the voice interaction strategy corresponding to the range for voice interaction.
  • the live echo cancellation suppression amount can be divided into two sections, one section corresponds to the voice interaction strategy in the full-duplex mode, and the other section corresponds to the voice interaction strategy in the half-duplex mode.
  • different live echo cancellation suppression amounts correspond to different voice interaction strategies.
  • the smaller the suppression amount of live echo cancellation the less activation time in the corresponding full-duplex voice interaction process; and/or, the smaller the amount of live echo cancellation suppression, the corresponding number of command words in the full-duplex voice interaction process. less.
  • step S120 the activation time and/or the command word in the full-duplex voice interaction process corresponding to the live echo cancellation suppression amount may be confirmed first, and then the voice interaction is performed based on the determined activation time and/or the command word.
  • the voice interaction is performed based on the correspondence table between the live echo cancellation suppression amount and the activation time, and/or the correspondence table between the live echo cancellation suppression amount and the command word.
  • the activation time and/or the number of command words in the full-duplex voice interaction process corresponding to the suppression amount of live echo cancellation can be calculated based on the activation time calculation formula and/or the calculation formula of the number of command words; Find out the number of command words; then perform voice interaction based on the corresponding activation time and command words.
  • the command words in the command thesaurus may be sorted according to the correlation between the command words and the voice device, so that the command words with the highest correlation are confirmed from the command thesaurus.
  • the echo cancellation suppression amount calculation is first performed on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain the real echo cancellation suppression amount;
  • Voice interaction in this way, based on the amount of live echo cancellation and suppression, the voice interaction strategy that is consistent with the actual use environment can be accurately selected, so as to achieve a better balance between the recognition rate and the interactive experience, so as to improve the effect of voice interaction.
  • the voice interaction method of the present application may be executed after the voice device is powered on. Specifically, the voice device is powered on; the audio is recorded while the boot music is played to obtain the original recorded audio; echo cancellation is performed on the original recorded audio to obtain the recorded audio output by the echo cancellation algorithm; echo cancellation is performed on the original recorded audio and the recorded audio
  • the suppression amount is calculated to obtain the real echo cancellation suppression amount; then the voice interaction strategy corresponding to the live echo cancellation suppression amount is used for voice interaction.
  • the voice interaction method of the present application may be executed when an instruction to modify the voice interaction strategy issued by the user is received.
  • FIG. 2 is a schematic flowchart of the second embodiment of the voice interaction method of the present application.
  • the voice interaction method of the present application may include the following steps.
  • S210 Perform an echo cancellation suppression amount operation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm, to obtain a live echo cancellation suppression amount.
  • step S110 Please refer to step S110, which is not repeated here.
  • S220 Confirm the echo cancellation suppression amount interval to which the live echo cancellation suppression amount belongs.
  • the echo cancellation suppression amount interval to which the live echo cancellation suppression amount belongs can be determined, and the voice interaction strategy corresponding to the echo cancellation suppression amount interval can be selected for voice interaction, so as to select the appropriate voice interaction according to the actual use environment. Strategy.
  • At least two echo cancellation suppression amount intervals may be preconfigured, and different echo cancellation suppression amount intervals correspond to different voice interaction strategies.
  • three echo cancellation suppression amount intervals may be configured.
  • the three echo cancellation suppression amount intervals may be respectively: less than or equal to the first preset value; greater than the first preset value and less than or equal to the second preset value; greater than the second preset value.
  • the second preset value is greater than the first preset value.
  • the first preset value may be pre-configured, which may be adjusted according to the actual situation, or may be adjusted according to the user's instruction.
  • the first preset value may be 27dB, 30dB, 32dB and the like.
  • the second preset value may be pre-configured, which may be adjusted according to the actual situation, or may be adjusted according to the user's instruction.
  • the second preset value may be 34dB, 35dB, 38dB and the like.
  • step S230 In response to confirming that the amount of suppression of live echo cancellation is less than or equal to the first preset value, proceed to step S230; in response to confirming that the amount of suppression of live echo cancellation is greater than the first preset value and less than or equal to the second preset value, proceed to step S240 ; In response to confirming that the suppression amount of live echo cancellation is greater than the second preset value, enter step S250.
  • the voice interaction can be performed in half-duplex mode, so as to avoid the use of full-duplex mode in the case of insufficient echo cancellation under the influence of the actual environment, resulting in excessive false recognition rate. High to ensure the normal operation of voice interaction.
  • S240 Perform voice interaction in the second full-duplex mode.
  • the full-duplex mode may include a first full-duplex mode and a second full-duplex mode, wherein the number of command words in the first full-duplex mode may be greater than the number of command words in the second full-duplex mode.
  • the first full-duplex mode may include, in addition to command words strongly related to the voice device, command words that are weakly related to the voice device.
  • the second full-duplex mode may only include command words that are strongly related to the voice device, of course not limited thereto.
  • the second full-duplex mode may also include some command words that are frequently used but weakly related to the voice device. It can be understood that, the command words of the second full-duplex mode can also be the command words of the first full-duplex mode.
  • the activation time of the first full-duplex mode may also be longer than the activation time of the second full-duplex mode.
  • the voice device can recognize the command word in the recording content within the activation time after receiving the wake-up word.
  • the voice interaction can be performed in the second full-duplex mode, so that the command words and commands that can be recognized in the full-duplex voice interaction can be used.
  • the activation time also needs to be set more rigorously to ensure that a more natural full-duplex dialogue experience can be provided in the category of some of the most frequently used command words.
  • S250 Perform voice interaction in the first full-duplex mode.
  • the voice interaction can be performed in the first full-duplex mode. Since the amount of echo cancellation and suppression is sufficient, there is almost no self-excited misrecognition, which allows communication with users or other devices. Long-term voice interaction, and can recognize more command words, so that when the amount of echo cancellation and suppression is sufficient, the voice interaction of the voice device can be sufficiently close to the previous voice interaction between real people.
  • the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for voice interaction, and the live echo cancellation suppression amount can be established by debugging before the voice device leaves the factory or before using the above voice interaction method.
  • the method for establishing a corresponding relationship in this embodiment may include the following steps.
  • the voice device can be placed under different acoustic echo channels, and the live echo cancellation suppression amount of the voice device under different acoustic echo channels can be tested, so as to establish the corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy.
  • the different acoustic echo channels can be the high reverberation environment simulated in the laboratory, the resonance of the manufacturing equipment and the nonlinear distortion of the playback channel, etc.
  • the acoustic echo channel can be used as a variable, the original audio played and the calculation method of the live echo cancellation suppression amount and other test conditions are kept the same, and the live echo cancellation of the voice device under different acoustic echo channels can be obtained. inhibitory amount.
  • S320 Define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels, and save the correspondence between the voice interaction strategy and the live echo cancellation suppression amount.
  • the interruption index can be used as a measurement standard to establish a corresponding relationship between the amount of live echo cancellation suppression and the voice interaction strategy.
  • the live echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels can be tested, so as to establish a corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy using the interruption index as a measurement standard .
  • the command word recognition rate may be used as a measure to establish a corresponding relationship between the amount of live echo cancellation suppression and the speech interaction strategy.
  • the real-time echo cancellation suppression amount and the command word recognition rate of the speech device under different acoustic echo channels can be tested, so as to use the command word recognition rate as a measure to establish the live echo cancellation suppression amount and the voice interaction strategy.
  • FIG. 4 is a schematic flowchart of the second embodiment of the method for establishing a corresponding relationship of the present application.
  • the voice interaction method of the present application may include the following steps.
  • the real-time echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels can be tested, so that the corresponding relationship between the real-time echo cancellation inhibition amount and the voice interaction strategy can be established by using the interruption index as a measurement standard.
  • the interruption metrics may include interruption precision and/or interruption recall.
  • interruption refers to the semantic interruption in full-duplex voice interaction, which means that during the broadcast process, the voice device receives an instruction that conforms to the recognition semantic range, and immediately stops the playback and responds to the update.
  • Interruption recall rate refers to the ratio of the number of times that the voice device receives valid interruptions and the command word is correctly recognized to the number of times that the voice device should interrupt. It can be understood that the number of times the voice device should interrupt refers to the number of times the command word is actually input to the voice device. Interruption recall rate can represent the accurate recognition in the full-duplex recognition process.
  • Interruption accuracy refers to the ratio of the number of times the voice device receives valid interruptions and the command word is correctly recognized to the total number of times the voice device receives valid interruptions.
  • the total number of times of effective interruption by the voice device includes: the number of times that the voice device effectively interrupts and the command word is correctly recognized; and the number of times that the voice device effectively interrupts and the command word is incorrectly recognized.
  • the interrupt accuracy rate can be used to calculate the ratio of correct interrupts and false interrupts due to echo self-excitation/external intervention in the full-duplex recognition process.
  • the combination of interruption precision and interruption recall can express the overall correctness of full-duplex dialogue recognition, so that whether to use full-duplex mode and which type to use can be accurately determined by the measurement of interruption precision and interruption recall. Full duplex mode.
  • the number of successful interruptions received by the voice device is B;
  • the number of times that the voice device receives valid interruptions and the command word is correctly recognized is C;
  • S420 Define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels using the interruption index as a measurement standard, and save the correspondence between the voice interaction strategy and the live echo cancellation suppression amount.
  • the corresponding relationship between the echo cancellation inhibition amount and the voice interaction strategy can be established by using the interruption index as a measurement standard.
  • the minimum live echo cancellation suppression amount for which the interruption index is qualified is taken as the first preset value, or the highest live echo cancellation suppression amount for which the interruption index is unqualified is used as the first preset value, wherein the first preset value is the voice The highest standard value for the device to start the half-duplex mode; the lowest real-time echo cancellation suppression amount for which the interruption index far exceeds the qualified standard is taken as the second preset value.
  • the second preset value is the highest standard value for the voice device to activate the second full-duplex mode.
  • the minimum live echo cancellation suppression amount with excellent interruption index can also be used as the second preset value, wherein the second preset value is the highest standard value for the voice device to activate the second full-duplex mode.
  • the live echo cancellation suppression amount for which the difference between each interruption index and the respective qualified value exceeds the corresponding first threshold value is the live echo cancellation inhibition amount with excellent interruption index, wherein the first threshold value is greater than 0.
  • the interruption index includes interruption precision rate and interruption recall rate
  • the qualified value of interruption precision rate is 80%
  • the first threshold of interruption precision rate is 8%
  • the qualified value of interruption recall rate is is 70%
  • the first threshold for interrupted recall is 12%.
  • the suppression amount of live echo cancellation is 32dB
  • the interruption precision rate is 84%
  • the interruption recall rate is 89%, because the difference between interruption precision rate and qualified value is 4%, which is less than the first threshold of interruption precision rate. Therefore, the 32dB live echo cancellation suppression amount is not an excellent live echo cancellation suppression amount for the interruption index.
  • the interruption precision rate is 90%, and the interruption recall rate is 93%, because the difference between interruption precision rate and qualified value is 10%, which is greater than the first threshold of interruption precision rate. And the difference between the interrupt recall rate and the qualified value is 23%, which is greater than the first threshold of the interrupt recall rate, so the 35dB live echo cancellation suppression amount is an excellent live echo cancellation suppression amount for the interrupt index.
  • the live echo cancellation suppression amount for which the ratio of each interruption index to the respective qualified value exceeds the corresponding second threshold is the live echo cancellation inhibition amount with excellent interruption index, and the second threshold value is greater than 1.
  • the interruption index includes interruption precision rate and interruption recall rate
  • the qualified value of interruption precision rate is 80%
  • the second threshold of interruption precision rate is 1.12
  • the qualified value of interruption recall rate is 1.12.
  • the second threshold for interrupt recall is 1.1.
  • the live echo cancellation suppression amount is 32dB
  • the interruption precision rate is 84%
  • the interruption recall rate is 89%, because the ratio of interruption precision rate to qualified value is 1.05, which is less than the second threshold of interruption precision rate, interruption rate
  • the ratio of recall rate to qualified value is 1.112, which is larger than the second threshold of interrupt recall rate, so the 32dB live echo cancellation suppression amount is not the excellent live echo cancellation suppression amount for the interrupt index.
  • the interruption precision rate is 90%, and the interruption recall rate is 93%, because the ratio of interruption precision rate to qualified value is 1.125, which is greater than the second threshold of interruption precision rate, and the interruption rate is 1.125.
  • the ratio of the interrupt recall rate to the qualified value is 1.162, which is greater than the second threshold of the interrupt recall rate, so the 35dB live echo cancellation suppression amount is an excellent live echo cancellation suppression amount for the interrupt index.
  • the first threshold of each interruption indicator exceeds 0.
  • the second threshold for each interruption metric exceeds 1.
  • the qualified value of each interruption index, the first threshold and the second threshold may be preset.
  • the interruption index includes interruption precision rate and interruption recall rate
  • the amount of live echo cancellation suppression for which the interruption precision rate and interruption recall rate both exceed their respective qualified values is the amount of live echo cancellation inhibition for which the interruption index is qualified ;
  • the live echo cancellation suppression amount for which either the interruption precision rate and the interruption recall rate does not exceed the respective qualified value is the live echo cancellation inhibition amount for which the interruption index is unqualified.
  • the live echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels are tested first, and then the interruption index is used as the measurement standard to define the live echo cancellation suppression amount under different acoustic echo channels.
  • the matching voice interaction strategy so that a more appropriate corresponding relationship can be formulated with the interruption index as the measurement standard, so that the voice device will autonomously select a voice interaction strategy that is consistent with the actual use environment when it is used later.
  • the following will use experimental data to show the impact on the amount of live echo cancellation suppression due to severe environmental reverberation in the acoustic echo channel, thereby reducing the interruption precision rate and interruption recall rate, and propose a dynamic adjustment of the voice interaction strategy at this level.
  • the experimental environment is as follows:
  • the acoustic structure of the voice equipment is guaranteed to be normal, and the two acoustic echo channels are guaranteed to use the same voice equipment and live echo cancellation and suppression algorithm.
  • the calculated real-time echo cancellation suppression amount is 28dB
  • the corresponding test interruption accuracy rate is 82.32%
  • the interruption recall rate is 75.25%.
  • the half-duplex recognition rate test can be passed normally. Under the suppression level of 28dB, the half-duplex mode should be selected for voice interaction.
  • the frequency sweep test shows that the reverberation performance of the environment is normal.
  • the calculated live echo cancellation suppression amount is 36dB
  • the corresponding test interruption accuracy rate is 91.20%
  • the interruption recall rate is 97.44. %
  • the test passed and the interruption indicator was significantly higher than the standard.
  • the real-time echo cancellation suppression amount of 36dB an accurate response can be obtained during the full-duplex dialogue process, and the first full-duplex mode should be selected for voice interaction.
  • the voice device finally defines the corresponding relationship between the specific interval of the live echo cancellation suppression amount and the voice interaction strategy as follows:
  • the first interval of the live echo cancellation suppression amount is ⁇ 35dB, and the first full-duplex mode with a wide skill field and a long activation time is used for voice interaction.
  • the second interval of the live echo cancellation suppression amount 30dB ⁇ live echo cancellation suppression amount ⁇ 35dB, using the limited skill field and the second full-duplex mode with short activation time for voice interaction.
  • the third interval of the live echo cancellation suppression amount the live echo cancellation suppression amount is less than or equal to 30dB, and the voice interaction is performed in a half-full-duplex mode.
  • FIG. 6 is a schematic structural diagram of an embodiment of a voice device of the present application.
  • the voice device 10 includes a processor 12, a playback device 13, a recording device 14 and an echo cancellation circuit 15; the playback device 13, the recording device 14 and the echo cancellation circuit 15 are all coupled to the processor 12, and the processor 12 is used for executing instructions to The above voice interaction method is implemented.
  • the processor 12 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 12 may be an integrated circuit chip with signal processing capability.
  • Processor 12 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components .
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor or the processor 12 may be any conventional processor or the like.
  • the speech device 10 may further include a memory 11 for storing instructions and data required for the operation of the processor 12 .
  • the processor 12 is configured to execute the instructions to implement the method provided by any of the above-mentioned embodiments of the voice interaction method of the present application and any non-conflicting combination.
  • FIG. 7 is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application.
  • the computer-readable storage medium 20 of the embodiment of the present application stores the instruction/program data 21, and when the instruction/program data 21 is executed, implements the method provided by any embodiment of the voice interaction method of the present application and any non-conflicting combination.
  • the instruction/program data 21 can be stored in the above-mentioned storage medium 20 in the form of a program file in the form of a software product, so that a computer device (may be a personal computer, a server, or a network device, etc.) or a processor (processor) Perform all or part of the steps of the methods of the various embodiments of the present application.
  • the aforementioned storage medium 20 includes: a USB flash drive, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, etc. media, or terminal devices such as computers, servers, mobile phones, and tablets.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

Abstract

Provided by the present application are a voice interaction method and a related apparatus, and a method for establishing a correspondence. The voice interaction method comprises: calculating the echo cancellation suppression amount of an original recorded audio and a recorded audio of a voice device to obtain a live echo cancellation suppression amount, wherein the original recorded audio is an audio recorded by the voice device while an original audio is played back by itself, and the recorded audio is an audio obtained after the original recorded audio is processed by an echo cancellation algorithm; and performing voice interaction by using a voice interaction strategy corresponding to the live echo cancellation suppression amount. The present application may improve the effect of voice interaction.

Description

语音交互方法及相关装置、对应关系建立方法Voice interaction method, related device, and method for establishing corresponding relationship 【技术领域】【Technical field】
本申请涉及语音交互技术领域,特别是涉及语音交互方法及相关装置、对应关系建立方法。The present application relates to the technical field of voice interaction, and in particular, to a voice interaction method, a related device, and a method for establishing a corresponding relationship.
【背景技术】【Background technique】
随着语音识别技术的日趋成熟,语音设备的语音交互功能正在快速的加强和改进。目前,语音设备可以采用半双工语音或全双工语音交互方式与人或其他设备进行语音交互,但是交互效果仍然不足。With the maturity of voice recognition technology, the voice interaction function of voice devices is rapidly strengthening and improving. At present, voice devices can use half-duplex voice or full-duplex voice interaction to interact with people or other devices, but the interaction effect is still insufficient.
【发明内容】[Content of the invention]
本申请提供语音交互方法及相关装置、对应关系建立方法,能够提高交互效果。The present application provides a voice interaction method, a related device, and a method for establishing a corresponding relationship, which can improve the interaction effect.
为达到上述目的,本申请提供一种语音交互方法,该方法包括:In order to achieve the above object, the present application provides a voice interaction method, the method includes:
对语音设备的原始录制音频及录音音频进行回声消除抑制量运算,获得实况回声消除抑制量,其中,原始录制音频为语音设备在自身播放原始音频的同时录制得到的音频,录音音频为原始录制音频经过回声消除算法处理得到的音频;Perform the echo cancellation suppression amount calculation on the original recorded audio and recorded audio of the voice device to obtain the live echo cancellation suppression amount, where the original recorded audio is the audio recorded by the voice device while playing the original audio itself, and the recorded audio is the original recorded audio The audio processed by the echo cancellation algorithm;
以实况回声消除抑制量对应的语音交互策略进行语音交互。Voice interaction is carried out with the voice interaction strategy corresponding to the amount of live echo cancellation suppression.
其中,实况回声消除抑制量与语音交互策略的对应关系预先设置;Wherein, the corresponding relationship between the amount of live echo cancellation suppression and the voice interaction strategy is preset;
且实况回声消除抑制量越小,语音交互策略所定义的语音交互性能指标越低。And the smaller the suppression amount of live echo cancellation, the lower the voice interaction performance index defined by the voice interaction strategy.
其中,实况回声消除抑制量与语音交互策略的对应关系包括:Among them, the corresponding relationship between the amount of live echo cancellation suppression and the voice interaction strategy includes:
响应于实况回声消除抑制量大于第一预设值,以全双工模式进行语音交互;In response to the live echo cancellation suppression amount being greater than the first preset value, the voice interaction is performed in a full-duplex mode;
响应于实况回声消除抑制量小于或等于第一预设值,以半双工模式进行语音交互。In response to the live echo cancellation suppression amount being less than or equal to the first preset value, the voice interaction is performed in half-duplex mode.
其中,全双工模式包括第一全双工模式和第二全双工模式,第一全双工模式的命令词多于第二全双工模式的命令词;所述若实况回声消除抑制量大于第一预设值,以全双工模式进行语音交互,包括:The full-duplex mode includes a first full-duplex mode and a second full-duplex mode, and the first full-duplex mode has more command words than the second full-duplex mode; greater than the first preset value, perform voice interaction in full-duplex mode, including:
响应于实况回声消除抑制量大于第二预设值,以第一全双工模式进行语音交互;in response to the live echo cancellation suppression amount being greater than the second preset value, performing the voice interaction in the first full-duplex mode;
响应于实况回声消除抑制量大于第一预设值且小于或等于第二预设值,以第二全双工模式进行语音交互。In response to the live echo cancellation suppression amount being greater than the first preset value and less than or equal to the second preset value, the voice interaction is performed in the second full-duplex mode.
其中,第一全双工模式的激活时间长于第二全双工模式的激活时间。Wherein, the activation time of the first full-duplex mode is longer than the activation time of the second full-duplex mode.
其中,获得实况回声消除抑制量包括:Wherein, obtaining the suppression amount of live echo cancellation includes:
分别计算原始录制音频和录音音频的能量矩阵;Calculate the energy matrix of the original recorded audio and the recorded audio separately;
对原始录制音频的能量矩阵和录音音频的能量矩阵的差值进行转换,得到差值数据;Convert the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio to obtain difference data;
对差值数据进行分帧处理,确定差值数据中每帧的峰值;Perform frame-by-frame processing on the difference data to determine the peak value of each frame in the difference data;
计算差值数据中至少部分帧的峰值的均值,得到实况回声消除抑制量。The mean value of the peak values of at least some frames in the difference data is calculated to obtain the suppression amount of live echo cancellation.
其中,计算差值数据中至少部分帧的峰值的均值,包括:Among them, calculating the mean value of the peak value of at least some frames in the difference data, including:
计算差值数据中回声消除算法收敛后的所有帧的峰值的均值。Calculate the mean of the peaks of all frames in the difference data after the echo cancellation algorithm has converged.
其中,原始音频的时长长于10s。Among them, the duration of the original audio is longer than 10s.
为达到上述目的,本申请提供一种对应关系建立方法,该方法包括:In order to achieve the above purpose, the present application provides a method for establishing a corresponding relationship, the method comprising:
测试语音设备在不同声学回声通道下的实况回声消除抑制量;Test the amount of live echo cancellation and suppression of voice equipment under different acoustic echo channels;
定义在不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,并保存语音交互策略和实况回声消除抑制量的对应关系。Define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels, and save the correspondence between the voice interaction strategy and the live echo cancellation suppression amount.
其中,测试语音设备在不同声学回声通道下的实况回声消除抑制量,包括:测试语音设备在不同声学回声通道下的实况回声消除抑制量和打断指标;Among them, testing the live echo cancellation suppression amount of the voice device under different acoustic echo channels includes: testing the live echo cancellation suppression amount and interruption index of the voice device under different acoustic echo channels;
定义在不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,包括:以打断指标为衡量标准定义实况回声消除抑制量所匹配的语音交互策略。Define the voice interaction strategy matched by the live echo cancellation suppression amount under different acoustic echo channels, including: defining the voice interaction strategy matched by the live echo cancellation suppression amount based on the interruption index.
其中,以打断指标为衡量标准定义实况回声消除抑制量所匹配的语音交互策略,包括:Among them, the voice interaction strategy matched by the amount of live echo cancellation suppression is defined with the interruption index as the measurement standard, including:
将打断指标合格的最低实况回声消除抑制量作为第一预设值,或将打断指标不合格的最高实况回声消除抑制量作为第一预设值;Taking the minimum live echo cancellation suppression amount for which the interruption index is qualified as the first preset value, or the highest live echo cancellation suppression amount for which the interruption index is unqualified as the first preset value;
其中,第一预设值为语音设备启动半双工模式的最高标准值。Wherein, the first preset value is the highest standard value for the voice device to activate the half-duplex mode.
其中,全双工模式包括第一全双工模式和第二全双工模式,第一全双工模式的命令词多于第二全双工模式的命令词,以打断指标为衡量标准定义实况回 声消除抑制量所匹配的语音交互策略,包括:The full-duplex mode includes a first full-duplex mode and a second full-duplex mode. The first full-duplex mode has more command words than the second full-duplex mode, and is defined by the interruption index as a measure. Voice interaction strategies matched by the amount of live echo cancellation suppression, including:
将打断指标优秀的最低实况回声消除抑制量作为第二预设值,Take the minimum live echo cancellation suppression amount with excellent interruption index as the second preset value,
其中,第二预设值为语音设备启动第二全双工模式的最高标准值,Wherein, the second preset value is the highest standard value for the voice device to activate the second full-duplex mode,
每一打断指标与各自的合格值的差值均超过对应的第一阈值的实况回声消除抑制量为打断指标优秀的实况回声消除抑制量,第一阈值大于0;或,每一打断指标与各自的合格值的比值均超过对应的第二阈值的实况回声消除抑制量为打断指标优秀的实况回声消除抑制量,第二阈值大于1。The live echo cancellation suppression amount for which the difference between each interruption index and the respective qualified value exceeds the corresponding first threshold value is the live echo cancellation inhibition amount with excellent interruption index, and the first threshold value is greater than 0; or, each interruption The live echo cancellation suppression amount for which the ratios of the indicators and their respective qualified values all exceed the corresponding second threshold value is the live echo cancellation suppression amount with excellent interrupt indicators, and the second threshold value is greater than 1.
其中,打断指标包括打断精确率和打断召回率;Among them, the interruption indicators include interruption precision rate and interruption recall rate;
打断精确率和打断召回率均超过各自的合格值的实况回声消除抑制量为打断指标合格的实况回声消除抑制量;The live echo cancellation suppression amount for which both the interruption precision rate and the interruption recall rate exceed their respective qualified values is the live echo cancellation suppression amount for which the interruption index is qualified;
打断精确率和打断召回率中任一者不超过各自的合格值的实况回声消除抑制量为打断指标不合格的实况回声消除抑制量。The live echo cancellation suppression amount for which either the interruption precision rate or the interruption recall rate does not exceed the respective pass values is the live echo cancellation suppression amount for which the interruption index fails.
为达到上述目的,本申请提供一种语音设备,该语音设备包括处理器、播放器件、录音器件和回声消除电路;播放器件、录音器件和回声消除电路均耦接于处理器,存储器中存储有计算机程序,处理器用于执行指令以实现上述语音交互方法。In order to achieve the above purpose, the application provides a kind of voice equipment, the voice equipment includes a processor, a playback device, a recording device and an echo cancellation circuit; the playback device, the recording device and the echo cancellation circuit are all coupled to the processor, and the memory stores a A computer program for the processor to execute instructions to implement the above voice interaction method.
为达到上述目的,本申请提供一种计算机可读存储介质,其用于存储指令/程序数据,指令/程序数据能够被执行以实现上述语音交互方法。To achieve the above object, the present application provides a computer-readable storage medium for storing instructions/program data, and the instructions/program data can be executed to implement the above voice interaction method.
本申请的方法是:先对语音设备的原始录制音频与回声消除算法后输出的录音音频进行回声消除抑制量运算,获得实况回声消除抑制量;然后使用实况回声消除抑制量对应的语音交互策略进行语音交互,这样以实况回声消除抑制量为衡量标准可以准确地选择出与实际使用环境相符合的语音交互策略,以便在识别率与交互体验上取得较佳的平衡点,提高交互效果。The method of the present application is as follows: first, perform an echo cancellation suppression amount calculation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain a live echo cancellation suppression amount; then use the voice interaction strategy corresponding to the live echo cancellation suppression amount Voice interaction, so that the actual echo cancellation and suppression amount can be used as a measure to accurately select a voice interaction strategy that is consistent with the actual use environment, so as to achieve a better balance between recognition rate and interaction experience, and improve the interaction effect.
【附图说明】【Description of drawings】
为了更清楚地说明本申请实施方式中的技术方案,下面将对实施方式描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施方式,对本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.
图1是本申请语音交互方法第一实施方法的流程示意图;1 is a schematic flowchart of a first implementation method of the voice interaction method of the present application;
图2是本申请语音交互方法第二实施方法的流程示意图;2 is a schematic flowchart of a second implementation method of the voice interaction method of the present application;
图3是本申请对应关系的建立方法第一实施方法的流程示意图;3 is a schematic flowchart of a first implementation method of a method for establishing a corresponding relationship in the present application;
图4是本申请对应关系的建立方法第二实施方法的流程示意图;4 is a schematic flowchart of a second implementation method of a method for establishing a corresponding relationship in the present application;
图5是本申请对应关系的建立方法一实施例中高残响环境下的扫频频谱对比示意图;5 is a schematic diagram of a comparison of sweep frequency spectrums in a high reverberation environment in a first embodiment of a method for establishing a corresponding relationship of the present application;
图6是本申请语音设备一实施方式的结构示意图;6 is a schematic structural diagram of an embodiment of a voice device of the present application;
图7是本申请计算机可读存储介质一实施方式的结构示意图。FIG. 7 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application.
【具体实施方式】【Detailed ways】
为使本领域的技术人员更好地理解本申请的技术方案,下面结合附图和具体实施方式对本申请所提供的语音交互方法及相关装置、对应关系建立方法做进一步详细描述。In order for those skilled in the art to better understand the technical solutions of the present application, the voice interaction method, related devices, and method for establishing a corresponding relationship provided by the present application are further described in detail below with reference to the accompanying drawings and specific embodiments.
本申请中的术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”、“第三”的特征可以明示或者隐含地包括至少一个该特征。本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。The terms "first", "second" and "third" in this application are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implying the number of indicated technical features. Thus, a feature defined as "first", "second", "third" may expressly or implicitly include at least one of that feature. In the description of the present application, "a plurality of" means at least two, such as two, three, etc., unless otherwise expressly and specifically defined.
在本文中提及“实施方式”意味着,结合实施方式描述的特定特征、结构或特性可以包含在本申请的至少一个实施方式中。在说明书中的各个位置出现该短语并不一定均是指相同的实施方式,也不是与其它实施方式互斥的独立的或备选的实施方式。本领域技术人员显式地和隐式地理解的是,在不冲突的情况下,本文所描述的实施方式可以与其它实施方式相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments without conflict.
本申请语音交互方法应用于语音设备与人或其他设备进行语音交互的情况。对于这种情况,以家电领域为例,目前冰箱、微波炉、空调和电饭煲等多种家电都具备语音交互的功能,可以作为语音设备。这些语音设备一般需要录制音频并对录制的音频进行唤醒词和命令词识别,以确定用户或其他设备下达的语音指令,继而基于语音指令播报相关内容,以实现语音交互。目前语音设备可通过半双工模式或全双工模式进行语音交互。语音设备采用半双工模式时,在接受用户语音指令的时候,不能进行播报;同样在播报的时候,不能接收用户指令,这样任意一方往往要进行充分的等待,与真实人与人之间的语音交互存在体验上较大的偏差。语音设备采用全双工模式时,可以同时进行播报与接收语音指令,但是由于外界环境的干扰,可能导致语音设备对原始录音无法很 准确地进行回声消除,从而产生自激反应,会引起误识别的情况频繁发生。在实际使用过程中,语音设备只能采用预先设置的语音交互策略与用户或其他设备进行语音交互,无法根据实际使用环境自主的选择适合的语音交互策略,无法在识别率与交互体验上取得较佳的平衡点。通过本申请的语音交互方法可根据实际使用环境自主的选择适合的语音交互策略。The voice interaction method of the present application is applied to a situation where a voice device performs voice interaction with a person or other devices. In this case, taking the field of home appliances as an example, at present, many home appliances such as refrigerators, microwave ovens, air conditioners and rice cookers have the function of voice interaction and can be used as voice devices. These voice devices generally need to record audio and perform wake word and command word recognition on the recorded audio to determine the voice command issued by the user or other devices, and then broadcast relevant content based on the voice command to achieve voice interaction. At present, voice devices can perform voice interaction in half-duplex mode or full-duplex mode. When the voice device adopts the half-duplex mode, it cannot broadcast when it accepts the user's voice command; also when it broadcasts, it cannot receive the user's command, so either party often has to wait enough to communicate with real people. There is a large deviation in the experience of voice interaction. When the voice device adopts full-duplex mode, it can broadcast and receive voice commands at the same time. However, due to the interference of the external environment, the voice device may not be able to accurately cancel the echo of the original recording, resulting in a self-excited response, which may cause misrecognition. happens frequently. In the actual use process, the voice device can only use the preset voice interaction strategy to interact with the user or other devices. good balance. With the voice interaction method of the present application, a suitable voice interaction strategy can be independently selected according to the actual use environment.
本申请的语音交互方法可以对语音设备的原始录制音频与回声消除算法后输出的录音音频进行回声消除抑制量运算,获得实况回声消除抑制量;然后使用实况回声消除抑制量对应的语音交互策略进行语音交互,这样以实况回声消除抑制量为衡量标准可以准确地选择出与实际使用环境相符合的语音交互策略。The voice interaction method of the present application can perform the echo cancellation suppression amount calculation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain the real echo cancellation suppression amount; and then use the voice interaction strategy corresponding to the live echo cancellation suppression amount to perform Voice interaction, so that the voice interaction strategy consistent with the actual use environment can be accurately selected based on the amount of live echo cancellation and suppression.
如图1所示,图1为本申请语音交互方法第一实施方式的流程示意图,本申请的语音交互方法可以包括以下步骤。As shown in FIG. 1 , FIG. 1 is a schematic flowchart of the first embodiment of the voice interaction method of the present application. The voice interaction method of the present application may include the following steps.
S110:对语音设备的原始录制音频及录音音频进行回声消除抑制量运算,获得实况回声消除抑制量。S110: Perform an echo cancellation suppression amount calculation on the original recorded audio and the recorded audio of the voice device to obtain a live echo cancellation suppression amount.
可以先对语音设备的原始录制音频与回声消除处理后的录音音频进行计算,以得到实况回声消除抑制量,以便后续采用实况回声消除抑制量对应的语音交互策略进行语音交互。The original recorded audio of the voice device and the recorded audio after echo cancellation processing can be calculated first to obtain the real echo cancellation suppression amount, so that the voice interaction strategy corresponding to the real echo cancellation suppression amount can be used for subsequent voice interaction.
其中,原始录制音频为语音设备在自身播放原始音频的同时录制得到的音频。为避免原始音频过短导致计算得到的实况回声消除抑制量不准确,原始音频的时长可以在10s以上。较优的,原始音频的时长在20-30s。其中,原始音频可以是音乐或较长时间的播报,例如可以为开机音乐。其中,录音音频为原始录制音频经过回声消除算法处理得到的音频。The original recorded audio is the audio recorded by the voice device while playing the original audio by itself. In order to avoid that the original audio frequency is too short and the calculated real-time echo cancellation suppression is inaccurate, the duration of the original audio can be more than 10s. Preferably, the duration of the original audio is 20-30s. Wherein, the original audio can be music or a long-term broadcast, for example, it can be boot music. The recorded audio is the audio obtained by processing the original recorded audio through an echo cancellation algorithm.
可以理解的是,本申请可以采用任意方法进行实况回声消除抑制量的计算。It can be understood that any method can be used in the present application to calculate the suppression amount of live echo cancellation.
在一实现方式中,可以先分别计算原始录制音频和回声消除处理后的录音音频的能量矩阵;然后基于原始录制音频的能量矩阵和录音音频的能量矩阵的差值确定实况回声消除抑制量。另外,可以统一实况回声消除抑制量的单位,以便更好地衡量实况回声消除抑制量。示例性地,回声消除抑制量的单位可统一为分贝,具体地,在计算实况回声消除抑制量的过程中,可以将原始录制音频的能量矩阵和录音音频的能量矩阵的差值转换为分贝表示,得到差值数据;然后对差值数据进行均值运算等处理,得到实况回声消除抑制量。In one implementation, the energy matrices of the original recorded audio and the recorded audio after echo cancellation processing may be calculated separately; then the live echo cancellation suppression amount is determined based on the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio. In addition, the unit of the live echo cancellation suppression amount can be unified to better measure the live echo cancellation suppression amount. Exemplarily, the unit of the echo cancellation suppression amount can be unified into decibels. Specifically, in the process of calculating the live echo cancellation suppression amount, the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio can be converted into a decibel representation. , to obtain the difference data; then, the difference data is processed by means of mean operation, etc., to obtain the real-time echo cancellation suppression amount.
此外,考虑到音频中安静的部分,本身能量差值比较小,如果取每个点的差值计算出实况回声消除抑制量,安静的部分的值会大幅度拉低实况回声消除抑制量,无法准确地计算出比较平稳的原始音频的实况回声消除抑制量;本申请通过分帧处理,取每帧的峰值计算均值而得到实况回声消除抑制量的方案,可以避免平稳的原始音频中大幅度安静的部分的值拉低实况回声消除抑制量,以提高实况回声消除抑制量计算的准确性。其中,分帧处理中帧的长度可以与原始音频相对应,具体地,可以根据每个原始音频的长度和/或起伏情况预先设定每个原始音频的帧长度,以提高实况回声消除抑制量计算的准确性。并且,为避免录音音频中回声消除算法收敛前的部分内容会对语音设备的实况回声消除抑制量产生影响,可以计算差值数据中回声消除算法收敛后的所有帧的峰值的均值,得到实况回声消除抑制量,以保证实况回声消除抑制量计算的准确性。In addition, considering the quiet part of the audio, the energy difference itself is relatively small. If the difference value of each point is used to calculate the live echo cancellation suppression amount, the value of the quiet part will greatly reduce the live echo cancellation suppression amount. Accurately calculate the real-time echo cancellation suppression amount of the relatively stable original audio; the present application obtains the live echo cancellation suppression amount by processing the frame by frame, taking the peak value of each frame and calculating the mean value to obtain the live echo cancellation suppression amount, which can avoid a large amount of silence in the stable original audio. The value of the part pulls down the live echo cancellation suppression amount to improve the accuracy of the live echo cancellation suppression amount calculation. The length of the frame in the frame-by-frame processing may correspond to the original audio. Specifically, the frame length of each original audio may be preset according to the length and/or fluctuation of each original audio, so as to improve the suppression amount of live echo cancellation. the accuracy of the calculation. In addition, in order to avoid that the part of the recorded audio before the convergence of the echo cancellation algorithm will affect the live echo cancellation suppression amount of the voice device, the average value of the peak values of all frames after the echo cancellation algorithm converges in the difference data can be calculated to obtain the live echo. Cancel the suppression amount to ensure the accuracy of the calculation of the live echo cancellation suppression amount.
示例性地,实况回声消除抑制量运算的过程如下所示:Exemplarily, the process of calculating the suppression amount of live echo cancellation is as follows:
(1)对原始录制音频mic,卷积固定向量,得出原始录制音频的能量矩阵;(1) For the original recorded audio mic, convolve the fixed vector to obtain the energy matrix of the original recorded audio;
rawEchoEng=conv2(mic.^2,ones(wlen,1)/wlen,'same');rawEchoEng=conv2(mic.^2,ones(wlen,1)/wlen,'same');
(2)对回声消除处理后的录音音频aec,卷积固定向量,得出录音音频的能量矩阵;(2) For the recorded audio aec after echo cancellation processing, convolve the fixed vector to obtain the energy matrix of the recorded audio;
resEchoEng=conv2(aec.^2,ones(wlen,1)/wlen,'same');resEchoEng=conv2(aec.^2,ones(wlen,1)/wlen,'same');
(3)将二者能量差值转换成db表示;(3) Convert the energy difference between the two into db representation;
ERL=-10*log10(resEchoEng./rawEchoEng+10^-9);ERL=-10*log10(resEchoEng./rawEchoEng+10^-9);
(4)进行分帧处理,设置合适的帧大小;(4) Perform frame-by-frame processing and set an appropriate frame size;
Blocks=floor(signal_length/Lk)Blocks=floor(signal_length/Lk)
(5)逐帧计算每帧的最大db差值点,即峰值;(5) Calculate the maximum db difference point of each frame, that is, the peak value, frame by frame;
ERL_max=zeros(Blocks,1);ERL_max=zeros(Blocks,1);
for kk=1:Blocksfor kk=1:Blocks
ERL_max(kk)=max(ERL(((kk-1)*Lk+1):kk*Lk));ERL_max(kk)=max(ERL(((kk-1)*Lk+1):kk*Lk));
endend
(6)将算法收敛后(假设需要30帧的收敛时间)的所有帧的峰值做均值运算,得出最终该段音频的实况回声消除抑制量;(6) Perform the mean value operation on the peak values of all frames after the algorithm converges (assuming that 30 frames of convergence time are required) to obtain the final live echo cancellation suppression amount of the audio segment;
ERLmean=mean(ERL_max(30:end))。ERLmean=mean(ERL_max(30:end)).
又例如,可以计算语音设备的原始录制音频与回声消除算法后输出的录音音频的均方根,得到实况回声消除抑制量。For another example, the root mean square of the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm can be calculated to obtain the actual echo cancellation suppression amount.
S120:以实况回声消除抑制量对应的语音交互策略进行语音交互。S120: Perform voice interaction by using the voice interaction strategy corresponding to the real-time echo cancellation suppression amount.
计算得到实况回声消除抑制量后,可以采用实况回声消除抑制量对应的语音交互策略进行语音交互,以便后续采用实况回声消除抑制量对应的语音交互策略进行语音交互,从而可以实况回声消除抑制量为衡量标准可以选择出与实际使用环境相符合的语音交互策略。After calculating the real echo cancellation suppression amount, the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for voice interaction, so that the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for subsequent voice interaction, so that the live echo cancellation suppression amount can be: Metrics can select a voice interaction strategy that matches the actual usage environment.
可以理解的是,实况回声消除抑制量和语音交互策略存在一定的对应关系,并且实况回声消除抑制量和语音交互策略的对应关系可以是预先配置的,这样通过步骤S110计算出实况回声消除抑制量后,可以直接采用实况回声消除抑制量对应的语音交互策略进行语音交互。其中,实况回声消除抑制量越小,语音交互策略所定义的语音交互性能指标越低,例如,随着实况回声消除抑制量的减小,语音交互策略从全双工语音交互策略变化为半双工语音交互策略,或者全双工语音交互策略中激活时间和/或命令词数量变少。当然也不排除实况回声消除抑制量变小,但语音交互策略所定义的语音交互性能指标不变或增加的情况。It can be understood that there is a certain corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy, and the corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy may be pre-configured, so that the live echo cancellation suppression amount is calculated through step S110. After that, the voice interaction strategy corresponding to the actual echo cancellation suppression amount can be directly used for voice interaction. Among them, the smaller the amount of live echo cancellation suppression, the lower the voice interaction performance index defined by the voice interaction strategy. For example, as the amount of live echo cancellation suppression decreases, the voice interaction strategy changes from a full-duplex voice interaction strategy to a half-duplex voice interaction strategy. The worker voice interaction strategy, or the activation time and/or the number of command words in the full-duplex voice interaction strategy decreases. Of course, it is not excluded that the suppression amount of live echo cancellation becomes smaller, but the voice interaction performance index defined by the voice interaction strategy remains unchanged or increases.
在一实现方式中,可以预先将实况回声消除抑制量分为至少两个区间,不同的区间对应着不同的语音交互策略,在此情况下,通过步骤S110计算出实况回声消除抑制量后,可以先确认语音设备当前的实况回声消除抑制量处于哪个区间,然后选择该区间对应的语音交互策略进行语音交互。可选地,可将实况回声消除抑制量分为两个区间,其中一个区间对应着全双工模式的语音交互策略,另一个区间对应着半双工模式的语音交互策略。In an implementation manner, the live echo cancellation suppression amount may be divided into at least two intervals in advance, and different intervals correspond to different voice interaction strategies. In this case, after the live echo cancellation suppression amount is calculated through step S110, First confirm which range the current live echo cancellation suppression amount of the voice device is in, and then select the voice interaction strategy corresponding to the range for voice interaction. Optionally, the live echo cancellation suppression amount can be divided into two sections, one section corresponds to the voice interaction strategy in the full-duplex mode, and the other section corresponds to the voice interaction strategy in the half-duplex mode.
在另一实现方式中,不同的实况回声消除抑制量对应着不同的语音交互策略。示例性地,实况回声消除抑制量越小,对应的全双工语音交互过程中激活时间越少;和/或,实况回声消除抑制量越小,对应的全双工语音交互过程中命令词数量越少。In another implementation manner, different live echo cancellation suppression amounts correspond to different voice interaction strategies. Exemplarily, the smaller the suppression amount of live echo cancellation, the less activation time in the corresponding full-duplex voice interaction process; and/or, the smaller the amount of live echo cancellation suppression, the corresponding number of command words in the full-duplex voice interaction process. less.
在步骤S120中,可以先确认实况回声消除抑制量对应的全双工语音交互过程中激活时间和/或命令词,然后基于确定的激活时间和/或命令词进行语音交互。可选地,可以基于实况回声消除抑制量与激活时间的对应关系表、和/或实况回声消除抑制量与命令词的对应关系表,确认实况回声消除抑制量对应的全双工语音交互过程中激活时间和/或命令词。在其他实施例中,可以基于激活时间计算公式和/或命令词数量计算公式计算出实况回声消除抑制量对应的全双工 语音交互过程中激活时间和/或命令词数量;可以从命令词库找出命令词数量个命令词;然后基于对应的激活时间和命令词进行语音交互。其中,命令词库中的命令词可以是根据命令词和语音设备的相关性进行排序的,以便从命令词库中确认出相关性最高的命令词数量个命令词。In step S120, the activation time and/or the command word in the full-duplex voice interaction process corresponding to the live echo cancellation suppression amount may be confirmed first, and then the voice interaction is performed based on the determined activation time and/or the command word. Optionally, based on the correspondence table between the live echo cancellation suppression amount and the activation time, and/or the correspondence table between the live echo cancellation suppression amount and the command word, it can be confirmed that the full-duplex voice interaction process corresponding to the live echo cancellation suppression amount. Activation time and/or command word. In other embodiments, the activation time and/or the number of command words in the full-duplex voice interaction process corresponding to the suppression amount of live echo cancellation can be calculated based on the activation time calculation formula and/or the calculation formula of the number of command words; Find out the number of command words; then perform voice interaction based on the corresponding activation time and command words. Wherein, the command words in the command thesaurus may be sorted according to the correlation between the command words and the voice device, so that the command words with the highest correlation are confirmed from the command thesaurus.
在本实施方式中,先对语音设备的原始录制音频与回声消除算法后输出的录音音频进行回声消除抑制量运算,获得实况回声消除抑制量;然后使用实况回声消除抑制量对应的语音交互策略进行语音交互,这样以实况回声消除抑制量为衡量标准可以准确地选择出与实际使用环境相符合的语音交互策略,以便在识别率与交互体验上取得较佳的平衡点,以提高语音交互效果。In this embodiment, the echo cancellation suppression amount calculation is first performed on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm to obtain the real echo cancellation suppression amount; Voice interaction, in this way, based on the amount of live echo cancellation and suppression, the voice interaction strategy that is consistent with the actual use environment can be accurately selected, so as to achieve a better balance between the recognition rate and the interactive experience, so as to improve the effect of voice interaction.
可选地,本申请的语音交互方法可以在语音设备开机后执行。具体地,语音设备上电开机;在播放开机音乐的同时录制音频,得到原始录制音频;对原始录制音频进行回声消除,得到回声消除算法输出的录音音频;对原始录制音频和录音音频进行回声消除抑制量计算,得到实况回声消除抑制量;然后采用实况回声消除抑制量对应的语音交互策略进行语音交互。Optionally, the voice interaction method of the present application may be executed after the voice device is powered on. Specifically, the voice device is powered on; the audio is recorded while the boot music is played to obtain the original recorded audio; echo cancellation is performed on the original recorded audio to obtain the recorded audio output by the echo cancellation algorithm; echo cancellation is performed on the original recorded audio and the recorded audio The suppression amount is calculated to obtain the real echo cancellation suppression amount; then the voice interaction strategy corresponding to the live echo cancellation suppression amount is used for voice interaction.
在其他实现方式中,本申请的语音交互方法可以在接收到用户下达的修正语音交互策略指令时执行。In other implementation manners, the voice interaction method of the present application may be executed when an instruction to modify the voice interaction strategy issued by the user is received.
如图2所示,图2为本申请语音交互方法第二实施方式的流程示意图,本申请的语音交互方法可以包括以下步骤。As shown in FIG. 2 , FIG. 2 is a schematic flowchart of the second embodiment of the voice interaction method of the present application. The voice interaction method of the present application may include the following steps.
S210:对语音设备的原始录制音频与回声消除算法后输出的录音音频进行回声消除抑制量运算,获得实况回声消除抑制量。S210: Perform an echo cancellation suppression amount operation on the original recorded audio of the voice device and the recorded audio output after the echo cancellation algorithm, to obtain a live echo cancellation suppression amount.
可参见步骤S110,在此不做赘述。Please refer to step S110, which is not repeated here.
S220:确认实况回声消除抑制量所属的回声消除抑制量区间。S220: Confirm the echo cancellation suppression amount interval to which the live echo cancellation suppression amount belongs.
计算得到实况回声消除抑制量后,可以确定实况回声消除抑制量所属的回声消除抑制量区间,以选用回声消除抑制量区间对应的语音交互策略进行语音交互,以便根据实际使用环境选择合适的语音交互策略。After calculating the live echo cancellation suppression amount, the echo cancellation suppression amount interval to which the live echo cancellation suppression amount belongs can be determined, and the voice interaction strategy corresponding to the echo cancellation suppression amount interval can be selected for voice interaction, so as to select the appropriate voice interaction according to the actual use environment. Strategy.
可选地,可以预先配置至少两个回声消除抑制量区间,不同的回声消除抑制量区间对应不同的语音交互策略。Optionally, at least two echo cancellation suppression amount intervals may be preconfigured, and different echo cancellation suppression amount intervals correspond to different voice interaction strategies.
示例性地,可以配置三个回声消除抑制量区间。这三个回声消除抑制量区间可分别为:小于或等于第一预设值;大于第一预设值且小于或等于第二预设值;大于第二预设值。其中,第二预设值大于第一预设值。第一预设值可以是 预先配置的,其可以根据实际情况进行调整,或者可以根据用户的指令进行调整。可选地,第一预设值可为27dB、30dB、32dB等值。第二预设值可以是预先配置的,其可以根据实际情况进行调整,或者可以根据用户的指令进行调整。可选地,第二预设值可为34dB、35dB、38dB等值。Illustratively, three echo cancellation suppression amount intervals may be configured. The three echo cancellation suppression amount intervals may be respectively: less than or equal to the first preset value; greater than the first preset value and less than or equal to the second preset value; greater than the second preset value. Wherein, the second preset value is greater than the first preset value. The first preset value may be pre-configured, which may be adjusted according to the actual situation, or may be adjusted according to the user's instruction. Optionally, the first preset value may be 27dB, 30dB, 32dB and the like. The second preset value may be pre-configured, which may be adjusted according to the actual situation, or may be adjusted according to the user's instruction. Optionally, the second preset value may be 34dB, 35dB, 38dB and the like.
响应于确认实况回声消除抑制量小于或等于第一预设值,则进入步骤S230;响应于确认实况回声消除抑制量大于第一预设值且小于或等于第二预设值,则进入步骤S240;响应于确认实况回声消除抑制量大于第二预设值,则进入步骤S250。In response to confirming that the amount of suppression of live echo cancellation is less than or equal to the first preset value, proceed to step S230; in response to confirming that the amount of suppression of live echo cancellation is greater than the first preset value and less than or equal to the second preset value, proceed to step S240 ; In response to confirming that the suppression amount of live echo cancellation is greater than the second preset value, enter step S250.
S230:以半双工模式进行语音交互。S230: Perform voice interaction in half-duplex mode.
确认实况回声消除抑制量小于或等于第一预设值后,可以以半双工模式进行语音交互,以避免在实际环境影响下导致回声消除不足的情况下采用全双工模式导致误识别率过高,以确保语音交互的正常进行。After confirming that the suppression amount of live echo cancellation is less than or equal to the first preset value, the voice interaction can be performed in half-duplex mode, so as to avoid the use of full-duplex mode in the case of insufficient echo cancellation under the influence of the actual environment, resulting in excessive false recognition rate. High to ensure the normal operation of voice interaction.
以半双工模式进行语音交互时,一般限定语音设备在同一时间内不允许同时进行录音和播放,避免由于回声一致不足而产生的自激反应。When performing voice interaction in half-duplex mode, it is generally limited that the voice device is not allowed to record and play at the same time, so as to avoid self-excited reactions due to insufficient echo consistency.
S240:以第二全双工模式进行语音交互。S240: Perform voice interaction in the second full-duplex mode.
其中,全双工模式可以包括第一全双工模式和第二全双工模式、其中,第一全双工模式的命令词数量可以多于第二全双工模式的命令词数量。第一全双工模式除包括与语音设备强相关的命令词外,还可包括与语音设备弱相关的命令词。而第二全双工模式可以只包括与语音设备强相关的命令词,当然不限于此,例如第二全双工模式还可以包括一些使用频率较高但与语音设备弱相关的命令词。可以理解的是,第二全双工模式的命令词也可是第一全双工模式的命令词。The full-duplex mode may include a first full-duplex mode and a second full-duplex mode, wherein the number of command words in the first full-duplex mode may be greater than the number of command words in the second full-duplex mode. The first full-duplex mode may include, in addition to command words strongly related to the voice device, command words that are weakly related to the voice device. The second full-duplex mode may only include command words that are strongly related to the voice device, of course not limited thereto. For example, the second full-duplex mode may also include some command words that are frequently used but weakly related to the voice device. It can be understood that, the command words of the second full-duplex mode can also be the command words of the first full-duplex mode.
另外,第一全双工模式的激活时间也可长于第二全双工模式的激活时间。其中,语音设备可以对接收到唤醒词后的激活时间内的录音内容中命令词进行识别。In addition, the activation time of the first full-duplex mode may also be longer than the activation time of the second full-duplex mode. The voice device can recognize the command word in the recording content within the activation time after receiving the wake-up word.
确认实况回声消除抑制量大于第一预设值且小于或等于第二预设值后,可以以第二全双工模式进行语音交互,以在全双工语音交互中能被识别的命令词和/或激活时间也需要设置得比较严谨,以保证在一些使用频率最高的命令词范畴也能提供较为自然的全双工对话体验。After confirming that the suppression amount of live echo cancellation is greater than the first preset value and less than or equal to the second preset value, the voice interaction can be performed in the second full-duplex mode, so that the command words and commands that can be recognized in the full-duplex voice interaction can be used. /Or the activation time also needs to be set more rigorously to ensure that a more natural full-duplex dialogue experience can be provided in the category of some of the most frequently used command words.
S250:以第一全双工模式进行语音交互。S250: Perform voice interaction in the first full-duplex mode.
确认实况回声消除抑制量大于第二预设值后,可以以第一全双工模式进行语音交互,由于回声消除抑制量足够,几乎没有自激误识别产生,从而可以允许与用户或其他设备进行长时间语音交互,并且能够识别较多的命令词,以便在回声消除抑制量足够时让语音设备的语音交互能够充分接近真实人与人之前的语音交互。After confirming that the suppression amount of live echo cancellation is greater than the second preset value, the voice interaction can be performed in the first full-duplex mode. Since the amount of echo cancellation and suppression is sufficient, there is almost no self-excited misrecognition, which allows communication with users or other devices. Long-term voice interaction, and can recognize more command words, so that when the amount of echo cancellation and suppression is sufficient, the voice interaction of the voice device can be sufficiently close to the previous voice interaction between real people.
可以理解的是,为使语音设备使用时,能够采用实况回声消除抑制量对应的语音交互策略进行语音交互,可以在语音设备出厂前或使用上述语音交互方法前,调试建立出实况回声消除抑制量和语音交互策略的对应关系。具体地,如图3所示,本实施方式对应关系的建立方法可以包括以下步骤。It can be understood that, in order to use the voice device, the voice interaction strategy corresponding to the live echo cancellation suppression amount can be used for voice interaction, and the live echo cancellation suppression amount can be established by debugging before the voice device leaves the factory or before using the above voice interaction method. Correspondence with voice interaction strategies. Specifically, as shown in FIG. 3 , the method for establishing a corresponding relationship in this embodiment may include the following steps.
S310:测试语音设备在不同声学回声通道下的实况回声消除抑制量。S310: Test the real-time echo cancellation suppression amount of the voice device under different acoustic echo channels.
可以将语音设备放置在不同的声学回声通道下,测试出语音设备在不同的声学回声通道下的实况回声消除抑制量,以便建立实况回声消除抑制量和语音交互策略的对应关系。The voice device can be placed under different acoustic echo channels, and the live echo cancellation suppression amount of the voice device under different acoustic echo channels can be tested, so as to establish the corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy.
其中,不同的声学回声通道可以是在实验室中模拟的高混响环境、制造设备共振与放音通道非线性失真等。Among them, the different acoustic echo channels can be the high reverberation environment simulated in the laboratory, the resonance of the manufacturing equipment and the nonlinear distortion of the playback channel, etc.
可以理解的是,在测试时,可以以声学回声通道为变量,播放的原始音频和实况回声消除抑制量计算方法等其他测试条件保持一致,得到语音设备在不同的声学回声通道下的实况回声消除抑制量。It can be understood that during the test, the acoustic echo channel can be used as a variable, the original audio played and the calculation method of the live echo cancellation suppression amount and other test conditions are kept the same, and the live echo cancellation of the voice device under different acoustic echo channels can be obtained. inhibitory amount.
S320:定义在不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,并保存语音交互策略和实况回声消除抑制量的对应关系。S320: Define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels, and save the correspondence between the voice interaction strategy and the live echo cancellation suppression amount.
测试出语音设备在不同声学回声通道下的实况回声消除抑制量后,可以定义在不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,然后保存实况回声消除抑制量和语音交互策略的对应关系,以便语音设备后续使用时会自主地选择与实际使用环境相符合的语音交互策略。After testing the live echo cancellation suppression amount of the voice device under different acoustic echo channels, you can define the voice interaction strategy that matches the live echo cancellation suppression amount under different acoustic echo channels, and then save the live echo cancellation suppression amount and voice interaction strategy so that the voice device will autonomously select a voice interaction strategy that matches the actual use environment when it is used subsequently.
在一实现方式中,可以以打断指标为衡量标准,来建立实况回声消除抑制量和语音交互策略的对应关系。具体地,在步骤S310中,可以测试语音设备在不同声学回声通道下的实况回声消除抑制量和打断指标,以便以打断指标为衡量标准建立实况回声消除抑制量和语音交互策略的对应关系。In an implementation manner, the interruption index can be used as a measurement standard to establish a corresponding relationship between the amount of live echo cancellation suppression and the voice interaction strategy. Specifically, in step S310, the live echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels can be tested, so as to establish a corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy using the interruption index as a measurement standard .
在另一实现方式中,可以以命令词识别率为衡量标准,来建立实况回声消除抑制量和语音交互策略的对应关系。具体地,在步骤S310中,可以测试语音 设备在不同声学回声通道下的实况回声消除抑制量和命令词识别率,以便以命令词识别率为衡量标准建立实况回声消除抑制量和语音交互策略的对应关系。In another implementation manner, the command word recognition rate may be used as a measure to establish a corresponding relationship between the amount of live echo cancellation suppression and the speech interaction strategy. Specifically, in step S310, the real-time echo cancellation suppression amount and the command word recognition rate of the speech device under different acoustic echo channels can be tested, so as to use the command word recognition rate as a measure to establish the live echo cancellation suppression amount and the voice interaction strategy. Correspondence.
在本实施方式中,先测试出语音设备在不同声学回声通道下的实况回声消除抑制量,接着定义在不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,然后保存实况回声消除抑制量和语音交互策略的对应关系,以便语音设备后续使用时会自主地选择与实际使用环境相符合的语音交互策略。In this implementation manner, first test the live echo cancellation suppression amount of the voice device under different acoustic echo channels, then define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels, and then save the live echo cancellation The corresponding relationship between the suppression amount and the voice interaction strategy, so that the voice device will autonomously select a voice interaction strategy that is consistent with the actual use environment when it is used later.
如图4所示,图4为本申请对应关系的建立方法第二实施方式的流程示意图,本申请的语音交互方法可以包括以下步骤。As shown in FIG. 4 , FIG. 4 is a schematic flowchart of the second embodiment of the method for establishing a corresponding relationship of the present application. The voice interaction method of the present application may include the following steps.
S410:测试语音设备在不同声学回声通道下的实况回声消除抑制量和打断指标。S410: Test the real-time echo cancellation suppression amount and interruption index of the voice device under different acoustic echo channels.
可以测试出语音设备在不同的声学回声通道下的实况回声消除抑制量和打断指标,以便以打断指标为衡量标准建立实况回声消除抑制量和语音交互策略的对应关系。The real-time echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels can be tested, so that the corresponding relationship between the real-time echo cancellation inhibition amount and the voice interaction strategy can be established by using the interruption index as a measurement standard.
其中,打断指标可以包括打断精确率和/或打断召回率。其中,打断是指全双工语音交互中的语义打断,指语音设备在播报过程中,收到符合识别语义范围的指令,立即作出播放停止与应答更新的行为。打断召回率是指语音设备收到有效打断且命令词识别正确的次数与语音设备应该打断的次数的比值。可以理解的是,语音设备应该打断的次数是指实际将命令词输给语音设备的次数。打断召回率能表现在全双工识别过程中的准确识别情况。打断精确率是指语音设备收到有效打断且命令词识别正确的次数与语音设备收到有效打断的总次数的比值。其中,语音设备有效打断的总次数包括:语音设备有效打断且命令词识别正确的次数;和,语音设备有效打断且命令词识别错误的次数。可以通过打断精确率计算出全双工识别过程中正确打断与回声自激/外界干预的误打断下的比例。打断精确率和打断召回率结合能表现全双工对话识别的整体正确性,这样以打断精确率和打断召回率为衡量标准可以准确地判断是否使用全双工模式以及使用哪种全双工模式。The interruption metrics may include interruption precision and/or interruption recall. Among them, interruption refers to the semantic interruption in full-duplex voice interaction, which means that during the broadcast process, the voice device receives an instruction that conforms to the recognition semantic range, and immediately stops the playback and responds to the update. Interruption recall rate refers to the ratio of the number of times that the voice device receives valid interruptions and the command word is correctly recognized to the number of times that the voice device should interrupt. It can be understood that the number of times the voice device should interrupt refers to the number of times the command word is actually input to the voice device. Interruption recall rate can represent the accurate recognition in the full-duplex recognition process. Interruption accuracy refers to the ratio of the number of times the voice device receives valid interruptions and the command word is correctly recognized to the total number of times the voice device receives valid interruptions. The total number of times of effective interruption by the voice device includes: the number of times that the voice device effectively interrupts and the command word is correctly recognized; and the number of times that the voice device effectively interrupts and the command word is incorrectly recognized. The interrupt accuracy rate can be used to calculate the ratio of correct interrupts and false interrupts due to echo self-excitation/external intervention in the full-duplex recognition process. The combination of interruption precision and interruption recall can express the overall correctness of full-duplex dialogue recognition, so that whether to use full-duplex mode and which type to use can be accurately determined by the measurement of interruption precision and interruption recall. Full duplex mode.
打断精确率与打断召回率的计算方式如下所示:Interruption precision and interruption recall are calculated as follows:
相应的声学回声通道下,实际输入命令词的次数为A;Under the corresponding acoustic echo channel, the actual number of input command words is A;
相应的声学回声通道下,语音设备收到有效打断成功次数为B;Under the corresponding acoustic echo channel, the number of successful interruptions received by the voice device is B;
相应的声学回声通道下,语音设备收到有效打断且命令词识别正确的次数 为C;Under the corresponding acoustic echo channel, the number of times that the voice device receives valid interruptions and the command word is correctly recognized is C;
打断精确率=C/B;Interrupt accuracy = C/B;
打断召回率=C/A。Interrupt recall = C/A.
S420:以打断指标为衡量标准定义在不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,并保存语音交互策略和实况回声消除抑制量的对应关系。S420 : Define the voice interaction strategy matched with the live echo cancellation suppression amount under different acoustic echo channels using the interruption index as a measurement standard, and save the correspondence between the voice interaction strategy and the live echo cancellation suppression amount.
测试出语音设备在不同声学回声通道下的实况回声消除抑制量和打断指标后,可以以打断指标为衡量标准建立回声消除抑制量和语音交互策略的对应关系。After testing the live echo cancellation suppression amount and interruption index of the voice device under different acoustic echo channels, the corresponding relationship between the echo cancellation inhibition amount and the voice interaction strategy can be established by using the interruption index as a measurement standard.
可选地,可以先确认各个声学回声通道下的打断指标是否合格以及是否优秀。然后将打断指标合格的最低实况回声消除抑制量作为第一预设值,或将打断指标不合格的最高实况回声消除抑制量作为第一预设值,其中,第一预设值为语音设备启动半双工模式的最高标准值;将打断指标远超于合格标准的最低实况回声消除抑制量作为第二预设值。其中,第二预设值为语音设备启动第二全双工模式的最高标准值。Optionally, it is possible to first confirm whether the interruption index under each acoustic echo channel is qualified and excellent. Then, the minimum live echo cancellation suppression amount for which the interruption index is qualified is taken as the first preset value, or the highest live echo cancellation suppression amount for which the interruption index is unqualified is used as the first preset value, wherein the first preset value is the voice The highest standard value for the device to start the half-duplex mode; the lowest real-time echo cancellation suppression amount for which the interruption index far exceeds the qualified standard is taken as the second preset value. Wherein, the second preset value is the highest standard value for the voice device to activate the second full-duplex mode.
另外,还可将打断指标优秀的最低实况回声消除抑制量作为第二预设值,其中,第二预设值为语音设备启动第二全双工模式的最高标准值。In addition, the minimum live echo cancellation suppression amount with excellent interruption index can also be used as the second preset value, wherein the second preset value is the highest standard value for the voice device to activate the second full-duplex mode.
每一打断指标与各自的合格值的差值均超过对应的第一阈值的实况回声消除抑制量为打断指标优秀的实况回声消除抑制量,其中,第一阈值大于0。The live echo cancellation suppression amount for which the difference between each interruption index and the respective qualified value exceeds the corresponding first threshold value is the live echo cancellation inhibition amount with excellent interruption index, wherein the first threshold value is greater than 0.
示例性地,打断指标包括打断精确率和打断召回率时,假设打断精确率的合格值为80%,打断精确率的第一阈值为8%,打断召回率的合格值为70%,打断召回率的第一阈值为12%。实况回声消除抑制量为32dB时,打断精确率为84%,打断召回率为89%,因为打断精确率与合格值的差值为4%,小于打断精确率的第一阈值,从而32dB的实况回声消除抑制量不是打断指标优秀的实况回声消除抑制量。实况回声消除抑制量为35dB时,打断精确率为90%,打断召回率为93%,因为打断精确率与合格值的差值为10%,大于打断精确率的第一阈值,且打断召回率与合格值的差值为23%,大于打断召回率的第一阈值,从而35dB的实况回声消除抑制量是打断指标优秀的实况回声消除抑制量。Exemplarily, when the interruption index includes interruption precision rate and interruption recall rate, it is assumed that the qualified value of interruption precision rate is 80%, the first threshold of interruption precision rate is 8%, and the qualified value of interruption recall rate is is 70%, and the first threshold for interrupted recall is 12%. When the suppression amount of live echo cancellation is 32dB, the interruption precision rate is 84%, and the interruption recall rate is 89%, because the difference between interruption precision rate and qualified value is 4%, which is less than the first threshold of interruption precision rate. Therefore, the 32dB live echo cancellation suppression amount is not an excellent live echo cancellation suppression amount for the interruption index. When the suppression amount of live echo cancellation is 35dB, the interruption precision rate is 90%, and the interruption recall rate is 93%, because the difference between interruption precision rate and qualified value is 10%, which is greater than the first threshold of interruption precision rate. And the difference between the interrupt recall rate and the qualified value is 23%, which is greater than the first threshold of the interrupt recall rate, so the 35dB live echo cancellation suppression amount is an excellent live echo cancellation suppression amount for the interrupt index.
每一所述打断指标与各自的合格值的比值超过对应的第二阈值的实况回声消除抑制量为所述打断指标优秀的实况回声消除抑制量,第二阈值大于1。The live echo cancellation suppression amount for which the ratio of each interruption index to the respective qualified value exceeds the corresponding second threshold is the live echo cancellation inhibition amount with excellent interruption index, and the second threshold value is greater than 1.
示例性地,打断指标包括打断精确率和打断召回率时,假设打断精确率的合格值为80%,打断精确率的第二阈值为1.12,打断召回率的合格值为80%,打断召回率的第二阈值为1.1。实况回声消除抑制量为32dB时,打断精确率为84%,打断召回率为89%,因为打断精确率与合格值的比值为1.05,小于打断精确率的第二阈值,打断召回率与合格值的比值为1.112,大于打断召回率的第二阈值,从而32dB的实况回声消除抑制量不是打断指标优秀的实况回声消除抑制量。实况回声消除抑制量为35dB时,打断精确率为90%,打断召回率为93%,因为打断精确率与合格值的比值为1.125,大于打断精确率的第二阈值,且打断召回率与合格值的比值为1.162,大于打断召回率的第二阈值,从而35dB的实况回声消除抑制量是打断指标优秀的实况回声消除抑制量。Exemplarily, when the interruption index includes interruption precision rate and interruption recall rate, it is assumed that the qualified value of interruption precision rate is 80%, the second threshold of interruption precision rate is 1.12, and the qualified value of interruption recall rate is 1.12. 80%, the second threshold for interrupt recall is 1.1. When the live echo cancellation suppression amount is 32dB, the interruption precision rate is 84%, and the interruption recall rate is 89%, because the ratio of interruption precision rate to qualified value is 1.05, which is less than the second threshold of interruption precision rate, interruption rate The ratio of recall rate to qualified value is 1.112, which is larger than the second threshold of interrupt recall rate, so the 32dB live echo cancellation suppression amount is not the excellent live echo cancellation suppression amount for the interrupt index. When the live echo cancellation suppression amount is 35dB, the interruption precision rate is 90%, and the interruption recall rate is 93%, because the ratio of interruption precision rate to qualified value is 1.125, which is greater than the second threshold of interruption precision rate, and the interruption rate is 1.125. The ratio of the interrupt recall rate to the qualified value is 1.162, which is greater than the second threshold of the interrupt recall rate, so the 35dB live echo cancellation suppression amount is an excellent live echo cancellation suppression amount for the interrupt index.
可以理解的是,每个打断指标的第一阈值超过0。每个打断指标的第二阈值超过1。其中,每个打断指标的合格值、第一阈值和第二阈值可以是预先设定的。Understandably, the first threshold of each interruption indicator exceeds 0. The second threshold for each interruption metric exceeds 1. Wherein, the qualified value of each interruption index, the first threshold and the second threshold may be preset.
其中,若打断指标包括打断精确率和打断召回率,那打断精确率和打断召回率均超过各自的合格值的实况回声消除抑制量为打断指标合格的实况回声消除抑制量;打断精确率和打断召回率中任一者不超过各自的合格值的实况回声消除抑制量为打断指标不合格的实况回声消除抑制量。Among them, if the interruption index includes interruption precision rate and interruption recall rate, then the amount of live echo cancellation suppression for which the interruption precision rate and interruption recall rate both exceed their respective qualified values is the amount of live echo cancellation inhibition for which the interruption index is qualified ; The live echo cancellation suppression amount for which either the interruption precision rate and the interruption recall rate does not exceed the respective qualified value is the live echo cancellation inhibition amount for which the interruption index is unqualified.
在本实施方式中,先测试出语音设备在不同声学回声通道下的实况回声消除抑制量和打断指标,接着以打断指标为衡量标准,定义在不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,这样以打断指标为衡量标准可以制定出比较恰当的对应关系,以便语音设备后续使用时会自主地选择与实际使用环境相符合的语音交互策略。In this implementation manner, the live echo cancellation suppression amount and the interruption index of the voice device under different acoustic echo channels are tested first, and then the interruption index is used as the measurement standard to define the live echo cancellation suppression amount under different acoustic echo channels. The matching voice interaction strategy, so that a more appropriate corresponding relationship can be formulated with the interruption index as the measurement standard, so that the voice device will autonomously select a voice interaction strategy that is consistent with the actual use environment when it is used later.
下面为更好说明本申请对应关系的建立方法,提供以下对应关系的建立具体实施例来示例性说明:In order to better illustrate the method for establishing the corresponding relationship in the present application, the following specific examples for establishing the corresponding relationship are provided for illustrative description:
以下将用实验数据展示由于声学回声通道环境残响严重时,对实况回声消除抑制量的影响,从而降低打断精确率与打断召回率的情况,并提出该水平下语音交互策略动态调整的策略。实验环境如下:The following will use experimental data to show the impact on the amount of live echo cancellation suppression due to severe environmental reverberation in the acoustic echo channel, thereby reducing the interruption precision rate and interruption recall rate, and propose a dynamic adjustment of the voice interaction strategy at this level. Strategy. The experimental environment is as follows:
(1)设定语音设备的打断精确率标准为80%,打断召回率标准为80%。(1) Set the interrupt precision rate standard of the voice device as 80%, and the interrupt recall rate standard as 80%.
(2)选择狭窄强反射小型会议室(约10平米)与标准模拟录音室(约25平米,带吸音材料降低回声反射)作为回声通道的对比环境。(2) Choose a narrow and strong reflection small conference room (about 10 square meters) and a standard analog recording studio (about 25 square meters, with sound-absorbing materials to reduce echo reflection) as the contrast environment for the echo channel.
(3)语音设备声学结构保证正常,两种声学回声通道保证使用同样语音设备 与实况回声消除抑制量算法进行。(3) The acoustic structure of the voice equipment is guaranteed to be normal, and the two acoustic echo channels are guaranteed to use the same voice equipment and live echo cancellation and suppression algorithm.
实验结果:Experimental results:
(1)在狭窄强反射小型会议室中,使用扫频测试出该环境的残响严重,频点相互干扰严重,如图5所示(上方为狭窄强反射小型会议室的声学回声通道,下方为标准的电路回声参考通道):(1) In the narrow and strong reflection small conference room, the frequency sweep test shows that the reverberation of the environment is serious, and the frequency points interfere seriously with each other, as shown in Figure 5 (the upper part is the acoustic echo channel of the narrow and strong reflection small conference room, and the lower part is the acoustic echo channel of the narrow and strong reflection small conference room. for the standard circuit echo reference channel):
此时,计算得出的实况回声消除抑制量为28dB,对应测试打断精确率为82.32%,打断召回率为75.25%,出现较多回声自激情况,导致打断召回率不及格。此时进行半双工识别率测试能正常通过,在28dB的抑制水平下,应当选用半双工模式进行语音交互。At this time, the calculated real-time echo cancellation suppression amount is 28dB, the corresponding test interruption accuracy rate is 82.32%, and the interruption recall rate is 75.25%. At this time, the half-duplex recognition rate test can be passed normally. Under the suppression level of 28dB, the half-duplex mode should be selected for voice interaction.
(2)在标准模拟录音室中,使用扫频测试出该环境残响表现正常,计算得出的实况回声消除抑制量为36dB,对应测试打断精确率为91.20%,打断召回率为97.44%,测试通过且打断指标明显高于标准。此时在36dB的实况回声消除抑制量下,能在全双工对话过程中得到精确的应答,应当选用第一全双工模式进行语音交互。(2) In a standard analog recording studio, the frequency sweep test shows that the reverberation performance of the environment is normal. The calculated live echo cancellation suppression amount is 36dB, the corresponding test interruption accuracy rate is 91.20%, and the interruption recall rate is 97.44. %, the test passed and the interruption indicator was significantly higher than the standard. At this time, under the real-time echo cancellation suppression amount of 36dB, an accurate response can be obtained during the full-duplex dialogue process, and the first full-duplex mode should be selected for voice interaction.
(3)其余测试不再赘述,该语音设备最终定义实况回声消除抑制量的具体区间与语音交互策略的对应关系如下所示:(3) The rest of the tests will not be repeated. The voice device finally defines the corresponding relationship between the specific interval of the live echo cancellation suppression amount and the voice interaction strategy as follows:
实况回声消除抑制量的第一个区间:实况回声消除抑制量≥35dB,选用宽技能领域与长激活时间的第一全双工模式进行语音交互。The first interval of the live echo cancellation suppression amount: the live echo cancellation suppression amount is ≥ 35dB, and the first full-duplex mode with a wide skill field and a long activation time is used for voice interaction.
实况回声消除抑制量的第二个区间:30dB≤实况回声消除抑制量<35dB,采用受限技能领域与短激活时间的第二全双工模式进行语音交互。The second interval of the live echo cancellation suppression amount: 30dB≤live echo cancellation suppression amount<35dB, using the limited skill field and the second full-duplex mode with short activation time for voice interaction.
实况回声消除抑制量的第三个区间:实况回声消除抑制量≤30dB,采用半全双工模式进行语音交互。The third interval of the live echo cancellation suppression amount: the live echo cancellation suppression amount is less than or equal to 30dB, and the voice interaction is performed in a half-full-duplex mode.
请参阅图6,图6是本申请语音设备一实施方式的结构示意图。本语音设备10包括处理器12、播放器件13、录音器件14和回声消除电路15;播放器件13、录音器件14和回声消除电路15均耦接于处理器12,处理器12用于执行指令以实现上述语音交互方法。Please refer to FIG. 6 , which is a schematic structural diagram of an embodiment of a voice device of the present application. The voice device 10 includes a processor 12, a playback device 13, a recording device 14 and an echo cancellation circuit 15; the playback device 13, the recording device 14 and the echo cancellation circuit 15 are all coupled to the processor 12, and the processor 12 is used for executing instructions to The above voice interaction method is implemented.
处理器12还可以称为CPU(Central Processing Unit,中央处理单元)。处理器12可能是一种集成电路芯片,具有信号的处理能力。处理器12还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件 组件。通用处理器可以是微处理器或者该处理器12也可以是任何常规的处理器等。The processor 12 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 12 may be an integrated circuit chip with signal processing capability. Processor 12 may also be a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components . A general purpose processor may be a microprocessor or the processor 12 may be any conventional processor or the like.
语音设备10还可进一步包括存储器11,用于存储处理器12运行所需的指令和数据。The speech device 10 may further include a memory 11 for storing instructions and data required for the operation of the processor 12 .
处理器12用于执行指令以实现上述本申请语音交互方法任一实施例及任意不冲突的组合所提供的方法。The processor 12 is configured to execute the instructions to implement the method provided by any of the above-mentioned embodiments of the voice interaction method of the present application and any non-conflicting combination.
请参阅图7,图7为本申请实施方式中计算机可读存储介质的结构示意图。本申请实施例的计算机可读存储介质20存储有指令/程序数据21,该指令/程序数据21被执行时实现本申请语音交互方法任一实施例以及任意不冲突的组合所提供的方法。其中,该指令/程序数据21可以形成程序文件以软件产品的形式存储在上述存储介质20中,以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质20包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。Please refer to FIG. 7 , which is a schematic structural diagram of a computer-readable storage medium in an embodiment of the present application. The computer-readable storage medium 20 of the embodiment of the present application stores the instruction/program data 21, and when the instruction/program data 21 is executed, implements the method provided by any embodiment of the voice interaction method of the present application and any non-conflicting combination. Wherein, the instruction/program data 21 can be stored in the above-mentioned storage medium 20 in the form of a program file in the form of a software product, so that a computer device (may be a personal computer, a server, or a network device, etc.) or a processor (processor) Perform all or part of the steps of the methods of the various embodiments of the present application. The aforementioned storage medium 20 includes: a USB flash drive, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a random access memory (RAM, Random Access Memory), a magnetic disk or an optical disk, etc. media, or terminal devices such as computers, servers, mobile phones, and tablets.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the device embodiments described above are only illustrative. For example, the division of units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
以上仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the embodiments of the present application, and are not intended to limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied in other related technical fields, All are similarly included in the scope of patent protection of the present application.

Claims (15)

  1. 一种语音交互方法,其特征在于,所述方法包括:A voice interaction method, characterized in that the method comprises:
    对语音设备的原始录制音频及录音音频进行回声消除抑制量运算,获得实况回声消除抑制量,其中,所述原始录制音频为语音设备在自身播放原始音频的同时录制得到的音频,所述录音音频为所述原始录制音频经过回声消除算法处理得到的音频;Perform an echo cancellation suppression amount calculation on the original recorded audio and the recorded audio of the voice device to obtain a live echo cancellation suppression amount, wherein the original recorded audio is the audio obtained by the voice device while playing the original audio by itself, and the recorded audio The audio obtained by the echo cancellation algorithm for the original recorded audio;
    以所述实况回声消除抑制量对应的语音交互策略进行语音交互。The voice interaction is performed according to the voice interaction strategy corresponding to the live echo cancellation suppression amount.
  2. 根据权利要求1所述的方法,其特征在于,The method of claim 1, wherein:
    所述实况回声消除抑制量与所述语音交互策略的对应关系预先设置;The corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy is preset;
    且所述实况回声消除抑制量越小,所述语音交互策略所定义的语音交互性能指标越低。And the smaller the suppression amount of the live echo cancellation, the lower the voice interaction performance index defined by the voice interaction strategy.
  3. 根据权利要求2所述的方法,其特征在于,所述实况回声消除抑制量与所述语音交互策略的对应关系包括:The method according to claim 2, wherein the corresponding relationship between the live echo cancellation suppression amount and the voice interaction strategy comprises:
    响应于所述实况回声消除抑制量大于第一预设值,以全双工模式进行语音交互;In response to the live echo cancellation suppression amount being greater than a first preset value, performing voice interaction in a full-duplex mode;
    响应于所述实况回声消除抑制量小于或等于第一预设值,以半双工模式进行语音交互;in response to the live echo cancellation suppression amount being less than or equal to a first preset value, performing voice interaction in half-duplex mode;
    其中,所述第一预设值为所述语音设备启动半双工模式的最高标准值。Wherein, the first preset value is the highest standard value for enabling the half-duplex mode of the voice device.
  4. 根据权利要求3所述的方法,其特征在于,所述全双工模式包括第一全双工模式和第二全双工模式,所述第一全双工模式的命令词多于第二全双工模式的命令词,所述若所述实况回声消除抑制量大于第一预设值,以全双工模式进行语音交互包括:The method according to claim 3, wherein the full-duplex mode includes a first full-duplex mode and a second full-duplex mode, and the first full-duplex mode has more command words than the second full-duplex mode. Command word in duplex mode, if the amount of live echo cancellation suppression is greater than a first preset value, performing voice interaction in full duplex mode includes:
    响应于所述实况回声消除抑制量大于第二预设值,以第一全双工模式进行语音交互;In response to the live echo cancellation suppression amount being greater than a second preset value, performing voice interaction in a first full-duplex mode;
    响应于所述实况回声消除抑制量大于第一预设值且小于或等于第二预设值,以第二全双工模式进行语音交互;In response to the live echo cancellation suppression amount being greater than a first preset value and less than or equal to a second preset value, performing voice interaction in a second full-duplex mode;
    其中,所述第二预设值为所述语音设备启动第二全双工模式的最高标准值。Wherein, the second preset value is the highest standard value at which the voice device starts the second full-duplex mode.
  5. 根据权利要求4所述的方法,其特征在于,所述第一全双工模式的激活时间长于所述第二全双工模式的激活时间。The method of claim 4, wherein the activation time of the first full-duplex mode is longer than the activation time of the second full-duplex mode.
  6. 根据权利要求1所述的方法,其特征在于,所述获得实况回声消除抑制 量包括:The method according to claim 1, wherein the obtaining a live echo cancellation suppression amount comprises:
    分别计算所述原始录制音频和所述录音音频的能量矩阵;Calculate the energy matrix of the original recorded audio and the recorded audio respectively;
    对所述原始录制音频的能量矩阵和所述录音音频的能量矩阵的差值进行转换,得到差值数据;Convert the difference between the energy matrix of the original recorded audio and the energy matrix of the recorded audio to obtain difference data;
    对所述差值数据进行分帧处理,确定所述差值数据中每帧的峰值;Perform frame-by-frame processing on the difference data, and determine the peak value of each frame in the difference data;
    计算所述差值数据中至少部分帧的峰值的均值,得到所述实况回声消除抑制量。The mean value of the peak values of at least some frames in the difference data is calculated to obtain the live echo cancellation suppression amount.
  7. 根据权利要求6所述的方法,其特征在于,所述计算所述差值数据中至少部分帧的峰值的均值,包括:The method according to claim 6, wherein the calculating the mean value of the peak values of at least some frames in the difference data comprises:
    计算所述差值数据中回声消除算法收敛后的所有帧的峰值的均值。Calculate the mean value of the peak values of all frames in the difference data after the echo cancellation algorithm converges.
  8. 根据权利要求1所述的方法,其特征在于,所述原始音频的时长长于10s。The method according to claim 1, wherein the duration of the original audio is longer than 10s.
  9. 一种对应关系建立方法,其特征在于,所述方法包括:A method for establishing a corresponding relationship, characterized in that the method comprises:
    测试语音设备在不同声学回声通道下的实况回声消除抑制量;Test the amount of live echo cancellation and suppression of voice equipment under different acoustic echo channels;
    定义在所述不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,并保存语音交互策略和所述实况回声消除抑制量的对应关系。A voice interaction strategy matched with the live echo cancellation suppression amount under the different acoustic echo channels is defined, and the correspondence between the voice interaction strategy and the live echo cancellation suppression amount is saved.
  10. 根据权利要求9所述的对应关系建立方法,其特征在于,The method for establishing a corresponding relationship according to claim 9, wherein,
    所述测试语音设备在不同声学回声通道下的实况回声消除抑制量,包括:测试语音设备在所述不同声学回声通道下的所述实况回声消除抑制量和打断指标;The live echo cancellation suppression amount of the test voice device under different acoustic echo channels includes: testing the live echo cancellation suppression amount and interruption index of the voice device under the different acoustic echo channels;
    所述定义在所述不同声学回声通道下的实况回声消除抑制量所匹配的语音交互策略,包括:以所述打断指标为衡量标准,定义所述实况回声消除抑制量所匹配的语音交互策略。The defining the voice interaction strategy matched by the live echo cancellation suppression amount under the different acoustic echo channels includes: taking the interruption index as a measurement standard, defining the voice interaction strategy matched by the live echo cancellation suppression amount .
  11. 根据权利要求10所述的对应关系建立方法,其特征在于,所述以所述打断指标为衡量标准,定义所述实况回声消除抑制量所匹配的语音交互策略,包括:The method for establishing a corresponding relationship according to claim 10, wherein, using the interruption index as a measurement standard, defining a voice interaction strategy matched by the live echo cancellation suppression amount, comprising:
    将所述打断指标合格的最低实况回声消除抑制量作为第一预设值,或将所述打断指标不合格的最高实况回声消除抑制量作为所述第一预设值;taking the minimum live echo cancellation suppression amount for which the interruption index is qualified as the first preset value, or the highest live echo cancellation suppression amount for which the interruption index is unqualified as the first preset value;
    其中,所述第一预设值为所述语音设备启动半双工模式的最高标准值。Wherein, the first preset value is the highest standard value for enabling the half-duplex mode of the voice device.
  12. 根据权利要求11所述的对应关系建立方法,其特征在于,所述全双工模式包括第一全双工模式和第二全双工模式,所述第一全双工模式的命令词多 于第二全双工模式的命令词,所述以所述打断指标为衡量标准,定义所述实况回声消除抑制量所匹配的语音交互策略,包括:The method for establishing a correspondence relationship according to claim 11, wherein the full-duplex mode includes a first full-duplex mode and a second full-duplex mode, and the first full-duplex mode has more command words than The command word of the second full-duplex mode, the said interruption index is used as the measurement standard to define the voice interaction strategy matched by the said live echo cancellation suppression amount, including:
    将所述打断指标优秀的最低实况回声消除抑制量作为第二预设值;Taking the minimum live echo cancellation suppression amount with the excellent interruption index as the second preset value;
    其中,所述第二预设值为所述语音设备启动第二全双工模式的最高标准值,Wherein, the second preset value is the highest standard value at which the voice device starts the second full-duplex mode,
    每一所述打断指标与各自的合格值的差值均超过对应的第一阈值的实况回声消除抑制量为所述打断指标优秀的实况回声消除抑制量,第一阈值大于0;或,每一所述打断指标与各自的合格值的比值均超过对应的第二阈值的实况回声消除抑制量为所述打断指标优秀的实况回声消除抑制量,第二阈值大于1。The live echo cancellation suppression amount for which the difference between each interruption index and the respective qualified value exceeds the corresponding first threshold value is the live echo cancellation inhibition amount with excellent interruption index, and the first threshold value is greater than 0; or, The live echo cancellation suppression amount for which the ratio of each interruption index to the respective qualified value exceeds the corresponding second threshold value is the live echo cancellation inhibition amount with excellent interruption index, and the second threshold value is greater than 1.
  13. 根据权利要求11所述的方法,其特征在于,所述打断指标包括打断精确率和打断召回率;The method according to claim 11, wherein the interruption index comprises interruption precision rate and interruption recall rate;
    所述打断精确率和所述打断召回率均超过各自的合格值的实况回声消除抑制量为所述打断指标合格的实况回声消除抑制量;The live echo cancellation suppression amount for which the interruption precision rate and the interruption recall rate both exceed their respective qualified values is the live echo cancellation inhibition amount for which the interruption index is qualified;
    所述打断精确率和所述打断召回率中任一者不超过各自的合格值的实况回声消除抑制量为所述打断指标不合格的实况回声消除抑制量。The live echo cancellation suppression amount for which either of the interruption precision rate and the interruption recall rate does not exceed the respective pass values is the live echo cancellation suppression amount for which the interruption index is unqualified.
  14. 一种语音设备,其特征在于,所述语音设备包括处理器、播放器件、录音器件和回声消除电路;所述播放器件、所述录音器件和所述回声消除电路均耦接于所述处理器,所述处理器用于执行指令以实现如权利要求1-8任一项所述的语音交互方法。A voice device, characterized in that the voice device comprises a processor, a playback device, a recording device and an echo cancellation circuit; the playback device, the recording device and the echo cancellation circuit are all coupled to the processor , the processor is configured to execute instructions to implement the voice interaction method according to any one of claims 1-8.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储指令/程序数据,所述指令/程序数据能够被执行以实现如权利要求1-8任一项所述的语音交互方法。A computer-readable storage medium, characterized in that, the computer-readable storage medium is used for storing instructions/program data, and the instructions/program data can be executed to implement any one of claims 1-8. Voice interaction method.
PCT/CN2021/123913 2020-10-22 2021-10-14 Voice interaction method and related apparatus, and method for establishing correspondence WO2022083502A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011142062.8 2020-10-22
CN202011142062.8A CN112581972A (en) 2020-10-22 2020-10-22 Voice interaction method, related device and corresponding relation establishing method

Publications (1)

Publication Number Publication Date
WO2022083502A1 true WO2022083502A1 (en) 2022-04-28

Family

ID=75119927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123913 WO2022083502A1 (en) 2020-10-22 2021-10-14 Voice interaction method and related apparatus, and method for establishing correspondence

Country Status (2)

Country Link
CN (1) CN112581972A (en)
WO (1) WO2022083502A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581972A (en) * 2020-10-22 2021-03-30 广东美的白色家电技术创新中心有限公司 Voice interaction method, related device and corresponding relation establishing method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070290A (en) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 Man-machine voice interaction method and system
US20180090152A1 (en) * 2016-09-28 2018-03-29 Panasonic Intellectual Property Corporation Of America Parameter prediction device and parameter prediction method for acoustic signal processing
CN108847226A (en) * 2017-04-12 2018-11-20 声音猎手公司 The agency managed in human-computer dialogue participates in
CN109994108A (en) * 2017-12-29 2019-07-09 微软技术许可有限责任公司 Full-duplex communication technology for the session talk between chat robots and people
WO2020063798A1 (en) * 2018-09-27 2020-04-02 深圳市冠旭电子股份有限公司 Echo cancellation method, device and intelligent loudspeaker box
CN111445918A (en) * 2020-03-23 2020-07-24 深圳市友杰智新科技有限公司 Method and device for reducing false awakening of intelligent voice equipment and computer equipment
CN111696569A (en) * 2020-06-29 2020-09-22 美的集团武汉制冷设备有限公司 Echo cancellation method for household appliance, terminal and storage medium
CN112581972A (en) * 2020-10-22 2021-03-30 广东美的白色家电技术创新中心有限公司 Voice interaction method, related device and corresponding relation establishing method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130332156A1 (en) * 2012-06-11 2013-12-12 Apple Inc. Sensor Fusion to Improve Speech/Audio Processing in a Mobile Device
DK2822263T3 (en) * 2013-07-05 2019-06-17 Sennheiser Communications As Communication device with echo cancellation
CN109389979B (en) * 2018-12-05 2022-05-20 广东美的制冷设备有限公司 Voice interaction method, voice interaction system and household appliance
CN110211599B (en) * 2019-06-03 2021-07-16 Oppo广东移动通信有限公司 Application awakening method and device, storage medium and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070290A (en) * 2015-07-08 2015-11-18 苏州思必驰信息科技有限公司 Man-machine voice interaction method and system
US20180090152A1 (en) * 2016-09-28 2018-03-29 Panasonic Intellectual Property Corporation Of America Parameter prediction device and parameter prediction method for acoustic signal processing
CN108847226A (en) * 2017-04-12 2018-11-20 声音猎手公司 The agency managed in human-computer dialogue participates in
CN109994108A (en) * 2017-12-29 2019-07-09 微软技术许可有限责任公司 Full-duplex communication technology for the session talk between chat robots and people
WO2020063798A1 (en) * 2018-09-27 2020-04-02 深圳市冠旭电子股份有限公司 Echo cancellation method, device and intelligent loudspeaker box
CN111445918A (en) * 2020-03-23 2020-07-24 深圳市友杰智新科技有限公司 Method and device for reducing false awakening of intelligent voice equipment and computer equipment
CN111696569A (en) * 2020-06-29 2020-09-22 美的集团武汉制冷设备有限公司 Echo cancellation method for household appliance, terminal and storage medium
CN112581972A (en) * 2020-10-22 2021-03-30 广东美的白色家电技术创新中心有限公司 Voice interaction method, related device and corresponding relation establishing method

Also Published As

Publication number Publication date
CN112581972A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
JP6641430B2 (en) Multi-device interaction method, device, device and program
WO2016180100A1 (en) Method and device for improving audio processing performance
CN109831733A (en) Test method, device, equipment and the storage medium of audio broadcast performance
CN111798852B (en) Voice wakeup recognition performance test method, device, system and terminal equipment
CN111883156B (en) Audio processing method and device, electronic equipment and storage medium
CN107360530B (en) Echo cancellation testing method and device
WO2020019846A1 (en) Method for controlling volume of wireless headset, wireless headset and mobile terminal
WO2019033438A1 (en) Audio signal adjustment method and device, storage medium, and terminal
CN109195090B (en) Method and system for testing electroacoustic parameters of microphone in product
CN110956976B (en) Echo cancellation method, device and equipment and readable storage medium
CN109767780A (en) A kind of audio signal processing method, device, equipment and readable storage medium storing program for executing
WO2022083502A1 (en) Voice interaction method and related apparatus, and method for establishing correspondence
WO2021212905A1 (en) Audio processing method and apparatus, electronic device, and storage medium
CN104601130A (en) Method and device for adjusting volume
CN110139204A (en) Intelligent sound equipment acoustical behavior test method and system
CN110636432A (en) Microphone testing method and related equipment
WO2020125325A1 (en) Method for eliminating echo and device
WO2019033940A1 (en) Volume adjustment method and apparatus, terminal device, and storage medium
CN109545237A (en) A kind of computer readable storage medium and the interactive voice speaker using the medium
CN108449691A (en) A kind of sound pick up equipment and sound source distance determine method
CN103812462A (en) Loudness control method and device
CN108900959B (en) Method, device, equipment and computer readable medium for testing voice interaction equipment
CN105764008B (en) A kind of method and device for debugging sound reinforcement system transmission frequency characteristic
CN108829370B (en) Audio resource playing method and device, computer equipment and storage medium
CN112437391B (en) Microphone testing method and system for open environment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21881920

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.09.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21881920

Country of ref document: EP

Kind code of ref document: A1